Matthew Dempsky from Mochi mailed me some suggestions for nodefinder. Now the multicast_ttl is an application parameter so multicast routing can be used; separate send and receive sockets are employed to ensure the right source address on BSD; and the discovery packet is signed using the cookie as a secret which prevents (hopeless but annoyingly logged) connect attempts to nodes with different cookies and also prevents a list_to_atom flood from packets that happen to match.
Thanks Matt!
Wednesday, March 26, 2008
Friday, March 21, 2008
network partition ... oops
well we had our first network partition on ec2 yesterday evening. we think it was amazon doing maintenance because it happened in both our production and development clusters.
it turns out things went pretty well considering we've done nothing yet to address network partitioning. two islands formed, but never tried to join back up (it turns out we only try to connect nodes when schemafinder starts). after a while each island's schemafinder decided that the other island was dead and removed it from the schema. so we ended up with two disjoint mnesia instances. they each had complete data so they were able to serve. of course, they were also creating data and diverging over time. however right now we're not doing anything that important so it didn't really matter. we just picked one island to survive and restarted the other ones after nuking their mnesia directories.
however the experience has convinced us that we need to prioritize up our network partition recovery strategy. the good news is that we have an out-of-band source of truth about what the set of nodes should be (via ec2-describe-instances), so it should be possible for us to detect a network partition has occurred. the other good news is that we care more about high availability than transactional integrity so we'll be willing to set master nodes on tables based upon simple heuristics.
it turns out things went pretty well considering we've done nothing yet to address network partitioning. two islands formed, but never tried to join back up (it turns out we only try to connect nodes when schemafinder starts). after a while each island's schemafinder decided that the other island was dead and removed it from the schema. so we ended up with two disjoint mnesia instances. they each had complete data so they were able to serve. of course, they were also creating data and diverging over time. however right now we're not doing anything that important so it didn't really matter. we just picked one island to survive and restarted the other ones after nuking their mnesia directories.
however the experience has convinced us that we need to prioritize up our network partition recovery strategy. the good news is that we have an out-of-band source of truth about what the set of nodes should be (via ec2-describe-instances), so it should be possible for us to detect a network partition has occurred. the other good news is that we care more about high availability than transactional integrity so we'll be willing to set master nodes on tables based upon simple heuristics.
Saturday, March 15, 2008
walkenfs improvements
I've made some progress towards my goal of making walkenfs usable by those who know nothing about Erlang or Mnesia.
- schemafinder has been open sourced, which provides Mnesia auto-configure.
- mount and umount work (although the mount line is very baroque), so no need to start an Erlang emulator explicitly.
- the distribution includes an example init script very close to the one we actually use.
Wednesday, March 12, 2008
mnesia and ec2
Mnesia is Erlang's built-in multi-master distributed database. It's one of the reasons why we chose Erlang for our startup. And while it is mostly a great fit for EC2, it needs some tweaks to work. In particular we wanted the following to be automatic:
It's actually surprisingly complicated, given that at the root the way to add a node to a running database is to call mnesia:change_config/2 as mnesia:change_config (extra_db_nodes, NodeList). However while adding nodes is easy removing nodes is harder. Detecting nodes to remove is not that hard; we go with the (EC2-centric) strategy of periodically updating a record in a table to "stay alive" (and also a way to mark clean shutdown); nodes that fail to check in eventually are considered dead. To handle multiple node failure we patch mnesia to provide mnesia_schema:del_table_copies/2, the multiple table analog to mnesia:del_table_copy/2. To guard against rejoining the global schema after having been removed but without having removed one's mnesia database directory (this is a no-no), we check with the other database nodes to see if they think we've been removed.
Even with all this complication, we have yet to address the problem of handling network partitioning automatically. We'll see if we have time to address that before it bites us in the butt.
Now available on google code.
- Discover nodes entering or leaving the system
- Have them join the Mnesia database
- Have them take responsibility for a portion of the data
It's actually surprisingly complicated, given that at the root the way to add a node to a running database is to call mnesia:change_config/2 as mnesia:change_config (extra_db_nodes, NodeList). However while adding nodes is easy removing nodes is harder. Detecting nodes to remove is not that hard; we go with the (EC2-centric) strategy of periodically updating a record in a table to "stay alive" (and also a way to mark clean shutdown); nodes that fail to check in eventually are considered dead. To handle multiple node failure we patch mnesia to provide mnesia_schema:del_table_copies/2, the multiple table analog to mnesia:del_table_copy/2. To guard against rejoining the global schema after having been removed but without having removed one's mnesia database directory (this is a no-no), we check with the other database nodes to see if they think we've been removed.
Even with all this complication, we have yet to address the problem of handling network partitioning automatically. We'll see if we have time to address that before it bites us in the butt.
Now available on google code.
Wednesday, March 5, 2008
controlling mnesia db size
genexpire something quick that we rolled together to satisfy our paranoia about out-of-control databases (ram based dbs are especially worrisome). Basically, it periodically runs and enforces a maximum size for each database fragment (which depends upon how many fragments are on the node; in practice this means you get the space of your most crowded node in a distributed config. With fragmentron things are balanced to within one fragment so with enough fragments this is ok.)
Now available on google code.
Now available on google code.
Sunday, March 2, 2008
Procfs
Our next sprint's theme is "monitoring", so this weekend I thought I'd slap together a procfs style filesystem with fuserl. I've got the contents of erlang:processes (), erlang:process_info (), erlang:ports (), erlang:port_info (), and erlang:system_info () now exposed in a filesystem format. This is less useful than I envisioned on Friday, but it was pedagogical.
Now available on google code.
Now available on google code.