Wednesday, April 29, 2009

erlrc erlang factory talk

I'll be giving a talk about erlrc at the Erlang Factory. erlrc is our scheme for integrating the Erlang VM with the OS package manager (we use deb, but it should be possible to integrate with rpm and others as well).

The google code project contains the powerpoint slides as well as other useful documentation.

In the talk I indicate that erlrc is componentized for embedding in other build systems and launch processes, and I also mention that we built our own automake-based build system (which is used for all the languages we develop in, not just Erlang). That system is called framewerk and is also available from google code.

I also claim that once your entire software state is represented in OS packages, simple and effective strategies for deployment and provisioning emerge. Two simple tools that we've developed along these lines are ec2-do and apt-snapshot.

ec2-do is an escript that provides rolling window execution of an arbitrary command across a set of ec2 machines (typically, an ec2 group). When we want to launch we do something like this to sync all the machines with the package archive,
ec2-do -g production -m 1000 -b -d apt-get update
and then something like this to install the package
ec2-do -g production -m 3 -b -d apt-get install <newpackage>
ec2-do is available from the erlrc downloads page.

apt-snapshot is even simpler than ec2-do (just a shell script), but very useful for provisioning when using deb packages. It can associate the current package state of a server with a symbolic tag which can later be restored (possibly from another host). To create a tag we do something like
apt-snapshot create <tag>
and to restore it later we do
apt-snapshot restore <host>:<tag>
We use this in three distinct ways. First, we have a cron job creating regular snapshots on every machine with timestamp tag names in case of snafu (to allow us to rollback). Second, when we launch something particularly risky we create a symbolic tag manually for rollback. Third, when we start up new machines, we get them in the correct state quickly by tagging one of the currently running machines and then telling the new machines to restore that tag from that host (our EC2 images are sensitive to the extra arguments passed to ec2-run-instance, but you can also just use ec2-do for this purpose, since restoring a tag is idempotent).

apt-snapshot is available from the erlrc downloads page.

Wednesday, August 6, 2008

scaling mnesia with local_content

so we've been spending the last two weeks trying to scale our mnesia application for a whole bunch of new traffic that's about to show up. previously we had to run alot of ec2 instances since we were using ets (RAM-based) storage; once we solved that we started to wonder how many machines we could turn off.

initial indications were disappointing in terms of capacity. none of cpu, network, memory, or disk i/o seemed particularly taxed. for instance, even though raw sequential performance on an ec2 instance can hit 100mb/s, we were unable to hit above 1mb/s of disk utilization. one clue: we did get about 6mb/s of disk utilization when we started a node from scratch. in that case, 100 parallel mnesia_loader processes grab table copies from remote nodes. thus even under an unfavorable access pattern (writing 100 different tables at once), the machines were capable of more.

one problem we suspected was the registered process mnesia_tm, since all table updates go through this process. the mnesia docs do say that it is "primarily intended to be a memory-resident database". so one thought was that mnesia_tm was hanging out waiting for disk i/o to finish and this was introducing latency and lowering throughput; with ets tables, updates are cpu bound so this design would not be so problematic. (we already have tcerl using the async_thread_pool, but that just means the emulator can do something else, not the mnesia_tm process in particular). thus, we added an option to tcerl to not wait for the return value of the linked-in driver before returning from an operation (and therefore not to check for errors). that didn't have much impact on i/o utilization.

we'd long ago purged transactions from the serving path, but we use sync_dirty alot. we thought maybe mnesia_tm was hanging out waiting for acknowledgements from remote mnesia_tm processes and this was introducing latency and lowering throughput. so we tried async_dirty. well that helped, except that under load the message queue length for the mnesia_tm process began to grow and eventually we would have run out of memory.

then we discovered local_content, which causes a table to have the same definition everywhere, but different content. as a side effect, replication is short-circuited. so with very minor code changes we tested this out and saw a significant performance improvement. of course, we couldn't do this to every table we have; only for data for which we were ok with losing if the node went down. however it's neat because now there are several types of data that can be managed within mnesia, in order of expense:
  1. transactional data. distributed transactions are extremely expensive, but sometimes necessary.
  2. highly available data. when dirty operations are ok, but multiple copies of the data have to be kept, because the data should persist across node failures.
  3. useful data. dirty operations are ok, and it's ok to lose some of the data if a node fails.
the erlang efficiency guide says "For non-persistent database storage, prefer Ets tables over Mnesia local_content tables.", i.e., bypass mnesia for fastest results. so we might do that, but right now it's convenient to have these tables acting like all our other tables.

interestingly, i/o utilization didn't go up that much even though overall capacity improved alot. we're writing about 1.5 mb/s now to local disks. instead we appear cpu bound now; we don't know why yet.

Monday, June 23, 2008

Tokyocabinet and Mnesia

As Daisy has already indicated, it is now possible to plug arbitrary storage strategies into Mnesia. For those who are familiar with mnesia_access, this is different; mnesia_access only covers reads and writes, not schema manipulations, and has other deficiencies that it render it useless for adding a new storage type in practice (what mnesia_access is great for is changing the semantics of mnesia operations, e.g., mnesia_frag). This project lets you make tables that are essentially indistinguishable from the built-in mnesia table types (ram_copies, disc_copies, disc_only_copies).

Anyway our goal was to get a good on-disk ordered_set table type, since we've found ordered_sets very useful for the kinds of problems we're solving but we're tired of being limited by memory. After looking around for a while Tokyocabinet emerged as our favorite for underlying implementation. We considered BDB and libmysql, but Tokyocabinet seemed simple and faster, and had a more accommodating license. So we ported Tokyocabinet to Erlang and then used the above storage API to connect to Mnesia.

As a side benefit Tokyocabinet might also be preferred to dets even for set-type applications because of the lack of file size limit and high performance. Tokyocabinet actually has a set-type storage strategy that we'd like to define an Erlang Term Store for, but as of this post, the set-type store doesn't support cursor positioning based upon a key, which makes the implementation of next tedious (although not impossible). So I'm waiting on the author to add that call; if that happens, we could have a nicer on-disk set-type table as well.

Everything is available on google code: mnesiaex (storage API) and tcerl (erlang port of Tokyocabinet).

Also Daisy (aka Joel Reymont) is really nice to work with.