Wednesday, August 6, 2008

scaling mnesia with local_content

so we've been spending the last two weeks trying to scale our mnesia application for a whole bunch of new traffic that's about to show up. previously we had to run alot of ec2 instances since we were using ets (RAM-based) storage; once we solved that we started to wonder how many machines we could turn off.

initial indications were disappointing in terms of capacity. none of cpu, network, memory, or disk i/o seemed particularly taxed. for instance, even though raw sequential performance on an ec2 instance can hit 100mb/s, we were unable to hit above 1mb/s of disk utilization. one clue: we did get about 6mb/s of disk utilization when we started a node from scratch. in that case, 100 parallel mnesia_loader processes grab table copies from remote nodes. thus even under an unfavorable access pattern (writing 100 different tables at once), the machines were capable of more.

one problem we suspected was the registered process mnesia_tm, since all table updates go through this process. the mnesia docs do say that it is "primarily intended to be a memory-resident database". so one thought was that mnesia_tm was hanging out waiting for disk i/o to finish and this was introducing latency and lowering throughput; with ets tables, updates are cpu bound so this design would not be so problematic. (we already have tcerl using the async_thread_pool, but that just means the emulator can do something else, not the mnesia_tm process in particular). thus, we added an option to tcerl to not wait for the return value of the linked-in driver before returning from an operation (and therefore not to check for errors). that didn't have much impact on i/o utilization.

we'd long ago purged transactions from the serving path, but we use sync_dirty alot. we thought maybe mnesia_tm was hanging out waiting for acknowledgements from remote mnesia_tm processes and this was introducing latency and lowering throughput. so we tried async_dirty. well that helped, except that under load the message queue length for the mnesia_tm process began to grow and eventually we would have run out of memory.

then we discovered local_content, which causes a table to have the same definition everywhere, but different content. as a side effect, replication is short-circuited. so with very minor code changes we tested this out and saw a significant performance improvement. of course, we couldn't do this to every table we have; only for data for which we were ok with losing if the node went down. however it's neat because now there are several types of data that can be managed within mnesia, in order of expense:
  1. transactional data. distributed transactions are extremely expensive, but sometimes necessary.
  2. highly available data. when dirty operations are ok, but multiple copies of the data have to be kept, because the data should persist across node failures.
  3. useful data. dirty operations are ok, and it's ok to lose some of the data if a node fails.
the erlang efficiency guide says "For non-persistent database storage, prefer Ets tables over Mnesia local_content tables.", i.e., bypass mnesia for fastest results. so we might do that, but right now it's convenient to have these tables acting like all our other tables.

interestingly, i/o utilization didn't go up that much even though overall capacity improved alot. we're writing about 1.5 mb/s now to local disks. instead we appear cpu bound now; we don't know why yet.

Monday, June 23, 2008

Tokyocabinet and Mnesia

As Daisy has already indicated, it is now possible to plug arbitrary storage strategies into Mnesia. For those who are familiar with mnesia_access, this is different; mnesia_access only covers reads and writes, not schema manipulations, and has other deficiencies that it render it useless for adding a new storage type in practice (what mnesia_access is great for is changing the semantics of mnesia operations, e.g., mnesia_frag). This project lets you make tables that are essentially indistinguishable from the built-in mnesia table types (ram_copies, disc_copies, disc_only_copies).

Anyway our goal was to get a good on-disk ordered_set table type, since we've found ordered_sets very useful for the kinds of problems we're solving but we're tired of being limited by memory. After looking around for a while Tokyocabinet emerged as our favorite for underlying implementation. We considered BDB and libmysql, but Tokyocabinet seemed simple and faster, and had a more accommodating license. So we ported Tokyocabinet to Erlang and then used the above storage API to connect to Mnesia.

As a side benefit Tokyocabinet might also be preferred to dets even for set-type applications because of the lack of file size limit and high performance. Tokyocabinet actually has a set-type storage strategy that we'd like to define an Erlang Term Store for, but as of this post, the set-type store doesn't support cursor positioning based upon a key, which makes the implementation of next tedious (although not impossible). So I'm waiting on the author to add that call; if that happens, we could have a nicer on-disk set-type table as well.

Everything is available on google code: mnesiaex (storage API) and tcerl (erlang port of Tokyocabinet).

Also Daisy (aka Joel Reymont) is really nice to work with.

Sunday, June 15, 2008

generating app files

We automatically generate our .app files around here from analysis of the source code. However this trick is buried inside fw-template-erlang which most (all?) people outside our group do not use. I thought I'd highlight how it works.

A typical OTP application specification file looks like:

{ application, walkenfs,
[
{ description, "Distributed filesystem." },
{ vsn, "0.1.7" },
{ modules, [ walkenfssup, walkenfsfragmentsrv, walkenfs, walkenfssrv ] },
{ registered, [ walkenfsfragmentsrv ] },
{ applications, [ kernel, stdlib ] },
{ mod, { walkenfs, [] } },
{ env, [ { linked_in, false }, { mount_opts, "" }, { mount_point, "/walken" }, { prefix, walken }, { read_data_context, async_dirty }, { write_data_context, sync_dirty }, { read_meta_context, transaction }, { write_meta_context, sync_transaction }, { attr_timeout_ms, 0 }, { entry_timeout_ms, 0 }, { reuse_node, true }, { start_timeout, infinity }, { stop_timeout, 1800000 }, { init_fragments, 7 }, { init_copies, 3 }, { copy_type, n_disc_only_copies }, { buggy_ets, auto_detect }, { port_exit_stop, true } ] }

]
}.
There are alot of bits here that can filled in automatically. It helps to think in terms of the template:

{ application, @FW_PACKAGE_NAME@,
[
{ description, "@FW_PACKAGE_SHORT_DESCRIPTION@" },
{ vsn, "@FW_PACKAGE_VERSION@" },
{ modules, [ @FW_ERL_APP_MODULES@ ] },
{ registered, [ @FW_ERL_APP_REGISTERED@ ] },
{ applications, [ kernel, stdlib @FW_ERL_PREREQ_APPLICATIONS@ ] },
@FW_ERL_APP_MOD_LINE@
{ env, @FW_ERL_APP_ENVIRONMENT@ }
@FW_ERL_APP_EXTRA@
]
}.
This is an "automake" style template where @VARIABLE@ indicates something that will be substituted.

Now we can say where each of these templates values comes from when using fw-template-erlang:
  1. FW_PACKAGE_NAME: specified by the developer. if you're using automake you've had to specify this already via AC_INIT, so it's good to reuse that rather than require duplicate specification.
  2. FW_PACKAGE_SHORT_DESCRIPTION: specified by the developer.
  3. FW_PACKAGE_VERSION: specified by the developer; like the package name, if you're using automake this information is already contained in the arguments to AC_INIT so reuse that.
  4. FW_ERL_APP_MODULES: generated by analysis of the source code directory.
  5. FW_ERL_APP_REGISTERED: generated by analysis of the source code directory.
  6. FW_ERL_PREREQ_APPLICATIONS: specified by the developer.
  7. FW_ERL_MOD_LINE: partially generated by analysis of the source code directory, partially specified by the developer (see below).
  8. FW_ERL_APP_ENVIRONMENT: specified by the developer.
  9. FW_ERL_APP_EXTRA: specified by the developer. this is for unusual extra directives like included applications.
So the interesting bits come down to FW_ERL_APP_MODULES, FW_ERL_APP_REGISTERED, and FW_ERL_MOD_LINE being generated automatically via inspection of the source code.

First, a note on intervention. Anytime you are attempting to automate a task that was previous done by humans, it's helps to allow a human to override any of the automatic settings that you generate by default. That way, you can be correct only 95% of the time and still be a huge timesaver. In practice we've found that although it easy to come up with code examples that flummox the automatic strategies, such code is never actually written by people in the normal course of their work. In fact, we've found that application files we get from the universe are often missing registered process names that our automatic strategy finds. So empirically we're running at 100% correct so far, but fw-template-erlang still contains the capability to override any automatically generated value.

Now the actual trick to computing all the stuff is that Erlang contains an Erlang compiler as a library. Here's an example of leveraging this to generate all the module names from a set of files; any file that contains a
-fwskip()
directive will be ignored.

-module (find_modules).
-export ([ main/1 ]).

is_skipped ([]) -> false;
is_skipped ([ { attribute, _, fwskip, _ } | _ ]) -> true;
is_skipped ([ _ | Rest ]) -> is_skipped (Rest).

print_module (Dir, F) ->
case compile:file (F, [ binary, 'E', { outdir, Dir } ]) of
{ ok, Mod, { Mod, _, Forms } } ->
case is_skipped (Forms) of
true ->
ok;
false ->
port_command (get (three), io_lib:format ("~p~n", [ Mod ]))
end;
_ ->
ok
end.

main ([ Dir | Rest ]) ->
ok = file:make_dir (Dir),

try
Three = open_port ({ fd, 0, 3 }, [ out ]),
% ugh ... don't want to have to change all these function signatures,
% so i'm gonna be dirty
put (three, Three),
lists:foreach (fun (F) -> print_module (Dir, F) end, Rest)
after
{ ok, FileNames } = file:list_dir (Dir),
lists:foreach (fun (F) -> file:delete (Dir ++ "/" ++ F) end, FileNames),
file:del_dir (Dir)
end;
main ([]) ->
Port = open_port ({ fd, 0, 2 }, [ out ]),
port_command (Port, "usage: find-modules.esc tmpdir filename [filename ...]\n"),
halt (1).
The output is on file descriptor 3 in order to isolate the desired output from various messages being output by the Erlang code. This was originally an escript but to be compatible with older versions of Erlang we changed it to run as follows:

#! /bin/sh

# NB: this script is called find-modules.sh

# unfortunately, escript is a recent innovation ...

ERL_CRASH_DUMP=/dev/null
export ERL_CRASH_DUMP

erl -pa "${FW_ROOT}/share/fw/template/erlang/" -pa "${FW_ROOT}/share/fw.local/template/erlang" -pa "${top_srcdir}/fw/template/erlang/" -pa "${top_srcdir}/fw.local/template/erlang" -noshell -noinput -eval '
find_modules:main (init:get_plain_arguments ()),
halt (0).
' -extra "$@" 3>&1 >/dev/null


In practice this gets executed like

find src -name '*.erl' -print | xargs find-modules.sh "$tmpdir" |
perl -ne 'chomp;
next if $seen{$_}++;
print ", " if $f;
print $_;
$f = 1;'


Ok, that was the easy one! In fact, the problem of finding all the module names seems so easy that one could be tempted to solve it with grep (hmmm: what about preprocesser directives ?). However the point is that nothing parses Erlang as well as Erlang, so why reinvent the wheel.

Computing the mod line in the application specification is just slightly harder. Basically we scan source code for a module which has a
-behaviour(application)
attribute and assume that is the start module. The Erlang compiler allows behaviour to be (mis)spelt American-style so we allow that as well.

-module (find_start_module).
-export ([ main/1 ]).

is_application ([]) -> false;
is_application ([ { attribute, _, behaviour, [ application ] } | _ ]) -> true;
is_application ([ { attribute, _, behavior, [ application ] } | _ ]) -> true;
is_application ([ _ | Rest ]) -> is_application (Rest).

is_skipped ([]) -> false;
is_skipped ([ { attribute, _, fwskip, _ } | _ ]) -> true;
is_skipped ([ _ | Rest ]) -> is_skipped (Rest).

find_start_module (_, []) ->
ok;
find_start_module (Dir, [ F | Rest ]) ->
case compile:file (F, [ binary, 'E', { outdir, Dir } ]) of
{ ok, Mod, { Mod, _, Forms } } ->
case is_application (Forms) and not is_skipped (Forms) of
true ->
port_command (get (three), io_lib:format ("~p~n", [ Mod ]));
false ->
find_start_module (Dir, Rest)
end;
_ ->
find_start_module (Dir, Rest)
end.

main ([ Dir | Rest ]) ->
ok = file:make_dir (Dir),

try
Three = open_port ({ fd, 0, 3 }, [ out ]),
% ugh ... don't want to have to change all these function signatures,
% so i'm gonna be dirty
put (three, Three),
find_start_module (Dir, Rest)
after
{ ok, FileNames } = file:list_dir (Dir),
lists:foreach (fun (F) -> file:delete (Dir ++ "/" ++ F) end, FileNames),
file:del_dir (Dir)
end;
main ([]) ->
Port = open_port ({ fd, 0, 2 }, [ out ]),
port_command (Port, "usage: find-modules.esc tmpdir filename [filename ...]\n"),
halt (1).

If there are no applications in the set of files being analyzed, we output nothing for the mod line in the application file, otherwise we output
{ mod, { @FW_ERL_APP_START_MODULE@, @FW_ERL_APP_START_ARGS@ } },
where FW_ERL_APP_START_MODULE is the output of the above command and FW_ERL_APP_START_ARGS is indicated by the programmer (we use [] by default for FW_ERL_APP_START_ARGS and in practice never use the field). Obviously, if multiple modules implementing the application behaviour are found then this will not work. Our style has been to put each application we make in a separate directory to facilitate automatic analysis.

Anyway now we can tackle the more challenging problem of finding registered processes. There are many library calls that end up registering a process name; the ones we recognize are:
  • supervisor child specs that contain calls of the form
    • { _, start, [ { local, xxx }, ... ] } -> xxx being registered
    • { _, start_link, [ { local, xxx }, ... ] } -> xxx being registered
  • function calls of the form
    • Module:start ({ local, xxx }, ...) -> xxx being registered
    • Module:start_link ({ local, xxx }, ...) -> xxx being registered
  • calls to erlang:register (xxx, ...) -> xxx being registered
There's alot of code here but the general idea is the same: walk the forms generated by compile and pattern match on one of the above cases, outputting discovered registered process names to file descriptor 3.

-module (find_registered).
-export ([ main/1 ]).

%% looks like fun_clauses are the same as clauses (?)
print_registered_fun_clauses (Clauses) ->
print_registered_clauses (Clauses).

%% looks like icr_clauses are the same as clauses (?)
print_registered_icr_clauses (Clauses) ->
print_registered_clauses (Clauses).

print_registered_inits (Inits) ->
lists:foreach (fun ({ record_field, _, _, E }) ->
print_registered_expr (E);
(_) ->
ok
end,
Inits).

print_registered_upds (Upds) ->
lists:foreach (fun ({ record_field, _, _, E }) ->
print_registered_expr (E);
(_) ->
ok
end,
Upds).

% hmmm ... pretty sure patterns are not supposed to have side-effects ...
print_registered_pattern (_) -> ok.
print_registered_pattern_group (_) -> ok.

print_registered_quals (Qs) ->
lists:foreach (fun ({ generate, _, P, E }) ->
print_registered_pattern (P),
print_registered_expr (E);
({ b_generate, _, P, E }) ->
print_registered_pattern (P),
print_registered_expr (E);
(E) ->
print_registered_expr (E)
end,
Qs).

print_registered_expr ({ cons, _, H, T }) ->
print_registered_expr (H),
print_registered_expr (T);
print_registered_expr ({ lc, _, E, Qs }) ->
print_registered_expr (E),
print_registered_quals (Qs);
print_registered_expr ({ bc, _, E, Qs }) ->
print_registered_expr (E),
print_registered_quals (Qs);

%% Ok, here's some meat:
%% the "supervisor child spec" rule
%% { _, start, [ { local, xxx }, ... ] } -> xxx being registered
%% { _, start_link, [ { local, xxx }, ... ] } -> xxx being registered
print_registered_expr ({ tuple, _,
Exprs=[ _,
{ atom, _, Func },
{ cons,
_,
{ tuple, _, [ { atom, _, local },
{ atom, _, Name } ] },
_ } ] }) when (Func =:= start) or
(Func =:= start_link) ->
port_command (get (three), io_lib:format ("~p~n", [ Name ])),
print_registered_exprs (Exprs);

print_registered_expr ({ tuple, _, Exprs }) ->
print_registered_exprs (Exprs);
print_registered_expr ({ record_index, _, _, E }) ->
print_registered_expr (E);
print_registered_expr ({ record, _, _, Inits }) ->
print_registered_inits (Inits);
print_registered_expr ({ record_field, _, E0, _, E1 }) ->
print_registered_expr (E0),
print_registered_expr (E1);
print_registered_expr ({ record, _, E, _, Upds }) ->
print_registered_expr (E),
print_registered_upds (Upds);
print_registered_expr ({ record_field, _, E0, E1 }) ->
print_registered_expr (E0),
print_registered_expr (E1);
print_registered_expr ({ block, _, Exprs }) ->
print_registered_exprs (Exprs);
print_registered_expr ({ 'if', _, IcrClauses }) ->
print_registered_icr_clauses (IcrClauses);
print_registered_expr ({ 'case', _, E, IcrClauses }) ->
print_registered_expr (E),
print_registered_icr_clauses (IcrClauses);
print_registered_expr ({ 'receive', _, IcrClauses }) ->
print_registered_icr_clauses (IcrClauses);
print_registered_expr ({ 'receive', _, IcrClauses, E, Exprs }) ->
print_registered_icr_clauses (IcrClauses),
print_registered_expr (E),
print_registered_exprs (Exprs);
print_registered_expr ({ 'try', _, Exprs0, IcrClauses0, IcrClauses1, Exprs1 }) ->
print_registered_exprs (Exprs0),
print_registered_icr_clauses (IcrClauses0),
print_registered_icr_clauses (IcrClauses1),
print_registered_exprs (Exprs1);
print_registered_expr ({ 'fun', _, Body }) ->
case Body of
{ clauses, Cs } ->
print_registered_fun_clauses (Cs);
_ ->
ok
end;

%% Ok, here's some meat:
%% Module:start ({ local, xxx }, ...) -> xxx being registered
%% Module:start_link ({ local, xxx }, ...) -> xxx being registered

print_registered_expr ({ call,
_,
E={ remote, _, _, { atom, _, Func } },
Exprs=[ { tuple, _, [ { atom, _, local },
{ atom, _, Name } ] } | _ ] })
when (Func =:= start) or
(Func =:= start_link) ->
port_command (get (three), io_lib:format ("~p~n", [ Name ])),
print_registered_expr (E),
print_registered_exprs (Exprs);

%% Ok, here's some meat:
%% erlang:register (xxx, ...) -> xxx being registered

print_registered_expr ({ call,
_,
{ remote,
_,
{ atom, _, erlang },
{ atom, _, register } },
Exprs=[ { atom, _, Name } | _ ] }) ->
port_command (get (three), io_lib:format ("~p~n", [ Name ])),
print_registered_exprs (Exprs);

print_registered_expr ({ call, _, E, Exprs }) ->
print_registered_expr (E),
print_registered_exprs (Exprs);
print_registered_expr ({ 'catch', _, E }) ->
print_registered_expr (E);
print_registered_expr ({ 'query', _, E }) ->
print_registered_expr (E);
print_registered_expr ({ match, _, P, E }) ->
print_registered_pattern (P),
print_registered_expr (E);
print_registered_expr ({ bin, _, PatternGrp }) ->
print_registered_pattern_group (PatternGrp);
print_registered_expr ({ op, _, _, E }) ->
print_registered_expr (E);
print_registered_expr ({ op, _, _, E0, E1 }) ->
print_registered_expr (E0),
print_registered_expr (E1);
print_registered_expr ({ remote, _, E0, E1 }) ->
print_registered_expr (E0),
print_registered_expr (E1);
print_registered_expr (_) ->
ok.

print_registered_exprs (Exprs) ->
lists:foreach (fun (E) -> print_registered_expr (E) end, Exprs).

print_registered_clauses (Clauses) ->
lists:foreach (fun ({ clause, _, _, _, Exprs }) ->
print_registered_exprs (Exprs);
(_) ->
ok
end,
Clauses).

print_registered_forms (Forms) ->
lists:foreach (fun ({ function, _, _, _, Clauses }) ->
print_registered_clauses (Clauses);
(_) ->
ok
end,
Forms).

is_skipped ([]) -> false;
is_skipped ([ { attribute, _, fwskip, _ } | _ ]) -> true;
is_skipped ([ _ | Rest ]) -> is_skipped (Rest).

print_registered (Dir, F) ->
case compile:file (F, [ binary, 'E', { outdir, Dir } ]) of
{ ok, _, { _, _, Forms } } ->
case is_skipped (Forms) of
true ->
ok;
false ->
print_registered_forms (Forms)
end;
_ ->
ok
end.

main ([ Dir | Rest ]) ->
ok = file:make_dir (Dir),

try
Three = open_port ({ fd, 0, 3 }, [ out ]),
% ugh ... don't want to have to change all these function signatures,
% so i'm gonna be dirty
put (three, Three),
lists:foreach (fun (F) -> print_registered (Dir, F) end, Rest)
after
{ ok, FileNames } = file:list_dir (Dir),
lists:foreach (fun (F) -> file:delete (Dir ++ "/" ++ F) end, FileNames),
file:del_dir (Dir)
end;
main ([]) ->
Port = open_port ({ fd, 0, 2 }, [ out ]),
port_command (Port, "usage: find-modules.esc tmpdir filename [filename ...]\n"),
halt (1).
It's definitely possible to fool the above code. For instance:

Hidden = therealname,
erlang:register (Hidden, SomePid).

However in practice developers tend to use either explicitly name their registered process or at worst obscure them with a macro (which, by going through the Erlang compiler, we get expanded for free), so this automatic strategy has been working very well.

You can download fw-template-erlang from google code and get the complete source code from above plus some intuition about how it is called. One thing that's been on our TODO list is to combine all these analysis into one pass on the source code; currently we pass through the source code three times, which is clearly suboptimal, although the process is speedy enough that it is only noticeable when analyzing larger projects (like, when we converted yaws and mnesia to fw-template-erlang for our internal use).

Friday, June 13, 2008

Erlang is so clear.

Ah the clearness of erlang.

Someone on the yaws list gets errors and asks:
Can any one please explain me what this error is about and how to
overcome this one?
and gets the reply

Well - it's fairly clear. It says:

{undef,[{compile,file,
["/home/phanikar/.yaws/yaws/default/m1.erl",
[binary,return_errors, ...

That means that when .yaws page has been transformed into an
.erl page, the yaws server cannot compile the generated .erl file
You need to have the erlang compiler in the load path
I'm going to assume the author was being sarcastic about that being clear, I mean I get that it couldn't compile something.. but.. no hint of what reasons it couldn't compile.

maybe to quote the awesome tude of of the dynamic window manager guys:
This keeps its userbase small and elitist. No novices asking stupid questions.

Well despite some of these problems erlang is super awesome. It just needs a little more new blood to inject new attitudes of usability [ or some other language will evbentually just steal it's good ideas, which would be a shame.. we have them now. ] Also new attitudes around easier sharing of libraries. [ the erlware guys get that. ].

I'm pretty proud to see Uncle Jesse and the gang here trying to contribute improvements all the time. Better mnesia on disk tables in the works, yaws and erlang patches, erlrc, automatic mnesia fragmentation mangment and loads more. I mean we're building a big and real distributed mnesia backed system so we're learning a lot, and we're certainly not the only ones contributing lately.

So, as a novice who asks stupid questions, I'm telling the rest of you out there not to give up if you run into a few weird error messages or attitudes from language scholars who have been so entrenched they don't see what seems weird at first. Despite those few small speed bumps, there is some really powerful awesome stuff going on here. So, don't let em be elitist barge in there and get some erlang for yourself.

P.S. Pattern matching rules!!!

Wednesday, June 11, 2008

I'm in ur Erlangz, upgrading ur applicationz

Hot code loading is one of the sexiest features of Erlang; after all, who wants downtime? If you're running mnesia, it's also very convenient, because as we've learned mnesia does not like to be shut down (more precisely, it makes no attempt to incrementally replicate changes across brief outages, instead doing full table (re)copying; something probably nobody cares about because they never turn it off).

Erlang's hot code loading support consists of low level language and runtime features, and a high level strategy embodied by release handler. The latter is a very Java-esque strategy of bundling up the complete new state of the system and upgrading everything. Without being normative let's just say that we enjoy using our OS package manager to manage the software on our boxes, and so we were interested in integrating Erlang hot code loading with the OS package manager.

The result is erlrc, which when operating looks something like this:
pmineiro@ub32srvvmw-199% sudo apt-get install egerl
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
egerl
1 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Need to get 0B/113kB of archives.
After unpacking 0B of additional disk space will be used.
WARNING: The following packages cannot be authenticated! egerl
Install these packages without verification [y/N]? y
(Reading database ... 48221 files and directories currently installed.)
Preparing to replace egerl 4.0.1 (using .../archives/egerl_4.1.0_all.deb) ...
Unpacking replacement egerl ...
erlrc-upgrade: Upgrading 'egerl': (cb8eec1a1b85ec017517d3e51c5aee7b) upgraded
Setting up egerl (4.1.0) ...
erlrc-start: Starting 'egerl': (cb8eec1a1b85ec017517d3e51c5aee7b) already_running
Basically, I issued an apt command, the result was to hot upgrade the corresponding Erlang application on the running Erlang node on the machine.

In addition to integration with the OS package manager, erlrc also features an extensible boot procedure (basically, using a boot directory instead of a boot file to determine what to start, which makes packaging easier); and in addition erlrc will automatically generate appup files which correspond to most of the common cases outlined in the appup cookbook. Of course, Erlang packages made with fw-template-erlang are automatically instrumented for erlrc (but the instrumentation is inert if erlrc is not installed).

If you're interested, there's more extensive documentation on google code.

Wednesday, March 26, 2008

Nodefinder improvements

Matthew Dempsky from Mochi mailed me some suggestions for nodefinder. Now the multicast_ttl is an application parameter so multicast routing can be used; separate send and receive sockets are employed to ensure the right source address on BSD; and the discovery packet is signed using the cookie as a secret which prevents (hopeless but annoyingly logged) connect attempts to nodes with different cookies and also prevents a list_to_atom flood from packets that happen to match.

Thanks Matt!

Friday, March 21, 2008

network partition ... oops

well we had our first network partition on ec2 yesterday evening. we think it was amazon doing maintenance because it happened in both our production and development clusters.

it turns out things went pretty well considering we've done nothing yet to address network partitioning. two islands formed, but never tried to join back up (it turns out we only try to connect nodes when schemafinder starts). after a while each island's schemafinder decided that the other island was dead and removed it from the schema. so we ended up with two disjoint mnesia instances. they each had complete data so they were able to serve. of course, they were also creating data and diverging over time. however right now we're not doing anything that important so it didn't really matter. we just picked one island to survive and restarted the other ones after nuking their mnesia directories.

however the experience has convinced us that we need to prioritize up our network partition recovery strategy. the good news is that we have an out-of-band source of truth about what the set of nodes should be (via ec2-describe-instances), so it should be possible for us to detect a network partition has occurred. the other good news is that we care more about high availability than transactional integrity so we'll be willing to set master nodes on tables based upon simple heuristics.

Saturday, March 15, 2008

walkenfs improvements

I've made some progress towards my goal of making walkenfs usable by those who know nothing about Erlang or Mnesia.
  • schemafinder has been open sourced, which provides Mnesia auto-configure.
  • mount and umount work (although the mount line is very baroque), so no need to start an Erlang emulator explicitly.
  • the distribution includes an example init script very close to the one we actually use.
Of course walkenfs could still have a showstopping bug in it. In our recent use it was mostly fuse mount options that gave us trouble, and also the shared writeable mmap problem surfaced because rrdtool uses that by default. fortunately with rrdtool, it was a simple configure option to disable mmap. We did have a data corruption scare in development but that turned out to be due to the default redundancy setting being 1 copy (changed that to 3, now that fragmentron can initialize under infeasible conditions). Otherwise, so far so good, knock on wood.

Wednesday, March 12, 2008

mnesia and ec2

Mnesia is Erlang's built-in multi-master distributed database. It's one of the reasons why we chose Erlang for our startup. And while it is mostly a great fit for EC2, it needs some tweaks to work. In particular we wanted the following to be automatic:
  1. Discover nodes entering or leaving the system
  2. Have them join the Mnesia database
  3. Have them take responsibility for a portion of the data
We've already talked about and published solutions to the first problem (nodefinder) and third problem (fragmentron). Now I just published a solution to the second problem on google code (schemafinder). It took a while to get this part out partially because we kept finding small problems with it, but mostly because we're very busy.

It's actually surprisingly complicated, given that at the root the way to add a node to a running database is to call mnesia:change_config/2 as mnesia:change_config (extra_db_nodes, NodeList). However while adding nodes is easy removing nodes is harder. Detecting nodes to remove is not that hard; we go with the (EC2-centric) strategy of periodically updating a record in a table to "stay alive" (and also a way to mark clean shutdown); nodes that fail to check in eventually are considered dead. To handle multiple node failure we patch mnesia to provide mnesia_schema:del_table_copies/2, the multiple table analog to mnesia:del_table_copy/2. To guard against rejoining the global schema after having been removed but without having removed one's mnesia database directory (this is a no-no), we check with the other database nodes to see if they think we've been removed.

Even with all this complication, we have yet to address the problem of handling network partitioning automatically. We'll see if we have time to address that before it bites us in the butt.

Now available on google code.

Wednesday, March 5, 2008

controlling mnesia db size

genexpire something quick that we rolled together to satisfy our paranoia about out-of-control databases (ram based dbs are especially worrisome). Basically, it periodically runs and enforces a maximum size for each database fragment (which depends upon how many fragments are on the node; in practice this means you get the space of your most crowded node in a distributed config. With fragmentron things are balanced to within one fragment so with enough fragments this is ok.)

Now available on google code.

Sunday, March 2, 2008

Procfs

Our next sprint's theme is "monitoring", so this weekend I thought I'd slap together a procfs style filesystem with fuserl. I've got the contents of erlang:processes (), erlang:process_info (), erlang:ports (), erlang:port_info (), and erlang:system_info () now exposed in a filesystem format. This is less useful than I envisioned on Friday, but it was pedagogical.

Now available on google code.

Sunday, February 24, 2008

Erlang and Automake

There were some inquiries on the erlang-questions mailing list recently about using Gnu Automake with Erlang. That reminded me to release a build system we've been using for development.

My fascination with build systems predates my obsession with Erlang so it is more general than just Erlang. It is organized around project templates, and there is Erlang project template that has been getting pretty well developed. Every time we figure out a new best practice around Erlang development we incorporate it into the template so that all of our projects benefit. So for instance now we have cover integration, eunit integration, and edoc integration.

It is now available on google code.

Friday, February 22, 2008

New book from Erlang Training

So, the guys Erlang Consulting and Training are making a new book.

and on the erlang-questions mailing list:

On Feb 21, 2008, at 1:50 PM, Francesco Cesarini wrote:
Our main goal is to cover Erlang in depth. Make sure that newbies struggling with recursion can easily understand it or developers being exposed to pattern matching for the first time use it optimally. The contents of the book are pretty much outlined by the introductory Erlang courses Jan-Henry and I have been giving in the last decade, and based on the knowledge and experience gained when teaching. We know [where] students struggle...
On Feb 21, 2008, at 4:29 PM, some guy wrote:
It's just that, no matter how good it is, with Joe's book already out there - it's not going to be a must read for me.

On Feb 21, 2008, at 3:17 PM, someone else wrote:
Another book on just the Erlang programming language would not find a place on my bookshelf.
These responses kind of made me upset.

I applaud Francesco, and Jan-Henry in getting another book out there. Obviously they have learned from their history areas where new comers to erlang have troubles, and they're making a book to plug that hole. They are of course asking you to look back on your experience coming into erlang, think about what concepts you had troubles with and let them know, this would both help confirm the topics for their book, and possibly give them new topics that are perhaps different than they have been exposed to because they come from people
who don't buy erlang training courses.

Just because you don't need the book now doesn't mean you dont' have something to add.

And sure, we all have frustrations about the OTP stuff, it seems like it might be very very good, and then again it seems pretty opaque. I encourage anyone reading this who knows more OTP to publish more stuff about it.

[ I know I'm calling myself RoscoeOTPColtrain, but right now I'm more roscoe than OTP, but perhaps I'll get there. ]

Wednesday, February 13, 2008

Automatic Node Discovery

One of the first problems we ran into on EC2 was having Erlang nodes find each other. Erlang ships with a node discovery strategy based upon a static hosts file. On EC2, the idea is to start and stop nodes frequently; and EC2 will typically hand out different hostnames each time (I guess that repeats are possible but unlikely). So a static hosts file won't work.

Fortunately Erlang is ready for the set of nodes to change constantly, it is just not spelled out how to find them. I've noticed this about Erlang: often an excellent design is coupled with a simplistic stub implementation. It's the "Tommy" school of software: even the most casual observer is supposed to be able to figure out the next step.

Initially we thought we'd use multicast UDP for discovery, which was extremely easy to do but then we discovered that EC2 does not support multicast. So then I overthought the problem and came up with a discovery protocol based upon using S3 as a blackboard. That worked but was overkill, since just parsing the output of ec2-describe-instances contains all the information needed, plus it allows EC2 security groups to define the Erlang node groups. This is convenient because the EC2 security group(s) can be determined dynamically from the instance metadata.

As an aside, the parsing of the output of ec2-describe-instances is done with a list comprehension and built-in pattern matching only (parse_host_list/2 and parse_host_list/4).

All three strategies are now available on google code. If you're on EC2, ec2nodefinder is the recommendation. Note, you can still in addition use a static hosts file (or any other node discovery strategy), e.g., if you have some non-EC2 hosts you'd like to connect as well.

Tuesday, February 12, 2008

howdy

we're the dukes. basically 3 guys at a brand new startup (more about that later). we want to have a go with Erlang and EC2 as the two seem to go together pretty nicely. anyway we thought it'd be fun to blog about things as we figure them out, mostly about cloud computing with Erlang, but occasionally maybe about some dramatic startup stuff like not being able to make payroll.

walkenfs on google code

We Dukes are into cloud computing. We especially like our machines to be operationally symmetric so we don't have to think that hard. Erlang makes that easy but unfortunately sometimes we have to deal with software not written in Erlang.

In particular Fess started to put together a monitoring system lately using rrdtool which is great but wants to operate on files. We'd done all this work to make mnesia play nicely on EC2 (more on that later) and suddenly we seemed back at square one. We considered storing data in mnesia and rendering files at the last minute for rrdtool; but then we noticed fuse. Sweet! Now we can store data in mnesia and make it look like a filesystem at the same time.

First we needed Erlang bindings for fuse, so I wrote those. That turned out to involve writing C code, a moderately frustrating experience that served as a nice reminder why we don't call ourselves the "Dukes of C". Once the nasty bits were subdued, writing the actual Erlang code corresponding to a filesystem via mnesia was comparatively quick and painless. I'm still trying to figure out the best way to support posix locking (right now, I'm looking for a satisfying way to have the locks release when an entire node dies unexpectedly), but everything else seems to be working.

It still needs alot of polish. As a filesystem one would expect a mkfs tool and the ability to put entries in /etc/fstab that work. However it's usable as part of an Erlang system right now.