Wednesday, December 23, 2009

erlrc and rpm

Two of the Dukes now work at OpenX, which is starting to dip its toe into the Erlang waters. They use CentOS so they agreed to fund porting framewerk and erlrc to rpm; previously I'd only used them with deb.

A bit of background: erlrc is a set of Erlang modules and shell scripts that are designed to be integrated into packaging hooks so that installation, upgrade, or removal of a package causes corresponding hot-code activity to happen inside registered Erlang VMs on the box. Since erlrc was designed to easily integrate with multiple package managers, getting it to work with rpm was mostly about me understanding rpm's package hooks model. The result is an experience like this,

% sudo yum -q -y install lyet
erlrc-start: Starting 'lyet': (erlang) started
% sudo yum -q -y remove lyet
erlrc-stop: Stopping 'lyet': (erlang) unloaded

i.e., installing an rpm causes a running Erlang VM on the box to hot-code load the new modules and start the associated application, and removing the rpm causes the associated application to be stopped and the corresponding modules to be hot-code unloaded.

If you use fw-template-erlang than the appropriate packaging hooks are added for you automatically, both for deb and now rpm. However even manual creation of rpm spec files is pretty easy:
  • erlrc-stop is called in %preun if the installation count indicates removal
  • erlrc-upgrade is called in %preun if the installation count indicates upgrade
  • erlrc-start is called in %posttrans
Also, the erlrc shell scripts want to know the previously installed version, so I call rpm -q in a %pretrans hook and save the result. Longer term, erlrc should probably ask the Erlang VM it is talking to what version is running to eliminate the need for this argument (I was a bit surprised that rpm doesn't provide this argument to the package hook like debian does; it seems very useful for creating version-specific upgrade fixes).

Tuesday, October 6, 2009

Linear Programming with Erlang

So you have to solve a linear program, so naturally the first language you think of is Erlang. Actually, it's not a natural first choice for most people, but if you are solving a linear program as part of an automatic control strategy for an internet facing application, the choice is better motivated.

Since I faced this situation recently I wrote a binding for GLPK for Erlang. Writing port drivers is a drag so I actually wrote a program to generate the C and Erlang for me. Perhaps with the new FFI these sort of games will not be necessary, but I was happy with the approach because I anticipate experimenting with several linear programming packages, which should be much easier to accommodate.

Available at Google code.

Thursday, September 10, 2009

Removing warnings from ct_expand

I use ct_expand alot (can you tell?). The released version outputs warnings about unused variables, which are harmless, except ... that it conditioned me to ignore unused variable warnings from the compiler. Then the other day I had a real bug which the compiler was warning me about (via an unused variable) that took me quite a bit of time to find. Lesson learned: make sure that a correct compile is as quiet as possible, so problems stand out.

Ulf gave me some pointers on how to prevent ct_expand from emitting these warnings. Basically, ensure that the temporary variables created by ct_expand are prefixed with an underbar. Here's a patch:

--- src/ct_expand.erl 5 Jun 2009 21:07:32 -0000 1.1
+++ src/ct_expand.erl 10 Sep 2009 04:12:07 -0000
@@ -139,6 +139,9 @@
{Fname, Arity} = erl_syntax_lib:analyze_function(Form),
VarNames = erl_syntax_lib:new_variable_names(
+ fun (N) ->
+ list_to_atom ("_V" ++ integer_to_list (N))
+ end,
{Form, [{function, Fname},
{arity, Arity},

Tuesday, September 8, 2009

Application Variables during Upgrade

A quick note related to my previous post: it turns out the application controller gives "old values" for application variables during the code upgrade process. Therefore the call to

element (2, application:get_key (drurly, modules))

returns the old list of modules, i.e., is missing new modules. ct_expand to the rescue! I get the list of modules at compile time from the application specification via

Modules =
ct_expand:term (
{ ok, [ { application, drurly, PropList } ] } =
file:consult (""),
case proplists:get_value (modules, PropList) of
Y when is_list (Y) -> Y

For this to work, you must ensure the application specification is generated prior to compiling the module with this function in it.

Wednesday, August 26, 2009

Dynamically Loading Webmachine Resources

I've been using webmachine lately, which is fabulous for REST server development. It has a modular concept of resources and the design could be summarized as a "RESTlet container". Because Erlang has hot code loading I was interested in doing this dynamically; in other words, I want to write a module which provides a webmachine resource, hot code load it into the system, and have webmachine start dispatching requests to it.

The good news is, you can change the dispatch list at any time, via
application:set_env (webmachine, dispatch_list, NewDispatchList).
This does not upset webmachine and it will see the change immediately. So for a single application leveraging webmachine, this seemed as simple as having a code_change handler execute something like
lists:foldl (fun (Mod, Acc) ->
case catch Mod:dispatch_rules () of
{ 'EXIT', _ } -> Acc;
X -> X ++ Acc
element (2, application:get_key (drurly, modules)))).
(drurly is the name of the application in this case). Now my modules export a function like

dispatch_rules () ->
[ { [ "clip" ], ?MODULE, [] },
{ [ "clip", id ], ?MODULE, [] }
if they are webmachine resources.

The problem is that order is important: the webmachine_dispatch_list is consulted in order, and the first match is executed. To solve this I created a sort function which orders the dispatch rules by "specificity"
path_spec_priority ('*') -> 3;
path_spec_priority (X) when is_atom (X) -> 2;
path_spec_priority (X) when is_list (X) -> 1.

dispatch_specificity ({ PathSpecA, _, _ },
{ PathSpecB, _, _ }) ->
case erlang:length (PathSpecA) - erlang:length (PathSpecB) of
X when X > 0 ->
X when X < 0 ->
_ ->
PrioPathSpecA = [ path_spec_priority (X) || X <- PathSpecA ],
PrioPathSpecB = [ path_spec_priority (X) || X <- PathSpecB ],

case PrioPathSpecA =< PrioPathSpecB of
false ->
true ->
FullPathSpecA = [ { path_spec_priority (X), X } || X <- PathSpecA ],
FullPathSpecB = [ { path_spec_priority (X), X } || X <- PathSpecB ],

FullPathSpecA =< FullPathSpecB
  • Longer dispatch paths come first.
  • If two dispatch paths have equal length, the more specific one comes first, where specificity is defined by examining the elements left-to-right, with
    • string literals are most specific
    • atoms except '*' are the next most specific
    • '*' is the least specific
  • If two dispatch paths have equal length and specificity, sort by Erlang term order (effectively, break ties arbitrarily)
The code change handler now becomes:
(fun dispatch_specificity/2,
lists:foldl (fun (Mod, Acc) ->
case catch Mod:dispatch_rules () of
{ 'EXIT', _ } -> Acc;
X -> X ++ Acc
element (2, application:get_key (drurly, modules))))).
This has proven sufficient for a single application running webmachine.

For multiple applications that want to run under the same webmachine, I suspect the right way to go is to have a gen_server which
  • contains webmachine dispatch configuration "fragments" per key, where the key is intended to be the application name;
  • accepts commands to replace or delete a particular fragment by key;
  • for any replace or delete, rebuilds the complete dispatch list by concatenating all fragments together and sorting by specificity, and then updates the webmachine application env var.
For multiple applications that want to run under different webmachines, well unfortunately webmachine uses some global application settings under the fixed atom webmachine and thus currently, like the highlander, there can be only one. (In fact, you can only listen on one ip/port combination with webmachine right now. I might have to patch it to accept multiple ip/port combinations to listen to, since a standard trick of mine is to have nginx handle both regular and ssl connections and connect to a different back-end port to indicate whether or not the connection is secure.)

Saturday, August 22, 2009

Metaprogramming with ct_expand

Yesterday I posted about putting arbitrary Erlang terms into HTTP cookies. I suggested that constructing the base64 codec at compile time was an excellent application for ct_expand, but I didn't provide any details. Therefore, here is a follow-up.

ct_expand provides a simple interface to Erlang metaprogramming: evaluating an arbitrary Erlang term at compile time and substituting the results into the source code during compilation. This is especially useful for initializing a data structure at compile time, and in particular, it can be used to construct the forward and inverse maps for the base64 codec. We will specify our codec by providing a list of 64 unique characters, and ct_expand will do the rest. The following code is functionally identical to the termcookie module presented previously.

-module (termcookie2).
-compile ({ parse_transform, ct_expand }).
-export ([ decode/2,
encode/2 ]).

-define (CODEC, "abcdefghijklmnopqrstuvwxyz"

% Public

decode (Encoded, Secret) when is_binary (Encoded) ->
<<Signature:28/binary, Payload/binary>> = Encoded,
Signature = to_base64 (crypto:sha ([ Payload, Secret ])),
erlang:binary_to_term (from_base64 (Payload)).

encode (Term, Secret) ->
Payload =
(erlang:term_to_binary (Term,
[ compressed,
{ minor_version, 1 } ])),
Signature = to_base64 (crypto:sha ([ Payload, Secret ])),
<<Signature/binary, Payload/binary>>.

% Private

to_base64 (Bin) when (8 * byte_size (Bin)) rem 6 =:= 0 ->
to_base64_padded (Bin);
to_base64 (Bin) when (8 * byte_size (Bin)) rem 6 =:= 2 ->
to_base64_padded (<<Bin/binary, 0:16>>);
to_base64 (Bin) when (8 * byte_size (Bin)) rem 6 =:= 4 ->
to_base64_padded (<<Bin/binary, 0:8>>).

to_base64_padded (Bin) ->
<< <<(element (N + 1,
ct_expand:term (
64 = length (?CODEC),
64 = length (lists:usort (?CODEC)),
list_to_tuple (?CODEC)
|| <<N:6>> <= Bin >>.

from_base64 (Bin) ->
<< <<(element
(N + 1,
(fun (X, { K, T }) ->
{ K + 1, setelement (X + 1, T, K) }
{ 0, erlang:make_tuple (256, -1) },
|| <<N:8>> <= Bin >>.

Some notes:
  • Lines 6-8 contain the specification of the codec.
  • Lines 43-44 are compile-time assertions on the codec specification, namely that it consists of 64 unique characters. If you modify the specification to violate these assertions, the module will not compile, although the resulting error message will be nearly unintelligible (try it!).
  • Line 45 constructs the forward mapping from the specification; the result is a (constant) tuple which is consulted at run time.
  • Lines 56-65 construct the inverse mapping from the specification; the result is a (constant) tuple which is consulted at run time.
Note you cannot use any functions from the current module inside the ct_expand:term/1 argument, because the current module has not been compiled yet! If you need to do something really complicated and don't like an unwieldy inline expression you can place helper code in a separate module which is compiled first.

The resulting software is easier to maintain than the original version, because if the codec needs to be changed, only the specification is modified. For instance we can replace lines 5-8 with

-define (CHARS, "abcdefghijklmnopqrstuvwxyz"
-define (CODEC,
ct_expand:term (
fun () ->
random:seed (1, 2, 3),
[ X ||
{ _, X } <-
lists:sort (
[ { random:uniform (), Y }
|| Y <- ?CHARS
end ()

which results in a proper codec utilizing a permutation of the original codec. Note the call to random:seed/3 is happening at compile time, and is setting the random seed of the compilation process. Therefore this is a stable codec definition. (Unfortunately, it doesn't really increase the opacity of the encoding scheme; the bits in an encoded Erlang term are highly degenerate so any interested party would be able to deduce the permutation given enough cookies).

Friday, August 21, 2009

Erlang Terms in Cookies

Erlang is an increasingly popular choice for web development. Such projects tend to heavily leverage HTTP cookies, and because Erlang defines an external format for any Erlang term, it turns out to be very easy to store an arbitrary term with a cookie (within cookie size limitations), which can be a useful trick.

The technique outlined here generates a signed but not encrypted cookie. That means it's fairly simple for anyone possessing one of your cookies to determine the contents, but difficult for them to forge a novel cookie. The latter is typically important for the application, but has additional importance here because we will be calling erlang:binary_to_term/1 on the cookie value and passing arbitrary data to erlang:binary_to_term/1 is a bad idea (for example, this could cause an extremely large memory allocation).

Here's the code:

-module (termcookie).
-export ([ decode/2,
encode/2 ]).

% Public

decode (Encoded, Secret) when is_binary (Encoded) ->
<<Signature:28/binary, Payload/binary>> = Encoded,
Signature = to_base64 (crypto:sha ([ Payload, Secret ])),
erlang:binary_to_term (from_base64 (Payload)).

encode (Term, Secret) ->
Payload =
(erlang:term_to_binary (Term,
[ compressed,
{ minor_version, 1 } ])),
Signature = to_base64 (crypto:sha ([ Payload, Secret ])),
<<Signature/binary, Payload/binary>>.

% Private

to_base64 (Bin) when (8 * byte_size (Bin)) rem 6 =:= 0 ->
to_base64_padded (Bin);
to_base64 (Bin) when (8 * byte_size (Bin)) rem 6 =:= 2 ->
to_base64_padded (<<Bin/binary, 0:16>>);
to_base64 (Bin) when (8 * byte_size (Bin)) rem 6 =:= 4 ->
to_base64_padded (<<Bin/binary, 0:8>>).

to_base64_padded (Bin) ->
<< <<(to_base64_char (N)):8>> || <<N:6>> <= Bin >>.

to_base64_char (N) when N >= 0, N =< 25 -> $a + N;
to_base64_char (N) when N >= 26, N =< 51 -> $A + (N - 26);
to_base64_char (N) when N >= 52, N =< 61 -> $0 + (N - 52);
to_base64_char (62) -> $.;
to_base64_char (63) -> $,.

from_base64 (Bin) ->
<< <<(from_base64_char (N)):6>> || <<N:8>> <= Bin >>.

from_base64_char (N) when N >= $a, N =< $z -> N - $a;
from_base64_char (N) when N >= $A, N =< $Z -> 26 + (N - $A);
from_base64_char (N) when N >= $0, N =< $9 -> 52 + (N - $0);
from_base64_char ($.) -> 62;
from_base64_char ($,) -> 63.
We are taking advantage of the fact that erlang:binary_to_term/1 will ignore extra bytes at the end, which allows us to mindlessly pad for base 64 encoding.

If you really like to squeeze the last few drops of efficiency out of code, you can change those to_base64_char/1 and from_base64_char/1 functions into tuple lookups. If you are extra cool you can use Ulf Wiger's ct_expand parse transform to construct the tuples at compile time from a specified character list.

This code will throw an exception if anything is amiss with the input, including a signature fail.

% erl
Erlang (BEAM) emulator version 5.6.5 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.6.5 (abort with ^G)
1> crypto:start ().
2> termcookie:encode ({ "omg", erlang, rulz }, "wazzup").
3> termcookie:decode (termcookie:encode ({ "omg", erlang, rulz }, "wazzup"), "wazzup").
4> termcookie:decode (termcookie:encode ({ "omg", erlang, rulz }, "huh"), "wazzup").
** exception error: no match of right hand side value <<"nQCwmuMgeK3bTPzBqKDSmSylIcia">>
in function termcookie:decode/2

Saturday, August 8, 2009

Implementing a DSL in Erlang

If you are willing to abide by Erlang syntax, then you can leverage erl_parse, erl_scan, and erl_eval to quickly whip up a domain specific language (DSL). You can manipulate the semantics via a combination of transformations on the parse tree (with the help of erl_syntax) and interception of function calls (which comes with erl_eval).

For instance, suppose we want a domain specific language which is just like Erlang, except that it has destructive assignment. This can be done in three steps: 1) parse the input using the Erlang parser, 2) transform the parse tree so that match expressions are rewritten as a special local function call, and 3) eval the result, intercept the special local function call and implement the destructive assignment by recursive eval.

Here's the code:

-module (dsl).
-export ([ compile/1,
run/2 ]).

%- Public -

compile (String) ->
{ ok, Scanned, _ } =
erl_scan:string (maybe_append_dot (String)),
{ ok, Parsed } = erl_parse:parse_exprs (Scanned),
{ dsl, transform (Parsed) }.

run ({ dsl, Parsed }, Bindings) ->
eval (Parsed, Bindings).

%- eval callbacks -

local ('_assign', [ Pattern, Expr ], Bindings) ->
{ value, Rhs, NewBindings } =
eval ([ Expr ], Bindings),
MatchExpr =
erl_syntax:match_expr (Pattern,
erl_syntax:abstract (Rhs)),
{ value, _Lhs, MatchBindings } =
eval ([ erl_syntax:revert (MatchExpr) ],
erl_eval:new_bindings ()),
{ value,
destructive_merge (NewBindings, MatchBindings) }.

%- Private -

destructive_merge (Bindings, MatchBindings) ->
(fun ({ Key, Val }, Acc) ->
erl_eval:del_binding (Key, Acc))
erl_eval:bindings (MatchBindings)).

eval (Parsed, Bindings) ->
erl_eval:exprs (Parsed,
{ eval, fun local/3 }).

maybe_append_dot (String) ->
case lists:last (string:strip (String,
right)) of
$. -> String;
_ -> String ++ "."

postorder (F, Tree) ->
F (case erl_syntax:subtrees (Tree) of
[] ->
List ->
[[ postorder (F, Subtree)
|| Subtree <- Group ]
|| Group <- List ])

transform (Parsed) ->
F = fun (Node) ->
case erl_syntax:type (Node) of
% match_expr: rewrite as (destructive) assignment
match_expr ->
erl_syntax:atom ("_assign"),
[ erl_syntax:match_expr_pattern (Node),
erl_syntax:match_expr_body (Node) ]);
_ ->

[ erl_syntax:revert (postorder (F, X))
|| X <- Parsed ].

dsl:compile/1 scans and parses the string in the standard way, but transforms the resulting parse tree rewriting match expressions as calls to a fictitious local function _assign/2. This local function call is intercepted by specifying a local function handler to erl_eval:exprs/3. At that point (lines 22-33), the sequence is: 1) evaluate the right hand side using the current bindings to reduce the right hand side to a constant, 2) evaluate a match of the left hand side to the constant expression given an empty set of bindings, and 3) merge the new bindings with the existing bindings, overwriting as necessary. By proceeding in this fashion, we can still error out if we have a structural mismatch, or if an unbound variable is used on a right hand side.

% erl
Erlang (BEAM) emulator version 5.6.5 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.6.5 (abort with ^G)
1> c (dsl).
2> dsl:run (dsl:compile ("A = 7, A = A + 3"), erl_eval:new_bindings ()).
3> dsl:run (dsl:compile ("B = 7, A = A + 3"), erl_eval:new_bindings ()).
** exception error: variable 'A' is unbound
4> dsl:run (dsl:compile ("A = 7, { _, A } = A + 3"), erl_eval:new_bindings ()).
** exception error: no match of right hand side value 10

Sunday, July 26, 2009

Ets Ordered Set Select Efficiency

Ets (and tcerl) ordered_sets have the useful property that range queries on them can be resolved more efficiently than full-table scan. For instance, given the operation

ets:select (Tab, [ { { { foo, 1, '_' }, bar }, [], [ '$_' ] } ])

only the range of keys corresponding to 3-element tuples whose first two elements are 'foo' and 1 will be examined, and this is potentially far less than the entire set of keys.

Recently I've been working on a project that requires a range of partially bound keys. This would correspond to an operation such as

ets:select (Tab, [ { { { foo, '$1', '_' }, bar }, [ { '>=', '$1', 1 }, { '=<', '$1', 5 } ], [ '$_' ] } ])

and so I made the modifications to tcerl to be able to detect match conditions of this form and constrain the query range. I started to wonder if I was reinventing the wheel and whether ets had sophisticated match specification analysis under the hood somewhere. I asked the mailing list but answers were inconclusive so I decided to do some measurements. Using the following module:

-module (etsselect).
-compile (export_all).

populate_ets (Tab, N) ->
ets:delete_all_objects (Tab),
lists:foreach (fun (M) ->
ets:insert (Tab, { { foo, M div 10, M }, bar })
lists:seq (1, N)).

repeat_factor (Size) when Size < 1000 -> 100;
repeat_factor (Size) when Size < 100000 -> 30;
repeat_factor (_) -> 10.

select (MatchSpec, Tab, Repeat) when Repeat > 0 ->
ets:select (Tab, MatchSpec),
select (MatchSpec, Tab, Repeat - 1);
select (_, _, _) ->

select_time (MatchSpec) ->
Tab = ets:new (?MODULE, [ ordered_set ]),
[ { N,
(fun () ->
populate_ets (Tab, N),
{ Time, _ } = timer:tc (?MODULE, select, [ MatchSpec, Tab, Repeat ]),
Time / Repeat
end) ()
|| X <- lists:seq (1, 9),
N <- [ trunc (math:pow (5, X)) ],
Repeat <- [ repeat_factor (N) ]
ets:delete (Tab)

I did some timing tests with different match specifications and table sizes. The match specifications were:

[{ '_', [], [ '$_' ] }] % unconstrained
[{{{ foo, 1, '_' }, bar }, [], [ '$_' ] }] % prefix bound
[{{{ foo, '$1', '_' }, bar },
[{ '>=', '$1', 1 }, { '=<', '$1', 1 }],
[ '$_' ]
}] % prefix range bound

The results (using R12B5) indicate that the prefix range bound condition is not being optimized by ets.
The tcerl timings indicate the prefix range bound condition is being optimized (they also indicate much worse constant factors then ets).
Note I haven't released these changes yet, but I will soon as 1.3.1h. Versions of tcerl prior to 1.3.1h will do a full table scan in the prefix range bound case.

Wednesday, July 1, 2009


I just released the first version of osmos, a pure Erlang library that provides on-disk ordered set tables which allow thousands of updates per second with ACID properties.

It achieves that update rate using a rather different structure from a traditional B-tree based table (like the ones used by most RDBMSs or provided by DBM libraries like BDB or tokyocabinet): an incremental merge sort with user-defined merge semantics.


Ordinarily, the rate of updates to an on-disk table is limited by the need to seek around and update an index in place. With a typical seek time on the order of 10 ms, this makes it challenging to scale past about 100 updates/s. Most strategies for going beyond that involve some kind of partitioning, either over multiple disks, multiple machines, or both.

However, a few key observations[1] point to a different strategy:
  1. The reason for updating an index on every write is the expectation that reads are much more frequent than writes, so that read efficiency is the dominating factor. But if writes are far more frequent than reads, you can use some kind of lazy updating to delay the work until absolutely necessary, and combine the work from multiple updates.
  2. An extreme example of a write-dominated, lazily updated database is a full-text inverted index for a search engine. To build one, you might typically read in billions of term-document pairs, sort them by term using some form of external merge sort, and then create a highly optimized index before ever handling a single query.
  3. A merge sort can operate continuously, by injecting new records in sorted batches, and then merging the batches as necessary to maintain a set of sorted files with exponentially increasing sizes. And crucially, this kind of incremental merge sort process can allow relatively efficient searches of the data while it's operating, by binary-searching each sorted file. (An example of this is the incremental indexing provided by Lucene.)
This gives you an ordered set table with a slight increase in the cost of a search (with N records, maybe an extra factor of log N). But the cost of a write is tiny, and mostly delayed: about log N future comparisons during merging, and log N future disk writes, but since all the disk writes are sequential, they will be buffered, and writing to the table requires no explicit seeking at all.[2]

User-defined merging

Things get even more interesting when you let the user control how records are merged. In the osmos model, there is at most one record for any given key; if two records with the same key are encountered during the merge sort, the user's merge function is called to merge the two records into a single record.

The merge function can be any function

Merge(Key, EarlierValue, LaterValue) -> MergedValue
that is associative, i.e.,

Merge(K, Merge(K, V1, V2), V3) =:=
Merge(K, V1, Merge(K, V2, V3))
for any key K and any consecutive sequence of values V1, V2, V3.

This allows a wide variety of semantics for writing to the table. For example:
  • If the merge function always returns the later value, then a write replaces any previous value, like an ordinary key-value store.
  • If the values are integers, and the merge function returns the sum of the two values, then writing to the table acts like transactionally incrementing a counter.
Similarly, you could use any associative function of two numbers; you could apply such a function to each element of a vector of numbers; or you could apply a different function to each element. For example, to keep a minimum, maximum, and average over some set of keys, you could use something like:

merge(_K, N1, N2)
when is_number(N1), is_number(N2) ->
{min(N1, N2), max(N1, N2), N1 + N2, 2};
merge(_K, N1, {Min2, Max2, Sum2, Count2})
when is_number(N1) ->
{min(N1, Min2), max(N1, Max2),
N1 + Sum2, 1 + Count2};
merge(_K, {Min1, Max1, Sum1, Count1}, N2)
when is_number(N2) ->
{min(Min1, N2), max(Max1, N2),
Sum1 + N2, Count1 + 1};
merge(_K, {Min1, Max1, Sum1, Count1},
{Min2, Max2, Sum2, Count2}) ->
{min(Min1, Min2), max(Max1, Max2),
Sum1 + Sum2, Count1 + Count2}.
This lets you write single numbers as values, but read back either {Min, Max, Sum, Count} tuples (if more than one number has been written for a key) or single numbers (if that was the only value written). To do this with an ordinary key-value table and multiple writers would require expensive transactions, but with osmos, operations like this are no more expensive than replacement, but still ACID.

As you can see, keeping statistics for reporting (when the reports are queried infrequently relative to data collection) is one of the killer applications for a merge sort table.

Among the possibilities for even wackier merge operations are:
  • Always return the earlier value. (What was the first value that occurred for this key?)
  • Take the union of two sets. (What are all the values that have occurred for this key?)
  • Take the intersection of two sets. (What values have always occurred for this key?)
  • Multiply two NxN matrices, e.g., to keep track of a series of rotations applied to a vector in RN.
  • Compose a series of arbitrary operations applied to a space of replaceable objects, e.g.:

    merge(_K, _, V) when ?is_value(V) ->
    merge(_K, V, Op) when ?is_value(V),
    ?is_operation(Op) ->
    do_operation(Op, V);
    merge(_K, V, Ops) when ?is_value(V),
    is_list(Ops) ->
    lists:foldl (fun (Op, V) ->
    do_operation(Op, V)
    merge(_K, Op1, Op2) when ?is_operation(Op1),
    ?is_operation(Op2) ->
    [Op1, Op2];
    merge(_K, Op, Ops) when ?is_operation(Op),
    is_list(Ops) ->
    [Op | Ops];
    merge(_K, Ops, Op) when is_list(Ops),
    ?is_operation(Op) ->
    Ops ++ [Op];
    merge(_K, Ops1, Ops2) when is_list(Ops1),
    is_list(Ops2) ->
    Ops1 ++ Ops2.
    The values could be employee records, and the operations could be things like, “change street address to X,” “increase salary by Y%.” Using this pattern, you can get extremely cheap transactional safety for any single-key operation, as long as your merge function implements it.


The basic API is quite simple:

{ok, Table} = osmos:open(Table, [{directory, D}, {format, F}])
to open a table named Table with the format F in the directory D;

ok = osmos:write(Table, Key, Value)
to write a record to the table;

case osmos:read(Table, Key) of
{ok, Value} -> ...;
not_found -> ...
to read the record for a key; and

ok = osmos:close(Table)
to close the table.

You can also iterate over a range of keys in chunks using osmos:select_range/5 and osmos:select_continue/3. The results from a select provide a consistent snapshot of the table, meaning that the results always reflect the contents of the table at the time of the original call to select_range. (In other words, any writes that happen between subsequent calls to select_continue won't affect the results.)

A table format is a record with the following fields:
  • block_size::integer(): block size of the table in bytes, controlling the size of disk reads (which are always whole blocks), and the fanout of the on-disk search trees.
  • key_format::#osmos_format{}: on-disk format for keys. (A pair of functions to convert some set of terms to binaries and back.)
  • key_less::(KeyA, KeyB) -> bool(): comparison function defining the order of keys in the table. Takes two native-format keys, and returns true if the first argument is less than the second argument.
  • value_format::#osmos_format{}: on-disk format for values.
  • merge::(Key, EarlierValue, LaterValue) -> MergedValue: the merge function described above.
  • short_circuit::(Key, Value) -> bool(): function which allows searches of the table to be terminated early (short-circuited) if it can be determined from a record that any earlier records with the same key are irrelevant.
  • delete::(Key, Value) -> bool(): function controlling when records are deleted from the table.
There are several pre-canned formats available from the function osmos_table_format:new/3, or you can build your own #osmos_table_format{} record as needed.


The file tests/osmos_benchmark_1.erl in the source distribution contains a benchmark that uses variable-length binaries as keys (some more frequent than others, with an average length of 15 bytes), and nonnegative integers as values, encoded in 64 bits, where merging takes the sum of the two values. One process writes random keys and values as fast as it can, while another process reads random keys with a 10 ms sleep between reads.

I ran the benchmark for 15 minutes on a 2.2 GHz dual-core MacBook, and got the following numbers:
  • 5028735 total records written, for an average of 5587 writes/second (including the time to compute random keys, etc.)
  • an average of 109.8 microseconds per write call, which would mean a theoretical maximum write rate of 9111 writes/second (for this table format, machine, etc.)
  • the median time per write call was 30 microseconds, and the 90th percentile was 54 microseconds, indicating that the vast majority of writes are extremely quick
  • an average of 1900 microseconds (1.9 ms) per read call
  • the median time per read call was 979 microseconds, and the 90th percentile was 2567 microseconds

I reran the benchmark for 2 hours on the same machine, and got the following numbers:
  • 17247676 total records written, for an average of 2396 writes/second
  • an average of 329.7 microseconds per write call, for a theoretical maximum write rate of 3033 writes/second
  • the median time per write call was 39 microseconds, and the 90th percentile was 88 microseconds
  • an average of 13076 microseconds (13 ms) per read call
  • the median time per read call was 6081 microseconds, and the 90th percentile was 34094 microseconds
The table had about 400 MB of data on the disk (in 7 files) at the end of the 2-hour run. This shows that read performance does start to suffer a bit as the amount of data on the disk grows, but writes remain very fast. (In fact, if there were no reads competing with writes, I wouldn't expect the write times to degrade even that much, since all that's happening synchronously during a write call is a buffered write to a journal file, and an insert into an in-memory tree.)

[1] I have to thank my good friend Dave Benson for introducing me to these ideas, and the generalized merge sort table. His excellent library GSK provides a very similar table API (GskTable) for people who want to write single-threaded servers in C. (I ripped off several of his design ideas, including prefix compression and the variable-length integer encoding.)
[2] Of course there may be implicit seeking due to the need to access multiple files at the same time. But for sequential access, the OS and disk hardware can mitigate this to a large degree as long as the number of files isn't too large.

Wednesday, June 17, 2009

Keeping it simple with flatula

Paul has blogged about overcoming mnesia performance issues in the past, but I don't think we've talked much about the ultimate strategy -- keeping data out of mnesia altogether.

When we first started serving ads, we stored information about every single ad impression in a huge mnesia database, for retrieval on click, and for building behavioral profiles. Almost needless to say, this didn't scale very far. We spent many a day last summer delving into mnesia internals, fixing corrupted table fragments after node crashes, bemoaning how long it took new nodes to join the schema under heavy load, and so on.

One of the simplest and most effective changes that got us out of this mess was not to store any per-impression data in mnesia at all -- instead, we started logging the data to flat files on disk, and storing a small pointer to the data in a cookie so we could read it back the next time we saw the user. Hardly a revolutionary solution . . . it's well-known that disk seeking is the enemy of performance. The hardest part was coming to realizations like, "Hmm, I guess we don't really care if a node goes down and we lose part of that data!"

We've open-sourced one of the main components that enabled this strategy: flatula, an Erlang application that manages write-once "tables" that are really just collections of flat files. It looks a bit like dets, except that it doesn't support deletions, updates, or iteration, and you can't make up the keys. But when you don't need those things, it's hard to imagine a more efficient way to store data.

If you'd like to learn more, there's a brief tutorial on the Google Code site.

Wednesday, June 10, 2009

Let parse transform

So the problem of intermediate variable naming came up again on erlang questions.

Subject:Versioned variable names
From: Attila Rajmund Nohl
Date: Tue, 9 Jun 2009 17:12:34 +0200


I think there wasn't any grumbling this month about the
immutable local variables in Erlang, so here's real world
code I've found just today:

% Take away underscore and replace with hyphen
MO1 = re:replace(MO, "_", "-", [{return, list}, global]),
MO2 = toupper(MO1),
% Replace zeros
MO3 = re:replace(MO2,
[{return, list}, global]),
% Insert hyphen if missing
MO5 = case re:run(MO3, "-", [{capture, none}]) of
nomatch ->
match ->


Mikael Pettersson pointed out that this really has less to do with immutable local variables and more to do with the lack of a let expression. That was insightful, and since a let expression can be considered syntactic sugar for a lambda expression, I realized that a parse transform could provide let like functionality. Let is a reserved keyword in Erlang so I used lyet instead.

Essentially the parse transform rewrites
lyet:lyet (A = B, C)
(fun (A) -> C end) (B)
so the above code could be rewritten as

Result = lyet:lyet (
% Take away underscore and replace with hyphen
MO = re:replace(MO, "_", "-", [{return, list}, global]),
MO = toupper(MO),
% Replace zeros
MO = re:replace(MO,
[{return, list}, global]),
% Insert hyphen if missing
case re:run(MO, "-", [{capture, none}]) of
nomatch ->
match ->

You must provide at least one argument to lyet:lyet. All but the last argument to lyet:lyet must be an assignment, and the last argument has to be a single expression (but you can use begin and end for a block of expressions inside the lyet). As you can see above, you can reuse a variable name across the assignment arguments to lyet:lyet. You can even use lyet:lyet on the right hand side of the assignments, or as part of the expression argument. Some examples of usage are present in the unit test.

Update: per Ulf's suggestion, the parse transform also recognizes the local call let_ in addition to the remote call lyet:lyet. It definitely looks nicer with let_.

The software is available on Google code.

Tuesday, May 19, 2009

Automatic .app file generation

At our startup, we have our own build system (framewerk) and our deployment framework (erlrc) which play well together. As I learned at the Bay Area Erlang Factory, mostly people have their own processes already, so what they want to extract some of the useful functionality and incorporate it into their way of doing things. This led to exposing automatic .appup file generation from erlrc in a reusable fashion; that technique requires .app files to be correct which is why we automatically generate them in framewerk. To compliment, I've isolated our automatic .app file generation and released it in a standalone form.

The escript is called fwte-makeappfile, and it basically takes a set of Erlang source code which comprise an application and does three things for you: 1) attempts to automatically discover registered processes, 2) attempts to automatically discover all the module names, and 3) attempts to automatically discover the start module for the application (if present). You can override these choices if you don't like them. Here's an example of how it works:

% ./fwte-makeappfile --application nitrogen --description 'Nitrogen Web Framework' --version '0.2009.05.11' ~/src/nitrogen-git/src/**/*.erl
[{description,"Nitrogen Web Framework"},
We use this in our build system, where the .app file is always automatically generated, but the developer can set overrides if the autodetection is f-ing up. I recommend this as a general strategy: you need to be able to manually specify for edge cases, but you don't want to count on your developers maintaining the routine cases correctly.

Tuesday, May 12, 2009

Erlang R12B-5 for Fink (Mac OS/X)

It looks like Fink is a bit behind the Erlang releases. Here's an file and an erlang-otp.patch file. You can put these in /sw/fink/dists/unstable/main/finkinfo/languages and then do a fink install erlang-otp . I'll ping the maintainer to get these added to Fink.

When you fink install these right now, you'll get prompted because the tarballs are not on the fink mirrors, so you'll have to select "download from original source URL".

Update: if you have pcre installed it can confuse the build. You can use instead. Rename this to in the /sw/fink/dists/unstable/main/finkinfo/languages directory.

Saturday, May 9, 2009

Automatic .appup file generation

My talk at the Erlang Factory covered several topics, but mostly people were only interested in the last 5 minutes, where I talk about how we automatically generate appup files when we are installing our software. (Makes you wonder how to do a minimum viable talk).

I created a direct (escript) interface to the automatic appup generation that doesn't require you use erlrc to manage your nodes (although, you do need erlrc installed so the escript can access the magic functions). The idea is you invoke this at the right time during whatever installation procedure you are using, and redirect the output to create the appup file (e.g., either when building a new release from an old release, or "just in time" when you are installing a new application). When things go right, it looks something like this:

% ./erlrc-makeappup erlrc 0.1.2 0.2.0 /usr/lib/erlang/lib/erlrc-0.1.2 /Users/pmineiro/tmp/usr/lib/erlang/lib/erlrc-0.2.0
That's the appup file that erlrc generates for itself when it is being upgraded from version 0.1.2 to 0.2.0. The (surprisingly simple) appup generation algorithm is outlined in the documentation.

This is available starting in version 0.2.0 of erlrc which is available on google code.

Friday, May 1, 2009

The Need for Language Specific Packaging

We use apt to manage all our software at our startup, and we like being able to manage components from a heterogeneous set of languages in a uniform way. In other words, some of our software is in Erlang, some is not, and the apt (plus erlrc) lets us treat them similarly. One might guess that I would be against an Erlang-specific packaging solution such as erlware.

Not so! We need erlware and I love it.

At Yahoo we had a proprietary package format that nobody outside Yahoo used. However we could algorithmically convert any CPAN package to our internal package format, so in effect, we had access to all of CPAN. Analogously, companies might use rpm or deb or something else entirely; but by releasing all Erlang software in a standard form, it should be possible to transduce. Right now, when we get stuff from the universe, we end up repackaging them for debian and placing them into our private package archive. Everybody does things slightly differently so it's difficult to generalize this process.

Furthermore, there are exciting benefits to having a single place to look for open-source Erlang software which enforces project quality standards and provides a cross-platform way to start playing around with Erlang.

Wednesday, April 29, 2009

erlrc erlang factory talk

I'll be giving a talk about erlrc at the Erlang Factory. erlrc is our scheme for integrating the Erlang VM with the OS package manager (we use deb, but it should be possible to integrate with rpm and others as well).

The google code project contains the powerpoint slides as well as other useful documentation.

In the talk I indicate that erlrc is componentized for embedding in other build systems and launch processes, and I also mention that we built our own automake-based build system (which is used for all the languages we develop in, not just Erlang). That system is called framewerk and is also available from google code.

I also claim that once your entire software state is represented in OS packages, simple and effective strategies for deployment and provisioning emerge. Two simple tools that we've developed along these lines are ec2-do and apt-snapshot.

ec2-do is an escript that provides rolling window execution of an arbitrary command across a set of ec2 machines (typically, an ec2 group). When we want to launch we do something like this to sync all the machines with the package archive,
ec2-do -g production -m 1000 -b -d apt-get update
and then something like this to install the package
ec2-do -g production -m 3 -b -d apt-get install <newpackage>
ec2-do is available from the erlrc downloads page.

apt-snapshot is even simpler than ec2-do (just a shell script), but very useful for provisioning when using deb packages. It can associate the current package state of a server with a symbolic tag which can later be restored (possibly from another host). To create a tag we do something like
apt-snapshot create <tag>
and to restore it later we do
apt-snapshot restore <host>:<tag>
We use this in three distinct ways. First, we have a cron job creating regular snapshots on every machine with timestamp tag names in case of snafu (to allow us to rollback). Second, when we launch something particularly risky we create a symbolic tag manually for rollback. Third, when we start up new machines, we get them in the correct state quickly by tagging one of the currently running machines and then telling the new machines to restore that tag from that host (our EC2 images are sensitive to the extra arguments passed to ec2-run-instance, but you can also just use ec2-do for this purpose, since restoring a tag is idempotent).

apt-snapshot is available from the erlrc downloads page.