It achieves that update rate using a rather different structure from a traditional B-tree based table (like the ones used by most RDBMSs or provided by DBM libraries like BDB or tokyocabinet): an incremental merge sort with user-defined merge semantics.
Motivation
Ordinarily, the rate of updates to an on-disk table is limited by the need to seek around and update an index in place. With a typical seek time on the order of 10 ms, this makes it challenging to scale past about 100 updates/s. Most strategies for going beyond that involve some kind of partitioning, either over multiple disks, multiple machines, or both.However, a few key observations[1] point to a different strategy:
- The reason for updating an index on every write is the expectation that reads are much more frequent than writes, so that read efficiency is the dominating factor. But if writes are far more frequent than reads, you can use some kind of lazy updating to delay the work until absolutely necessary, and combine the work from multiple updates.
- An extreme example of a write-dominated, lazily updated database is a full-text inverted index for a search engine. To build one, you might typically read in billions of term-document pairs, sort them by term using some form of external merge sort, and then create a highly optimized index before ever handling a single query.
- A merge sort can operate continuously, by injecting new records in sorted batches, and then merging the batches as necessary to maintain a set of sorted files with exponentially increasing sizes. And crucially, this kind of incremental merge sort process can allow relatively efficient searches of the data while it's operating, by binary-searching each sorted file. (An example of this is the incremental indexing provided by Lucene.)
User-defined merging
Things get even more interesting when you let the user control how records are merged. In the osmos model, there is at most one record for any given key; if two records with the same key are encountered during the merge sort, the user's merge function is called to merge the two records into a single record.The merge function can be any function
that is associative, i.e.,
Merge(Key, EarlierValue, LaterValue) -> MergedValue
for any key K and any consecutive sequence of values V1, V2, V3.
Merge(K, Merge(K, V1, V2), V3) =:=
Merge(K, V1, Merge(K, V2, V3))
This allows a wide variety of semantics for writing to the table. For example:
- If the merge function always returns the later value, then a write replaces any previous value, like an ordinary key-value store.
- If the values are integers, and the merge function returns the sum of the two values, then writing to the table acts like transactionally incrementing a counter.
This lets you write single numbers as values, but read back either {Min, Max, Sum, Count} tuples (if more than one number has been written for a key) or single numbers (if that was the only value written). To do this with an ordinary key-value table and multiple writers would require expensive transactions, but with osmos, operations like this are no more expensive than replacement, but still ACID.
merge(_K, N1, N2)
when is_number(N1), is_number(N2) ->
{min(N1, N2), max(N1, N2), N1 + N2, 2};
merge(_K, N1, {Min2, Max2, Sum2, Count2})
when is_number(N1) ->
{min(N1, Min2), max(N1, Max2),
N1 + Sum2, 1 + Count2};
merge(_K, {Min1, Max1, Sum1, Count1}, N2)
when is_number(N2) ->
{min(Min1, N2), max(Max1, N2),
Sum1 + N2, Count1 + 1};
merge(_K, {Min1, Max1, Sum1, Count1},
{Min2, Max2, Sum2, Count2}) ->
{min(Min1, Min2), max(Max1, Max2),
Sum1 + Sum2, Count1 + Count2}.
As you can see, keeping statistics for reporting (when the reports are queried infrequently relative to data collection) is one of the killer applications for a merge sort table.
Among the possibilities for even wackier merge operations are:
- Always return the earlier value. (What was the first value that occurred for this key?)
- Take the union of two sets. (What are all the values that have occurred for this key?)
- Take the intersection of two sets. (What values have always occurred for this key?)
- Multiply two NxN matrices, e.g., to keep track of a series of rotations applied to a vector in RN.
- Compose a series of arbitrary operations applied to a space of replaceable objects, e.g.:
The values could be employee records, and the operations could be things like, “change street address to X,” “increase salary by Y%.” Using this pattern, you can get extremely cheap transactional safety for any single-key operation, as long as your merge function implements it.
merge(_K, _, V) when ?is_value(V) ->
V;
merge(_K, V, Op) when ?is_value(V),
?is_operation(Op) ->
do_operation(Op, V);
merge(_K, V, Ops) when ?is_value(V),
is_list(Ops) ->
lists:foldl (fun (Op, V) ->
do_operation(Op, V)
end,
V,
Ops);
merge(_K, Op1, Op2) when ?is_operation(Op1),
?is_operation(Op2) ->
[Op1, Op2];
merge(_K, Op, Ops) when ?is_operation(Op),
is_list(Ops) ->
[Op | Ops];
merge(_K, Ops, Op) when is_list(Ops),
?is_operation(Op) ->
Ops ++ [Op];
merge(_K, Ops1, Ops2) when is_list(Ops1),
is_list(Ops2) ->
Ops1 ++ Ops2.
API
The basic API is quite simple:to open a table named Table with the format F in the directory D;
{ok, Table} = osmos:open(Table, [{directory, D}, {format, F}])
to write a record to the table;
ok = osmos:write(Table, Key, Value)
to read the record for a key; and
case osmos:read(Table, Key) of
{ok, Value} -> ...;
not_found -> ...
end
to close the table.
ok = osmos:close(Table)
You can also iterate over a range of keys in chunks using osmos:select_range/5 and osmos:select_continue/3. The results from a select provide a consistent snapshot of the table, meaning that the results always reflect the contents of the table at the time of the original call to select_range. (In other words, any writes that happen between subsequent calls to select_continue won't affect the results.)
A table format is a record with the following fields:
- block_size::integer(): block size of the table in bytes, controlling the size of disk reads (which are always whole blocks), and the fanout of the on-disk search trees.
- key_format::#osmos_format{}: on-disk format for keys. (A pair of functions to convert some set of terms to binaries and back.)
- key_less::(KeyA, KeyB) -> bool(): comparison function defining the order of keys in the table. Takes two native-format keys, and returns true if the first argument is less than the second argument.
- value_format::#osmos_format{}: on-disk format for values.
- merge::(Key, EarlierValue, LaterValue) -> MergedValue: the merge function described above.
- short_circuit::(Key, Value) -> bool(): function which allows searches of the table to be terminated early (short-circuited) if it can be determined from a record that any earlier records with the same key are irrelevant.
- delete::(Key, Value) -> bool(): function controlling when records are deleted from the table.
Performance
The file tests/osmos_benchmark_1.erl in the source distribution contains a benchmark that uses variable-length binaries as keys (some more frequent than others, with an average length of 15 bytes), and nonnegative integers as values, encoded in 64 bits, where merging takes the sum of the two values. One process writes random keys and values as fast as it can, while another process reads random keys with a 10 ms sleep between reads.I ran the benchmark for 15 minutes on a 2.2 GHz dual-core MacBook, and got the following numbers:
- 5028735 total records written, for an average of 5587 writes/second (including the time to compute random keys, etc.)
- an average of 109.8 microseconds per write call, which would mean a theoretical maximum write rate of 9111 writes/second (for this table format, machine, etc.)
- the median time per write call was 30 microseconds, and the 90th percentile was 54 microseconds, indicating that the vast majority of writes are extremely quick
- an average of 1900 microseconds (1.9 ms) per read call
- the median time per read call was 979 microseconds, and the 90th percentile was 2567 microseconds
I reran the benchmark for 2 hours on the same machine, and got the following numbers:
- 17247676 total records written, for an average of 2396 writes/second
- an average of 329.7 microseconds per write call, for a theoretical maximum write rate of 3033 writes/second
- the median time per write call was 39 microseconds, and the 90th percentile was 88 microseconds
- an average of 13076 microseconds (13 ms) per read call
- the median time per read call was 6081 microseconds, and the 90th percentile was 34094 microseconds
[1] I have to thank my good friend Dave Benson for introducing me to these ideas, and the generalized merge sort table. His excellent library GSK provides a very similar table API (GskTable) for people who want to write single-threaded servers in C. (I ripped off several of his design ideas, including prefix compression and the variable-length integer encoding.)
[2] Of course there may be implicit seeking due to the need to access multiple files at the same time. But for sequential access, the OS and disk hardware can mitigate this to a large degree as long as the number of files isn't too large.
This looks pretty cool. I've been using mnesia ordered_set tables to store an inverted index using tuple keys, so records look like {{string, docid}, weight}. This makes select matching really fast, but reindexing can easily overload mnesia and lead to out of memory crashes. Sounds like osmos was made to solve the reindexing problem. How could I transition from using mnesia to osmos? In other words, what would be the osmos way of creating inverted index records for optimal select_range?
ReplyDeleteHi, this looks pretty hot :) For fun, can you find out your MB/sec write speed and compare it to what the bonnie tool can get your harddrive to write?
ReplyDeleteJacob: for your problem, ideally the string would be the only key, but right now there's no support for multiple records with the same key. (I have thought about adding that.)
ReplyDeleteSo for now, I think you could use {string,docid} as the key, and weight as the value. Then you could pass upper and lower bound functions to select_range that only pay attention to the string part of the key, to select the range of records with a given string as the first element of the key.
You should add qlc support by defining a table interface, that might help some people transition from mnesia.
ReplyDeleteDoes the implementation require all records to be kept in memory?
ReplyDeletephilipp: no, records are only in memory for a brief time after they are written to the table (a fixed number are kept in an in-memory tree and then periodically flushed to disk). All other records are on the disk. So memory usage is bounded, and quite small.
ReplyDeleteWhy fw-pkgin?
ReplyDeleteWhy fw-bootstrap?
Why don't you guys package this in traditional make/automake way?
Why every single one try to use something sophisticated like rake or this? What's wrong with make and regular bash?
We already have enough pain in the ass. I couldn't understand this...
Hey Kunthar,
ReplyDeleteSource tarballs that we release with 'make dist' build in the typical automake fashion (./configure; make; make install).
It sounds like perhaps you are trying to build directly from a source code checkout. For that you need framewerk, so I would suggest grabbing one of the source tarballs from the download page instead.
As to why we use framewerk, it does alot of things for us beyond what automake does, too many to mention here; check out http://code.google.com/p/fwtemplates/wiki/FwTemplateErlangWalkthrough for an overview.
Hi, not sure if you guys noticed, but I posted a defect at http://code.google.com/p/osmos/issues/detail?id=1. The highest key is not selected during select_range iteration. Or I'm misunderstanding how select_range works. Either way, it's a major block for switching from mnesia to osmos.
ReplyDeleteJacob: no you weren't misunderstanding, that was a real bug. Fixed in release 0.0.1. Please let me know if you run into any more problems.
ReplyDeleteExcellent, thank you. Glad I was understanding select_range correctly :)
ReplyDeleteNow I can see about replacing mnesia with osmos text indexing tables, and hopefully produce some benchmarks in the process.
Any chance someone could put up a wiki on how to get it running in the erlang shell? Since I think you're using http://code.google.com/p/fwtemplates/wiki/FwTemplateErlangWalkthrough
ReplyDeleteI'm not even sure how to set up that.
I have a mixed network and am thinking about using this for the local disk storage for my distributed application.
@Jeff if you download the release package, you can do the standard ./configure && make, then include the src/ directory (should have the beam files after make) in your erlang path (erl -pa PATH/TO/osmos-0.0.1/src)
ReplyDeleteHope that helps
Any update on osmos? Any expiriences of long running instalations of osmos, any code updates?
ReplyDeletePS. Please allow anonymous comments. Thanks.