Sunday, July 26, 2009

Ets Ordered Set Select Efficiency

Ets (and tcerl) ordered_sets have the useful property that range queries on them can be resolved more efficiently than full-table scan. For instance, given the operation

ets:select (Tab, [ { { { foo, 1, '_' }, bar }, [], [ '$_' ] } ])

only the range of keys corresponding to 3-element tuples whose first two elements are 'foo' and 1 will be examined, and this is potentially far less than the entire set of keys.

Recently I've been working on a project that requires a range of partially bound keys. This would correspond to an operation such as

ets:select (Tab, [ { { { foo, '$1', '_' }, bar }, [ { '>=', '$1', 1 }, { '=<', '$1', 5 } ], [ '$_' ] } ])

and so I made the modifications to tcerl to be able to detect match conditions of this form and constrain the query range. I started to wonder if I was reinventing the wheel and whether ets had sophisticated match specification analysis under the hood somewhere. I asked the mailing list but answers were inconclusive so I decided to do some measurements. Using the following module:

-module (etsselect).
-compile (export_all).

populate_ets (Tab, N) ->
ets:delete_all_objects (Tab),
lists:foreach (fun (M) ->
ets:insert (Tab, { { foo, M div 10, M }, bar })
end,
lists:seq (1, N)).

repeat_factor (Size) when Size < 1000 -> 100;
repeat_factor (Size) when Size < 100000 -> 30;
repeat_factor (_) -> 10.

select (MatchSpec, Tab, Repeat) when Repeat > 0 ->
ets:select (Tab, MatchSpec),
select (MatchSpec, Tab, Repeat - 1);
select (_, _, _) ->
ok.

select_time (MatchSpec) ->
Tab = ets:new (?MODULE, [ ordered_set ]),
try
[ { N,
(fun () ->
populate_ets (Tab, N),
{ Time, _ } = timer:tc (?MODULE, select, [ MatchSpec, Tab, Repeat ]),
Time / Repeat
end) ()
}
|| X <- lists:seq (1, 9),
N <- [ trunc (math:pow (5, X)) ],
Repeat <- [ repeat_factor (N) ]
]
after
ets:delete (Tab)
end.

I did some timing tests with different match specifications and table sizes. The match specifications were:

[{ '_', [], [ '$_' ] }] % unconstrained
[{{{ foo, 1, '_' }, bar }, [], [ '$_' ] }] % prefix bound
[{{{ foo, '$1', '_' }, bar },
[{ '>=', '$1', 1 }, { '=<', '$1', 1 }],
[ '$_' ]
}] % prefix range bound

The results (using R12B5) indicate that the prefix range bound condition is not being optimized by ets.
The tcerl timings indicate the prefix range bound condition is being optimized (they also indicate much worse constant factors then ets).
Note I haven't released these changes yet, but I will soon as 1.3.1h. Versions of tcerl prior to 1.3.1h will do a full table scan in the prefix range bound case.

1 comment:

Djui said...

Hej,

I stumbled upon the same problem for my project working with ETS tables. I have to do range queries and wondered, if ETS will handled prefix range bound efficient.
Could you elaborate a bit more how you have fixed tcerl and what needs to be done to apply a better timing?
And do you know if these ideas are incorporated in newer releases than 12b5?

Thanks,
Uwe