Sunday, June 15, 2008

generating app files

We automatically generate our .app files around here from analysis of the source code. However this trick is buried inside fw-template-erlang which most (all?) people outside our group do not use. I thought I'd highlight how it works.

A typical OTP application specification file looks like:

{ application, walkenfs,
[
{ description, "Distributed filesystem." },
{ vsn, "0.1.7" },
{ modules, [ walkenfssup, walkenfsfragmentsrv, walkenfs, walkenfssrv ] },
{ registered, [ walkenfsfragmentsrv ] },
{ applications, [ kernel, stdlib ] },
{ mod, { walkenfs, [] } },
{ env, [ { linked_in, false }, { mount_opts, "" }, { mount_point, "/walken" }, { prefix, walken }, { read_data_context, async_dirty }, { write_data_context, sync_dirty }, { read_meta_context, transaction }, { write_meta_context, sync_transaction }, { attr_timeout_ms, 0 }, { entry_timeout_ms, 0 }, { reuse_node, true }, { start_timeout, infinity }, { stop_timeout, 1800000 }, { init_fragments, 7 }, { init_copies, 3 }, { copy_type, n_disc_only_copies }, { buggy_ets, auto_detect }, { port_exit_stop, true } ] }

]
}.
There are alot of bits here that can filled in automatically. It helps to think in terms of the template:

{ application, @FW_PACKAGE_NAME@,
[
{ description, "@FW_PACKAGE_SHORT_DESCRIPTION@" },
{ vsn, "@FW_PACKAGE_VERSION@" },
{ modules, [ @FW_ERL_APP_MODULES@ ] },
{ registered, [ @FW_ERL_APP_REGISTERED@ ] },
{ applications, [ kernel, stdlib @FW_ERL_PREREQ_APPLICATIONS@ ] },
@FW_ERL_APP_MOD_LINE@
{ env, @FW_ERL_APP_ENVIRONMENT@ }
@FW_ERL_APP_EXTRA@
]
}.
This is an "automake" style template where @VARIABLE@ indicates something that will be substituted.

Now we can say where each of these templates values comes from when using fw-template-erlang:
  1. FW_PACKAGE_NAME: specified by the developer. if you're using automake you've had to specify this already via AC_INIT, so it's good to reuse that rather than require duplicate specification.
  2. FW_PACKAGE_SHORT_DESCRIPTION: specified by the developer.
  3. FW_PACKAGE_VERSION: specified by the developer; like the package name, if you're using automake this information is already contained in the arguments to AC_INIT so reuse that.
  4. FW_ERL_APP_MODULES: generated by analysis of the source code directory.
  5. FW_ERL_APP_REGISTERED: generated by analysis of the source code directory.
  6. FW_ERL_PREREQ_APPLICATIONS: specified by the developer.
  7. FW_ERL_MOD_LINE: partially generated by analysis of the source code directory, partially specified by the developer (see below).
  8. FW_ERL_APP_ENVIRONMENT: specified by the developer.
  9. FW_ERL_APP_EXTRA: specified by the developer. this is for unusual extra directives like included applications.
So the interesting bits come down to FW_ERL_APP_MODULES, FW_ERL_APP_REGISTERED, and FW_ERL_MOD_LINE being generated automatically via inspection of the source code.

First, a note on intervention. Anytime you are attempting to automate a task that was previous done by humans, it's helps to allow a human to override any of the automatic settings that you generate by default. That way, you can be correct only 95% of the time and still be a huge timesaver. In practice we've found that although it easy to come up with code examples that flummox the automatic strategies, such code is never actually written by people in the normal course of their work. In fact, we've found that application files we get from the universe are often missing registered process names that our automatic strategy finds. So empirically we're running at 100% correct so far, but fw-template-erlang still contains the capability to override any automatically generated value.

Now the actual trick to computing all the stuff is that Erlang contains an Erlang compiler as a library. Here's an example of leveraging this to generate all the module names from a set of files; any file that contains a
-fwskip()
directive will be ignored.

-module (find_modules).
-export ([ main/1 ]).

is_skipped ([]) -> false;
is_skipped ([ { attribute, _, fwskip, _ } | _ ]) -> true;
is_skipped ([ _ | Rest ]) -> is_skipped (Rest).

print_module (Dir, F) ->
case compile:file (F, [ binary, 'E', { outdir, Dir } ]) of
{ ok, Mod, { Mod, _, Forms } } ->
case is_skipped (Forms) of
true ->
ok;
false ->
port_command (get (three), io_lib:format ("~p~n", [ Mod ]))
end;
_ ->
ok
end.

main ([ Dir | Rest ]) ->
ok = file:make_dir (Dir),

try
Three = open_port ({ fd, 0, 3 }, [ out ]),
% ugh ... don't want to have to change all these function signatures,
% so i'm gonna be dirty
put (three, Three),
lists:foreach (fun (F) -> print_module (Dir, F) end, Rest)
after
{ ok, FileNames } = file:list_dir (Dir),
lists:foreach (fun (F) -> file:delete (Dir ++ "/" ++ F) end, FileNames),
file:del_dir (Dir)
end;
main ([]) ->
Port = open_port ({ fd, 0, 2 }, [ out ]),
port_command (Port, "usage: find-modules.esc tmpdir filename [filename ...]\n"),
halt (1).
The output is on file descriptor 3 in order to isolate the desired output from various messages being output by the Erlang code. This was originally an escript but to be compatible with older versions of Erlang we changed it to run as follows:

#! /bin/sh

# NB: this script is called find-modules.sh

# unfortunately, escript is a recent innovation ...

ERL_CRASH_DUMP=/dev/null
export ERL_CRASH_DUMP

erl -pa "${FW_ROOT}/share/fw/template/erlang/" -pa "${FW_ROOT}/share/fw.local/template/erlang" -pa "${top_srcdir}/fw/template/erlang/" -pa "${top_srcdir}/fw.local/template/erlang" -noshell -noinput -eval '
find_modules:main (init:get_plain_arguments ()),
halt (0).
' -extra "$@" 3>&1 >/dev/null


In practice this gets executed like

find src -name '*.erl' -print | xargs find-modules.sh "$tmpdir" |
perl -ne 'chomp;
next if $seen{$_}++;
print ", " if $f;
print $_;
$f = 1;'


Ok, that was the easy one! In fact, the problem of finding all the module names seems so easy that one could be tempted to solve it with grep (hmmm: what about preprocesser directives ?). However the point is that nothing parses Erlang as well as Erlang, so why reinvent the wheel.

Computing the mod line in the application specification is just slightly harder. Basically we scan source code for a module which has a
-behaviour(application)
attribute and assume that is the start module. The Erlang compiler allows behaviour to be (mis)spelt American-style so we allow that as well.

-module (find_start_module).
-export ([ main/1 ]).

is_application ([]) -> false;
is_application ([ { attribute, _, behaviour, [ application ] } | _ ]) -> true;
is_application ([ { attribute, _, behavior, [ application ] } | _ ]) -> true;
is_application ([ _ | Rest ]) -> is_application (Rest).

is_skipped ([]) -> false;
is_skipped ([ { attribute, _, fwskip, _ } | _ ]) -> true;
is_skipped ([ _ | Rest ]) -> is_skipped (Rest).

find_start_module (_, []) ->
ok;
find_start_module (Dir, [ F | Rest ]) ->
case compile:file (F, [ binary, 'E', { outdir, Dir } ]) of
{ ok, Mod, { Mod, _, Forms } } ->
case is_application (Forms) and not is_skipped (Forms) of
true ->
port_command (get (three), io_lib:format ("~p~n", [ Mod ]));
false ->
find_start_module (Dir, Rest)
end;
_ ->
find_start_module (Dir, Rest)
end.

main ([ Dir | Rest ]) ->
ok = file:make_dir (Dir),

try
Three = open_port ({ fd, 0, 3 }, [ out ]),
% ugh ... don't want to have to change all these function signatures,
% so i'm gonna be dirty
put (three, Three),
find_start_module (Dir, Rest)
after
{ ok, FileNames } = file:list_dir (Dir),
lists:foreach (fun (F) -> file:delete (Dir ++ "/" ++ F) end, FileNames),
file:del_dir (Dir)
end;
main ([]) ->
Port = open_port ({ fd, 0, 2 }, [ out ]),
port_command (Port, "usage: find-modules.esc tmpdir filename [filename ...]\n"),
halt (1).

If there are no applications in the set of files being analyzed, we output nothing for the mod line in the application file, otherwise we output
{ mod, { @FW_ERL_APP_START_MODULE@, @FW_ERL_APP_START_ARGS@ } },
where FW_ERL_APP_START_MODULE is the output of the above command and FW_ERL_APP_START_ARGS is indicated by the programmer (we use [] by default for FW_ERL_APP_START_ARGS and in practice never use the field). Obviously, if multiple modules implementing the application behaviour are found then this will not work. Our style has been to put each application we make in a separate directory to facilitate automatic analysis.

Anyway now we can tackle the more challenging problem of finding registered processes. There are many library calls that end up registering a process name; the ones we recognize are:
  • supervisor child specs that contain calls of the form
    • { _, start, [ { local, xxx }, ... ] } -> xxx being registered
    • { _, start_link, [ { local, xxx }, ... ] } -> xxx being registered
  • function calls of the form
    • Module:start ({ local, xxx }, ...) -> xxx being registered
    • Module:start_link ({ local, xxx }, ...) -> xxx being registered
  • calls to erlang:register (xxx, ...) -> xxx being registered
There's alot of code here but the general idea is the same: walk the forms generated by compile and pattern match on one of the above cases, outputting discovered registered process names to file descriptor 3.

-module (find_registered).
-export ([ main/1 ]).

%% looks like fun_clauses are the same as clauses (?)
print_registered_fun_clauses (Clauses) ->
print_registered_clauses (Clauses).

%% looks like icr_clauses are the same as clauses (?)
print_registered_icr_clauses (Clauses) ->
print_registered_clauses (Clauses).

print_registered_inits (Inits) ->
lists:foreach (fun ({ record_field, _, _, E }) ->
print_registered_expr (E);
(_) ->
ok
end,
Inits).

print_registered_upds (Upds) ->
lists:foreach (fun ({ record_field, _, _, E }) ->
print_registered_expr (E);
(_) ->
ok
end,
Upds).

% hmmm ... pretty sure patterns are not supposed to have side-effects ...
print_registered_pattern (_) -> ok.
print_registered_pattern_group (_) -> ok.

print_registered_quals (Qs) ->
lists:foreach (fun ({ generate, _, P, E }) ->
print_registered_pattern (P),
print_registered_expr (E);
({ b_generate, _, P, E }) ->
print_registered_pattern (P),
print_registered_expr (E);
(E) ->
print_registered_expr (E)
end,
Qs).

print_registered_expr ({ cons, _, H, T }) ->
print_registered_expr (H),
print_registered_expr (T);
print_registered_expr ({ lc, _, E, Qs }) ->
print_registered_expr (E),
print_registered_quals (Qs);
print_registered_expr ({ bc, _, E, Qs }) ->
print_registered_expr (E),
print_registered_quals (Qs);

%% Ok, here's some meat:
%% the "supervisor child spec" rule
%% { _, start, [ { local, xxx }, ... ] } -> xxx being registered
%% { _, start_link, [ { local, xxx }, ... ] } -> xxx being registered
print_registered_expr ({ tuple, _,
Exprs=[ _,
{ atom, _, Func },
{ cons,
_,
{ tuple, _, [ { atom, _, local },
{ atom, _, Name } ] },
_ } ] }) when (Func =:= start) or
(Func =:= start_link) ->
port_command (get (three), io_lib:format ("~p~n", [ Name ])),
print_registered_exprs (Exprs);

print_registered_expr ({ tuple, _, Exprs }) ->
print_registered_exprs (Exprs);
print_registered_expr ({ record_index, _, _, E }) ->
print_registered_expr (E);
print_registered_expr ({ record, _, _, Inits }) ->
print_registered_inits (Inits);
print_registered_expr ({ record_field, _, E0, _, E1 }) ->
print_registered_expr (E0),
print_registered_expr (E1);
print_registered_expr ({ record, _, E, _, Upds }) ->
print_registered_expr (E),
print_registered_upds (Upds);
print_registered_expr ({ record_field, _, E0, E1 }) ->
print_registered_expr (E0),
print_registered_expr (E1);
print_registered_expr ({ block, _, Exprs }) ->
print_registered_exprs (Exprs);
print_registered_expr ({ 'if', _, IcrClauses }) ->
print_registered_icr_clauses (IcrClauses);
print_registered_expr ({ 'case', _, E, IcrClauses }) ->
print_registered_expr (E),
print_registered_icr_clauses (IcrClauses);
print_registered_expr ({ 'receive', _, IcrClauses }) ->
print_registered_icr_clauses (IcrClauses);
print_registered_expr ({ 'receive', _, IcrClauses, E, Exprs }) ->
print_registered_icr_clauses (IcrClauses),
print_registered_expr (E),
print_registered_exprs (Exprs);
print_registered_expr ({ 'try', _, Exprs0, IcrClauses0, IcrClauses1, Exprs1 }) ->
print_registered_exprs (Exprs0),
print_registered_icr_clauses (IcrClauses0),
print_registered_icr_clauses (IcrClauses1),
print_registered_exprs (Exprs1);
print_registered_expr ({ 'fun', _, Body }) ->
case Body of
{ clauses, Cs } ->
print_registered_fun_clauses (Cs);
_ ->
ok
end;

%% Ok, here's some meat:
%% Module:start ({ local, xxx }, ...) -> xxx being registered
%% Module:start_link ({ local, xxx }, ...) -> xxx being registered

print_registered_expr ({ call,
_,
E={ remote, _, _, { atom, _, Func } },
Exprs=[ { tuple, _, [ { atom, _, local },
{ atom, _, Name } ] } | _ ] })
when (Func =:= start) or
(Func =:= start_link) ->
port_command (get (three), io_lib:format ("~p~n", [ Name ])),
print_registered_expr (E),
print_registered_exprs (Exprs);

%% Ok, here's some meat:
%% erlang:register (xxx, ...) -> xxx being registered

print_registered_expr ({ call,
_,
{ remote,
_,
{ atom, _, erlang },
{ atom, _, register } },
Exprs=[ { atom, _, Name } | _ ] }) ->
port_command (get (three), io_lib:format ("~p~n", [ Name ])),
print_registered_exprs (Exprs);

print_registered_expr ({ call, _, E, Exprs }) ->
print_registered_expr (E),
print_registered_exprs (Exprs);
print_registered_expr ({ 'catch', _, E }) ->
print_registered_expr (E);
print_registered_expr ({ 'query', _, E }) ->
print_registered_expr (E);
print_registered_expr ({ match, _, P, E }) ->
print_registered_pattern (P),
print_registered_expr (E);
print_registered_expr ({ bin, _, PatternGrp }) ->
print_registered_pattern_group (PatternGrp);
print_registered_expr ({ op, _, _, E }) ->
print_registered_expr (E);
print_registered_expr ({ op, _, _, E0, E1 }) ->
print_registered_expr (E0),
print_registered_expr (E1);
print_registered_expr ({ remote, _, E0, E1 }) ->
print_registered_expr (E0),
print_registered_expr (E1);
print_registered_expr (_) ->
ok.

print_registered_exprs (Exprs) ->
lists:foreach (fun (E) -> print_registered_expr (E) end, Exprs).

print_registered_clauses (Clauses) ->
lists:foreach (fun ({ clause, _, _, _, Exprs }) ->
print_registered_exprs (Exprs);
(_) ->
ok
end,
Clauses).

print_registered_forms (Forms) ->
lists:foreach (fun ({ function, _, _, _, Clauses }) ->
print_registered_clauses (Clauses);
(_) ->
ok
end,
Forms).

is_skipped ([]) -> false;
is_skipped ([ { attribute, _, fwskip, _ } | _ ]) -> true;
is_skipped ([ _ | Rest ]) -> is_skipped (Rest).

print_registered (Dir, F) ->
case compile:file (F, [ binary, 'E', { outdir, Dir } ]) of
{ ok, _, { _, _, Forms } } ->
case is_skipped (Forms) of
true ->
ok;
false ->
print_registered_forms (Forms)
end;
_ ->
ok
end.

main ([ Dir | Rest ]) ->
ok = file:make_dir (Dir),

try
Three = open_port ({ fd, 0, 3 }, [ out ]),
% ugh ... don't want to have to change all these function signatures,
% so i'm gonna be dirty
put (three, Three),
lists:foreach (fun (F) -> print_registered (Dir, F) end, Rest)
after
{ ok, FileNames } = file:list_dir (Dir),
lists:foreach (fun (F) -> file:delete (Dir ++ "/" ++ F) end, FileNames),
file:del_dir (Dir)
end;
main ([]) ->
Port = open_port ({ fd, 0, 2 }, [ out ]),
port_command (Port, "usage: find-modules.esc tmpdir filename [filename ...]\n"),
halt (1).
It's definitely possible to fool the above code. For instance:

Hidden = therealname,
erlang:register (Hidden, SomePid).

However in practice developers tend to use either explicitly name their registered process or at worst obscure them with a macro (which, by going through the Erlang compiler, we get expanded for free), so this automatic strategy has been working very well.

You can download fw-template-erlang from google code and get the complete source code from above plus some intuition about how it is called. One thing that's been on our TODO list is to combine all these analysis into one pass on the source code; currently we pass through the source code three times, which is clearly suboptimal, although the process is speedy enough that it is only noticeable when analyzing larger projects (like, when we converted yaws and mnesia to fw-template-erlang for our internal use).

No comments: