Previous Up Next

7  Profiling

With MLton and mlprof, you can profile your program to find out for each function how much time it spends or how many bytes it allocates during an entire run of the program. To profile your program, compile with either -profile alloc or -profile time (you can not do both simultaneously). Then, run the executable. While it runs, the profiler maintains counts (bytes or clock ticks) for each source function in the program. When the program finishes, it automatically writes the counts to an mlmon.out file. You can then run mlprof on the executable and the mlmon.out file to see the percentage of the total (allocation or time) spent in each function.

Here is an example of time profiling, run from within the examples/profiling directory.
% mlton -profile time tak.sml
% ./tak
% mlprof tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
function     cur 
------------- -----
Tak.tak1.tak2 75.8%
Tak.tak1      24.2%
This is a contrived example with two mutually recursive copies of the tak function, defined within the Tak structure. This examples shows how mlprof indicates lexical nesting via a sequence of period-separated names indicating the structures and functions in which a function definition is nested. The profiling data shows that roughly three-quarters of the time is spent in the Tak.tak1.tak2 function, while the rest is spent in Tak.tak1.

You can display raw counts in addition to percentages with -raw true.
% mlprof -raw true tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
  function     cur    raw  
------------- ----- -------
Tak.tak1.tak2 75.8% (4.55s)
Tak.tak1      24.2% (1.45s)
You can display the file name and line number for each function in addition to its name with -show-line true.
% mlprof -show-line true tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
        function           cur 
------------------------- -----
Tak.tak1.tak2  tak.sml: 5 75.8%
Tak.tak1  tak.sml: 3      24.2%
Allocation profiling is very similar to time profiling. Here is an example run from within the examples/profiling directory.
% mlton -profile alloc list-rev.sml
% ./list-rev
% mlprof -show-line true list-rev mlmon.out
6,030,136 bytes allocated (108,336 bytes by GC)
       function          cur 
----------------------- -----
append  list-rev.sml: 1 97.6%
<gc>                     1.8%
<main>                   0.4%
rev  list-rev.sml: 6     0.2%
The data shows that most of the allocation is done by the append function defined on line 1 of list-rev.sml. The table also shows how special functions like gc and main are handled: they are printed with surrounding brackets. C functions are displayed similarly. Here, as is usually the case, the allocation done by the garbage collector is due to stack growth.

Time profiling typically has a very small performance impact. However, the performance impact of allocation profiling is noticeable, because it inserts additional C calls for object allocation.

7.1  Profiling the stack

For both allocation and time profiling, you can use -profile-stack true to count the time spent (or bytes allocated) while a function is on the stack. Here is an example.
% mlton -profile alloc -profile-stack true list-rev.sml
% ./list-rev
% mlprof -show-line true list-rev mlmon.out
6,030,136 bytes allocated (108,336 bytes by GC)
       function          cur  stack  GC 
----------------------- ----- ----- ----
append  list-rev.sml: 1 97.6% 97.6% 1.4%
<gc>                     1.8%  0.0% 1.8%
<main>                   0.4% 98.2% 1.8%
rev  list-rev.sml: 6     0.2% 97.6% 1.8%
In the above table, we see that rev, defined on line 6 of list-rev.sml, is only responsible for 0.2% of the allocation, but is on the stack while 97.6% of the allocation is done by the user program and while 1.8% of the allocation is done by the garbage collector.

The performance impact of -profile-stack true can be noticeable since there is some extra bookkeeping at every nontail call and return.

7.2  Call graphs

For easier visualization of profiling data, mlprof can create a call graph of the program in dot format, from which you can use the graphviz software package to create a postscript graph. For example, mlprof -call-graph foo.dot foo mlmon.out will create foo.dot with a complete call graph. For each source function, there will be one node in the graph that contains the function name (and source position with -show-line true), as well as the percentage of ticks. If you want to create a call graph for your program without any profiling data, you can simply call mlprof without any mlmon.out files, e.g. mlprof -call-graph foo.dot foo.

Because SML has higher-order functions, the call graph is is dependent on MLton's analysis of which functions call each other. This analysis depends on many implementation details and might display spurious edges that a human could conclude are impossible. However, in practice, the call graphs tend to be very accurate.

Because call graphs can get big, mlprof provides the -keep option to specify the nodes that you would like to see. This option also controls which functions appear in the table that mlprof prints. The argument to -keep is an expression describing a set of source functions (i.e. graph nodes), taken from the following grammar.

e ::= all
  | "s"
  | (and e ...)
  | (from e)
  | (not e)
  | (or e ...)
  | (pred e)
  | (succ e)
  | (to e)
  | (thresh x)
  | (thresh-gc x)
  | (thresh-stack x)

In the grammar, all denotes the set of all nodes. "s" is a regular expression denoting the set of functions whose name (followed by a space and the source position) has a prefix matching the regexp. The and, not, and or expressions denote intersection, complement, and union, respectively. The pred and succ expressions add the set of immediate predecessors or successors to their argument, respectively. The from and to expressions denote the set of nodes that have paths from or to the set of nodes denoted by their arguments, respectively. Finally, thresh, thresh-gc, and thresh-stack denote the set of nodes whose percentage of ticks, gc ticks, or stack ticks, respectively, is greater than or equal to the real number x.

For example, if you want to see the entire call graph for a program, you can use -keep all (this is the default). If you want to see all nodes reachable from function foo in your program, you would use -keep '(from "foo")'. Or, if you want to see all the functions defined in subdirectory bar of your project that used at least 1% of the ticks, you would use -keep '(and ".*/bar/" (thresh 1.0))'. To see all functions with ticks above a threshold, you can also use -thresh x, which is an abbreviation for -keep '(thresh x)'. You can not use multiple -keep arguments or both -keep and -thresh. When you use -keep to display a subset of the functions, mlprof will add dashed edges to the call graph to indicate a path in the original call graph from one function to another.

When compiling with -profile-stack true, you can use mlprof -gray true to make the nodes darker or lighter depending on whether their stack percentage is higher or lower.

MLton's optimizer may duplicate source functions for any of a number of reasons (functor duplication, monomorphisation, polyvariance, inlining). By default, all duplicates of a function are treated as one. If you would like to treat the duplicates separately, you can use mlprof -split regexp, which will cause all duplicates of functions whose name has a prefix matching the regular expression to be treated separately. This can be especially useful for higher-order utility functions like General.o.

7.3  Using MLton.Profile

To profile individual portions of your program, you can use the MLton.Profile structure (see Section 9.2.15). This allows you to create many units of profiling data (essentially, mappings from functions to counts) during a run of a program, to switch between them while the program is running, and to output multiple mlmon.out files. Executing mlprof with multiple mlmon.out files sums the profiling data to produce the output profiling information.

Here is an example, run from within the examples/profiling directory, showing how to profile the executions of the fib and tak functions separately.
% mlton -profile time fib-tak.sml
% ./fib-tak
% mlprof fib-tak mlmon.fib.out
5.77 seconds of CPU time (0.00 seconds GC)
function   cur 
--------- -----
fib       96.9%
<unknown>  3.1%
% mlprof fib-tak mlmon.tak.out
0.68 seconds of CPU time (0.00 seconds GC)
function  cur  
-------- ------
tak      100.0%
% mlprof fib-tak mlmon.fib.out mlmon.tak.out mlmon.out
6.45 seconds of CPU time (0.00 seconds GC)
function   cur 
--------- -----
fib       86.7%
tak       10.5%
<unknown>  2.8%

7.4  Profiling details

Technically speaking, mlprof produces a call-stack graph rather than a call graph, because it describes the set of possible call stacks. The difference is in how tail calls are displayed. For example if f nontail calls g and g tail calls h, then the call-stack graph has edges from f to g and f to h, while the call graph has edges from f to g and g to h. That is, a tail call from g to h removes g from the call stack and replaces it with h.

Conceptually, both allocation and time profiling work in the same way. The compiler produces information that maps machine code positions to source functions that the profiler uses while the program is running to find out the current source function. With -profile-stack true, the profiler also keeps track of which functions are on the stack. The profiler associates counters (either clock ticks or byte counts) with source functions. For allocation profiling, the compiler inserts code whenever an object is allocated to call a C function to increment the appropriate counter. For time profiling, the profiler catches the SIGPROF signal 100 times per second and increments the appropriate counter. When the program finishes, the profiler writes the counts out to the mlmon.out file. Then, mlprof uses source information stored in the executable to associate the counts in the mlmon.out file with source functions.

With -profile time, use of the following in your program will cause a run-time error, since they would interfere with the profiler signal handler.
MLton.Itimer.set (MLton.Itimer.Prof, ...)}
MLton.Signal.setHandler (MLton.Signal.prof, ...)

Also, because of the random sampling used to implement -profile time, it is best to have a long running program (at least tens of seconds) in order to get reasonable time data.

There may be a few missed clock ticks or bytes allocated at the very end of the program after the data is written.

Profiling has not been tested with signals or threads. In particular, stack profiling may behave strangely.
Previous Up Next