Callgrind: A call-graph generating Cache Simulator and Profiler

Last updated for Version 0.9.8

Callgrind (previously named Calltree) is a Valgrind Tool, able to run applications under supervision to generate profiling data. Additionally, two command line tools (PERL scripts) are provided:

To use this skin, you must specify --tool=callgrind on the Valgrind command line or use the supplied script callgrind.

This tool is heavily based on the Cachegrind Tool of the Valgrind package. Read the documentation of Cachegrind first; this page only describes the features supported in addition to Cachegrinds features.

Detailed technical documentation on how Callgrind works is available here. If you want to know how to use it, you only need to read this page.

1. Purpose

1.1 Profiling as Part Of Application Development

When you develop a program, usually, one of the last steps is to make it as fast as possible (but still correct). You don't want to waste your time optimizing functions rarely used. So you need to know in which part of your program most of the time is spent.

This is done with a technique called Profiling. The program is run under control of a profiling tool, which gives you the time distribution among executed functions in the run. After examination of the program's profile, you probably know where to optimize and afterwards you verify the optimisation success again with another profile run.

1.2 Profiling Tools

Most known is the GCC profiling tool GProf: You need to compile your program with option "-pg"; running the program generates a file "gmon.out", which can be transformed into human readable form with the command line tool "gprof". The disadvantage is the needed compilation step for a prepared executable, which has to be statically linked.

Another profiling tool is Cachegrind, part of Valgrind. It uses the processor emulation of Valgrind to run the executable, and catches all memory accesses for the trace. The user program does not need to be recompiled; it can use shared libraries and plugins, and the profile measuring doesn't influence the trace results. The trace includes the number of instruction/data memory accesses and 1st/2nd level cache misses, and relates it to source lines and functions of the run program. A disadvantage is the slowdown involved in the processor emulation, it's around 50 times slower.

Cachegrind only can deliver a flat profile. There is no call relationship among the functions of an application stored. Thus, Inclusive Costs, i.e. costs of a function including the cost of all functions called from there, can't be calculated. Callgrind extends Cachegrind by including call relationship and exact event counts spent while doing a call.

Because Callgrind is based on simulation, the slowdown due to some preprocessing of events while collecting does not influence the results. See the next chapter for more details on the possibilities.

2. Usage

2.1 Basics

To start a profile run for a program, execute After program termination, a profile dump file named "callgrind.out.pid" is generated with pid being the process ID number of the profile run.

This will collect information

  1. on memory accesses of your program, and if an access can be satisfied by loading from 1st/2nd level cache,
  2. on the calls made in your program among the functions executed.

If you are only interested the first item, it's enough to use Cachegrind from Valgrind. If you are only interested in the second item, use Callgrind with option "--simulate-cache=no". This will only count events of type Instruction Read Accesses. But it significantly speeds up the profiling typically by a factor of 2 or 3. If the program section you want to profile is somewhere in the middle of the run, it is benificial to fast forward to this section without any profiling at all, and switch it on later. This is achieved by using "--instr-atstart=no" and interactively use ""callgrind_control -i on" before the interesting code section is about to be executed.

2.2 Multiple dumps from one program run

Often, you aren't interested in time characteristics of a full program run, but only of a small part of it (e.g. execution of one algorithm). If there are multiple algorithms or one algorithm running with different input data, it's even useful to get different profile information for multiple parts of one program run.

The generated dump files are named

where pid is the PID of the running program, part is a number incremented on each dump (".part" is skipped for the dump at program termination), threadID is a thread identification ("-threadID" is only used if you request dumps if individual threads).

There are different ways to generate multiple profile dumps while a program is running under supervision of Callgrind. Still, all methods trigger the same action "dump all profile information since last dump or program start, and zero cost counters afterwards". To allow for zeroing cost counters without dumping, there exists a second action "zero all cost counters now". The different methods are:

If you are running a multi-threaded application and specify the command line option "--dump-threads=yes", every thread will be profiled on its own and will create its own profile dump. Thus, the last two methods will only generate one dump of the currently running thread. With the other methods, you will get multiple dumps (one for each thread) on a dump request.

2.3 Limiting range of event collection

You can control for which part of your program you want to collect event costs by using --toggle-collect=funcprefix. This will toggle the collection state on entering and leaving a function. When specifying this option, the default collecting state at program start is "off". Thus, only events happing while running inside of funcprefix will be collected. Recursive function calls of funcprefix don't influence collecting at all.

2.4 Avoiding cycles

Each group of functions with any two of them happening to have a call chain from one to the other, is called a cycle. E.g. with A calling B, B calling C, and C calling A, the three functions A,B,C build up one cycle.

If a call chain goes multiple times around inside of a cycle, you can't distinguish costs coming from the first round or the second. Thus, it makes no sense to attach any cost to a call among function in one cycle: if "A > B" appears multiple times in a call chain, you have no way to partition the one big sum of all appearances of "A > B". Thus, for profile data presentation, all functions of a cycle are seen as one big virtual function.

Unfortunately, if you have an application using some callback mechanism (like any GUI program), or even with normal polymorphism (as in OO languages like C++), it's quite possible to get large cycles. As it is often impossible to say anything about performance behaviour inside of cycles, it is useful to introduce some mechanisms to avoid cycles in call graphs at all. This is done by treating the same function as different functions depending on the current execution context by giving them different names, or by ignoring calls to functions at all.

There is an option to ignore calls to a function with "--fn-skip=funcprefix". E.g., you usually don't want to see the trampoline functions in the PLT sections for calls to functions in shared libraries. You can see the difference if you profile with "--skip-plt=no". If a call is ignored, cost events happening will be attached to the enclosing function.

If you have a recursive function, you can distinguish the first 10 recursion levels by specifying "--fn-recursion10=funcprefix". Or for all functions with "fn-recursion=10", but this will give you much bigger profile dumps. In the profile data, you will see the recursion levels of "func" as the different functions with names "func", "func'2", "func'3" and so on.

If you have call chains "A > B > C" and "A > C > B" in your program, you usually get a "false" cycle "B <> C". Use "--fn-caller2=B --fn-caller2=C", and functions "B" and "C" will be treated as different functions depending on the direct caller. Using the apostrophe for appending this "context" to the function name, you get "A > B'A > C'B" and "A > C'A > B'C", and there will be no cycle. Use "--fn-callers=3" to get a 2-caller depencendy for all functions. Again, this will multiplicate the profile data size.

3. Command line option reference

--base=<prefix>

--simulate-cache=yes|no

--instr-atstart=yes|no

--collect-atstart=yes|no

--skip-plt=no|yes

--fn-skip=<function>/code>

--fn-group<number>=<function>

--fn-recursion<number>=<function>

--fn-caller<number>=<function>

--dump-before=<function>

--zero-before=<function>

--dump-after=<function>

--toggle-collect=<function>

--fn-recursion=<level>

--fn-caller=<callers>

--mangle-names=no|yes

--dump-threads=no|yes

--compress-strings=no|yes

--dump-bbs=no|yes

--dumps=<count>

--dump-instr=no|yes

--trace-jump=no|yes

4. Profile data file format

The header has an arbitrary number of lines of the format "key: value". Afterwards, position specifications "spec=position" and cost lines starting with a number of position columns (as given by the "positions:" header field), followed by space separated cost numbers can appear. Empty lines are always allowed.

Possible key values for the header are:

As said above, there also exist lines "spec=position". The values for position specifications are arbitrary strings. When starting with "(" and a digit, it's a string in compressed format. Otherwise it's the real position string. This allows for file and symbol names as position strings, as these never start with "(" + digit. The compressed format is either "(" number ")" space position or only "(" number ")". The first relates position to number in the context of the given format specification from this line to the end of the file; it makes the (number) an alias for position. Compressed format is always optional.

Position specifications allowed: