Callgrind (previously named Calltree) is a Valgrind Tool, able to run applications under supervision to generate profiling data. Additionally, two command line tools (PERL scripts) are provided:
--tool=callgrind
on the Valgrind command line or use the supplied script callgrind
.
This tool is heavily based on the Cachegrind Tool of the Valgrind package. Read the documentation of Cachegrind first; this page only describes the features supported in addition to Cachegrinds features.
Detailed technical documentation on how Callgrind works is available here. If you want to know how to use it, you only need to read this page.
This is done with a technique called Profiling. The program is run under control of a profiling tool, which gives you the time distribution among executed functions in the run. After examination of the program's profile, you probably know where to optimize and afterwards you verify the optimisation success again with another profile run.
Another profiling tool is Cachegrind, part of Valgrind. It uses the processor emulation of Valgrind to run the executable, and catches all memory accesses for the trace. The user program does not need to be recompiled; it can use shared libraries and plugins, and the profile measuring doesn't influence the trace results. The trace includes the number of instruction/data memory accesses and 1st/2nd level cache misses, and relates it to source lines and functions of the run program. A disadvantage is the slowdown involved in the processor emulation, it's around 50 times slower.
Cachegrind only can deliver a flat profile. There is no call relationship among the functions of an application stored. Thus, Inclusive Costs, i.e. costs of a function including the cost of all functions called from there, can't be calculated. Callgrind extends Cachegrind by including call relationship and exact event counts spent while doing a call.
Because Callgrind is based on simulation, the slowdown due to some preprocessing of events while collecting does not influence the results. See the next chapter for more details on the possibilities.
This will collect information
If you are only interested the first item, it's enough to use Cachegrind from Valgrind. If you are only interested in the second item, use Callgrind with option "--simulate-cache=no". This will only count events of type Instruction Read Accesses. But it significantly speeds up the profiling typically by a factor of 2 or 3. If the program section you want to profile is somewhere in the middle of the run, it is benificial to fast forward to this section without any profiling at all, and switch it on later. This is achieved by using "--instr-atstart=no" and interactively use ""callgrind_control -i on" before the interesting code section is about to be executed.
The generated dump files are named
There are different ways to generate multiple profile dumps while a program is running under supervision of Callgrind. Still, all methods trigger the same action "dump all profile information since last dump or program start, and zero cost counters afterwards". To allow for zeroing cost counters without dumping, there exists a second action "zero all cost counters now". The different methods are:
If you are using KCachegrind for browsing of profile information, you can use the toolbar button "Force dump". This will create the file "cachegrind.cmd" and will trigger a reload after the dump is written.
You can specify these options multiple times for different function prefixes.
In Valgrind terminology, this way is called "Client requests". The given macros generate a special instruction pattern with no effect at all (i.e. a NOP). Only when run under Valgrind, the CPU simulation engine detects the special instruction pattern and triggers special actions like the ones described above.
If a call chain goes multiple times around inside of a cycle, you can't distinguish costs coming from the first round or the second. Thus, it makes no sense to attach any cost to a call among function in one cycle: if "A > B" appears multiple times in a call chain, you have no way to partition the one big sum of all appearances of "A > B". Thus, for profile data presentation, all functions of a cycle are seen as one big virtual function.
Unfortunately, if you have an application using some callback mechanism (like any GUI program), or even with normal polymorphism (as in OO languages like C++), it's quite possible to get large cycles. As it is often impossible to say anything about performance behaviour inside of cycles, it is useful to introduce some mechanisms to avoid cycles in call graphs at all. This is done by treating the same function as different functions depending on the current execution context by giving them different names, or by ignoring calls to functions at all.
There is an option to ignore calls to a function with "--fn-skip=funcprefix". E.g., you usually don't want to see the trampoline functions in the PLT sections for calls to functions in shared libraries. You can see the difference if you profile with "--skip-plt=no". If a call is ignored, cost events happening will be attached to the enclosing function.
If you have a recursive function, you can distinguish the first 10 recursion levels by specifying "--fn-recursion10=funcprefix". Or for all functions with "fn-recursion=10", but this will give you much bigger profile dumps. In the profile data, you will see the recursion levels of "func" as the different functions with names "func", "func'2", "func'3" and so on.
If you have call chains "A > B > C" and "A > C > B" in your program, you usually get a "false" cycle "B <> C". Use "--fn-caller2=B --fn-caller2=C", and functions "B" and "C" will be treated as different functions depending on the direct caller. Using the apostrophe for appending this "context" to the function name, you get "A > B'A > C'B" and "A > C'A > B'C", and there will be no cycle. Use "--fn-callers=3" to get a 2-caller depencendy for all functions. Again, this will multiplicate the profile data size.
--base=<prefix>
This option is especially usefull if your application changes its working directory. Usually, the dump file is generated in the current working directory of the application at program termination. By giving an absolute path with the base specification, you can force a fixed directory for the dump files.
--simulate-cache=yes|no
Note however, that estimating of how much real time your program will need only by using the instruction read counts is impossible. Use it if you want to find out how many times different functions are called and there call relation.
--instr-atstart=yes|no
For cache simulation, results will be a little bit off when switching on instrumentation later in the program run, as the simulator starts with an empty cache at that moment. Switch on event collection later to cope with this error.
--collect-atstart=yes|no
To only look at parts of your program, you have two possibilities:
Collection state can be toggled at entering and leaving of a given function with option
--toggle-collect=<function>
. For this, collection state should be
switched off at the beginning. Note that the specification of --toggle-collect
implicitly sets --collect-state=no
.
Collection state can be toggled also by using a Valgrind User Request in your application.
For this, include valgrind/callgrind.h
and specify the macro
CALLGRIND_TOGGLE_COLLECT
at the needed positions. This only will have any effect
if run under supervision of the Callgrind tool.
--skip-plt=no|yes
--fn-skip=<function>/code>
Ignore calls to/from a given function? E.g. if you have a call chain A > B > C, and
you specify function B to be ignored, you will only see A > C.
This is very convenient to skip functions handling callback behaviour. E.g. for the SIGNAL/SLOT
mechanism in QT, you only want to see the function emitting a signal to call the slots connected
to that signal. First, determine the real call chain to see the functions needed to be skipped,
then use this option.
--fn-group<number>=<function>
Put a function into separation group number.
--fn-recursion<number>=<function>
Separate <number> recursions for <function>
--fn-caller<number>=<function>
Separate <number> callers for <function>
--dump-before=<function>
Dump when entering <function>
--zero-before=<function>
Zero all costs when entering <function>
--dump-after=<function>
Dump when leaving <function>
--toggle-collect=<function>
Toggle collection on enter/leave <function>
--fn-recursion=<level>
Separate function recursions, maximal <level> [2]
--fn-caller=<callers>
Separate functions by callers [0]
--mangle-names=no|yes
Mangle separation into names? [yes]
--dump-threads=no|yes
Dump traces per thread? [no]
--compress-strings=no|yes
Compress strings in profile dump? [yes]
--dump-bbs=no|yes
Dump basic block info? [no]. This needs an update of the KCachegrind importer!
--dumps=<count>
Dump trace each <count> basic blocks [0=never]
--dump-instr=no|yes
This specifies the granularity of the profile information.
Note that if you dump at instruction level, ct_annotate currently
is not able to show you the data. You have to use KCachegrind to
get annotated disassembled code. [no]
--trace-jump=no|yes
This specifies whether information for (conditional) jumps
should be collected. Same as above, ct_annotate currently
is not able to show you the data. You have to use KCachegrind to
get jump arrows in the annotated code. [no]
4. Profile data file format
The header has an arbitrary number of lines of the format
"key: value". Afterwards, position specifications
"spec=position" and cost lines starting with a
number of position columns (as given by the
"positions:" header field), followed by
space separated cost numbers can appear.
Empty lines are always allowed.
Possible key values for the header are:
- version: major.minor [Callgrind]
This is used to distinguish future trace file formats.
A major version of 0 or 1 is supposed to be upwards
compatible with Cachegrind 1.0.x format.
It is optional; if not appearing, original Cachegrind
1.0.x format is supposed.
Otherwise, this has to be the first header line.
- pid: process id [Callgrind]
This specifies the process ID of the supervised
application for which this profile was generated.
- cmd: program name + args [Cachegrind]
This specifies the full command line of the supervised
application for which this profile was generated.
- part: number [Callgrind]
This specifies a sequentially incremented number for
each dump generated, starting at 1.
- desc: type: value [Cachegrind]
This specifies various information for this dump.
For some types, the semantic is defined, but any
description type is allowed. Unknown types should
be ignored.
There are the types "I1 cache", "D1 cache", "L2 cache",
which specify parameters used for the cache simulator.
These are the only types originally used by Cachegrind.
Additionally, Callgrind uses the following types:
"Timerange" gives a rough range of the basic
block counter, for which the cost of this dump
was collected. Type "Trigger" states the reason
of why this trace was generated.
E.g. program termination or forced interactive dump.
- positions: [instr] [line] [Callgrind]
For cost lines, this defines the semantic of the
first numbers. Any combination of "instr", "bb" and
"line" is allowed, but has to be in this order which
corresponds to position numbers at the start of the
cost lines later in the file.
If "instr" is specified, the position is the address
of an instruction whose execution raised the events
given later on the line. This address is relative to
the offset of the binary/shared library file to not
have to specify relocation info.
For "line", the position is the line number of a
source file, which is responsible for the events
raised. Note that the mapping of "instr" and "line"
positions are given by the debugging line information
produced by the compiler.
This field is optionally. If not specified, "line"
is supposed only.
- events: event type abbrevations [Cachegrind]
A list of short names of the event types logged in this
file. The order is the same as in cost lines.
The first event type is the second or third number in a
cost line, depending on the value of "positions".
Callgrind does not add additional cost types.
Specify exactly once.
Cost types from original Cachegrind are
- Ir
Instruction read access
- I1mr
Instruction Level 1 read cache miss
- I2mr
Instruction Level 2 read cache miss
- ...
- summary: costs [Callgrind]
- totals: costs [Cachegrind]
The value or the total number of events covered by
this trace file.
Both keys have the same meaning, but the "totals:" line happens
to be at the end of the file, while "summary:" appears in the header.
This was added to allow postprocessing tools to know in advance to total
cost. The two lines always give the same cost counts.
As said above, there also exist lines "spec=position".
The values for position specifications are arbitrary
strings.
When starting with "(" and a digit, it's a string in
compressed format.
Otherwise it's the real position string. This allows for
file and symbol names as position strings, as these
never start with "(" + digit.
The compressed format is either "(" number ")"
space position or only "(" number ")".
The first relates position to number in
the context of the given format specification from this line to the end
of the file; it makes the (number) an alias for position.
Compressed format is always optional.
Position specifications allowed:
- ob= [Callgrind]
The ELF object where the cost of next cost lines happens.
- fl= [Cachegrind]
- fi= [Cachegrind]
- fe= [Cachegrind]
The source file including the code which is responsible for
the cost of next cost lines. "fi="/"fe=" is used when the source
file changes inside of a function, i.e. for inlined code.
- fn= [Cachegrind]
The name of the function where the cost of next cost lines happens.
- cob= [Callgrind]
The ELF object of the target of the next call cost lines.
- cfl= [Callgrind]
The source file including the code of the target of the
next call cost lines.
- cfn= [Callgrind]
The name of the target function of the next call cost lines.
- calls= [Callgrind]
The number of nonrecursive calls which are responsible for the cost
specified by the next call cost line. This is the cost spent inside
of the called function.
After "calls=" there MUST be a cost line. This is the cost
spent in the called function. The first number is the source line from
where the call happened.
- jump=count target position [Callgrind]
Unconditional jump, executed count times, to the given target position.
- jcnd=exe.count jumpcount target position [Callgrind]
Conditional jump, executed exe.count times with jumpcount jumps
to the given target position.