/dev/posts/

Profiling and optimising with Flamegraph

Published:

Updated:

Flamegraph is a software which generates SVG graphics to visualise stack-sampling based profiles. It processes data collected with tools such as Linux perf, SystemTap, DTrace.

For the impatient:

Table of content

Profiling by sampling the stack

The idea is that in order to know where your application is using CPU time, you should sample its stack. You can get one sample of the stack(s) of a process with GDB:

# Sample the stack of the main (first) thread of a process:
gdb -ex "set pagination 0" -ex "bt" -batch -p $(pidof okular)

# Sample the stack of all threads of the process:
gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p $(pidof okular)

This generates backtraces such as:

[...]
Thread 2 (Thread 0x7f4d7bd56700 (LWP 15156)):
#0  0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x7f4d70002e70, timeout=-1, context=0x7f4d700009a0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028
#2  g_main_context_iterate (context=context@entry=0x7f4d700009a0, block=block@entry=1, dispatch=dispatch@entry=1, self=) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729
#3  0x00007f4d933750ec in g_main_context_iteration (context=0x7f4d700009a0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795
#4  0x00007f4d9718b676 in QEventDispatcherGlib::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#5  0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#6  0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#7  0x00007f4d97059bef in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#8  0x00007f4d9713e763 in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#9  0x00007f4d9705c2bf in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#10 0x00007f4d93855062 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007f4d96796c1d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f4d997ab780 (LWP 15150)):
#0  0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=8, fds=0x2f8a940, timeout=1998, context=0x1c747e0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028
#2  g_main_context_iterate (context=context@entry=0x1c747e0, block=block@entry=1, dispatch=dispatch@entry=1, self=) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729
#3  0x00007f4d933750ec in g_main_context_iteration (context=0x1c747e0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795
#4  0x00007f4d9718b655 in QEventDispatcherGlib::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#5  0x00007f4d97c017c6 in ?? () from /usr/lib/x86_64-linux-gnu/libQtGui.so.4
#6  0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#7  0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#8  0x00007f4d97162ab9 in QCoreApplication::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#9  0x00000000004082d6 in ?? ()
#10 0x00007f4d966d2b45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000409181 in _start ()
[...]

By doing this a few times, you should be able to have an idea of what is taking time in your process (or thread).

Using FlameGraph for visualising stack samples

Taking a few random stack samples of the process might be fine and help you in some cases but in order to have more accurate information, you might want to take a lot of stack samples. FlameGraph can help you visualize those stack samples.

How does FlameGraph work?

FlameGraph reads a file from the standard input representing stack samples in a simple format where each line represents a type of stack and the number of samples:

main;init;init_boson_processor;malloc  2
main;init;init_logging;malloc          4
main;processing;compyte_value          8
main;cleanup;free                      3

FlameGraph generates a corresponding SVG representation:

Example Flamegraph
Corresponding FlameGraph output

FlameGraph ships with a set of preprocessing scripts (stackcollapse-*.pl) used to convert data from various performance/profiling tools into this simple format which means you can use FlameGraph with perf, DTrace, SystemTap or your own tool:

your_tool | flamegraph_preprocessor_for_your_tool | flamegraph > result.svg

It is very easy to add support for a new tool in a few lines of scripts. I wrote a preprocessor for the GDB backtrace output (produced by the previous poor man's profiler script) which is now available in the main repository.

As FlameGraph uses a tool-neutral line-oriented format, it is very easy to add generic filters after the preprocessor (using sed, grep, etc.):

the_tool | flamegraph_preprocessor_for_the_tool | filters | flamegraph > result.svg

Update 2015-08-22: Elfutils ships a stack program (called eu-stack on Debian) which seems to be much faster than GDB for using as a poor person's Profiler in a shell script. I wrote a script in order to feed its output to FlameGraph.

Using FlameGraph with perf

perf is a very powerful tool for Linux to do performance analysis of programs. For example, here is how we can generate an on-CPU FlameGraph of an application using perf:

# Use perf to do a time based sampling of an application (on-CPU):
perf record -F99 --call-graph dwarf myapp

# Turn the data into a cute SVG:
perf script | stackcollapse-perf.pl | flamegraph.pl > myapp.svg

This samples the on-CPU time, excluding time when the process in not scheduled (idle, waiting on a semaphore, etc.) which may not be what you want. It is possible to sample off-CPU time as well with perf.

The simple and fast solution[1] is to use the frame pointer to unwind the stack frames (--call-graph fp). However, frame pointer tends to be omitted these days (it is not mandated by the x86_64 ABI): it might not work very well unless you recompile code and dependencies without omitting the frame pointer (-fno-omit-frame-pointer).

Another solution is to use CFI to unwind the stack (with --call-graph dwarf): this uses either the DWARF CFI (.debug_frame section) or runtime stack unwinding (.eh_frame section). The CFI must be present in the application and shared-objects (with -fasynchronous-unwind-tables or -g). On x86_64, .eh_frame should be enabled by default.

Update 2015-09-19: Another solution on recent Intel chips (and recent kernels) is to use the hardware LBR (Last Branch Record) registers (with --call-graph lbr).

Transforming and filtering the data

As FlameGraph uses a simple line oriented format, it is very easy to filter/transform the data by placing a filter between the stackcollapse preprocessor and FlameGraph:

# I am only interested in what is happening in MAIN():
perf script | stackcollapse-perf.pl | grep MAIN | flamegraph.pl > MAIN.svg

# I am not interested in what is happening in init():
perf script | stackcollapse-perf.pl | grep -v init | flamegraph.pl > noinit.svg

# Let's pretend that realloc() is the same thing as malloc():
perf script | stackcollapse-perf.pl | sed/realloc/malloc/ | flamegraph.pl > alloc.svg

If you have recursive calls you might want to merge them in order to have a more readable view. This is implemented in my branch by stackfilter-recursive.pl:

# I want to merge recursive calls:
perf script | stackcollapse-perf.pl | stackfilter-recursive.pl | grep MAIN | flamegraph.pl

Update 2015-10-16: this has been merged upstream.

Using FlameGraph with the poor man's profiler (based on GDB)

Sometimes you might not be able to get relevant information with perf. This might be because you do not have debugging symbols for some libraries you are using: you will end up with missing information in the stacktrace. In this case, you might want to use GDB instead using the poor man's profiler method because it tends to be better at unwinding the stack without frame pointer and debugging information:

# Sample an already running process:
pmp 500 0.1 $(pidof mycommand) > mycommand.gdb

# Or:
mycommand my_arguments &
pmp 500 0.1 $!

# Generate the SVG:
cat mycommand.gdb | stackcollapse-gdb.pl | flamegraph.pl > mycommand.svg

Where pmp is a poor man's profiler script such as:

#!/bin/bash
# pmp - "Poor man's profiler" - Inspired by http://poormansprofiler.org/
# See also: http://dom.as/tag/gdb/

nsamples=$1
sleeptime=$2
pid=$3

# Sample stack traces:
for x in $(seq 1 $nsamples); do
  gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p $pid 2> /dev/null
  sleep $sleeptime
done

Using this technique will slow the application a lot.

Compared to the example with perf, this approach samples both on-CPU and off-CPU time.

A real world example of optimisation with FlameGraph

Here are some figures obtained when I was optimising the Simgrid model checker on a given application using the poor man's profiler to sample the stack.

Here is the original profile before optimisation:

FlameGraph before optimisation

Avoid looking up data in a hash table

82% of the time is spent in get_type_description(). In fact, the model checker spends its time looking up type description in some hash tables again and over again.

Let's fix this and store a pointer to the type description instead of a type identifier in order to avoid looking up those type over and over again:

FlameGraph after avoiding the type lookups

Cache the memory areas addresses

After this modification, 32% of the time is spent in libunwind get_proc_name() (looking up functions name from given values of the instruction pointer) and 13% is spent reading and parsing the output of cat /proc/self/maps over and over again (in xbt_getline()). Let's fix the second issue first because it is simple: we can cache the memory mapping of the process in order to avoid parsing /proc/self/maps all of time.

FlameGraph after caching the /proc/self/maps output

Speed up function resolution

Now, let's fix the other issue by resolving the functions ourselves. It turns out, we already had the address range of each function in memory (parsed from DWARF informations). All we have to do is use a binary search in order to have a nice O(log n) lookup[2].

FlameGraph after optimising the function lookups

Avoid looking up data in a hash table (again)

Still 17% of the time is spent looking up type descriptions from type identifiers in a hash table. Let's store the reference to the type descriptions and avoid this:

FlameGraph after avoiding some remaining type lookups

Result

The non-optimised version was taking 2 minutes to complete. With those optimisations, it takes only 6 seconds 😮. There is still room for optimisation here as 30% of the time is now spent in malloc()/free() managing heap information.

Remaining stuff

Sampling other events

Perf can sample many other kind of events (hardware performance counters, software performance counters, tracepoints, etc.). You can get the list of available events with perf list. If you run it as root you will have a lot more events (all the kernel tracepoints).

Here are some interesting events:

More information about some perf events can be found in perf_event_open(2).

You can then sample an event with:

perf record --call-graph dwarf -e cache-misses myapp
[FlameGraphe for cache misses]
FlameGraph of cache misses

Ideas

Extra tips

References

If you liked this post, you might as well like:


  1. When using frame pointer unwinding, the kernel unwinds the stack itself and only gives the instruction pointer of each frame to perf record. This behaviour is triggered by the PERF_SAMPLE_CALLCHAIN sample type.

    When using DWARF unwinding, the kernels takes a snaphots of (a part of) the stack, gives it to perf record: perf record stores it in a file and the DWARF unwinding is done afterwards by the perf tools. This uses PERF_SAMPLE_STACK_USER. PERF_SAMPLE_CALLCHAIN is used as well but for the kernel-side stack (exclude_callchain_user). ↩︎

  2. Cache friendliness could probably be better however. See for example Cache-friendly binary search. ↩︎