{"version": "https://jsonfeed.org/version/1", "title": "/dev/posts/ - Tag index - optimisation", "home_page_url": "https://www.gabriel.urdhr.fr", "feed_url": "/tags/optimisation/feed.json", "items": [{"id": "http://www.gabriel.urdhr.fr/2014/05/23/flamegraph/", "title": "Profiling and optimising with Flamegraph", "url": "https://www.gabriel.urdhr.fr/2014/05/23/flamegraph/", "date_published": "2014-05-23T00:00:00+02:00", "date_modified": "2014-05-23T00:00:00+02:00", "tags": ["simgrid", "optimisation", "profiling", "computer", "flamegraph", "unix", "gdb", "perf"], "content_html": "

Flamegraph\nis a software which generates SVG graphics\nto visualise stack-sampling based\nprofiles. It processes data collected with tools such as Linux perf,\nSystemTap, DTrace.

\n

For the impatient:

\n\n

Table of Content

\n
\n\n
\n

Profiling by sampling the stack

\n

The idea is that in order to know where your application is using CPU\ntime, you should sample its stack. You can get one sample of the\nstack(s) of a process with GDB:

\n
# Sample the stack of the main (first) thread of a process:\ngdb -ex \"set pagination 0\" -ex \"bt\" -batch -p $(pidof okular)\n\n# Sample the stack of all threads of the process:\ngdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $(pidof okular)\n
\n\n\n

This generates backtraces such as:

\n
[...]\nThread 2 (Thread 0x7f4d7bd56700 (LWP 15156)):\n#0  0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1  0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x7f4d70002e70, timeout=-1, context=0x7f4d700009a0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2  g_main_context_iterate (context=context@entry=0x7f4d700009a0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3  0x00007f4d933750ec in g_main_context_iteration (context=0x7f4d700009a0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4  0x00007f4d9718b676 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5  0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#6  0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7  0x00007f4d97059bef in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8  0x00007f4d9713e763 in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9  0x00007f4d9705c2bf in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#10 0x00007f4d93855062 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0\n#11 0x00007f4d96796c1d in clone () from /lib/x86_64-linux-gnu/libc.so.6\n\nThread 1 (Thread 0x7f4d997ab780 (LWP 15150)):\n#0  0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1  0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=8, fds=0x2f8a940, timeout=1998, context=0x1c747e0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2  g_main_context_iterate (context=context@entry=0x1c747e0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3  0x00007f4d933750ec in g_main_context_iteration (context=0x1c747e0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4  0x00007f4d9718b655 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5  0x00007f4d97c017c6 in ?? () from /usr/lib/x86_64-linux-gnu/libQtGui.so.4\n#6  0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7  0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8  0x00007f4d97162ab9 in QCoreApplication::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9  0x00000000004082d6 in ?? ()\n#10 0x00007f4d966d2b45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6\n#11 0x0000000000409181 in _start ()\n[...]\n
\n\n\n

By doing this a few times, you should be able to have an idea of\nwhat's taking time in your process (or thread).

\n

Using FlameGraph for visualising stack samples

\n

Taking a few random stack samples of the process might be fine and\nhelp you in some cases but in order to have more accurate information,\nyou might want to take a lot of stack samples. FlameGraph can help you\nvisualize those stack samples.

\n

How does FlameGraph work?

\n

FlameGraph reads a file from the standard input representing stack\nsamples in a simple format where each line represents a type of stack\nand the number of samples:

\n
main;init;init_boson_processor;malloc  2\nmain;init;init_logging;malloc          4\nmain;processing;compyte_value          8\nmain;cleanup;free                      3\n
\n\n\n

FlameGraph generates a corresponding SVG representation:

\n
\n\n \"[corresponding\n\n
Corresponding FlameGraph output
\n
\n\n

FlameGraph ships with a set of preprocessing scripts\n(stackcollapse-*.pl) used to convert data from various\nperformance/profiling tools into this simple format\nwhich means you can use FlameGraph with perf, DTrace,\nSystemTap or your own tool:

\n
your_tool | flamegraph_preprocessor_for_your_tool | flamegraph > result.svg\n
\n\n\n

It is very easy to add support for a new tool in a few lines of\nscripts. I wrote a\npreprocessor\nfor the GDB backtrace output (produced by the previous poor man's\nprofiler script) which is now available\nin the main repository.

\n

As FlameGraph uses a tool-neutral line-oriented format, it is very\neasy to add generic filters after the preprocessor (using sed,\ngrep\u2026):

\n
the_tool | flamegraph_preprocessor_for_the_tool | filters | flamegraph > result.svg\n
\n\n\n

Update 2015-08-22:\nElfutils ships a stack program\n(called eu-stack on Debian) which seems to be much faster than GDB\nfor using as a Poor man's Profiler in a shell script. I wrote a\nscript in order to feed its output to\nFlameGraph.

\n

Using FlameGraph with perf

\n

perf is a very powerful tool for Linux to do performance analysis of\nprograms. For example, here's how we can generate a\non-CPU\nFlameGraph of an application using perf:

\n
# Use perf to do a time based sampling of an application (on-CPU):\nperf record -F99 --call-graph dwarf myapp\n\n# Turn the data into a cute SVG:\nperf script | stackcollapse-perf.pl | flamegraph.pl > myapp.svg\n
\n\n\n

This samples the on-CPU time, excluding time when the process in not\nscheduled (idle, waiting on a semaphore\u2026) which may not be what you\nwant. It is possible to sample\noff-CPU\ntime as well with\nperf.

\n

The simple and fast solution1 is to use the frame pointer\nto unwind the stack frames (--call-graph fp). However, frame pointer\ntends to be omitted these days (it is not mandated by the x86_64 ABI):\nit might not work very well unless you recompile code and dependencies\nwithout omitting the frame pointer (-fno-omit-frame-pointer).

\n

Another solution is to use CFI to unwind the stack (with --call-graph\ndwarf): this uses either the DWARF CFI (.debug_frame section) or\nruntime stack unwinding (.eh_frame section). The CFI must be present\nin the application and shared-objects (with\n-fasynchronous-unwind-tables or -g). On x86_64, .eh_frame should\nbe enabled by default.

\n

Update 2015-09-19: Another solution on recent Intel chips (and\nrecent kernels) is to use the hardware LBR\nregisters (with --call-graph\nlbr).

\n

Transforming and filtering the data

\n

As FlameGraph uses a simple line oriented format, it is very easy to\nfilter/transform the data by placing a filter between the\nstackcollapse preprocessor and FlameGraph:

\n
# I'm only interested in what's happening in MAIN():\nperf script | stackcollapse-perf.pl | grep MAIN | flamegraph.pl > MAIN.svg\n\n# I'm not interested in what's happening in init():\nperf script | stackcollapse-perf.pl | grep -v init | flamegraph.pl > noinit.svg\n\n# Let's pretend that realloc() is the same thing as malloc():\nperf script | stackcollapse-perf.pl | sed/realloc/malloc/ | flamegraph.pl > alloc.svg\n
\n\n\n

If you have recursive calls you might want to merge them in order to\nhave a more readable view. This is implemented in my\nbranch\nby stackfilter-recursive.pl:

\n
# I want to merge recursive calls:\nperf script | stackcollapse-perf.pl | stackfilter-recursive.pl | grep MAIN | flamegraph.pl\n
\n\n\n

Update 2015-10-16: this has been merged upstream.

\n

Using FlameGraph with the poor man's profiler (based on GDB)

\n

Sometimes you might not be able to get relevant information with\nperf. This might be because you do not have debugging symbols for\nsome libraries you are using: you will end up with missing\ninformation in the stacktrace. In this case, you might want to use GDB\ninstead using the poor man's profiler\nmethod because it tends to be better at unwinding the stack without\nframe pointer and debugging information:

\n
# Sample an already running process:\npmp 500 0.1 $(pidof mycommand) > mycommand.gdb\n\n# Or:\nmycommand my_arguments &\npmp 500 0.1 $!\n\n# Generate the SVG:\ncat mycommand.gdb | stackcollapse-gdb.pl | flamegraph.pl > mycommand.svg\n
\n\n\n

Where pmp is a poor man's profiler script such as:

\n
#!/bin/bash\n# pmp - \"Poor man's profiler\" - Inspired by http://poormansprofiler.org/\n# See also: http://dom.as/tag/gdb/\n\nnsamples=$1\nsleeptime=$2\npid=$3\n\n# Sample stack traces:\nfor x in $(seq 1 $nsamples); do\n  gdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $pid 2> /dev/null\n  sleep $sleeptime\ndone\n
\n\n\n

Using this technique will slow the application a lot.

\n

Compared to the example with perf, this approach samples both on-CPU\nand off-CPU time.

\n

A real world example of optimisation with FlameGraph

\n

Here are some figures obtained when I was optimising the\nSimgrid\nmodel checker\non a given application\nusing the poor man's profiler to sample the stack.

\n

Here is the original profile before optimisation:

\n
\n\n \n\n
FlameGraph before optimisation
\n
\n\n

Avoid looking up data in a hash table

\n

Nearly 65% of the time is spent in get_type_description(). In fact, the\nmodel checker spends its time looking up type description in some hash tables\nagain and over again.

\n

Let's fix this and store a pointer to the type description instead of\na type identifier in order to avoid looking up those type over\nand over again:

\n
\n\n \"[profile\n\n
FlameGraph after avoiding the type lookups
\n
\n\n

Cache the memory areas addresses

\n

After this modification,\n32% of the time is spent in libunwind get_proc_name() (looking up\nfunctions name from given values of the instruction pointer) and\n12% is spent reading and parsing the output of cat\n/proc/self/maps over and over again. Let's fix the second issue first\nbecause it is simple, we cache the memory mapping of the process in\norder to avoid parsing /proc/self/maps all of time.

\n
\n\n \"[profile\n\n
FlameGraph after caching the /proc/self/maps output
\n
\n\n

Speed up function resolution

\n

Now, let's fix the other issue by resolving the functions\nourselves. It turns out, we already had the address range of each function\nin memory (parsed from DWARF informations). All we have to do is use a\nbinary search in order to have a nice O(log n) lookup.

\n
\n\n \"[profile\n\n
FlameGraph after optimising the function lookups
\n
\n\n

Avoid looking up data in a hash table (again)

\n

Still 10% of the time is spent looking up type descriptions from type\nidentifiers in a hash tables. Let's store the reference to the type\ndescriptions and avoid this:

\n
\n\n \"profile\n\n
FlameGraph after avoiding some remaining type lookups
\n
\n\n

Result

\n

The non-optimised version was taking 2 minutes to complete. With\nthose optimisations, it takes only 6 seconds \"\ud83d\ude2e\". There is\nstill room for optimisation here as 30% of the time is now spent in\nmalloc()/free() managing heap information.

\n

Remaining stuff

\n

Sampling other events

\n

Perf can sample many other kind of events (hardware performance\ncounters, software performance counters, tracepoints\u2026). You can get\nthe list of available events with perf list. If you run it as\nroot you will have a lot more events (all the kernel tracepoints).

\n

Here are some interesting events:

\n\n

More information about some perf events can be found in\nperf_event_open(2).

\n

You can then sample an event with:

\n
perf record --call-graph dwarf -e cache-misses myapp\n
\n\n\n
\n\n \"[FlameGraphe\n\n
FlameGraph of cache misses
\n
\n\n

Ideas

\n\n

Extra tips

\n\n

References

\n\n
\n
\n
    \n
  1. \n

    When using frame pointer unwinding, the kernel unwinds the stack\nitself and only gives the instruction pointer of each frame to\nperf record. This behaviour is triggered by the\nPERF_SAMPLE_CALLCHAIN sample type.

    \n

    When using DWARF unwinding, the kernels takes a snaphots of (a\npart of) the stack, gives it to perf record: perf record\nstores it in a file and the DWARF unwinding is done afterwards by\nthe perf tools. This uses\nPERF_SAMPLE_STACK_USER. PERF_SAMPLE_CALLCHAIN is used as well\nbut for the kernel-side stack (exclude_callchain_user).\u00a0\u21a9

    \n
  2. \n
\n
"}]}