{"version": "https://jsonfeed.org/version/1", "title": "/dev/posts/ - Tag index - optimisation", "home_page_url": "https://www.gabriel.urdhr.fr", "feed_url": "/tags/optimisation/feed.json", "items": [{"id": "http://www.gabriel.urdhr.fr/2014/05/23/flamegraph/", "title": "Profiling and optimising with Flamegraph", "url": "https://www.gabriel.urdhr.fr/2014/05/23/flamegraph/", "date_published": "2014-05-23T00:00:00+02:00", "date_modified": "2014-05-23T00:00:00+02:00", "tags": ["simgrid", "optimisation", "profiling", "computer", "flamegraph", "unix", "gdb", "perf"], "content_html": "
Flamegraph\nis a software which generates SVG graphics\nto visualise stack-sampling based\nprofiles. It processes data collected with tools such as Linux perf,\nSystemTap, DTrace.
\nFor the impatient:
\nThe idea is that in order to know where your application is using CPU\ntime, you should sample its stack. You can get one sample of the\nstack(s) of a process with GDB:
\n# Sample the stack of the main (first) thread of a process:\ngdb -ex \"set pagination 0\" -ex \"bt\" -batch -p $(pidof okular)\n\n# Sample the stack of all threads of the process:\ngdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $(pidof okular)\n
\nThis generates backtraces such as:
\n\n[...]\nThread 2 (Thread 0x7f4d7bd56700 (LWP 15156)):\n#0 0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1 0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x7f4d70002e70, timeout=-1, context=0x7f4d700009a0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2 g_main_context_iterate (context=context@entry=0x7f4d700009a0, block=block@entry=1, dispatch=dispatch@entry=1, self=\n) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3 0x00007f4d933750ec in g_main_context_iteration (context=0x7f4d700009a0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4 0x00007f4d9718b676 in QEventDispatcherGlib::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5 0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#6 0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7 0x00007f4d97059bef in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8 0x00007f4d9713e763 in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9 0x00007f4d9705c2bf in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#10 0x00007f4d93855062 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0\n#11 0x00007f4d96796c1d in clone () from /lib/x86_64-linux-gnu/libc.so.6\n\nThread 1 (Thread 0x7f4d997ab780 (LWP 15150)):\n#0 0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1 0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=8, fds=0x2f8a940, timeout=1998, context=0x1c747e0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2 g_main_context_iterate (context=context@entry=0x1c747e0, block=block@entry=1, dispatch=dispatch@entry=1, self= ) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3 0x00007f4d933750ec in g_main_context_iteration (context=0x1c747e0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4 0x00007f4d9718b655 in QEventDispatcherGlib::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5 0x00007f4d97c017c6 in ?? () from /usr/lib/x86_64-linux-gnu/libQtGui.so.4\n#6 0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7 0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8 0x00007f4d97162ab9 in QCoreApplication::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9 0x00000000004082d6 in ?? ()\n#10 0x00007f4d966d2b45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6\n#11 0x0000000000409181 in _start ()\n[...]\n
By doing this a few times, you should be able to have an idea of\nwhat is taking time in your process (or thread).
\nTaking a few random stack samples of the process might be fine and\nhelp you in some cases but in order to have more accurate information,\nyou might want to take a lot of stack samples. FlameGraph can help you\nvisualize those stack samples.
\nFlameGraph reads a file from the standard input representing stack\nsamples in a simple format where each line represents a type of stack\nand the number of samples:
\n\nmain;init;init_boson_processor;malloc 2\nmain;init;init_logging;malloc 4\nmain;processing;compyte_value 8\nmain;cleanup;free 3\n\n
FlameGraph generates a corresponding SVG representation:
\n\nFlameGraph ships with a set of preprocessing scripts\n(stackcollapse-*.pl
) used to convert data from various\nperformance/profiling tools into this simple format\nwhich means you can use FlameGraph with perf, DTrace,\nSystemTap or your own tool:
your_tool | flamegraph_preprocessor_for_your_tool | flamegraph > result.svg\n
\nIt is very easy to add support for a new tool in a few lines of\nscripts. I wrote a\npreprocessor\nfor the GDB backtrace
output (produced by the previous poor man's\nprofiler script) which is now available\nin the main repository.
As FlameGraph uses a tool-neutral line-oriented format, it is very\neasy to add generic filters after the preprocessor (using sed
,\ngrep
, etc.):
the_tool | flamegraph_preprocessor_for_the_tool | filters | flamegraph > result.svg\n
\nUpdate 2015-08-22:\nElfutils ships a stack
program\n(called eu-stack
on Debian) which seems to be much faster than GDB\nfor using as a poor person's Profiler in a shell script. I wrote a\nscript in order to feed its output to\nFlameGraph.
perf is a very powerful tool for Linux to do performance analysis of\nprograms. For example, here is how we can generate\nan on-CPU\nFlameGraph of an application using perf:
\n# Use perf to do a time based sampling of an application (on-CPU):\nperf record -F99 --call-graph dwarf myapp\n\n# Turn the data into a cute SVG:\nperf script | stackcollapse-perf.pl | flamegraph.pl > myapp.svg\n
\nThis samples the on-CPU time, excluding time when the process in not\nscheduled (idle, waiting on a semaphore, etc.) which may not be what you\nwant. It is possible to sample\noff-CPU\ntime as well with\nperf.
\nThe simple and fast solution[1] is to use the frame pointer\nto unwind the stack frames (--call-graph fp
). However, frame pointer\ntends to be omitted these days (it is not mandated by the x86_64 ABI):\nit might not work very well unless you recompile code and dependencies\nwithout omitting the frame pointer (-fno-omit-frame-pointer
).
Another solution is to use CFI to unwind the stack (with --call-graph dwarf
): this uses either the DWARF CFI (.debug_frame
section) or\nruntime stack unwinding (.eh_frame
section). The CFI must be present\nin the application and shared-objects (with\n-fasynchronous-unwind-tables
or -g
). On x86_64, .eh_frame
should\nbe enabled by default.
Update 2015-09-19: Another solution on recent Intel chips (and\nrecent kernels) is to use the hardware LBR (Last Branch Record)\nregisters (with --call-graph lbr
).
As FlameGraph uses a simple line oriented format, it is very easy to\nfilter/transform the data by placing a filter between the\nstackcollapse
preprocessor and FlameGraph:
# I am only interested in what is happening in MAIN():\nperf script | stackcollapse-perf.pl | grep MAIN | flamegraph.pl > MAIN.svg\n\n# I am not interested in what is happening in init():\nperf script | stackcollapse-perf.pl | grep -v init | flamegraph.pl > noinit.svg\n\n# Let's pretend that realloc() is the same thing as malloc():\nperf script | stackcollapse-perf.pl | sed/realloc/malloc/ | flamegraph.pl > alloc.svg\n
\nIf you have recursive calls you might want to merge them in order to\nhave a more readable view. This is implemented in my\nbranch\nby stackfilter-recursive.pl
:
# I want to merge recursive calls:\nperf script | stackcollapse-perf.pl | stackfilter-recursive.pl | grep MAIN | flamegraph.pl\n
\nUpdate 2015-10-16: this has been merged upstream.
\nSometimes you might not be able to get relevant information with\nperf
. This might be because you do not have debugging symbols for\nsome libraries you are using: you will end up with missing\ninformation in the stacktrace. In this case, you might want to use GDB\ninstead using the poor man's profiler\nmethod because it tends to be better at unwinding the stack without\nframe pointer and debugging information:
# Sample an already running process:\npmp 500 0.1 $(pidof mycommand) > mycommand.gdb\n\n# Or:\nmycommand my_arguments &\npmp 500 0.1 $!\n\n# Generate the SVG:\ncat mycommand.gdb | stackcollapse-gdb.pl | flamegraph.pl > mycommand.svg\n
\nWhere pmp
is a poor man's profiler script such as:
#!/bin/bash\n# pmp - \"Poor man's profiler\" - Inspired by http://poormansprofiler.org/\n# See also: http://dom.as/tag/gdb/\n\nnsamples=$1\nsleeptime=$2\npid=$3\n\n# Sample stack traces:\nfor x in $(seq 1 $nsamples); do\n gdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $pid 2> /dev/null\n sleep $sleeptime\ndone\n
\nUsing this technique will slow the application a lot.
\nCompared to the example with perf, this approach samples both on-CPU\nand off-CPU time.
\nHere are some figures obtained when I was optimising the\nSimgrid\nmodel checker\non a given application\nusing the poor man's profiler to sample the stack.
\nHere is the original profile before optimisation:
\n\n82% of the time is spent in get_type_description()
. In fact, the\nmodel checker spends its time looking up type description in some hash tables\nagain and over again.
Let's fix this and store a pointer to the type description instead of\na type identifier in order to avoid looking up those type over\nand over again:
\n\nAfter this modification,\n32% of the time is spent in libunwind get_proc_name()
(looking up\nfunctions name from given values of the instruction pointer) and\n13% is spent reading and parsing the output of cat /proc/self/maps
\nover and over again (in xbt_getline()
). Let's fix the second issue first\nbecause it is simple: we can cache the memory mapping of the process in\norder to avoid parsing /proc/self/maps
all of time.
Now, let's fix the other issue by resolving the functions\nourselves. It turns out, we already had the address range of each function\nin memory (parsed from DWARF informations). All we have to do is use a\nbinary search in order to have a nice O(log n) lookup[2].
\n\nStill 17% of the time is spent looking up type descriptions from type\nidentifiers in a hash table. Let's store the reference to the type\ndescriptions and avoid this:
\n\nThe non-optimised version was taking 2 minutes to complete. With\nthose optimisations, it takes only 6 seconds \ud83d\ude2e. There is\nstill room for optimisation here as 30% of the time is now spent in\nmalloc()
/free()
managing heap information.
Perf can sample many other kind of events (hardware performance\ncounters, software performance counters, tracepoints, etc.). You can get\nthe list of available events with perf list
. If you run it as\nroot you will have a lot more events (all the kernel tracepoints).
Here are some interesting events:
\ncache-misses
are in general last level cache misses (the\ndata in not in any cache and must be fetched from RAM which\nis much slower).page-faults
.More information about some perf events can be found in\nperf_event_open(2)
.
You can then sample an event with:
\nperf record --call-graph dwarf -e cache-misses myapp\n
\n\n_ZTSSt9bad_alloc@@GLIBCXX_3.4
),\nc++filt
can be used after the stackcollapse
script to demangle them.--reverse
flag of flamegraph.pl
.perf
./proc/$pid/stack
.If you liked this post, you might as well like:
\n\nWhen using frame pointer unwinding, the kernel unwinds the stack\nitself and only gives the instruction pointer of each frame to\nperf record
. This behaviour is triggered by the\nPERF_SAMPLE_CALLCHAIN
sample type.
When using DWARF unwinding, the kernels takes a snaphots of (a\npart of) the stack, gives it to perf record
: perf record
\nstores it in a file and the DWARF unwinding is done afterwards by\nthe perf tools. This uses\nPERF_SAMPLE_STACK_USER
. PERF_SAMPLE_CALLCHAIN
is used as well\nbut for the kernel-side stack (exclude_callchain_user
). \u21a9\ufe0e
Cache friendliness could probably be better however.\nSee for example\nCache-friendly binary search. \u21a9\ufe0e
\n