SimGrid model checker);
then the remaining stuff.
The idea is that in order to know where your application is using CPU time, you should sample its stack. You can get one sample of the stack(s) of a process with GDB:
# Sample the stack of the main (first) thread of a process: gdb -ex "set pagination 0" -ex "bt" -batch -p $(pidof okular) # Sample the stack of all threads of the process: gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p $(pidof okular)
This generates backtraces such as:
[...] Thread 2 (Thread 0x7f4d7bd56700 (LWP 15156)): #0 0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x7f4d70002e70, timeout=-1, context=0x7f4d700009a0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028 #2 g_main_context_iterate (context=context@entry=0x7f4d700009a0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729 #3 0x00007f4d933750ec in g_main_context_iteration (context=0x7f4d700009a0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795 #4 0x00007f4d9718b676 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #5 0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #6 0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #7 0x00007f4d97059bef in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #8 0x00007f4d9713e763 in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #9 0x00007f4d9705c2bf in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #10 0x00007f4d93855062 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #11 0x00007f4d96796c1d in clone () from /lib/x86_64-linux-gnu/libc.so.6 Thread 1 (Thread 0x7f4d997ab780 (LWP 15150)): #0 0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=8, fds=0x2f8a940, timeout=1998, context=0x1c747e0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028 #2 g_main_context_iterate (context=context@entry=0x1c747e0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729 #3 0x00007f4d933750ec in g_main_context_iteration (context=0x1c747e0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795 #4 0x00007f4d9718b655 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #5 0x00007f4d97c017c6 in ?? () from /usr/lib/x86_64-linux-gnu/libQtGui.so.4 #6 0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #7 0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #8 0x00007f4d97162ab9 in QCoreApplication::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4 #9 0x00000000004082d6 in ?? () #10 0x00007f4d966d2b45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6 #11 0x0000000000409181 in _start () [...]
By doing this a few times, you should be able to have an idea of what's taking time in your process (or thread).
Taking a few random stack samples of the process might be fine and help you in some cases but in order to have more accurate information, you might want to take a lot of stack samples. FlameGraph can help you visualize those stack samples.
FlameGraph reads a file from the standard input representing stack samples in a simple format where each line represents a type of stack and the number of samples:
main;init;init_boson_processor;malloc 2 main;init;init_logging;malloc 4 main;processing;compyte_value 8 main;cleanup;free 3
FlameGraph generates a corresponding SVG representation:
FlameGraph ships with a set of preprocessing scripts (
stackcollapse-*.pl) used to convert data from various performance/profiling tools into this simple format which means you can use FlameGraph with perf, DTrace, SystemTap or your own tool:
your_tool | flamegraph_preprocessor_for_your_tool | flamegraph > result.svg
It is very easy to add support for a new tool in a few lines of scripts. I wrote a preprocessor for the GDB
backtrace output (produced by the previous poor man's profiler script) which is now available in the main repository.
As FlameGraph uses a tool-neutral line-oriented format, it is very easy to add generic filters after the preprocessor (using
the_tool | flamegraph_preprocessor_for_the_tool | filters | flamegraph > result.svg
Update 2015-08-22: Elfutils ships a
stack program (called
eu-stack on Debian) which seems to be much faster than GDB for using as a Poor man's Profiler in a shell script. I wrote a script in order to feed its output to FlameGraph.
perf is a very powerful tool for Linux to do performance analysis of programs. For example, here's how we can generate a on-CPU FlameGraph of an application using perf:
# Use perf to do a time based sampling of an application (on-CPU): perf record -F99 --call-graph dwarf myapp # Turn the data into a cute SVG: perf script | stackcollapse-perf.pl | flamegraph.pl > myapp.svg
This samples the on-CPU time, excluding time when the process in not scheduled (idle, waiting on a semaphore…) which may not be what you want. It is possible to sample off-CPU time as well with perf.
The simple and fast solution1 is to use the frame pointer to unwind the stack frames (
--call-graph fp). However, frame pointer tends to be omitted these days (it is not mandated by the x86_64 ABI): it might not work very well unless you recompile code and dependencies without omitting the frame pointer (
Another solution is to use CFI to unwind the stack (with
--call-graph dwarf): this uses either the DWARF CFI (
.debug_frame section) or runtime stack unwinding (
.eh_frame section). The CFI must be present in the application and shared-objects (with
-g). On x86_64,
.eh_frame should be enabled by default.
Update 2015-09-19: Another solution on recent Intel chips (and recent kernels) is to use the hardware LBR registers (with
As FlameGraph uses a simple line oriented format, it is very easy to filter/transform the data by placing a filter between the
stackcollapse preprocessor and FlameGraph:
# I'm only interested in what's happening in MAIN(): perf script | stackcollapse-perf.pl | grep MAIN | flamegraph.pl > MAIN.svg # I'm not interested in what's happening in init(): perf script | stackcollapse-perf.pl | grep -v init | flamegraph.pl > noinit.svg # Let's pretend that realloc() is the same thing as malloc(): perf script | stackcollapse-perf.pl | sed/realloc/malloc/ | flamegraph.pl > alloc.svg
If you have recursive calls you might want to merge them in order to have a more readable view. This is implemented in my branch by
# I want to merge recursive calls: perf script | stackcollapse-perf.pl | stackfilter-recursive.pl | grep MAIN | flamegraph.pl
Update 2015-10-16: this has been merged upstream.
Sometimes you might not be able to get relevant information with
perf. This might be because you do not have debugging symbols for some libraries you are using: you will end up with missing information in the stacktrace. In this case, you might want to use GDB instead using the poor man's profiler method because it tends to be better at unwinding the stack without frame pointer and debugging information:
# Sample an already running process: pmp 500 0.1 $(pidof mycommand) > mycommand.gdb # Or: mycommand my_arguments & pmp 500 0.1 $! # Generate the SVG: cat mycommand.gdb | stackcollapse-gdb.pl | flamegraph.pl > mycommand.svg
pmp is a poor man's profiler script such as:
#!/bin/bash # pmp - "Poor man's profiler" - Inspired by http://poormansprofiler.org/ # See also: http://dom.as/tag/gdb/ nsamples=$1 sleeptime=$2 pid=$3 # Sample stack traces: for x in $(seq 1 $nsamples); do gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p $pid 2> /dev/null sleep $sleeptime done
Using this technique will slow the application a lot.
Compared to the example with perf, this approach samples both on-CPU and off-CPU time.
Here are some figures obtained when I was optimising the Simgrid model checker on a given application using the poor man's profiler to sample the stack.
Here is the original profile before optimisation:
Nearly 65% of the time is spent in
get_type_description(). In fact, the model checker spends its time looking up type description in some hash tables again and over again.
Let's fix this and store a pointer to the type description instead of a type identifier in order to avoid looking up those type over and over again:
After this modification, 32% of the time is spent in libunwind
get_proc_name() (looking up functions name from given values of the instruction pointer) and 12% is spent reading and parsing the output of
cat /proc/self/maps over and over again. Let's fix the second issue first because it is simple, we cache the memory mapping of the process in order to avoid parsing
/proc/self/maps all of time.
Now, let's fix the other issue by resolving the functions ourselves. It turns out, we already had the address range of each function in memory (parsed from DWARF informations). All we have to do is use a binary search in order to have a nice O(log n) lookup.
Still 10% of the time is spent looking up type descriptions from type identifiers in a hash tables. Let's store the reference to the type descriptions and avoid this:
The non-optimised version was taking 2 minutes to complete. With those optimisations, it takes only 6 seconds . There is still room for optimisation here as 30% of the time is now spent in
free() managing heap information.
Perf can sample many other kind of events (hardware performance counters, software performance counters, tracepoints…). You can get the list of available events with
perf list. If you run it as root you will have a lot more events (all the kernel tracepoints).
Here are some interesting events:
cache-misses are in general last level cache misses (the data in not in any cache and must be fetched from RAM which is much slower).
More information about some perf events can be found in
You can then sample an event with:
perf record --call-graph dwarf -e cache-misses myapp
If you have mangled C++ names (such as
c++filt can be used after the
stackcollapse script to demangle them.
--reverse flag of
perf Examples is a good introduction/reference to
perf wiki, with documentation of the subcommands, a tutorial.
Peeking into Linux kernel-land using /proc filesystem for quick'n'dirty troubleshooting has a section about profiling what's happening in the kernel with
gprof2dot is another tool which generate call (dot-based) graphs from the profiler tool data.
When using frame pointer unwinding, the kernel unwinds the stack itself and only gives the instruction pointer of each frame to
perf record. This behaviour is triggered by the
PERF_SAMPLE_CALLCHAIN sample type.
When using DWARF unwinding, the kernels takes a snaphots of (a part of) the stack, gives it to
perf record stores it in a file and the DWARF unwinding is done afterwards by the perf tools. This uses
PERF_SAMPLE_CALLCHAIN is used as well but for the kernel-side stack (