{"version": "https://jsonfeed.org/version/1", "title": "/dev/posts/ - Tag index - gdb", "home_page_url": "https://www.gabriel.urdhr.fr", "feed_url": "/tags/gdb/feed.json", "items": [{"id": "http://www.gabriel.urdhr.fr/2015/11/25/rr-use-after-free/", "title": "Debugging use-after-free with RR reverse execution", "url": "https://www.gabriel.urdhr.fr/2015/11/25/rr-use-after-free/", "date_published": "2015-11-25T00:00:00+01:00", "date_modified": "2015-11-25T00:00:00+01:00", "tags": ["computer", "debug", "gdb", "rr", "simgrid"], "content_html": "

RR is a very useful tool for debugging. It\ncan record the execution of a program and then replay the exact same\nexecution at will inside a debugger. One very useful extra power\navailable since 4.0 is the support for efficient reverse\nexecution\nwhich can be used to find the root cause of a bug in your program\nby rewinding time. In this example, we reverse-execute a program from a\ncase of use-after-free in order to find where the block of memory was\nfreed.

\n

TLDR

\n
\n$ rr record ./foo my_args\n$ rr replay\n(rr) continue\n(rr) break free if $rdi == some_address\n(rr) reverse-continue\n
\n\n

Problem

\n

We have a case of use-after-free:

\n
$ gdb --args java -classpath \"$classpath\" surfCpuModel/TestCpuModel \\\n  small_platform.xml surfCpuModelDeployment.xml \\\n  --cfg=host/model:compound\n\n(gdb) run\n[\u2026]\n\nProgram received signal SIGSEGV, Segmentation fault.\n[Switching to Thread 0x7ffff7fbb700 (LWP 12766)]\n0x00007fffe4fe3fb7 in xbt_dynar_map (dynar=0x7ffff0276ea0, op=0x56295a443b6c65) at /home/gabriel/simgrid/src/xbt/dynar.c:603\n603     op(elm);\n\n(gdb) p *dynar\n$2 = {size = 2949444837771837443, used = 3415824664728436765,\n      elmsize = 3414970357536090483, data = 0x646f4d2f66727573,\n      free_f = 0x56295a443b6c65}\n
\n\n

The fields of this structure are all wrong and we suspect than this\nblock of heap was already freed and reused by another allocation.

\n

We could use GDB with a conditional breakpoint of free(ptr) with\nptr == dynar but this approach poses a few problems:

\n
    \n
  1. \n

    in the new execution of the program this address might be\n completely different because of different source of indeterminism\n such as,

    \n
  2. \n
  3. \n

    ASLR which we could disable with setarch -R,

    \n
  4. \n
  5. \n

    scheduling of the different threads (and Java usually spawns quite\n a few threads);

    \n
  6. \n
  7. \n

    there could be a lot of calls of free() for this specific\n address for previous allocations before we reach the correct one.

    \n
  8. \n
\n

Using RR

\n

Deterministic recording

\n

RR can be used to create a recording of a given execution of the\nprogram. This execution can then be replayed exactly inside a\ndebugger. This fixes our first problem.

\n

Let's record our crash in RR:

\n
$ rr record java -classpath \"$classpath\" surfCpuModel/TestCpuModel \\\n  small_platform.xml surfCpuModelDeployment.xml \\\n    --cfg=host/model:compound\n[\u2026]\n# A fatal error has been detected by the Java Runtime Environment:\n[\u2026]\n
\n\n

Now we can replay the exact same execution over and over gain in a special\nGDB session:

\n
$ rr replay\n(rr) continue\nContinuing.\n[\u2026]\n\nProgram received signal SIGSEGV, Segmentation fault.\n[Switching to Thread 12601.12602]\n0x00007fe94761efb7 in xbt_dynar_map (dynar=0x7fe96c24f350, op=0x56295a443b6c65) at /home/gabriel/simgrid/src/xbt/dynar.c:603\n603     op(elm);\n
\n\n

Reverse execution to the root cause of the problem

\n

We want to know who freed this block of memory. RR 4.0 provides\nsupport for efficient reverse-execution which can be used to solve our\nsecond problem.

\n

Let's set a conditional breakpoint on free():

\n
(rr) p dynar\n$1 = (const xbt_dynar_t) 0x7fe96c24f350\n\n(rr) break free if $rdi == 0x7fe96c24f350\n
\n\n

Note: This is for x86_64.\nIn the x86_64 ABI,\nthe RDI register is used to pass the first parameter.

\n

Now we can use RR super powers by reverse-executing the program until\nwe find who freed this block of memory:

\n
\n(rr) reverse-continue\nContinuing.\nProgram received signal SIGSEGV, Segmentation fault.\n[\u2026]\n\n(rr) reverse-continue\nContinuing.\nBreakpoint 1, __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n2917    malloc.c: Aucun fichier ou dossier de ce type.\n\n(bt) backtrace\n#0  __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n#1  0x00007fe96b18486d in ZIP_FreeEntry (jz=0x7fe96c0f43d0, ze=0x7fe96c24f6e0) at ../../../src/share/native/java/util/zip/zip_util.c:1104\n#2  0x00007fe968191d78 in ?? ()\n#3  0x00007fe96818dcbb in ?? ()\n#4  0x0000000000000002 in ?? ()\n#5  0x00007fe96c24f6e0 in ?? ()\n#6  0x000000077ab0c2d8 in ?? ()\n#7  0x00007fe970641a80 in ?? ()\n#8  0x0000000000000000 in ?? ()\n\n(rr) reverse-continue\nContinuing.\nBreakpoint 1, __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n2917    in malloc.c\n\n(rr) backtrace\n#0  __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n#1  0x00007fe94761f28e in xbt_dynar_to_array (dynar=0x7fe96c24f350) at /home/gabriel/simgrid/src/xbt/dynar.c:691\n#2  0x00007fe946b98a2f in SwigDirector_CpuModel::createCpu (this=0x7fe96c14d850, name=0x7fe96c156862 \"Tremblay\", power_peak=0x7fe96c24f350, pstate=0, \n    power_scale=1, power_trace=0x0, core=1, state_initial=SURF_RESOURCE_ON, state_trace=0x0, cpu_properties=0x0)\n    at /home/gabriel/simgrid/src/bindings/java/org/simgrid/surf/surfJAVA_wrap.cxx:1571\n#3  0x00007fe947531615 in cpu_parse_init (host=0x7fe9706456d0) at /home/gabriel/simgrid/src/surf/cpu_interface.cpp:44\n#4  0x00007fe947593f88 in sg_platf_new_host (h=0x7fe9706456d0) at /home/gabriel/simgrid/src/surf/sg_platf.c:138\n#5  0x00007fe9475e54fb in ETag_surfxml_host () at /home/gabriel/simgrid/src/surf/surfxml_parse.c:481\n#6  0x00007fe9475da1dc in surf_parse_lex () at src/surf/simgrid_dtd.c:7093\n#7  0x00007fe9475e84f2 in _surf_parse () at /home/gabriel/simgrid/src/surf/surfxml_parse.c:1068\n#8  0x00007fe9475e8cfa in parse_platform_file (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n    at /home/gabriel/simgrid/src/surf/surfxml_parseplatf.c:172\n#9  0x00007fe9475142f4 in SIMIX_create_environment (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n    at /home/gabriel/simgrid/src/simix/smx_environment.c:39\n#10 0x00007fe9474cd98f in MSG_create_environment (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n    at /home/gabriel/simgrid/src/msg/msg_environment.c:37\n#11 0x00007fe94686c473 in Java_org_simgrid_msg_Msg_createEnvironment (env=0x7fe96c00a1d8, cls=0x7fe9706459a8, jplatformFile=0x7fe9706459b8)\n    at /home/gabriel/simgrid/src/bindings/java/jmsg.c:203\n#12 0x00007fe968191d78 in ?? ()\n#13 0x00000007fffffffe in ?? ()\n#14 0x00007fe970645958 in ?? ()\n#15 0x00000007f5cd1100 in ?? ()\n#16 0x00007fe9706459b8 in ?? ()\n#17 0x00000007f5cd1738 in ?? ()\n#18 0x0000000000000000 in ?? ()\n
\n\n

Now that we have found the offending free() call we can inspect the state\nof the program:

\n
\n(rr) frame 1\n#1  0x00007fe94761f28e in xbt_dynar_to_array (dynar=0x7fe96c24f350) at /home/gabriel/simgrid/src/xbt/dynar.c:691\n691   free(dynar);\n\n(rr) list\n686 {\n687   void *res;\n688   xbt_dynar_shrink(dynar, 1);\n689   memset(xbt_dynar_push_ptr(dynar), 0, dynar->elmsize);\n690   res = dynar->data;\n691   free(dynar);\n692   return res;\n693 }\n694\n695 /** @brief Compare two dynars\n
\n\n

If necessary we could continue reverse-executing in order to understand\nbetter what caused the problem.

\n

Using GDB

\n

While GDB has builtin support for reverse\nexecution,\ndoing the same thing in GDB is much slower. Moreover, recording\nthe execution fills the GDB record buffer quite rapidly which prevents\nus from recording a large execution: with the native support of GDB\nwe would probably need to narrow down the region when the bug appeared\nin order to only record (and the reverse-execute) a small part of the\nexecution of the program.

\n

References

\n"}, {"id": "http://www.gabriel.urdhr.fr/2014/07/17/sample-watchpoint/", "title": "Sample watchpoints or breakpoints with GDB (and FlameGraph)", "url": "https://www.gabriel.urdhr.fr/2014/07/17/sample-watchpoint/", "date_published": "2014-07-17T00:00:00+02:00", "date_modified": "2014-07-17T00:00:00+02:00", "tags": ["gdb", "debug", "computer", "flamegraph"], "content_html": "

GDB can be used to get the stack each time a breakpoint is reached.

\n

GDB in batch mode can be used to get the stack each time a\nbreakpoint/watchpoint is hit:

\n

Sampling breakpoints

\n
# This is sample.gdb:\n\n# my_function may not be availabe straight away.\n# First get into main:\nbreak main\nrun\ndelete\n\n# Now, we can set the breakpoint:\nbreak my_function\ncommands\n  silent\n  backtrace\n  continue\nend\n\n# Resume the program:\ncontinue\n
\n\n\n

And run it with:

\n
gdb --batch -x sample.gdb ./my_program > my_program.txt\n
\n\n\n

Sampling breakpoints as a Python script

\n

We can use Python\ninstead which is simpler and more flexible:

\n
class MyBreakpoint(gdb.Breakpoint):\n      def stop (self):\n        gdb.execute(\"backtrace\")\n        return False\n\nmain = MyBreakpoint(\"my_function\");\ngdb.execute(\"run\")\n
\n\n\n

The script must end in .py in order to be recognised by GDB as a\nPython script:

\n
gdb --batch -x sample.py ./my_program > my_program.txt\n
\n\n\n

Example

\n

Here is an example with a watchpoint I used to find where an expected\nvalue in a program was coming from:

\n
# Run the program a first time.\n# The program calls abort() when the expected value is reached.\n# A conditional breakpoint could be used instead.\nrun\n\n# At this point, setup a watchpoint on the location of this unexpected value:\nframe 3\n# -l is used in order to set the breakpoint on a given address:\nwatch -l *(int*) ((char*)heap_region1->start_addr + ((char*)&heapinfo1->type-(char*)heap_region1->data))\ncommands\n  backtrace\n  continue\nend\n\n# Restart the application in order to figure out who writes at this address:\nrun\n
\n\n\n

In order to have stable addresses between the two runs, we need to disable\nASLR\u00a0:

\n
setarch x86_64 -R gdb --batch -x ./my_program > my_program.txt\n
\n\n\n

We can even generate a FlameGraph:

\n
\n\n \"[corresponding\n\n
FlameGraph generated from this watchpoint
\n
"}, {"id": "http://www.gabriel.urdhr.fr/2014/05/23/flamegraph/", "title": "Profiling and optimising with Flamegraph", "url": "https://www.gabriel.urdhr.fr/2014/05/23/flamegraph/", "date_published": "2014-05-23T00:00:00+02:00", "date_modified": "2014-05-23T00:00:00+02:00", "tags": ["simgrid", "optimisation", "profiling", "computer", "flamegraph", "unix", "gdb", "perf"], "content_html": "

Flamegraph\nis a software which generates SVG graphics\nto visualise stack-sampling based\nprofiles. It processes data collected with tools such as Linux perf,\nSystemTap, DTrace.

\n

For the impatient:

\n\n

Table of Content

\n
\n\n
\n

Profiling by sampling the stack

\n

The idea is that in order to know where your application is using CPU\ntime, you should sample its stack. You can get one sample of the\nstack(s) of a process with GDB:

\n
# Sample the stack of the main (first) thread of a process:\ngdb -ex \"set pagination 0\" -ex \"bt\" -batch -p $(pidof okular)\n\n# Sample the stack of all threads of the process:\ngdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $(pidof okular)\n
\n\n\n

This generates backtraces such as:

\n
[...]\nThread 2 (Thread 0x7f4d7bd56700 (LWP 15156)):\n#0  0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1  0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x7f4d70002e70, timeout=-1, context=0x7f4d700009a0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2  g_main_context_iterate (context=context@entry=0x7f4d700009a0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3  0x00007f4d933750ec in g_main_context_iteration (context=0x7f4d700009a0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4  0x00007f4d9718b676 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5  0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#6  0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7  0x00007f4d97059bef in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8  0x00007f4d9713e763 in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9  0x00007f4d9705c2bf in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#10 0x00007f4d93855062 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0\n#11 0x00007f4d96796c1d in clone () from /lib/x86_64-linux-gnu/libc.so.6\n\nThread 1 (Thread 0x7f4d997ab780 (LWP 15150)):\n#0  0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1  0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=8, fds=0x2f8a940, timeout=1998, context=0x1c747e0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2  g_main_context_iterate (context=context@entry=0x1c747e0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3  0x00007f4d933750ec in g_main_context_iteration (context=0x1c747e0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4  0x00007f4d9718b655 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5  0x00007f4d97c017c6 in ?? () from /usr/lib/x86_64-linux-gnu/libQtGui.so.4\n#6  0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7  0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8  0x00007f4d97162ab9 in QCoreApplication::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9  0x00000000004082d6 in ?? ()\n#10 0x00007f4d966d2b45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6\n#11 0x0000000000409181 in _start ()\n[...]\n
\n\n\n

By doing this a few times, you should be able to have an idea of\nwhat's taking time in your process (or thread).

\n

Using FlameGraph for visualising stack samples

\n

Taking a few random stack samples of the process might be fine and\nhelp you in some cases but in order to have more accurate information,\nyou might want to take a lot of stack samples. FlameGraph can help you\nvisualize those stack samples.

\n

How does FlameGraph work?

\n

FlameGraph reads a file from the standard input representing stack\nsamples in a simple format where each line represents a type of stack\nand the number of samples:

\n
main;init;init_boson_processor;malloc  2\nmain;init;init_logging;malloc          4\nmain;processing;compyte_value          8\nmain;cleanup;free                      3\n
\n\n\n

FlameGraph generates a corresponding SVG representation:

\n
\n\n \"[corresponding\n\n
Corresponding FlameGraph output
\n
\n\n

FlameGraph ships with a set of preprocessing scripts\n(stackcollapse-*.pl) used to convert data from various\nperformance/profiling tools into this simple format\nwhich means you can use FlameGraph with perf, DTrace,\nSystemTap or your own tool:

\n
your_tool | flamegraph_preprocessor_for_your_tool | flamegraph > result.svg\n
\n\n\n

It is very easy to add support for a new tool in a few lines of\nscripts. I wrote a\npreprocessor\nfor the GDB backtrace output (produced by the previous poor man's\nprofiler script) which is now available\nin the main repository.

\n

As FlameGraph uses a tool-neutral line-oriented format, it is very\neasy to add generic filters after the preprocessor (using sed,\ngrep\u2026):

\n
the_tool | flamegraph_preprocessor_for_the_tool | filters | flamegraph > result.svg\n
\n\n\n

Update 2015-08-22:\nElfutils ships a stack program\n(called eu-stack on Debian) which seems to be much faster than GDB\nfor using as a Poor man's Profiler in a shell script. I wrote a\nscript in order to feed its output to\nFlameGraph.

\n

Using FlameGraph with perf

\n

perf is a very powerful tool for Linux to do performance analysis of\nprograms. For example, here's how we can generate a\non-CPU\nFlameGraph of an application using perf:

\n
# Use perf to do a time based sampling of an application (on-CPU):\nperf record -F99 --call-graph dwarf myapp\n\n# Turn the data into a cute SVG:\nperf script | stackcollapse-perf.pl | flamegraph.pl > myapp.svg\n
\n\n\n

This samples the on-CPU time, excluding time when the process in not\nscheduled (idle, waiting on a semaphore\u2026) which may not be what you\nwant. It is possible to sample\noff-CPU\ntime as well with\nperf.

\n

The simple and fast solution1 is to use the frame pointer\nto unwind the stack frames (--call-graph fp). However, frame pointer\ntends to be omitted these days (it is not mandated by the x86_64 ABI):\nit might not work very well unless you recompile code and dependencies\nwithout omitting the frame pointer (-fno-omit-frame-pointer).

\n

Another solution is to use CFI to unwind the stack (with --call-graph\ndwarf): this uses either the DWARF CFI (.debug_frame section) or\nruntime stack unwinding (.eh_frame section). The CFI must be present\nin the application and shared-objects (with\n-fasynchronous-unwind-tables or -g). On x86_64, .eh_frame should\nbe enabled by default.

\n

Update 2015-09-19: Another solution on recent Intel chips (and\nrecent kernels) is to use the hardware LBR\nregisters (with --call-graph\nlbr).

\n

Transforming and filtering the data

\n

As FlameGraph uses a simple line oriented format, it is very easy to\nfilter/transform the data by placing a filter between the\nstackcollapse preprocessor and FlameGraph:

\n
# I'm only interested in what's happening in MAIN():\nperf script | stackcollapse-perf.pl | grep MAIN | flamegraph.pl > MAIN.svg\n\n# I'm not interested in what's happening in init():\nperf script | stackcollapse-perf.pl | grep -v init | flamegraph.pl > noinit.svg\n\n# Let's pretend that realloc() is the same thing as malloc():\nperf script | stackcollapse-perf.pl | sed/realloc/malloc/ | flamegraph.pl > alloc.svg\n
\n\n\n

If you have recursive calls you might want to merge them in order to\nhave a more readable view. This is implemented in my\nbranch\nby stackfilter-recursive.pl:

\n
# I want to merge recursive calls:\nperf script | stackcollapse-perf.pl | stackfilter-recursive.pl | grep MAIN | flamegraph.pl\n
\n\n\n

Update 2015-10-16: this has been merged upstream.

\n

Using FlameGraph with the poor man's profiler (based on GDB)

\n

Sometimes you might not be able to get relevant information with\nperf. This might be because you do not have debugging symbols for\nsome libraries you are using: you will end up with missing\ninformation in the stacktrace. In this case, you might want to use GDB\ninstead using the poor man's profiler\nmethod because it tends to be better at unwinding the stack without\nframe pointer and debugging information:

\n
# Sample an already running process:\npmp 500 0.1 $(pidof mycommand) > mycommand.gdb\n\n# Or:\nmycommand my_arguments &\npmp 500 0.1 $!\n\n# Generate the SVG:\ncat mycommand.gdb | stackcollapse-gdb.pl | flamegraph.pl > mycommand.svg\n
\n\n\n

Where pmp is a poor man's profiler script such as:

\n
#!/bin/bash\n# pmp - \"Poor man's profiler\" - Inspired by http://poormansprofiler.org/\n# See also: http://dom.as/tag/gdb/\n\nnsamples=$1\nsleeptime=$2\npid=$3\n\n# Sample stack traces:\nfor x in $(seq 1 $nsamples); do\n  gdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $pid 2> /dev/null\n  sleep $sleeptime\ndone\n
\n\n\n

Using this technique will slow the application a lot.

\n

Compared to the example with perf, this approach samples both on-CPU\nand off-CPU time.

\n

A real world example of optimisation with FlameGraph

\n

Here are some figures obtained when I was optimising the\nSimgrid\nmodel checker\non a given application\nusing the poor man's profiler to sample the stack.

\n

Here is the original profile before optimisation:

\n
\n\n \n\n
FlameGraph before optimisation
\n
\n\n

Avoid looking up data in a hash table

\n

Nearly 65% of the time is spent in get_type_description(). In fact, the\nmodel checker spends its time looking up type description in some hash tables\nagain and over again.

\n

Let's fix this and store a pointer to the type description instead of\na type identifier in order to avoid looking up those type over\nand over again:

\n
\n\n \"[profile\n\n
FlameGraph after avoiding the type lookups
\n
\n\n

Cache the memory areas addresses

\n

After this modification,\n32% of the time is spent in libunwind get_proc_name() (looking up\nfunctions name from given values of the instruction pointer) and\n12% is spent reading and parsing the output of cat\n/proc/self/maps over and over again. Let's fix the second issue first\nbecause it is simple, we cache the memory mapping of the process in\norder to avoid parsing /proc/self/maps all of time.

\n
\n\n \"[profile\n\n
FlameGraph after caching the /proc/self/maps output
\n
\n\n

Speed up function resolution

\n

Now, let's fix the other issue by resolving the functions\nourselves. It turns out, we already had the address range of each function\nin memory (parsed from DWARF informations). All we have to do is use a\nbinary search in order to have a nice O(log n) lookup.

\n
\n\n \"[profile\n\n
FlameGraph after optimising the function lookups
\n
\n\n

Avoid looking up data in a hash table (again)

\n

Still 10% of the time is spent looking up type descriptions from type\nidentifiers in a hash tables. Let's store the reference to the type\ndescriptions and avoid this:

\n
\n\n \"profile\n\n
FlameGraph after avoiding some remaining type lookups
\n
\n\n

Result

\n

The non-optimised version was taking 2 minutes to complete. With\nthose optimisations, it takes only 6 seconds \"\ud83d\ude2e\". There is\nstill room for optimisation here as 30% of the time is now spent in\nmalloc()/free() managing heap information.

\n

Remaining stuff

\n

Sampling other events

\n

Perf can sample many other kind of events (hardware performance\ncounters, software performance counters, tracepoints\u2026). You can get\nthe list of available events with perf list. If you run it as\nroot you will have a lot more events (all the kernel tracepoints).

\n

Here are some interesting events:

\n\n

More information about some perf events can be found in\nperf_event_open(2).

\n

You can then sample an event with:

\n
perf record --call-graph dwarf -e cache-misses myapp\n
\n\n\n
\n\n \"[FlameGraphe\n\n
FlameGraph of cache misses
\n
\n\n

Ideas

\n\n

Extra tips

\n\n

References

\n\n
\n
\n
    \n
  1. \n

    When using frame pointer unwinding, the kernel unwinds the stack\nitself and only gives the instruction pointer of each frame to\nperf record. This behaviour is triggered by the\nPERF_SAMPLE_CALLCHAIN sample type.

    \n

    When using DWARF unwinding, the kernels takes a snaphots of (a\npart of) the stack, gives it to perf record: perf record\nstores it in a file and the DWARF unwinding is done afterwards by\nthe perf tools. This uses\nPERF_SAMPLE_STACK_USER. PERF_SAMPLE_CALLCHAIN is used as well\nbut for the kernel-side stack (exclude_callchain_user).\u00a0\u21a9

    \n
  2. \n
\n
"}]}