Debugging use-after-free with RR reverse execution

RR is a very useful tool for debugging. It can record the execution of a program and then replay the exact same execution at will inside a debugger. One very useful extra power available since 4.0 is the support for efficient reverse execution which can be used to find the root cause of a bug in your program by rewinding time. In this example, we reverse-execute a program from a case of use-after-free in order to find where the block of memory was freed.

TLDR

$ rr record ./foo my_args
$ rr replay
(rr) continue
(rr) break free if $rdi == some_address
(rr) reverse-continue

Problem

We have a case of use-after-free:

$ gdb --args java -classpath "$classpath" surfCpuModel/TestCpuModel \
  small_platform.xml surfCpuModelDeployment.xml \
  --cfg=host/model:compound

(gdb) run
[…]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fbb700 (LWP 12766)]
0x00007fffe4fe3fb7 in xbt_dynar_map (dynar=0x7ffff0276ea0, op=0x56295a443b6c65) at /home/gabriel/simgrid/src/xbt/dynar.c:603
603     op(elm);

(gdb) p *dynar
$2 = {size = 2949444837771837443, used = 3415824664728436765,
      elmsize = 3414970357536090483, data = 0x646f4d2f66727573,
      free_f = 0x56295a443b6c65}

The fields of this structure are all wrong and we suspect than this block of heap was already freed and reused by another allocation.

We could use GDB with a conditional breakpoint of free(ptr) with ptr == dynar but this approach poses a few problems:

  1. in the new execution of the program this address might be completely different because of different source of indeterminism such as,

  2. ASLR which we could disable with setarch -R,

  3. scheduling of the different threads (and Java usually spawns quite a few threads);

  4. there could be a lot of calls of free() for this specific address for previous allocations before we reach the correct one.

Using RR

Deterministic recording

RR can be used to create a recording of a given execution of the program. This execution can then be replayed exactly inside a debugger. This fixes our first problem.

Let's record our crash in RR:

$ rr record java -classpath "$classpath" surfCpuModel/TestCpuModel \
  small_platform.xml surfCpuModelDeployment.xml \
    --cfg=host/model:compound
[…]
# A fatal error has been detected by the Java Runtime Environment:
[…]

Now we can replay the exact same execution over and over gain in a special GDB session:

$ rr replay
(rr) continue
Continuing.
[…]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 12601.12602]
0x00007fe94761efb7 in xbt_dynar_map (dynar=0x7fe96c24f350, op=0x56295a443b6c65) at /home/gabriel/simgrid/src/xbt/dynar.c:603
603     op(elm);

Reverse execution to the root cause of the problem

We want to know who freed this block of memory. RR 4.0 provides support for efficient reverse-execution which can be used to solve our second problem.

Let's set a conditional breakpoint on free():

(rr) p dynar
$1 = (const xbt_dynar_t) 0x7fe96c24f350

(rr) break free if $rdi == 0x7fe96c24f350

Note: This is for x86_64. In the x86_64 ABI, the RDI register is used to pass the first parameter.

Now we can use RR super powers by reverse-executing the program until we find who freed this block of memory:

(rr) reverse-continue
Continuing.
Program received signal SIGSEGV, Segmentation fault.
[…]

(rr) reverse-continue
Continuing.
Breakpoint 1, __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917
2917    malloc.c: Aucun fichier ou dossier de ce type.

(bt) backtrace
#0  __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917
#1  0x00007fe96b18486d in ZIP_FreeEntry (jz=0x7fe96c0f43d0, ze=0x7fe96c24f6e0) at ../../../src/share/native/java/util/zip/zip_util.c:1104
#2  0x00007fe968191d78 in ?? ()
#3  0x00007fe96818dcbb in ?? ()
#4  0x0000000000000002 in ?? ()
#5  0x00007fe96c24f6e0 in ?? ()
#6  0x000000077ab0c2d8 in ?? ()
#7  0x00007fe970641a80 in ?? ()
#8  0x0000000000000000 in ?? ()

(rr) reverse-continue
Continuing.
Breakpoint 1, __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917
2917    in malloc.c

(rr) backtrace
#0  __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917
#1  0x00007fe94761f28e in xbt_dynar_to_array (dynar=0x7fe96c24f350) at /home/gabriel/simgrid/src/xbt/dynar.c:691
#2  0x00007fe946b98a2f in SwigDirector_CpuModel::createCpu (this=0x7fe96c14d850, name=0x7fe96c156862 "Tremblay", power_peak=0x7fe96c24f350, pstate=0, 
    power_scale=1, power_trace=0x0, core=1, state_initial=SURF_RESOURCE_ON, state_trace=0x0, cpu_properties=0x0)
    at /home/gabriel/simgrid/src/bindings/java/org/simgrid/surf/surfJAVA_wrap.cxx:1571
#3  0x00007fe947531615 in cpu_parse_init (host=0x7fe9706456d0) at /home/gabriel/simgrid/src/surf/cpu_interface.cpp:44
#4  0x00007fe947593f88 in sg_platf_new_host (h=0x7fe9706456d0) at /home/gabriel/simgrid/src/surf/sg_platf.c:138
#5  0x00007fe9475e54fb in ETag_surfxml_host () at /home/gabriel/simgrid/src/surf/surfxml_parse.c:481
#6  0x00007fe9475da1dc in surf_parse_lex () at src/surf/simgrid_dtd.c:7093
#7  0x00007fe9475e84f2 in _surf_parse () at /home/gabriel/simgrid/src/surf/surfxml_parse.c:1068
#8  0x00007fe9475e8cfa in parse_platform_file (file=0x7fe96c14f1e0 "/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml")
    at /home/gabriel/simgrid/src/surf/surfxml_parseplatf.c:172
#9  0x00007fe9475142f4 in SIMIX_create_environment (file=0x7fe96c14f1e0 "/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml")
    at /home/gabriel/simgrid/src/simix/smx_environment.c:39
#10 0x00007fe9474cd98f in MSG_create_environment (file=0x7fe96c14f1e0 "/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml")
    at /home/gabriel/simgrid/src/msg/msg_environment.c:37
#11 0x00007fe94686c473 in Java_org_simgrid_msg_Msg_createEnvironment (env=0x7fe96c00a1d8, cls=0x7fe9706459a8, jplatformFile=0x7fe9706459b8)
    at /home/gabriel/simgrid/src/bindings/java/jmsg.c:203
#12 0x00007fe968191d78 in ?? ()
#13 0x00000007fffffffe in ?? ()
#14 0x00007fe970645958 in ?? ()
#15 0x00000007f5cd1100 in ?? ()
#16 0x00007fe9706459b8 in ?? ()
#17 0x00000007f5cd1738 in ?? ()
#18 0x0000000000000000 in ?? ()

Now that we have found the offending free() call we can inspect the state of the program:

(rr) frame 1
#1  0x00007fe94761f28e in xbt_dynar_to_array (dynar=0x7fe96c24f350) at /home/gabriel/simgrid/src/xbt/dynar.c:691
691   free(dynar);

(rr) list
686 {
687   void *res;
688   xbt_dynar_shrink(dynar, 1);
689   memset(xbt_dynar_push_ptr(dynar), 0, dynar->elmsize);
690   res = dynar->data;
691   free(dynar);
692   return res;
693 }
694
695 /** @brief Compare two dynars

If necessary we could continue reverse-executing in order to understand better what caused the problem.

Using GDB

While GDB has builtin support for reverse execution, doing the same thing in GDB is much slower. Moreover, recording the execution fills the GDB record buffer quite rapidly which prevents us from recording a large execution: with the native support of GDB we would probably need to narrow down the region when the bug appeared in order to only record (and the reverse-execute) a small part of the execution of the program.

References