Better isolation for SimGridMC

| 🤔 | 👍 | 👎 |

In an attempt to simplify the development around the SimGrid model-checker, we were thinking about moving the model-checker out in a different process. Another different approach would be to use a dynamic-linker isolation of the different components of the process. Here's a summary of the goals, problems and design issues surrounding these topics.

Table of Content

Current state

SimGrid architecture

The design if the SimGrid simulator is based on the design of a operating system.

In a typical OS, we have a kernel managing a global state and a several userspace processes running on top of the kernel. The kernel schedules the execution of the different processes (and their threads) on the available CPUs. The kernel provides an API to the processes made of several system calls.

Process Process Process Process System calls OS kernel

SimGrid simulated a distributed system: it simulates a network and let the different processes of the simulated system use this simulated network. Each simulated process runs on top of the SimGrid kernel. The SimGrid kernel schedules the execution of the different processes on the available OS threads. The SimGrid kernel provides an API to the processes made of several simulation calls.

Process Process Process Process Simulation calls SimGrid kernel

In order to reduce the cost of context switching between the different processes, in the current implementation of SimGrid all the simulated processes and the SimGrid kernel are in the same OS process: there is no MMU-enforced separation of memory between the simulated processes but they are expected to only communicate between each other using only the means provided by the SimGrid kernel (the simulation calls) and should not share mutable memory.

Process Process Process Process Simulation calls SimGrid kernel System calls OS kernel

The SimGrid kernel has a dedicated stack and each simulated process has its own stack: cooperative multitasking (fibers, ucontext) is used to switch between the different contexts (SimGrid kernel/process) and is used by the SimGrid kernel to schedule the execution of the different processes.

The same (libc) heap is shared between the SimGrid kernel and the simulated processes.

SimGridMC architecture

The SimGrid model-checker is a dynamic analysis component for SimGrid. It explores the different possible interleavings of execution of the simulated processes (depending on the execution of their transitions i.e. the different possible orderings of their communications).

In order to do this, the MC saves at each node of the graph of the possible executions the state of the system:

  • the global variables;

  • the stack of each process;

  • the heap;

  • the file descriptors.

Those states are then used to:

  • restore a previously visited state in order to explore other parts of the execution graph;

  • detect equivalent states which is necessary in order to,

  • avoid exploring redundant parts of the execution graphs,

  • check liveness properties (using Double Depth Frist Search).

In the current implementation, the model-checker lives in the same process as the main SimGrid process (the SimGrid kernel and the processes):

Process Process Process Process Simulation calls SimGrid kernel Model-checker

Multiple heaps

However, the model-checker needs to maintain its own state: the state of the model-checker must not be saved, compared and restored with the rest of the state.

In order to do this, the state of the model-checker is maintained in a second heap:

  • the main heap is used for the SimGrid kernel and the simulated processes;

  • a model-checker heap is used for the model-checker and is used for all the state of the model-checker (current state of the execution graph).

This is implemented by overriding the malloc(), free() and friends in order to support multiple heap. A global variable is used to choose the current working heap:

// Simplified code
xbt_mheap_t __mmalloc_current_heap = NULL;

void *malloc(size_t n)
  return mmalloc(__mmalloc_current_heap, n);

void free(void *ptr)
  return mfree(__mmalloc_current_heap, ptr);

Limitation of the approach

The current implementation is complicated and not easy to understand and maintain:

  • Because of the two heaps, the code of the MC is littered with fragile code to switch between the heaps. An error at this level can lead to weird bugs or break the snapshot/restoration mechanism and the state detection mechanism.

  • There are many opportunities of breakage in the state snapshot/restore/equality mechanism: using a static variable as a cache in the model-checker, sharing data bewteen the model-checker and the simulated process.

  • Because of the two heaps, there is a strong dependency on the mmalloc() implementation. This implementation is probably not as efficient as a more modern malloc() implementation: for example, it does not use a per-thread arena or any sort of thread-friendly approach but a single mutex per heap. Avoiding to have multiple heaps per process would remove a dependency on the malloc() implementation (we would still have a dependency on the mmalloc() metadata format) and would make it easier to switch to another malloc() implementation.

A first motivation for modifying the architecture of SimGridMC, is to incraase the maintainability of the SimGridMC codebase.

Another related goal is to simplify the debugging experience (of the simulated application, the SimGrid kernel and the model-checker). For example, the current version of SimGridMC does not work under valgrind. A solution which would provide a more powerful debugging experience would be a valuable tool for the SimGridMC devs but more importantly for the users of SimGridMC.

Process-based isolation

For all these reasons, we would like to move the model-checker in a separate process: a model-checker process maintains the model-checker state and control the execution of a model-checked process.

Process Process Process Process Simulation calls SimGrid kernel Model-checking interface Model-checker

Memory snapshot/restoration

The snapshoting/restoration of the model-checked process memory can be done using /proc/${pid}/mem or process_vm_readv() and process_vm_writev().

As long as the OS threads are living on stacks which are not managed by the state snapshot/restoration mechanism, they will not be affected: we must take care that the OS threads switch to unmanaged stacks when we are doing the state snapshots/restorations.

Another solution would be to use ptrace() with PTRACE_GETREGSET and PTRACE_SETREGSET in order to snapshot/restore the registers of each thread but we would like to avoid this in order to be able to use ptrace() for debugging or other purposes.

File descriptors restoration

Linux does not provide a way to change the file descriptors of another process: the restoration of the file descriptors must be done in the taret OS process and cannot be done from the model-checker process. Cooperation of the model-checked process is needed for the file descriptors restoration.

We could abuse ptrace()-based syscall rewriting techniques or some sort of parasite injection in order to achieve this.

Dynamic-linker based isolation

Another idea would be to create a custom dynamic linker with namespace support in order to be able to link multiple instances of the same library and provide isolation between different parts of the process.

This could be used to:

  • load multiple instances of the same application compiled as a shared object (in PIC) as an alternative to our mmap() based SMPI privatisation technique;

  • separate the libc of the model-checker and the main application in order to have a cleaner separation of the two heaps;

  • run the simulated applications with overloaded functions hooked into the SimGrid kernel in order as an alternative to ptrace()-based system call interception.

Prior art in DCE with dlmopen()

It turns out that DCE already uses a similar approach to load multiples application instances along with Linux kernel implementations (and its network stack) on top of the NS3 packet level network simulator in the same process: the applications and Linux kernel are compiled as shared objects, the latter forming a Library OS shared object and loaded multiple times in the same process alongside with the NS3 instance.

Among several alternative strategies, DCE uses the dlmopen() function. This is a variant of dlopen() originating from SunOS/Solaris and implemented on the GNU userland which allows to load dynamic libraries in separated namespaces:

  • each namespace has its own symbols;

  • each namespace can have a distinct instance of a same shared object.

An alternative implementation of the dynamic linker, elf-loader, is used which provides additional features:

  • unlimited number of namespaces;

  • remapping libraries and symbols.

More information about dlmopen() can be found in old version of Sun Linkers and Libraries Guide.

A custom dynamic loader/linker on top of libc

However, I was envisioning something slightly different: instead of writing a replacement of (using raw system calls), I was thinking about building the custom dynamic linker on top of libc and libdl in order to be able to use libc (malloc()), libdl and libelf instead of using the raw system calls.

Impact on debuggability

In a split process design, the model-checker could be a quite standard application avoiding weird hacks (such as introspection with /proc/self/maps and DWARF, snapshoting/restoration of the state with memcpy(), custom mmalloc() implementation with multiple heaps). Once a relevant trajectory of the model-checked application has been identified, it could be replayed outside of the model-checker and debugged in this simpler mode.

However, having a single process could lead to a better debugging experience: by being able to combines breakpoints in the model-checker, the SimGrid kernel and the simulated application with conditions spanning all those components.

At the same time, using multiple dynamic-linking namespaces could make the debugging experience more complicated. I'm not sure how well it is supported by the different available debugging tools. The DCE tools seems to show that it is reasonably well supported by GDB and valgrind.


So we have two possible directions:

  • moving the model-checker in a separate process in order to isolate it;

  • using dynamic linking tricks in order to isolate the model-checker from the rest of the application,

    • using elf-loader;

    • using a new dynamic linker built on top of libc.

    This is quite a large undertaking but being able to use shared libraries could simplify its code base.

The first solution provides a better isolation of the model-checker. The second solution is closer to the current implementation and should have better performances by avoiding the context switches and IPC in favour of direct memory access and function calls. Moreover, the dynamic-linker-based isolation could be reused for other parts of the projects (such as the isolation of the simulated MPI processes).

It is not clear which solution would provide the better debugging experience for the user and which solution would be better for the maintainability of SimGridMC.

Appendix: dlmopen() quick demo

This simple program creates three new namespaces and loads libpthread in those namespaces:

#define _GNU_SOURCE
#include <dlfcn.h>

#include <unistd.h>

int main(int argc, const char** argv)
  size_t i;
  for (i=0; i!=3; ++i) {
    void* x = dlmopen(LM_ID_NEWLM, "", RTLD_NOW);
    if (!x)
      return 1;
  while(1) sleep(200000);
  return 0;

We see that libpthread is loaded thrice. Each instance has its own libc instance as well (and a fourth one is loaded for the main program):

00400000-00401000 r-xp 00000000 08:06 7603474                            /home/myself/temp/a.out
00600000-00601000 rw-p 00000000 08:06 7603474                            /home/myself/temp/a.out
0173a000-0175b000 rw-p 00000000 00:00 0                                  [heap]
7fca7ac7d000-7fca7ae1c000 r-xp 00000000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7ae1c000-7fca7b01c000 ---p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7b01c000-7fca7b020000 r--p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7b020000-7fca7b022000 rw-p 001a3000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7b022000-7fca7b026000 rw-p 00000000 00:00 0
7fca7b026000-7fca7b03e000 r-xp 00000000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7b03e000-7fca7b23d000 ---p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7b23d000-7fca7b23e000 r--p 00017000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7b23e000-7fca7b23f000 rw-p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7b23f000-7fca7b243000 rw-p 00000000 00:00 0
7fca7b243000-7fca7b3e2000 r-xp 00000000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7b3e2000-7fca7b5e2000 ---p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7b5e2000-7fca7b5e6000 r--p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7b5e6000-7fca7b5e8000 rw-p 001a3000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7b5e8000-7fca7b5ec000 rw-p 00000000 00:00 0
7fca7b5ec000-7fca7b604000 r-xp 00000000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7b604000-7fca7b803000 ---p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7b803000-7fca7b804000 r--p 00017000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7b804000-7fca7b805000 rw-p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7b805000-7fca7b809000 rw-p 00000000 00:00 0
7fca7b809000-7fca7b9a8000 r-xp 00000000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7b9a8000-7fca7bba8000 ---p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7bba8000-7fca7bbac000 r--p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7bbac000-7fca7bbae000 rw-p 001a3000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7bbae000-7fca7bbb2000 rw-p 00000000 00:00 0
7fca7bbb2000-7fca7bbca000 r-xp 00000000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7bbca000-7fca7bdc9000 ---p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7bdc9000-7fca7bdca000 r--p 00017000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7bdca000-7fca7bdcb000 rw-p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/
7fca7bdcb000-7fca7bdcf000 rw-p 00000000 00:00 0
7fca7bdcf000-7fca7bf6e000 r-xp 00000000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7bf6e000-7fca7c16e000 ---p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7c16e000-7fca7c172000 r--p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7c172000-7fca7c174000 rw-p 001a3000 08:01 2626010                    /lib/x86_64-linux-gnu/
7fca7c174000-7fca7c178000 rw-p 00000000 00:00 0
7fca7c178000-7fca7c17b000 r-xp 00000000 08:01 2626017                    /lib/x86_64-linux-gnu/
7fca7c17b000-7fca7c37a000 ---p 00003000 08:01 2626017                    /lib/x86_64-linux-gnu/
7fca7c37a000-7fca7c37b000 r--p 00002000 08:01 2626017                    /lib/x86_64-linux-gnu/
7fca7c37b000-7fca7c37c000 rw-p 00003000 08:01 2626017                    /lib/x86_64-linux-gnu/
7fca7c37c000-7fca7c39c000 r-xp 00000000 08:01 2625993                    /lib/x86_64-linux-gnu/
7fca7c568000-7fca7c56b000 rw-p 00000000 00:00 0
7fca7c59a000-7fca7c59c000 rw-p 00000000 00:00 0
7fca7c59c000-7fca7c59d000 r--p 00020000 08:01 2625993                    /lib/x86_64-linux-gnu/
7fca7c59d000-7fca7c59e000 rw-p 00021000 08:01 2625993                    /lib/x86_64-linux-gnu/
7fca7c59e000-7fca7c59f000 rw-p 00000000 00:00 0
7fffa8481000-7fffa84a2000 rw-p 00000000 00:00 0                          [stack]
7fffa85f5000-7fffa85f7000 r-xp 00000000 00:00 0                          [vdso]
7fffa85f7000-7fffa85f9000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

The new namespaces are probably not fully functional in this state: there are probably conflicts to solve in the different instances. For example, each libc probably tries to manage the same heap with sbrk().