Better isolation for SimGridMC
Published:
Updated:
In an attempt to simplify the development around the SimGrid model-checker, we were thinking about moving the model-checker out in a different process. Another different approach would be to use a dynamic-linker isolation of the different components of the process. Here is a summary of the goals, problems and design issues surrounding these topics.
Table of content
Current state
SimGrid architecture
The design if the SimGrid simulator is based on the design of a operating system.
In a typical OS, we have a kernel managing a global state and a several userspace processes running on top of the kernel. The kernel schedules the execution of the different processes (and their threads) on the available CPUs. The kernel provides an API to the processes made of several system calls.
SimGrid simulated a distributed system: it simulates a network and let the different processes of the simulated system use this simulated network. Each simulated process runs on top of the SimGrid kernel. The SimGrid kernel schedules the execution of the different processes on the available OS threads. The SimGrid kernel provides an API to the processes made of several simulation calls.
In order to reduce the cost of context switching between the different processes, in the current implementation of SimGrid all the simulated processes and the SimGrid kernel are in the same OS process: there is no MMU-enforced separation of memory between the simulated processes but they are expected to only communicate between each other using only the means provided by the SimGrid kernel (the simulation calls) and should not share mutable memory.
The SimGrid kernel has a dedicated stack and each simulated process has its own stack: cooperative multitasking (fibers, ucontext
) is used to switch between the different contexts (SimGrid kernel/process) and is used by the SimGrid kernel to schedule the execution of the different processes.
The same (libc
) heap is shared between the SimGrid kernel and the simulated processes.
SimGridMC architecture
The SimGrid model-checker is a dynamic analysis component for SimGrid. It explores the different possible interleavings of execution of the simulated processes (depending on the execution of their transitions i.e., the different possible orderings of their communications).
In order to do this, the MC saves at each node of the graph of the possible executions the state of the system:
- the global variables;
- the stack of each process;
- the heap;
- the file descriptors.
Those states are then used to:
- restore a previously visited state in order to explore other parts of the execution graph;
- detect equivalent states which is necessary in order to,
- avoid exploring redundant parts of the execution graphs,
- check liveness properties (using Double Depth Frist Search).
In the current implementation, the model-checker lives in the same process as the main SimGrid process (the SimGrid kernel and the processes):
Multiple heaps
However, the model-checker needs to maintain its own state: the state of the model-checker must not be saved, compared and restored with the rest of the state.
In order to do this, the state of the model-checker is maintained in a second heap:
- the main heap is used for the SimGrid kernel and the simulated processes;
- a model-checker heap is used for the model-checker and is used for all the state of the model-checker (current state of the execution graph).
This is implemented by overriding the malloc()
, free()
and friends in order to support multiple heap. A global variable is used to choose the current working heap:
// Simplified code
xbt_mheap_t __mmalloc_current_heap = NULL;
void *malloc(size_t n)
{
return mmalloc(__mmalloc_current_heap, n);
}
void free(void *ptr)
{
return mfree(__mmalloc_current_heap, ptr);
}
Limitation of the approach
The current implementation is complicated and not easy to understand and maintain:
- Because of the two heaps, the code of the MC is littered with fragile code to switch between the heaps. An error at this level can lead to weird bugs or break the snapshot/restoration mechanism and the state detection mechanism.
- There are many opportunities of breakage in the state snapshot/restore/equality mechanism: using a
static
variable as a cache in the model-checker, sharing data between the model-checker and the simulated process. - Because of the two heaps, there is a strong dependency on the
mmalloc()
implementation. This implementation is probably not as efficient as a more modernmalloc()
implementation: for example, it does not use a per-thread arena or any sort of thread-friendly approach but a single mutex per heap. Avoiding to have multiple heaps per process would remove a dependency on themalloc()
implementation (we would still have a dependency on themmalloc()
metadata format) and would make it easier to switch to anothermalloc()
implementation.
A first motivation for modifying the architecture of SimGridMC, is to incraase the maintainability of the SimGridMC codebase.
Another related goal is to simplify the debugging experience (of the simulated application, the SimGrid kernel and the model-checker). For example, the current version of SimGridMC does not work under valgrind. A solution which would provide a more powerful debugging experience would be a valuable tool for the SimGridMC devs but more importantly for the users of SimGridMC.
Process-based isolation
For all these reasons, we would like to move the model-checker in a separate process: a model-checker process maintains the model-checker state and control the execution of a model-checked process.
Memory snapshot/restoration
The snapshoting/restoration of the model-checked process memory can be done using /proc/${pid}/mem
or process_vm_readv()
and process_vm_writev()
.
As long as the OS threads are living on stacks which are not managed by the state snapshot/restoration mechanism, they will not be affected: we must take care that the OS threads switch to unmanaged stacks when we are doing the state snapshots/restorations.
Another solution would be to use ptrace()
with PTRACE_GETREGSET
and PTRACE_SETREGSET
in order to snapshot/restore the registers of each thread but we would like to avoid this in order to be able to use ptrace()
for debugging or other purposes.
File descriptors restoration
Linux does not provide a way to change the file descriptors of another process: the restoration of the file descriptors must be done in the taret OS process and cannot be done from the model-checker process. Cooperation of the model-checked process is needed for the file descriptors restoration.
We could abuse ptrace()
-based syscall rewriting techniques or some sort of parasite injection in order to achieve this.
Dynamic-linker based isolation
Another idea would be to create a custom dynamic linker with namespace support in order to be able to link multiple instances of the same library and provide isolation between different parts of the process.
This could be used to:
- load multiple instances of the same application compiled as a shared object (in PIC) as an alternative to our
mmap()
based SMPI privatisation technique; - separate the
libc
of the model-checker and the main application in order to have a cleaner separation of the two heaps; - run the simulated applications with overloaded functions hooked into the SimGrid kernel in order as an alternative to
ptrace()
-based system call interception.
Prior art in DCE with dlmopen()
It turns out that DCE (Direct Code Execution) already uses a similar approach to load multiples application instances along with Linux kernel implementations (and its network stack) on top of the NS3 packet level network simulator in the same process: the applications and Linux kernel are compiled as shared objects, the latter forming a Library OS liblinux.so
shared object and loaded multiple times in the same process alongside with the NS3 instance.
Among several alternative strategies, DCE uses the dlmopen()
function. This is a variant of dlopen()
originating from SunOS/Solaris and implemented on the GNU userland which allows to load dynamic libraries in separated namespaces:
- each namespace has its own symbols;
- each namespace can have a distinct instance of a same shared object.
An alternative implementation of the ld.so
dynamic linker, elf-loader
, is used which provides additional features:
- unlimited number of namespaces;
- remapping libraries and symbols.
More information about dlmopen()
can be found in old version of Sun Linkers and Libraries Guide.
A custom dynamic loader/linker on top of libc
However, I was envisioning something slightly different: instead of writing a replacement of ld.so
(using raw system calls), I was thinking about building the custom dynamic linker on top of libc
and libdl
in order to be able to use libc
(malloc()
), libdl
and libelf
instead of using the raw system calls.
Impact on debuggability
In a split process design, the model-checker could be a quite standard application avoiding weird hacks (such as introspection with /proc/self/maps
and DWARF, snapshoting/restoration of the state with memcpy()
, custom mmalloc()
implementation with multiple heaps). Once a relevant trajectory of the model-checked application has been identified, it could be replayed outside of the model-checker and debugged in this simpler mode.
However, having a single process could lead to a better debugging experience: by being able to combines breakpoints in the model-checker, the SimGrid kernel and the simulated application with conditions spanning all those components.
At the same time, using multiple dynamic-linking namespaces could make the debugging experience more complicated. I am not sure how well it is supported by the different available debugging tools. The DCE tools seems to show that it is reasonably well supported by GDB and valgrind.
Conclusion
So we have two possible directions:
- moving the model-checker in a separate process in order to isolate it;
- using dynamic linking tricks in order to isolate the model-checker from the rest of the application,
- using
elf-loader
; - using a new dynamic linker built on top of
libc
. (This is quite a large undertaking but being able to use shared libraries could simplify its code base.)
- using
The first solution provides a better isolation of the model-checker. The second solution is closer to the current implementation and should have better performances by avoiding the context switches and IPC in favour of direct memory access and function calls. Moreover, the dynamic-linker-based isolation could be reused for other parts of the projects (such as the isolation of the simulated MPI processes).
It is not clear which solution would provide the better debugging experience for the user and which solution would be better for the maintainability of SimGridMC.
Appendix: dlmopen()
quick demo
This simple program creates three new namespaces and loads libpthread
in those namespaces:
#define _GNU_SOURCE
#include <dlfcn.h>
#include <unistd.h>
int main(int argc, const char** argv)
{
size_t i;
for (i=0; i!=3; ++i) {
void* x = dlmopen(LM_ID_NEWLM, "libpthread.so.0", RTLD_NOW);
if (!x)
return 1;
}
while(1) sleep(200000);
return 0;
}
We see that libpthread
is loaded thrice. Each instance has its own libc
instance as well (and a fourth one is loaded for the main program):
00400000-00401000 r-xp 00000000 08:06 7603474 /home/myself/temp/a.out 00600000-00601000 rw-p 00000000 08:06 7603474 /home/myself/temp/a.out 0173a000-0175b000 rw-p 00000000 00:00 0 [heap] 7fca7ac7d000-7fca7ae1c000 r-xp 00000000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7ae1c000-7fca7b01c000 ---p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7b01c000-7fca7b020000 r--p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7b020000-7fca7b022000 rw-p 001a3000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7b022000-7fca7b026000 rw-p 00000000 00:00 0 7fca7b026000-7fca7b03e000 r-xp 00000000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7b03e000-7fca7b23d000 ---p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7b23d000-7fca7b23e000 r--p 00017000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7b23e000-7fca7b23f000 rw-p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7b23f000-7fca7b243000 rw-p 00000000 00:00 0 7fca7b243000-7fca7b3e2000 r-xp 00000000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7b3e2000-7fca7b5e2000 ---p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7b5e2000-7fca7b5e6000 r--p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7b5e6000-7fca7b5e8000 rw-p 001a3000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7b5e8000-7fca7b5ec000 rw-p 00000000 00:00 0 7fca7b5ec000-7fca7b604000 r-xp 00000000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7b604000-7fca7b803000 ---p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7b803000-7fca7b804000 r--p 00017000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7b804000-7fca7b805000 rw-p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7b805000-7fca7b809000 rw-p 00000000 00:00 0 7fca7b809000-7fca7b9a8000 r-xp 00000000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7b9a8000-7fca7bba8000 ---p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7bba8000-7fca7bbac000 r--p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7bbac000-7fca7bbae000 rw-p 001a3000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7bbae000-7fca7bbb2000 rw-p 00000000 00:00 0 7fca7bbb2000-7fca7bbca000 r-xp 00000000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7bbca000-7fca7bdc9000 ---p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7bdc9000-7fca7bdca000 r--p 00017000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7bdca000-7fca7bdcb000 rw-p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so 7fca7bdcb000-7fca7bdcf000 rw-p 00000000 00:00 0 7fca7bdcf000-7fca7bf6e000 r-xp 00000000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7bf6e000-7fca7c16e000 ---p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7c16e000-7fca7c172000 r--p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7c172000-7fca7c174000 rw-p 001a3000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so 7fca7c174000-7fca7c178000 rw-p 00000000 00:00 0 7fca7c178000-7fca7c17b000 r-xp 00000000 08:01 2626017 /lib/x86_64-linux-gnu/libdl-2.19.so 7fca7c17b000-7fca7c37a000 ---p 00003000 08:01 2626017 /lib/x86_64-linux-gnu/libdl-2.19.so 7fca7c37a000-7fca7c37b000 r--p 00002000 08:01 2626017 /lib/x86_64-linux-gnu/libdl-2.19.so 7fca7c37b000-7fca7c37c000 rw-p 00003000 08:01 2626017 /lib/x86_64-linux-gnu/libdl-2.19.so 7fca7c37c000-7fca7c39c000 r-xp 00000000 08:01 2625993 /lib/x86_64-linux-gnu/ld-2.19.so 7fca7c568000-7fca7c56b000 rw-p 00000000 00:00 0 7fca7c59a000-7fca7c59c000 rw-p 00000000 00:00 0 7fca7c59c000-7fca7c59d000 r--p 00020000 08:01 2625993 /lib/x86_64-linux-gnu/ld-2.19.so 7fca7c59d000-7fca7c59e000 rw-p 00021000 08:01 2625993 /lib/x86_64-linux-gnu/ld-2.19.so 7fca7c59e000-7fca7c59f000 rw-p 00000000 00:00 0 7fffa8481000-7fffa84a2000 rw-p 00000000 00:00 0 [stack] 7fffa85f5000-7fffa85f7000 r-xp 00000000 00:00 0 [vdso] 7fffa85f7000-7fffa85f9000 r--p 00000000 00:00 0 [vvar] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
The new namespaces are probably not fully functional in this state: there are probably conflicts to solve in the different instances. For example, each libc
probably tries to manage the same heap with sbrk()
.