SimGrid model checker explores the graph of possible executions of a simulated distributed application in order to verify safety and liveness properties. The model checker needs to store the state of the application in each node of the execution graph in order to detect cycles. However, saving the whole state of the application at each node of the graph leads to huge memory consumption and in some cases most of the time is spent copying data in order to take the snapshots of the application. We will see how we could solve this problem, using copy-on-write.
SimGrid simulates a distributed application on a single machine in a single OS process: this allows very efficient task switching as it can be done completely in userspace. All simulated process uses a shared heap and each one uses its own stack which is allocated on this shared heap.
The model checker lives in the same OS process and uses a separate heap. Each time it needs to take a snapshot of the application, the model checker makes a copy (using
memcpy()) of each memory area which is considered to contain a part of the state of the application:
the main heap (which includes the simulated processes stacks);
the read/write segment of the binary of the application (which which contains the
.data section containing the global variables of the application);
the read/write segment of libsimgrid (which contains the global variables of the SimGrid simulator).
Saving a lot of snapshots of the application can use a lot of memory. Some of the applications we are trying to model-check use the whole 256 GiB of RAM of the machines we are using. Moreover in some applications, most of the time is spent copying the data (the diagram is made with FlameGraph):
smpirun -wrapper "perf record -g -e cycles" -hostfile hostfile -platform msg_platform.xml -np 4 --cfg=model-check:1 --cfg=model-check/reduction:none --cfg=model-check/communications_determinism:1 --cfg=smpi/send_is_detached_thres:0 --cfg=model-check/max_depth:100000 --cfg=smpi/running_power:1e9 --cfg=contexts/factory:ucontext --cfg=model-check/visited:100 ./sp.S.4 perf script | ~/src/FlameGraph/stackcollapse-perf.pl | grep -v '^\[unknown\];' | ~/src/FlameGraph/flamegraph.pl > sp.S.4.svg
In practice, in many applications, only a small part of the memory of the application has changed between successive states. In order to evaluate this, I modified the model-checker to use the Linux soft-dirty mechanism:
after each snapshot, each memory page of the application is marked as soft-clean;
before doing the snapshot every page which is still soft-clean has not been modified by the application since the previous snapshot and could be shared with the previous snapshot.
On the previous benchmark (
sp.S.4 from the NAS Parallel Benchmarks Version 3.3), 99% of the memory pages of the state of the application were not touched between successive snapshots: at least 99% of the memory could be shared between successive snapshots when analysing this application.
Based on this observation, we would like to find a smarter way to take snapshots of the application with the following goals in mind:
share memory between the common parts of the snapshots;
avoid copying data as much as possible;
being able to share memory even after state restoration;
make an efficient restoration of the state when the model-checker needs to backtrack in the graph of executions.
The KSM (Kernel Samepage Merging) mechanism of the Linux kernel can be used to enable automatic page sharing between snapshots: the kernel finds memory pages with the same content, merges them and uses copy-on-write to unshare them if one of the virtual page is modified later on.
In order to do this, the application must mark each memory region where it wants the kernel to detect mergeable pages:
madvise(start, length, MADV_MERGEABLE);
KSM must be enabled system-wide (as root) with:
# Enable KSM: echo 1 > /sys/kernel/mm/ksm/run # Scan more pages: echo 10000 > /sys/kernel/mm/ksm/pages_to_scan
This solution is quite nonintrusive and has been implemented.
However, it does not address our second goal (avoid copying the data), the page must be completely copied and then the KSM kernel process will scan it to unmerge it. Moreover, scanning the pages in order to find duplicates in quite CPU intensive. As a result (this part needs to be verified), the memory pages are deduplicated slower than they are allocated which means that the memory reduction is very limited in practice.
This leads to the idea of doing explicit copy-on-write instead.
Copy-on-write in used on most POSIXish systems by the
fork() function. In the case of single-threaded application, a forked process could be seen as a snaphot of the simulated application. However, the snapshot memory does not live in the same virtual address space and is not easily available to the model checker without copying it back into the main process.
A possible solution to implement copy-on-write would be to implement it in userspace using
mprotect() is used to protect the pages against copy;
a signal handler is installed in order to handle the pagefaults (page deduplication, remove the protection on the page).
A memory-backed (tmpfs) file is used as an intermediate level between logical pages and physical pages: physical memory → file memory → virtual memory or swap. The
remap_file_pages() Linux system call can be used to create a non-linear mapping between physical pages and file pages.
However, this does not seem a suitable solution:
mprotect() is not designed to be used on a page granularity on Linux but to be used on the
vm_area_struct which is the structure used to in the kernel to represent a homogeneous memory mapping (they are seen in
mprotect() on a part of a
vm_area_struct will split it in parts: using
mprotect() at the page granularity will generate a lot of
vm_area_struct splits and merges and a in general a lot of
vm_area_struct for the process. Having a lot of
vm_area_struct have a bad impact on the performance of the application.
remap_file_pages() could be used to generate a non-linear mapping between physical pages and local pages but it does not seem to be designed to work at the page granularity as well: one system call per memory page must be used in this case.
Handling the pagefault is much more heavyweight than the standard copy-on-write. At least one system call is needed inside the signal handler to remove the protection on the unshared page leading to at least 2 round trips between userspace and kernelspace.
Update: remap_file_page is deprecated.
Some operating systems expose an in-process copy-on-write functionality. Some Mach-based systems expose it using the
vm_remap() Mach call. However, the only 64 bits OS supporting this seems to be XNU/Darwin/MacOS X/IOS: porting the model checker to XNU systems would take a lot of time (and it seems Darwin without MacOS X is quite dead anyway). It sounded like a good excuse to try the Hurd which is based on a Mach kernel but I discovered that it does not work wit 64 bits systems.
Linux does not expose an in-process copy-on-write functionality. I could try to add this feature to the Linux kernel: the copy-on-write logic would not be touched, the only missing bits is code to setup the copy-on-write regions properly inside the same process and a interface (syscall option…) to trigger it from userspace. I'm not sure our chances of merging this feature would be very high. but this might be a solution worth exploring in the future.
A native copy-on-write solution should address all of our goals. Page faults with page deduplication will slow the application down: in practice if a small number of pages are modified between different snapshots this should not be a big issue and I expect that it would still be a big win compared to the current implementation.
In the next episode, we will have a look at non copy-on-write solutions based on userspace-managed page-level snapshots.