Per-page shallow snapshots for the SimGrid model checker

Published: Jun 3 2014

Updated: Jun 3 2014

I looked at my options to achieve efficient/cheap snapshots of the simulated application for the Simgrid model checker using copy-on-write. Here I look at another solution to achieve this without using copy-on-write.

Checkpointing

Basic idea

The idea is to save each page of the state of the application independently: when a snapshot page is stored, the snapshoting logic first checks if a page with the same content is already stored in the snapshot pages:

if this is the case, a reference count of this page is simply incremented and the same page is shared between the snapshots;
otherwise, the page is copied in a free storage page and its reference count is set to 1.

The memory pages are only shared between the different snapshots but are never shared with the simulated application: copy-on-write is not used which means that the simulated application will not be slowned down by the unsharing page faults. As a result, the basic solution can be implemented purely in userspace.

The first snapshot will be a full snapshot. Other snapshots will usually be shallow: if 98% of the memory pages are not touched between successive snapshots, all those pages will be shared and only 2% of the pages will be copied in the second snapshot.

A hash of the content of the page can be used to limit the comparison of the new memory page with only a subset of the stored memory pages.

Better snapshots with soft-dirty page tracking

It is still necessary to scan and hash all the pages of the state of the simulated process each time a snapshot is done which seems to be quite inefficient. We can use the soft-dirty feature of the Linux kernel to detect which pages have been written since the previous snapshot and only try to store the modified ones.

After each snapshot, each page of the process is marked as soft-clean and protected against write. Each time a soft-clean page is touched, a page fault is raised: the kernel marks the page as soft-dirty and remove the protection on the page. On the next snapshot, it is possible to find which pages are soft-dirty (i.e., were modified since the previous snapshot) and only save those pages.

Incremental snapshot restoration

Even when restoring the state of a snapshot, we might use the soft-dirty information to avoid copying data which have not changed:

each soft-dirty page will have to be restored;
soft-clean pages have the same state as the previous snapshot (or previously restored snapshot) so the page needs to be restored only if the page is not the same in the previous snapshot as in the target snapshot.

If a lot of pages do not change between snapshots, this technique reduces the number of pages which needs to be copied to restore a snapshot (and avoid the related soft-dirty page faults).

Memory address translation

Once an efficient snapshoting strategy is implemented, I expect that in many cases, most of the time will be spent in the state comparison code: we need to find a solution to avoid spending too much time translating between the addresses of the simulated application and the addresses of the snapshots.

Create a linear view of the areas of the snapshots

We might want to create a linear view of the areas of each snapshot of in order to have a simple code for the address translation:

find_snapshot_address(real_address, snapshot)
{
  memory_area             ← find_memory_area(real_address)
  offset                  ← real_address - memory_area.start
  snapshot_area           ← find_snapshot_area(memory_area)
  snapshot_address        ← snapshot_area.start + offset
  return snapshot_address
}

Moreover and probably more importantly, as long as we are in the same memory region, we can apply an offset in the real address translate into the same offset in the linear view of the snapshot. This case happens all the time when we are comparing the states:

the local variables in the stack are at given offset from the base of the frame;
the field of a structure are at given offsets from the base of the structure;
the cells of an array are at given offsets from the base of the array;

In all thoses cases, we could apply a simple offset from the base snapshot address: if the memory pages of the snapshot are scattered in the virtual memory space, the model checker will have to apply the offset to the real base address and then translate the resulting address.

using non-linear memory mappings

One solution to create a linear view of a snapshot memory region, would be to use a non-linear memory mapping (remap_file_pages) of the snapshot memory:

a in-memory (tmpfs) file is used as a backend for the snapshot pages;
remap_file_pages() can be used to create a linear view of a given snapshot memory.

However, one remap_file_pages() call will be necessary per each memory page so I do not expect this solution to be very promising unless a more efficient version of this system call is added in a later release of the Linux kernel.

Update: remap_file_page is deprecated.

using incremental snapshot reconstruction

Another solution, is to create a linear copy of the snapshot areas. We incrementally update those copies to reflect different snapshots but only updating the pages which are different between the different snapshots.

When we want to compare the current state against another one, we first have to recreate a linear view of the snapshot of the latter by copying in the linear view all the memory pages which are different from the previous view.

We want avoid reconstructing the state memory. This can be done by creating a global hash of the state of the simulated application based on key characteristics of the state (such as the numberf of processes, the instruction pointers of each process in its stack frame, etc.).

Software MMU

The other solution is to replicate the algorithm of the MMU in software to translate from virtual pages into file pages:

find_snapshot_address(real_address, snapshot)
{
  page_number             ← get_page_number(address)
  offset                  ← get_offset(address)
  snapshot_page_number    ← get_snapshot_page_number(snapshot, page_number)
  snapshot_page_address   ← get_page_address(snapshot_page_number)
  snapshot_address        ← snapshot_page_address + offset
  return snapshot_address
}

As I said earlier, this might impact the performance of the state comparison.

Other granularity

We might use another granularity instead of the page: for example we might snapshot at the malloc() granularity:

the examples seem to indicate that the page granularity should be enough;
snapshoting without a fixed page would make the address translation algorithm much more complex;
with the page granularity we can potentially use virtual-memory tricks to try to speed things up (soft-dirty page tracking, non-linear memory mapping);
by using a smaller granularity we could increase the sharing of pages between states but it would increase the amount of memory for metadata about those chunks (reference counts, free chunks, which chunks are used in each snapshot, etc.).

Conclusion

This approach seems quite promising:

it does not need modification of the kernel;
the basic solution even works without any special system feature (probably quite poorly);
advanced Linux features such as soft-dirty page tracking might to enhance the performance of the application;
no copy-on-write page faults but more lightweight soft-dirty page faults.

It is not clear which variation will be the more efficient. I am probably going to implement the software MMU approach.