/dev/posts/

Results on same-page-merging snapshots

Published:

Updated:

In the previous episode, I talked about the implementation of a same-page-merging page store. On top of this, we can build same-page-merging snapshots for the SimGrid model checker.

Implementation

SimGrid agnostic layer

The next layer on top of the page store, is a generic logic for saving and restoring a contiguous area of memory pages:

/** @brief Take a per-page snapshot of a region
 *
 *  @param data            The start of the region (must be at the beginning of a page)
 *  @param pag_count       Number of pages of the region
 *  @param pagemap         Linux kernel pagemap values for this region (or NULL)
 *  @param reference_pages Snapshot page numbers of the previous mc_soft_dirty_reset() (or NULL)
 *  @return                Snapshot page numbers of this new snapshot
 */
mc_mem_region_t region* mc_take_page_snapshot_region(
  void* data, size_t page_count,
  uint64_t* pagemap, size_t* reference_pages);

/** @brief Restore a snapshot of a region
 *
 *  If possible, the restoration will be incremental
 *  (the modified pages will not be touched).
 *
*  @param start_addr      Address of the first page where we have to restore the page
 *  @param page_count      Number of pages of the region
 *  @param pagenos         Array of page indices from the global page store
 *  @param pagemap         Linux kernel pagemap values for this region (or NULL)
 *  @param reference_pages Snapshot page numbers of the previous soft_dirty_reset (or NULL)
 */
void mc_restore_page_snapshot_region(
  void* start_ddr, size_t page_count,
  size_t* pagenos,
  uint64_t* pagemap, size_t* reference_pagenos);

/** @brief Free memory of a page store
 */
void mc_free_page_snapshot_region(
  size_t* pagenos, size_t page_count);

/** @brief Reset the soft-dirty bits
 *
 *  This is done after checkpointing and after checkpoint restoration
 *  (if per page checkpoiting is used) in order to know which pages were
 *  modified.
 *
 *  See https://www.kernel.org/doc/Documentation/vm/soft-dirty.txt
 * */
void mc_softdirty_reset();

SimGrid snapshot layer

The next layer is SimGrid-specific and handles part of the snapshoting logic:

State comparison layer

The most invasive part of this modification in the SimGrid codebase is the logic to read data from the snapshots. Without this feature, a simple offset was applied to find the base of a variable in the snapshot: now, a software MMU algorithm must be done. A variable can now be split across different non-contiguous memory pages. The whole logic of reading from snapshots had to me modified to handle this.

Results

Those results were obtained with the command:

# COMMAND: sendrecv2, mprobe or sendall
# SPARSE, SOFTDIRTY: yes or no
cd teshsuite/smpi/mpich3-test/pt2pt/
export TIME="clock:%e user:%U sys:%S swapped:%W exitval:%x max:%Mk"
setarch x86_64 -R time smpirun -hostfile ../hostfile -platform $(find ../../../.. -name small_platform_with_routers.xml) --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI --cfg=network/TCP_gamma:4194304 -np 4 --cfg=model-check:1 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich --cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 --cfg=model-check/reduction:none --cfg=model-check/visited:100000 --cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:$SPARSE --cfg=model-check/soft-dirty:$SOFTDIRTY $COMMAND

They were run on a laptop with quad-core Intel® Core™ i7-3687U CPU @ 2.10GHz with 8GiB of RAM. Note that the memory reported is the RSS and does include swapped-out memory.

sendrecv2

In this example, we observe a 80% reduction of the memory consumption for a slight slowdown. Using soft-dirty tracking does not have a positive impact on the performance: some time is gained in user land by avoiding comparing memory pages but the same amount of time is spend in kernel space tracking the soft-clean/soft-dirty pages.

Type clock user system Max. RSS (KiB)
Simple snapshot 9.96s 9.16s 0.78s 3 332 788
Same-page-merging snapshot w/o soft-dirty tracking 10.02s 9.82s 0.19s 540 420
Same-page-merging snapshot with soft-dirty tracking 10.70s 8.86s 1.80s 540 936

mprobe

Similar results here:

Type clock user system Max. RSS (KiB)
Simple snapshot 13.41s 13.00s 0.40s 1 692 492
Same-page-merging snapshot w/o soft-dirty tracking 14.12s 13.89s 0.14s 414 916
Same-page-merging snapshot with soft-dirty tracking 14.44s 13.16s 1.25s 415 028

sendflood

In this example, without the same-page-merging snapshot we hit the swap limit (the RSS does not include the swapped-out memory). In this case, using same-page-merging snapshot is faster because the process does not swap. Using soft-dirty tracking does not have a beneficial impact in this case either: a lot of a time is lost marking the pages as soft-dirty/soft-clean.

Type clock user system Max. RSS (KiB)
Simple snapshot 73.31s 56.34s 5.26s 7 213 956
Same-page-merging snapshot w/o soft-dirty tracking 59.12s 56.87s 2.22s 1 570 312
Same-page-merging snapshot with soft-dirty tracking 82.74s 53.71s 29.06s 1 609 048

Conclusion

This approach achieves an important reduction of the memory consumption without a significant impact on performance. With this technique we should be able to handle bigger applications problem, save more states of the application. Those tests were run on applications where a lot of pages change between snapshots. On applications where many pages are not modified, the reduction of memory consumption should be much more bigger.

Soft-dirty tracking does not seem to be very efficient in our tests. It might be useful if the applications is swapping by avoiding to swap when taking a snapshot. This feature will probably be disabled by default and might be removed in the future.

It should be possible to increase the efficiency of the method by increasing page sharing:

It should be possible to speed up the process by :

We used the granularity of the memory page but it is not strictly necessary. We might use a finer granularity in order to increase the sharing between snapshots. The granularity (the size of the chunks) should be regular and a power of 2 (in order to be able to apply the MMU algorithm). However, the memory overhead would be greater (index of the page chunk store number of page chunk indices stored for each snapshot).