Results on same-page-merging snapshots

Published: Jul 22 2014

Updated: Jul 22 2014

In the previous episode, I talked about the implementation of a same-page-merging page store. On top of this, we can build same-page-merging snapshots for the SimGrid model checker.

Implementation

SimGrid agnostic layer

The next layer on top of the page store, is a generic logic for saving and restoring a contiguous area of memory pages:

/** @brief Take a per-page snapshot of a region
 *
 *  @param data            The start of the region (must be at the beginning of a page)
 *  @param pag_count       Number of pages of the region
 *  @param pagemap         Linux kernel pagemap values for this region (or NULL)
 *  @param reference_pages Snapshot page numbers of the previous mc_soft_dirty_reset() (or NULL)
 *  @return                Snapshot page numbers of this new snapshot
 */
mc_mem_region_t region* mc_take_page_snapshot_region(
  void* data, size_t page_count,
  uint64_t* pagemap, size_t* reference_pages);

/** @brief Restore a snapshot of a region
 *
 *  If possible, the restoration will be incremental
 *  (the modified pages will not be touched).
 *
*  @param start_addr      Address of the first page where we have to restore the page
 *  @param page_count      Number of pages of the region
 *  @param pagenos         Array of page indices from the global page store
 *  @param pagemap         Linux kernel pagemap values for this region (or NULL)
 *  @param reference_pages Snapshot page numbers of the previous soft_dirty_reset (or NULL)
 */
void mc_restore_page_snapshot_region(
  void* start_ddr, size_t page_count,
  size_t* pagenos,
  uint64_t* pagemap, size_t* reference_pagenos);

/** @brief Free memory of a page store
 */
void mc_free_page_snapshot_region(
  size_t* pagenos, size_t page_count);

/** @brief Reset the soft-dirty bits
 *
 *  This is done after checkpointing and after checkpoint restoration
 *  (if per page checkpoiting is used) in order to know which pages were
 *  modified.
 *
 *  See https://www.kernel.org/doc/Documentation/vm/soft-dirty.txt
 * */
void mc_softdirty_reset();

SimGrid snapshot layer

The next layer is SimGrid-specific and handles part of the snapshoting logic:

resetting the soft-dirty bits by calling mc_softdirty_reset() when after takind snapshot or restoring a snapshot;
generating SimGrid data-structures;
etc.

State comparison layer

The most invasive part of this modification in the SimGrid codebase is the logic to read data from the snapshots. Without this feature, a simple offset was applied to find the base of a variable in the snapshot: now, a software MMU algorithm must be done. A variable can now be split across different non-contiguous memory pages. The whole logic of reading from snapshots had to me modified to handle this.

Results

Those results were obtained with the command:

# COMMAND: sendrecv2, mprobe or sendall
# SPARSE, SOFTDIRTY: yes or no
cd teshsuite/smpi/mpich3-test/pt2pt/
export TIME="clock:%e user:%U sys:%S swapped:%W exitval:%x max:%Mk"
setarch x86_64 -R time smpirun -hostfile ../hostfile -platform $(find ../../../.. -name small_platform_with_routers.xml) --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI --cfg=network/TCP_gamma:4194304 -np 4 --cfg=model-check:1 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich --cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 --cfg=model-check/reduction:none --cfg=model-check/visited:100000 --cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:$SPARSE --cfg=model-check/soft-dirty:$SOFTDIRTY $COMMAND

They were run on a laptop with quad-core Intel® Core™ i7-3687U CPU @ 2.10GHz with 8GiB of RAM. Note that the memory reported is the RSS and does include swapped-out memory.

`sendrecv2`

In this example, we observe a 80% reduction of the memory consumption for a slight slowdown. Using soft-dirty tracking does not have a positive impact on the performance: some time is gained in user land by avoiding comparing memory pages but the same amount of time is spend in kernel space tracking the soft-clean/soft-dirty pages.

Type	clock	user	system	Max. RSS (KiB)
Simple snapshot	9.96s	9.16s	0.78s	3 332 788
Same-page-merging snapshot w/o soft-dirty tracking	10.02s	9.82s	0.19s	540 420
Same-page-merging snapshot with soft-dirty tracking	10.70s	8.86s	1.80s	540 936

`mprobe`

Type	clock	user	system	Max. RSS (KiB)
Simple snapshot	13.41s	13.00s	0.40s	1 692 492
Same-page-merging snapshot w/o soft-dirty tracking	14.12s	13.89s	0.14s	414 916
Same-page-merging snapshot with soft-dirty tracking	14.44s	13.16s	1.25s	415 028

`sendflood`

In this example, without the same-page-merging snapshot we hit the swap limit (the RSS does not include the swapped-out memory). In this case, using same-page-merging snapshot is faster because the process does not swap. Using soft-dirty tracking does not have a beneficial impact in this case either: a lot of a time is lost marking the pages as soft-dirty/soft-clean.

Type	clock	user	system	Max. RSS (KiB)
Simple snapshot	73.31s	56.34s	5.26s	7 213 956
Same-page-merging snapshot w/o soft-dirty tracking	59.12s	56.87s	2.22s	1 570 312
Same-page-merging snapshot with soft-dirty tracking	82.74s	53.71s	29.06s	1 609 048

Conclusion

This approach achieves an important reduction of the memory consumption without a significant impact on performance. With this technique we should be able to handle bigger applications problem, save more states of the application. Those tests were run on applications where a lot of pages change between snapshots. On applications where many pages are not modified, the reduction of memory consumption should be much more bigger.

Soft-dirty tracking does not seem to be very efficient in our tests. It might be useful if the applications is swapping by avoiding to swap when taking a snapshot. This feature will probably be disabled by default and might be removed in the future.

It should be possible to increase the efficiency of the method by increasing page sharing:

by setting to 0 the bytes of the heap which are not used (for example in free());
by setting to 0 the unused part of the stacks (and using a reference to a zero page instead);
by segregating data which do not change at the same time in different pages (in the SimGrid code);
using compression and/or some delta encoding when reaching the limit of available RAM.

It should be possible to speed up the process by :

by scanning the heap metadata to avoid saving pages which are known to be unused and restoring them;
by avoiding to save the unused pages of the stacks (and using a reference to a zero page instead) and restoring them.

We used the granularity of the memory page but it is not strictly necessary. We might use a finer granularity in order to increase the sharing between snapshots. The granularity (the size of the chunks) should be regular and a power of 2 (in order to be able to apply the MMU algorithm). However, the memory overhead would be greater (index of the page chunk store number of page chunk indices stored for each snapshot).