Avoiding to clean the stack

Published: Nov 3 2014

Updated: Nov 3 2014

In two previous posts, I looked into cleaning the stack frame of a function before using it by adding assembly at the beginning of each function. This was done either by modifying LLVM with a custom codegen pass or by rewriting the assembly between the compiler and the assembler. The current implementation adds a loop at the beginning of every function. We look at the impact of this modification on the performance on the application.

Update: this is an updated version of the post with fixed code and updated results (the original version of the code was broken).

Initial results

Here are the initial results:

Test	Normal	Stack cleaning
`ctest` (complete testsuite)	348.06s	387.53s
`ctest -R mc-bugged1-liveness-visited-ucontext-sparse`	1.53s	2.00s
`run_test comm dup 4`	42.54s	127.80s

On big problems, the overhead of the stack-cleaning modification becomes very important.

Optimisation

We would like to avoid the overhead of the stack-cleaning code. In order to do this we can use the following facts:

most of the time of SimGridMC is usually spent in MC which lives in the main stack (with the simulator);
we only need to clean the stack in the code of the simulated applications which is executed in their own stacks;
the applications are not executed while the simulator and model-checker are running.

Thus, we can disable stack-cleaning if we detect that we are not executing the application code. This can be implemented in two ways:

by adding a global variable which is set before executing application code;
by checking the address of the current stack (in %rsp).

In order to evaluate, the efficiency of this approach, we use a simple comparison of %rsp with a constant value:

	movq $0x7fff00000000, %r11
	cmpq %r11, %rsp
	jae .Lstack_cleaner_done0
	movabsq $3, %r11
.Lstack_cleaner_loop0:
	movq    $0, -32(%rsp,%r11,8)
	subq    $1, %r11
	jne     .Lstack_cleaner_loop0
.Lstack_cleaner_done0:
	# Main code of the function goes here

The value is hardcoded in this prototype but it could be loaded from a global variable instead.

Here are the results with this optimisation:

Test	Normal	Stack cleaning
`ctest` (complete testsuite)	348.06s	372.95s
`ctest -R mc-bugged1-liveness-visited-ucontext-sparse`	1.53s	1.53s
`run_test comm dup 4`	42.54s	36.68s

Appendix: reproducibility

Those results were generated with:

MAKEFLAGS="-j$(nproc)"

git clone https://gforge.inria.fr/git/simgrid/simgrid.git
git checkout cd84ed2b393b564f5d8bfdaae60b814f81f24dc4
cd simgrid
simgrid="$(pwd)"

mkdir build-normal
cd build-normal
cmake .. -Denable_model-checking=ON -Denable_documentation=OFF \
  -Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON
make $MAKEFLAGS
cd ..

mkdir build-zero
cd build-zero
cmake .. -Denable_model-checking=ON -Denable_documentation=OFF \
  -Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON \
  -DCMAKE_C_COMPILER="$simgrid/tools/stack-cleaner/cc" \
  -DCMAKE_CXX_COMPILER="$simgrid/tools/stack-cleaner/c++" \
  -DGFORTRAN_EXE="$simgrid/tools/stack-cleaner/fortran"
make $MAKEFLAGS
cd ..

run_test() {
  (
  platform=$(find $simgrid -name small_platform_with_routers.xml)
  hostfile=$(find $simgrid | grep mpich3-test/hostfile$)

  local base
  base=$(pwd)
  cd $base/teshsuite/smpi/mpich3-test/$1/

  $base/bin/smpirun -hostfile $hostfile -platform $platform \
    --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \
    --cfg=network/TCP_gamma:4194304 \
    -np $3 --cfg=model-check:1 \
    --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \
    --cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 \
    --cfg=model-check/reduction:none --cfg=model-check/visited:100000 \
    --cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:yes \
    --cfg=model-check/soft-dirty:no ./$2 > /dev/null
  )
}

The results without the optimisation are obtained by removing the relevant assembly from the clean-stack-filter script.