Avoiding to clean the stack
Published:
Updated:
In two previous posts, I looked into cleaning the stack frame of a function before using it by adding assembly at the beginning of each function. This was done either by modifying LLVM with a custom codegen pass or by rewriting the assembly between the compiler and the assembler. The current implementation adds a loop at the beginning of every function. We look at the impact of this modification on the performance on the application.
Update: this is an updated version of the post with fixed code and updated results (the original version of the code was broken).
Initial results
Here are the initial results:
Test | Normal | Stack cleaning |
---|---|---|
ctest (complete testsuite) |
348.06s | 387.53s |
ctest -R mc-bugged1-liveness-visited-ucontext-sparse |
1.53s | 2.00s |
run_test comm dup 4 |
42.54s | 127.80s |
On big problems, the overhead of the stack-cleaning modification becomes very important.
Optimisation
We would like to avoid the overhead of the stack-cleaning code. In order to do this we can use the following facts:
- most of the time of SimGridMC is usually spent in MC which lives in the main stack (with the simulator);
- we only need to clean the stack in the code of the simulated applications which is executed in their own stacks;
- the applications are not executed while the simulator and model-checker are running.
Thus, we can disable stack-cleaning if we detect that we are not executing the application code. This can be implemented in two ways:
- by adding a global variable which is set before executing application code;
- by checking the address of the current stack (in
%rsp
).
In order to evaluate, the efficiency of this approach, we use a simple
comparison of %rsp
with a constant value:
movq $0x7fff00000000, %r11
cmpq %r11, %rsp
jae .Lstack_cleaner_done0
movabsq $3, %r11
.Lstack_cleaner_loop0:
movq $0, -32(%rsp,%r11,8)
subq $1, %r11
jne .Lstack_cleaner_loop0
.Lstack_cleaner_done0:
# Main code of the function goes here
The value is hardcoded in this prototype but it could be loaded from a global variable instead.
Here are the results with this optimisation:
Test | Normal | Stack cleaning |
---|---|---|
ctest (complete testsuite) |
348.06s | 372.95s |
ctest -R mc-bugged1-liveness-visited-ucontext-sparse |
1.53s | 1.53s |
run_test comm dup 4 |
42.54s | 36.68s |
Appendix: reproducibility
Those results were generated with:
MAKEFLAGS="-j$(nproc)"
git clone https://gforge.inria.fr/git/simgrid/simgrid.git
git checkout cd84ed2b393b564f5d8bfdaae60b814f81f24dc4
cd simgrid
simgrid="$(pwd)"
mkdir build-normal
cd build-normal
cmake .. -Denable_model-checking=ON -Denable_documentation=OFF \
-Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON
make $MAKEFLAGS
cd ..
mkdir build-zero
cd build-zero
cmake .. -Denable_model-checking=ON -Denable_documentation=OFF \
-Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON \
-DCMAKE_C_COMPILER="$simgrid/tools/stack-cleaner/cc" \
-DCMAKE_CXX_COMPILER="$simgrid/tools/stack-cleaner/c++" \
-DGFORTRAN_EXE="$simgrid/tools/stack-cleaner/fortran"
make $MAKEFLAGS
cd ..
run_test() {
(
platform=$(find $simgrid -name small_platform_with_routers.xml)
hostfile=$(find $simgrid | grep mpich3-test/hostfile$)
local base
base=$(pwd)
cd $base/teshsuite/smpi/mpich3-test/$1/
$base/bin/smpirun -hostfile $hostfile -platform $platform \
--cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \
--cfg=network/TCP_gamma:4194304 \
-np $3 --cfg=model-check:1 \
--cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \
--cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 \
--cfg=model-check/reduction:none --cfg=model-check/visited:100000 \
--cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:yes \
--cfg=model-check/soft-dirty:no ./$2 > /dev/null
)
}
The results without the optimisation are obtained by removing the
relevant assembly from the clean-stack-filter
script.