Cleaning the stack in a LLVM pass
Published:
Updated:
In the previous episode, we implemented a LLVM pass which does nothing. Now we are trying to modify this to create a (proof-of-concept) LLVM pass which fills the current stack frame with zero before using it.
Table of content
Structure of the x86-64 stack
Basic structure
The top (in fact the bottom) of the stack is stored in the %rsp register: a push operation decrements the value of %rsp and store the value in the resulting address; conversely a pop operation increments the value of %rsp. Stack variables are allocated by decrementing %rsp.
A function call (call) pushes the current value of the instruction (%rip) pointer on the stack. A return instruction (ret) pops a value from the stack into %rip.
A typical call frame contains in order:
- function parameters if needed (most parameters are passed by register however);
- return address to the caller;
- local variables.
- parameter for f()
- parameter for f()
- return address to caller of f()
- local variable for f()
- local variable for f()
- parameter for g()
- parameter for g()
- return address to f()caller ofg()
- local variable for g()
- local variable for g()←%rsp
f() calls g()For example this C code,
int f();
int main(int argc, char** argv) {
  int i = 42;
  f();
  return 0;
}
is compiled (with clang -S -fomit-frame-poiner example.c) into this (using AT&T syntax):
main:
	subq	$24, %rsp
	movl	$0, 20(%rsp)
	movl	%edi, 16(%rsp)
	movq	%rsi, 8(%rsp)
	movl	$42, 4(%rsp)
	movb	$0, %al
	callq	f
	movl	$0, %edi
	movl	%eax, (%rsp)
	movl	%edi, %eax
	addq	$24, %rsp
	ret
Memory is allocated on the stack using subq. Local variables are usually referenced by offsets from the stack pointer, OFFSET(%rsp).
Frame pointer
The x86 (32 bit) ABI uses the %rbp as the base of the stack. This is not mandatory in the x86-64 ABI but the compiler might still use a frame pointer. The base of the stack frame in stored in %rbp.
- parameter for f()
- parameter for f()
- return address to caller of f()
- saved %rbpfrom caller off()← saved%rbp
- local variable for f()
- local variable for f()
- parameter for g()
- parameter for g()
- return address to f()caller ofg()
- saved %rbpfromf()←%rbp
- local variable for g()
- local variable for g()←%rsp
f() calls g() with frame pointer Here is the same program compiled with -fno-omit-frame-pointer:
main:
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$32, %rsp
	movl	$0, -4(%rbp)
	movl	%edi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movl	$42, -20(%rbp)
	movb	$0, %al
	callq	f
	movl	$0, %edi
	movl	%eax, -24(%rbp)
	movl	%edi, %eax
	addq	$32, %rsp
	popq	%rbp
	ret
When a frame pointer is used, stack memory is usually referenced as fixed offset from %rsp: OFFSET(%rsp).
Red zone
The x86 32-bit ABI did not allow the code of the function to use variables after the top of the stack: a signal handler could at any moment use any memory after the top of the stack.
The standard x86-64 ABI allows the code of the current function to use the 128 bytes (the red zone) after the top the stack. A signal handler must be instantiated by the OS after the red zone. The red zone can be used for temporary variables or for local variables for leaf functions (functions which do not call other functions).
- parameter for f()
- parameter for f()
- return address to caller of f()
- local variable for f()
- local variable for f()
- parameter for g()
- parameter for g()
- return address to f()caller ofg()
- local variable for g()
- local variable for g()←%rsp
- red zone
- …
- red zone
f() calls g() (with the red zone)Note: Windows systems do not use the standard x86-64 ABI: the usage of the register is different and there is no red zone.
Let's make main() a leaf function:
int main(int argc, char** argv) {
  int i = 42;
  return 0;
}
The variables are allocated in the red zone (negative offsets from the stack pointer):
main:
        movl    $0, %eax
        movl    $0, -4(%rsp)
        movl    %edi, -8(%rsp)
        movq    %rsi, -16(%rsp)
        movl    $42, -20(%rsp)
        ret
Cleaning the stack
Assembly
Here is the code we are going to add at the beginning of each function:
	movq $QSIZE, %r11
.Lloop:
        movq $0, OFFSET(%rsp,%r11,8)
        subq $1, %r11
        jne  .Lloop
for some suitable values of QSIZE and OFFSET.
The %r11 is defined by the System V x86-64 ABI (as well as the Windows ABI) as a scratchpad register: at the beginning of the function we are free to use it without saving it first.
LLVM pass
This is implemented by a StackCleaner machine pass whose runOnMachineFunction() works similarly to the NopInserter pass.
Parameter computation
We compute the parameters of the generate native code from the size of the stack frame:
- fn.getFrameInfo()->getStackSize()is the size of the stack used by this function (excluding the red zone);
- we do not need to care about the red zone as is it only used by LLVM on leaf functions (at least on this version of LLVM, see X86FrameLowering.cpp) and SimGridMC does not analyse the stack of leaf functions (we would just have to add 128 tosizein order to clean up the red zone as well);
- dynamic-size allocations (alloca()) are not counted here.
int size = fn.getFrameInfo()->getStackSize();
int qsize = size / sizeof(uint64_t);
if (size==0) {
  // No stack to clean, we do not modify the function:
  return false;
}
int offset = - size - sizeof(uint64_t);
Basic blocks
For LLVM, a functions is represented as a collection of basic blocks. A basic block is a sequence of instructions where:
- only the first instruction (the block leader) can be a target of a branching instruction (it is the only instructions which is allowed to have a label);
- only the last instruction is a branching instruction (jump or return).
Our assembly snippet is made of two basic blocks:
- the first instruction;
- the end of the snippet.
MachineBasicBlock* bb0 = fn.begin();
MachineBasicBlock* bb1 = fn.CreateMachineBasicBlock();
MachineBasicBlock* bb2 = fn.CreateMachineBasicBlock();
fn.push_front(bb2);
fn.push_front(bb1);
A functions is a Control Flow Graph of basic blocks. We need to complete the arcs in this graph:
bb1->addSuccessor(bb1);
bb2->addSuccessor(bb2);
bb2->addSuccessor(bb0);
Machine instruction generation
We generate the machine instructions:
// First basic block (initialisation):
// movq $QSIZE, %r11
llvm::BuildMI(*bb1, bb1->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64ri),
  X86::R11).addImm(qsize);
// Second basic block (.Lloop):
// movq $0, OFFSET(%rsp,%r11,8)
llvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64mi32))
  .addReg(X86::RSP).addImm(8).addReg(X86::R11).addImm(offset).addReg(0)
  .addImm(0);
// subq $1, %r11
llvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::SUB64ri8),
  X86::R11)
  .addReg(X86::R11)
  .addImm(1);
// jne  .Lloop
llvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::JNE_4))
  .addMBB(bb2);
The instructions have suffix on the argument size and types:
- 64for instructions working on 64-bit values;
- rfor register;
- ifor immediate;
- ifor memory.
Modification notification
The function has been modified:
return true;
Result
Generated assembly
Here is the generated assembly for our test code:
main:
	movabsq	$3, %r11
.LBB0_1:
	movq	$0, -32(%rsp,%r11,8)
	subq	$1, %r11
	jne	.LBB0_1
	subq	$24, %rsp
	movl	$0, 20(%rsp)
	movl	%edi, 16(%rsp)
	movq	%rsi, 8(%rsp)
	movl	$42, 4(%rsp)
	movb	$0, %al
	callq	f
	movl	$0, %edi
	movl	%eax, (%rsp)
	movl	%edi, %eax
	addq	$24, %rsp
	retq
Test program
Here is a simple test program using unitialized stack variables:
#include <stdio.h>
void f() {
  int i;
  int data[16];
  for(i=0; i!=16; ++i)
    printf("%i ", data[i]);
  printf("\n");
  for(i=0; i!=16; ++i)
    data[i] = i;
}
void g() {
  int i, j, k, l, m, n, o, p;
  printf("%i %i %i %i %i %i %i %i\n", i, j, k, l, m, n, o, p);
}
int main(int argc, char** argv) {
  f();
  f();
  g();
  return 0;
}
This is the output of a normal compilation:
-1 0 -812203224 32767 -406470232 32655 -400476992 32655 -400465496 32655 0 0 1 0 4195997 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0 15774463 15 14 13 12
And with our stack-cleaning clang:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Result on SimGrid
The whole SimGrid test suite works without compiling SimgridMC support.
At this point, I discovered that SimGrid fails to run when compiled with clang (or DragonEgg) with support for SimGridMC. I need to fix this first before testing the impact of cleaning the stack on SimGridMC state comparison.
In the next episode, I'll try another implementation of the same concept using a few scripts in order to process the generated assembly between the compiler and the assembler which should work with a standard GCC and with SimGridMC.