Cleaning the stack in a LLVM pass

| πŸ€” | πŸ‘ | πŸ‘Ž |

Previous episode:

Next episode:

In the previous episode, we implemented a LLVM pass which does nothing. Now we are trying to modify this to create a (proof-of-concept) LLVM pass which fills the current stack frame with zero before using it.

Table of Content

Structure of the x86-64 stack

Basic structure

The top (in fact the bottom) of the stack is stored in the %rsp register: a push operation decrements the value of %rsp and store the value in the resulting address; conversely a pop operation increments the value of %rsp. Stack variables are allocated by decrementing %rsp.

A function call (call) pushes the current value of the instruction (%rip) pointer on the stack. A return instruction (ret) pops a value from the stack into %rip.

A typical call frame contains in order:

  • function parameters if needed (most parameters are passed by register however);

  • return address to the caller;

  • local variables.

  1. parameter for f()
  2. parameter for f()
  3. return address to caller of f()
  4. local variable for f()
  5. local variable for f()
  6. parameter for g()
  7. parameter for g()
  8. return address to f() caller of g()
  9. local variable for g()
  10. local variable for g() ← %rsp
x86-64 stack structure for f() calls g()

For example this C code,

int f();

int main(int argc, char** argv) {
  int i = 42;
  f();
  return 0;
}

is compiled (with clang -S -fomit-frame-poiner example.c) into this (using AT&T syntax):

main:
    subq    $24, %rsp
    movl    $0, 20(%rsp)
    movl    %edi, 16(%rsp)
    movq    %rsi, 8(%rsp)
    movl    $42, 4(%rsp)
    movb    $0, %al
    callq   f
    movl    $0, %edi
    movl    %eax, (%rsp)
    movl    %edi, %eax
    addq    $24, %rsp
    ret

Memory is allocated on the stack using subq. Local variables are usually referenced by offsets from the stack pointer, OFFSET(%rsp).

Frame pointer

The x86 (32 bit) ABI uses the %rbp as the base of the stack. This is not mandatory in the x86-64 ABI but the compiler might still use a frame pointer. The base of the stack frame in stored in %rbp.

  1. parameter for f()
  2. parameter for f()
  3. return address to caller of f()
  4. saved %rbp from caller of f() ← saved %rbp
  5. local variable for f()
  6. local variable for f()
  7. parameter for g()
  8. parameter for g()
  9. return address to f() caller of g()
  10. saved %rbp from f() ← %rbp
  11. local variable for g()
  12. local variable for g() ← %rsp
x86-64 stack structure for f() calls g() with frame pointer

Here is the same program compiled with -fno-omit-frame-pointer:

main:
    pushq   %rbp
    movq    %rsp, %rbp
    subq    $32, %rsp
    movl    $0, -4(%rbp)
    movl    %edi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movl    $42, -20(%rbp)
    movb    $0, %al
    callq   f
    movl    $0, %edi
    movl    %eax, -24(%rbp)
    movl    %edi, %eax
    addq    $32, %rsp
    popq    %rbp
    ret

When a frame pointer is used, stack memory is usually referenced as fixed offset from %rsp: OFFSET(%rsp).

Red zone

The x86 32-bit ABI did not allow the code of the function to use variables after the top of the stack: a signal handler could at any moment use any memory after the top of the stack.

The standard x86-64 ABI allows the code of the current function to use the 128 bytes (the red zone) after the top the stack. A signal handler must be instantiated by the OS after the red zone. The red zone can be used for temporary variables or for local variables for leaf functions (functions which do not call other functions).

  1. parameter for f()
  2. parameter for f()
  3. return address to caller of f()
  4. local variable for f()
  5. local variable for f()
  6. parameter for g()
  7. parameter for g()
  8. return address to f() caller of g()
  9. local variable for g()
  10. local variable for g() ← %rsp
  11. red zone
  12. …
  13. red zone
x86-64 stack structure for f() calls g() (with the red zone)

Note: Windows systems do not use the standard x86-64 ABI: the usage of the register is different and there is no red zone.

Let's make main() a leaf function:

int main(int argc, char** argv) {
  int i = 42;
  return 0;
}

The variables are allocated in the red zone (negative offsets from the stack pointer):

main:
        movl    $0, %eax
        movl    $0, -4(%rsp)
        movl    %edi, -8(%rsp)
        movq    %rsi, -16(%rsp)
        movl    $42, -20(%rsp)
        ret

Cleaning the stack

Assembly

Here is the code we are going to add at the beginning of each function:

    movq $QSIZE, %r11
.Lloop:
        movq $0, OFFSET(%rsp,%r11,8)
        subq $1, %r11
        jne  .Lloop

for some suitable values of QSIZE and OFFSET.

The %r11 is defined by the System V x86-64 ABI (as well as the Windows ABI) as a scratchpad register: at the beginning of the function we are free to use it without saving it first.

LLVM pass

This is implemented by a StackCleaner machine pass whose runOnMachineFunction() works similarly to the NopInserter pass.

Parameter computation

We compute the parameters of the generate native code from the size of the stack frame:

  • fn.getFrameInfo()->getStackSize() is the size of the stack used by this function (excluding the red zone);

  • we do not need to care about the red zone as is it only used by LLVM on leaf functions (at least on this version of LLVM, see X86FrameLowering.cpp) and SimGridMC does not analyse the stack of leaf functions (we would just have to add 128 to size in order to clean up the red zone as well);

  • dynamic-size allocations (alloca()) are not counted here.

int size = fn.getFrameInfo()->getStackSize();
int qsize = size / sizeof(uint64_t);
if (size==0) {
  // No stack to clean, we do not modify the function:
  return false;
}
int offset = - size - sizeof(uint64_t);

Basic blocks

For LLVM, a functions is represented as a collection of basic blocks. A basic block is a sequence of instructions where:

  • only the first instruction (the block leader) can be a target of a branching instruction (it is the only instructions which is allowed to have a label);

  • only the last instruction is a branching instruction (jump or return).

Our assembly snippet is made of two basic blocks:

  1. the first instruction;

  2. the end of the snippet.

MachineBasicBlock* bb0 = fn.begin();
MachineBasicBlock* bb1 = fn.CreateMachineBasicBlock();
MachineBasicBlock* bb2 = fn.CreateMachineBasicBlock();

fn.push_front(bb2);
fn.push_front(bb1);

A functions is a Control Flow Graph of basic blocks. We need to complete the arcs in this graph:

bb1->addSuccessor(bb1);
bb2->addSuccessor(bb2);
bb2->addSuccessor(bb0);

Machine instruction generation

We generate the machine instructions:

// First basic block (initialisation):

// movq $QSIZE, %r11
llvm::BuildMI(*bb1, bb1->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64ri),
  X86::R11).addImm(qsize);

// Second basic block (.Lloop):

// movq $0, OFFSET(%rsp,%r11,8)
llvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64mi32))
  .addReg(X86::RSP).addImm(8).addReg(X86::R11).addImm(offset).addReg(0)
  .addImm(0);

// subq $1, %r11
llvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::SUB64ri8),
  X86::R11)
  .addReg(X86::R11)
  .addImm(1);

// jne  .Lloop
llvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::JNE_4))
  .addMBB(bb2);

The instructions have suffix on the argument size and types:

  • 64 for instructions working on 64-bit values;

  • r for register;

  • i for immediate;

  • i for memory.

Modification notification

The function has been modified:

return true;

Result

Generated assembly

Here is the generated assembly for our test code:

main:
    movabsq $3, %r11
.LBB0_1:
    movq    $0, -32(%rsp,%r11,8)
    subq    $1, %r11
    jne .LBB0_1
    subq    $24, %rsp
    movl    $0, 20(%rsp)
    movl    %edi, 16(%rsp)
    movq    %rsi, 8(%rsp)
    movl    $42, 4(%rsp)
    movb    $0, %al
    callq   f
    movl    $0, %edi
    movl    %eax, (%rsp)
    movl    %edi, %eax
    addq    $24, %rsp
    retq

Test program

Here is a simple test program using unitialized stack variables:

#include <stdio.h>

void f() {
  int i;
  int data[16];

  for(i=0; i!=16; ++i)
    printf("%i ", data[i]);
  printf("\n");

  for(i=0; i!=16; ++i)
    data[i] = i;
}

void g() {
  int i, j, k, l, m, n, o, p;
  printf("%i %i %i %i %i %i %i %i\n", i, j, k, l, m, n, o, p);
}

int main(int argc, char** argv) {
  f();
  f();
  g();
  return 0;
}

This is the output of a normal compilation:

-1 0 -812203224 32767 -406470232 32655 -400476992 32655 -400465496 32655 0 0 1 0 4195997 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 0 0 15774463 15 14 13 12

And with our stack-cleaning clang:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0

Result on SimGrid

The whole SimGrid test suite works without compiling SimgridMC support.

At this point, I discovered that SimGrid fails to run when compiled with clang (or DragonEgg) with support for SimGridMC. I need to fix this first before testing the impact of cleaning the stack on SimGridMC state comparison.

In the next episode, I'll try another implementation of the same concept using a few scripts in order to process the generated assembly between the compiler and the assembler which should work with a standard GCC and with SimGridMC.

References