{"version": "https://jsonfeed.org/version/1", "title": "/dev/posts/ - Tag index - compilation", "home_page_url": "https://www.gabriel.urdhr.fr", "feed_url": "/tags/compilation/feed.json", "items": [{"id": "http://www.gabriel.urdhr.fr/2014/11/03/not-cleaning-the-stack/", "title": "Avoiding to clean the stack", "url": "https://www.gabriel.urdhr.fr/2014/11/03/not-cleaning-the-stack/", "date_published": "2014-11-03T00:00:00+01:00", "date_modified": "2014-11-03T00:00:00+01:00", "tags": ["computer", "simgrid", "compilation", "assembly", "x86_64"], "content_html": "

In two previous posts, I looked into cleaning the stack frame of a\nfunction before using it by adding assembly at the beginning of each\nfunction. This was done either by modifying LLVM with a custom\ncodegen pass or by\nrewriting the\nassembly\nbetween the compiler and the assembler. The current implementation\nadds a loop at the beginning of every function. We look at the impact\nof this modification on the performance on the application.

\n

Update: this is an updated version of the post with fixed\ncode and updated results (the original version of the code was\nbroken).

\n

Initial results

\n

Here are the initial results:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TestNormalStack cleaning
ctest (complete testsuite)348.06s387.53s
ctest -R mc-bugged1-liveness-visited-ucontext-sparse1.53s2.00s
run_test comm dup 442.54s127.80s
\n

On big problems, the overhead of the stack-cleaning modification\nbecomes very important.

\n

Optimisation

\n

We would like to avoid the overhead of the stack-cleaning code. In order\nto do this we can use the following facts:

\n\n

Thus, we can disable stack-cleaning if we detect that we are not\nexecuting the application code. This can be implemented in two ways:

\n\n

In order to evaluate, the efficiency of this approach, we use a simple\ncomparison of %rsp with a constant value:

\n
    movq $0x7fff00000000, %r11\n    cmpq %r11, %rsp\n    jae .Lstack_cleaner_done0\n    movabsq $3, %r11\n.Lstack_cleaner_loop0:\n    movq    $0, -32(%rsp,%r11,8)\n    subq    $1, %r11\n    jne     .Lstack_cleaner_loop0\n.Lstack_cleaner_done0:\n    # Main code of the function goes here\n
\n\n\n

The value is hardcoded in this prototype but it could be loaded from a\nglobal variable instead.

\n

Here are the results with this optimisation:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TestNormalStack cleaning
ctest (complete testsuite)348.06s372.95s
ctest -R mc-bugged1-liveness-visited-ucontext-sparse1.53s1.53s
run_test comm dup 442.54s36.68s
\n

Appendix: reproducibility

\n

Those results were generated with:

\n
MAKEFLAGS=\"-j$(nproc)\"\n\ngit clone https://gforge.inria.fr/git/simgrid/simgrid.git\ngit checkout cd84ed2b393b564f5d8bfdaae60b814f81f24dc4\ncd simgrid\nsimgrid=\"$(pwd)\"\n\nmkdir build-normal\ncd build-normal\ncmake .. -Denable_model-checking=ON -Denable_documentation=OFF \\\n  -Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON\nmake $MAKEFLAGS\ncd ..\n\nmkdir build-zero\ncd build-zero\ncmake .. -Denable_model-checking=ON -Denable_documentation=OFF \\\n  -Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON \\\n  -DCMAKE_C_COMPILER=\"$simgrid/tools/stack-cleaner/cc\" \\\n  -DCMAKE_CXX_COMPILER=\"$simgrid/tools/stack-cleaner/c++\" \\\n  -DGFORTRAN_EXE=\"$simgrid/tools/stack-cleaner/fortran\"\nmake $MAKEFLAGS\ncd ..\n\nrun_test() {\n  (\n  platform=$(find $simgrid -name small_platform_with_routers.xml)\n  hostfile=$(find $simgrid | grep mpich3-test/hostfile$)\n\n  local base\n  base=$(pwd)\n  cd $base/teshsuite/smpi/mpich3-test/$1/\n\n  $base/bin/smpirun -hostfile $hostfile -platform $platform \\\n    --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n    --cfg=network/TCP_gamma:4194304 \\\n    -np $3 --cfg=model-check:1 \\\n    --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n    --cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 \\\n    --cfg=model-check/reduction:none --cfg=model-check/visited:100000 \\\n    --cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:yes \\\n    --cfg=model-check/soft-dirty:no ./$2 > /dev/null\n  )\n}\n
\n\n\n

The results without the optimisation are obtained by removing the\nrelevant assembly from the clean-stack-filter script.

"}, {"id": "http://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-in-a-llvm-pass/", "title": "Cleaning the stack in a LLVM pass", "url": "https://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-in-a-llvm-pass/", "date_published": "2014-10-06T00:00:00+02:00", "date_modified": "2014-10-06T00:00:00+02:00", "tags": ["computer", "simgrid", "llvm", "compilation", "assembly", "x86_64"], "content_html": "

In the previous episode, we implemented a LLVM pass which does\nnothing. Now we are trying to modify\nthis to create a (proof-of-concept) LLVM pass which fills the current\nstack frame with zero before using it.

\n

Table of Content

\n
\n\n
\n

Structure of the x86-64 stack

\n

Basic structure

\n

The top (in fact the bottom) of the stack is stored in the %rsp\nregister: a push operation decrements the value of %rsp and store\nthe value in the resulting address; conversely a pop operation\nincrements the value of %rsp. Stack variables are allocated by\ndecrementing %rsp.

\n

A function call (call) pushes the current value of the instruction\n(%rip) pointer on the stack. A return instruction (ret) pops a\nvalue from the stack into %rip.

\n

A typical call frame contains in order:

\n\n
\n
    \n
  1. parameter for f()
  2. \n
  3. parameter for f()
  4. \n
  5. return address to caller of f()
  6. \n
  7. local variable for f()
  8. \n
  9. local variable for f()
  10. \n\n
  11. parameter for g()
  12. \n
  13. parameter for g()
  14. \n
  15. return address to f() caller of g()
  16. \n
  17. local variable for g()
  18. \n
  19. local variable for g() \u2190 %rsp
  20. \n
  21. \n
  22. \n
\n
x86-64 stack structure for f()\n calls g()
\n
\n\n

For example this C code,

\n
int f();\n\nint main(int argc, char** argv) {\n  int i = 42;\n  f();\n  return 0;\n}\n
\n\n\n

is compiled (with clang -S -fomit-frame-poiner example.c) into this\n(using AT&T\nsyntax):

\n
main:\n    subq    $24, %rsp\n    movl    $0, 20(%rsp)\n    movl    %edi, 16(%rsp)\n    movq    %rsi, 8(%rsp)\n    movl    $42, 4(%rsp)\n    movb    $0, %al\n    callq   f\n    movl    $0, %edi\n    movl    %eax, (%rsp)\n    movl    %edi, %eax\n    addq    $24, %rsp\n    ret\n
\n\n\n

Memory is allocated on the stack using subq. Local variables are\nusually referenced by offsets from the stack pointer, OFFSET(%rsp).

\n

Frame pointer

\n

The x86 (32 bit) ABI uses the %rbp as the base of the stack. This is\nnot mandatory in the x86-64\nABI but the\ncompiler might still use a frame pointer. The base of the stack frame\nin stored in %rbp.

\n
\n
    \n
  1. parameter for f()
  2. \n
  3. parameter for f()
  4. \n
  5. return address to caller of f()
  6. \n
  7. saved %rbp from caller of f() \u2190 saved %rbp
  8. \n
  9. local variable for f()
  10. \n
  11. local variable for f()
  12. \n\n
  13. parameter for g()
  14. \n
  15. parameter for g()
  16. \n
  17. return address to f() caller of g()
  18. \n
  19. saved %rbp from f() \u2190 %rbp
  20. \n
  21. local variable for g()
  22. \n
  23. local variable for g() \u2190 %rsp
  24. \n
  25. \n
  26. \n
\n
x86-64 stack structure for f()\n calls g() with frame pointer
\n
\n\n

Here is the same program compiled with -fno-omit-frame-pointer:

\n
main:\n    pushq   %rbp\n    movq    %rsp, %rbp\n    subq    $32, %rsp\n    movl    $0, -4(%rbp)\n    movl    %edi, -8(%rbp)\n    movq    %rsi, -16(%rbp)\n    movl    $42, -20(%rbp)\n    movb    $0, %al\n    callq   f\n    movl    $0, %edi\n    movl    %eax, -24(%rbp)\n    movl    %edi, %eax\n    addq    $32, %rsp\n    popq    %rbp\n    ret\n
\n\n\n

When a frame pointer is used, stack memory is usually referenced as\nfixed offset from %rsp: OFFSET(%rsp).

\n

Red zone

\n

The x86 32-bit ABI did not allow the code of the function to use\nvariables after the top of the stack: a signal handler could at any\nmoment use any memory after the top of the stack.

\n

The standard x86-64\nABI allows the\ncode of the current function to use the 128 bytes (the red zone) after\nthe top the stack. A signal handler must be instantiated by the OS\nafter the red zone. The red zone can be used for temporary variables\nor for local variables for leaf functions (functions which do not call\nother functions).

\n
\n
    \n
  1. parameter for f()
  2. \n
  3. parameter for f()
  4. \n
  5. return address to caller of f()
  6. \n
  7. local variable for f()
  8. \n
  9. local variable for f()
  10. \n\n
  11. parameter for g()
  12. \n
  13. parameter for g()
  14. \n
  15. return address to f() caller of g()
  16. \n
  17. local variable for g()
  18. \n
  19. local variable for g() \u2190 %rsp
  20. \n\n
  21. red zone
  22. \n
  23. \u2026
  24. \n
  25. red zone
  26. \n\n
  27. \n
  28. \n
\n
x86-64 stack structure for f()\n calls g() (with the red zone)
\n
\n\n

Note: Windows systems do not use the standard x86-64 ABI: the\nusage of the register is different and there is no red zone.

\n

Let's make main() a leaf function:

\n
int main(int argc, char** argv) {\n  int i = 42;\n  return 0;\n}\n
\n\n\n

The variables are allocated in the red zone (negative offsets from the\nstack pointer):

\n
main:\n        movl    $0, %eax\n        movl    $0, -4(%rsp)\n        movl    %edi, -8(%rsp)\n        movq    %rsi, -16(%rsp)\n        movl    $42, -20(%rsp)\n        ret\n
\n\n\n

Cleaning the stack

\n

Assembly

\n

Here is the code we are going to add at the beginning of each\nfunction:

\n
    movq $QSIZE, %r11\n.Lloop:\n        movq $0, OFFSET(%rsp,%r11,8)\n        subq $1, %r11\n        jne  .Lloop\n
\n\n\n

for some suitable values of QSIZE and OFFSET.

\n

The %r11 is defined by the System V x86-64 ABI (as well as the\nWindows ABI) as a scratchpad register: at the beginning of the\nfunction we are free to use it without saving it first.

\n

LLVM pass

\n

This is implemented by a StackCleaner machine pass whose\nrunOnMachineFunction() works similarly to the NopInserter pass.

\n

Parameter computation

\n

We compute the parameters of the generate native code from the size of\nthe stack frame:

\n\n
int size = fn.getFrameInfo()->getStackSize();\nint qsize = size / sizeof(uint64_t);\nif (size==0) {\n  // No stack to clean, we do not modify the function:\n  return false;\n}\nint offset = - size - sizeof(uint64_t);\n
\n\n\n

Basic blocks

\n

For LLVM, a functions is represented as a collection\nof basic\nblocks. A basic block is a sequence of instructions where:

\n\n

Our assembly snippet is made of two basic blocks:

\n
    \n
  1. \n

    the first instruction;

    \n
  2. \n
  3. \n

    the end of the snippet.

    \n
  4. \n
\n
MachineBasicBlock* bb0 = fn.begin();\nMachineBasicBlock* bb1 = fn.CreateMachineBasicBlock();\nMachineBasicBlock* bb2 = fn.CreateMachineBasicBlock();\n\nfn.push_front(bb2);\nfn.push_front(bb1);\n
\n\n\n

A functions is a Control Flow Graph of basic blocks. We need to\ncomplete the arcs in this graph:

\n
bb1->addSuccessor(bb1);\nbb2->addSuccessor(bb2);\nbb2->addSuccessor(bb0);\n
\n\n\n

Machine instruction generation

\n

We generate the machine instructions:

\n
// First basic block (initialisation):\n\n// movq $QSIZE, %r11\nllvm::BuildMI(*bb1, bb1->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64ri),\n  X86::R11).addImm(qsize);\n\n// Second basic block (.Lloop):\n\n// movq $0, OFFSET(%rsp,%r11,8)\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64mi32))\n  .addReg(X86::RSP).addImm(8).addReg(X86::R11).addImm(offset).addReg(0)\n  .addImm(0);\n\n// subq $1, %r11\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::SUB64ri8),\n  X86::R11)\n  .addReg(X86::R11)\n  .addImm(1);\n\n// jne  .Lloop\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::JNE_4))\n  .addMBB(bb2);\n
\n\n\n

The instructions have suffix on the argument size and types:

\n\n

Modification notification

\n

The function has been modified:

\n
return true;\n
\n\n\n

Result

\n

Generated assembly

\n

Here is the generated assembly for our test code:

\n
main:\n    movabsq $3, %r11\n.LBB0_1:\n    movq    $0, -32(%rsp,%r11,8)\n    subq    $1, %r11\n    jne .LBB0_1\n    subq    $24, %rsp\n    movl    $0, 20(%rsp)\n    movl    %edi, 16(%rsp)\n    movq    %rsi, 8(%rsp)\n    movl    $42, 4(%rsp)\n    movb    $0, %al\n    callq   f\n    movl    $0, %edi\n    movl    %eax, (%rsp)\n    movl    %edi, %eax\n    addq    $24, %rsp\n    retq\n
\n\n\n

Test program

\n

Here is a simple test program using unitialized stack variables:

\n
#include <stdio.h>\n\nvoid f() {\n  int i;\n  int data[16];\n\n  for(i=0; i!=16; ++i)\n    printf(\"%i \", data[i]);\n  printf(\"\\n\");\n\n  for(i=0; i!=16; ++i)\n    data[i] = i;\n}\n\nvoid g() {\n  int i, j, k, l, m, n, o, p;\n  printf(\"%i %i %i %i %i %i %i %i\\n\", i, j, k, l, m, n, o, p);\n}\n\nint main(int argc, char** argv) {\n  f();\n  f();\n  g();\n  return 0;\n}\n
\n\n\n

This is the output of a normal compilation:

\n
-1 0 -812203224 32767 -406470232 32655 -400476992 32655 -400465496 32655 0 0 1 0 4195997 0\n0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n16 0 0 15774463 15 14 13 12\n
\n\n

And with our stack-cleaning clang:

\n
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n0 0 0 0 0 0 0 0\n
\n\n

Result on SimGrid

\n

The whole SimGrid test suite works without compiling SimgridMC\nsupport.

\n

At this point, I discovered that SimGrid fails to run when compiled\nwith clang (or DragonEgg) with support for SimGridMC. I need to fix\nthis first before testing the impact of cleaning the stack on\nSimGridMC state comparison.

\n

In the next episode, I'll try another implementation of the same\nconcept using a few scripts in order to process the generated\nassembly between the compiler and the\nassembler\nwhich should work with a standard GCC and with SimGridMC.

\n

References

\n"}, {"id": "http://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-by-filtering-the-assembly/", "title": "Cleaning the stack by filtering the assembly", "url": "https://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-by-filtering-the-assembly/", "date_published": "2014-10-06T00:00:00+02:00", "date_modified": "2014-10-06T00:00:00+02:00", "tags": ["computer", "simgrid", "unix", "compilation", "assembly", "x86_64"], "content_html": "

In order to help the SimGridMC state comparison code, I wrote a\nproof-of-concept LLVM pass which cleans each stack\nframe before using\nit. However, SimGridMC currently does not work properly when compiled\nwith clang/LLVM. We can do the same thing by pre-processing the\nassembly generated by the compiler before passing it to the linker:\nthis is done by inserting a script between the compiler and the\nassembler. This script will rewrite the generated assembly by\nprepending stack-cleaning code at the beginning of each function.

\n

Table of Content

\n
\n\n
\n

Summary

\n

In typical compilation process, the compiler (here cc1) reads the\ninput source file and generates assembly. This assembly is then passed\nto the assembler (as) which generates native binary code:

\n
cat foo.c | cc1  | as      > foo.o\n#         \u2191      \u2191         \u2191\n#         Source Assembly  Native\n
\n\n\n

We can achieve our goal without depending of LLVM by adding a simple\nassembly-rewriting script to this pipeline between the the compiler\nand the assembler:

\n
cat foo.c | cc1  | clean-stack-filter | as     > foo.o\n#         \u2191      \u2191                    \u2191        \u2191\n#         Source Assembly             Assembly Native\n
\n\n\n

By doing this, our modification can be used for any compiler as long\nas it sends assembly to an external assembler instead of generating\nthe native binary code directly.

\n

This will be done in three components:

\n\n

Assembly rewriting script

\n

The first step is to write a simple UNIX program taking in input the\nassembly code of a source file and adding in output a stack-cleaning\npre-prolog.

\n

Here is the generated assembly for the test function of the previous\nepisode (compiled with GCC):

\n
main:\n.LFB0:\n    .cfi_startproc\n    subq    $40, %rsp\n    .cfi_def_cfa_offset 48\n    movl    %edi, 12(%rsp)\n    movq    %rsi, (%rsp)\n    movl    $42, 28(%rsp)\n    movl    $0, %eax\n    call    f\n    movl    $0, %eax\n    addq    $40, %rsp\n    .cfi_def_cfa_offset 8\n    ret\n    .cfi_endproc\n
\n\n\n

We can use .cfi_startproc to find the beginning of a function and\neach pushq and subq $x, %rsp instruction to estimate the stack\nsize used by this function (excluding the red zone and alloca() as\npreviously). Each time we are seeing the beginning of a function we\nneed to buffer each line until we are ready to emit the stack-cleaning\ncode.

\n
#!/usr/bin/perl -w\n# Transform assembly in order to clean each stack frame for X86_64.\n\nuse strict;\n$SIG{__WARN__} = sub { die @_ };\n\n# Whether we are still scanning the content of a function:\nour $scanproc = 0;\n\n# Save lines of the function:\nour $lines = \"\";\n\n# Size of the stack for this function:\nour $size = 0;\n\n# Counter for assigning unique ids to labels:\nour $id=0;\n\nsub emit_code {\n    my $qsize = $size / 8;\n    my $offset = - $size - 8;\n\n    if($size != 0) {\n      print(\"\\tmovabsq \\$$qsize, %r11\\n\");\n      print(\".Lstack_cleaner_loop$id:\\n\");\n      print(\"\\tmovq    \\$0, $offset(%rsp,%r11,8)\\n\");\n      print(\"\\tsubq    \\$1, %r11\\n\");\n      print(\"\\tjne     .Lstack_cleaner_loop$id\\n\");\n    }\n\n    print $lines;\n\n    $id = $id + 1;\n    $size = 0;\n    $lines = \"\";\n    $scanproc = 0;\n}\n\nwhile (<>) {\n  if ($scanproc) {\n      $lines = $lines . $_;\n      if (m/^[ \\t]*.cfi_endproc$/) {\n      emit_code();\n      } elsif (m/^[ \\t]*pushq/) {\n      $size += 8;\n      } elsif (m/^[ \\t]*subq[\\t *]\\$([0-9]*),[ \\t]*%rsp$/) {\n          my $val = $1;\n          $val = oct($val) if $val =~ /^0/;\n          $size += $val;\n          emit_code();\n      }\n  } elsif (m/^[ \\t]*.cfi_startproc$/) {\n      print $_;\n\n      $scanproc = 1;\n  } else {\n      print $_;\n  }\n}\n
\n\n\n

This is used as:

\n
# Use either of:\nclean-stack-filter < helloworld.s\ngcc -o- -S hellworld.c | clean-stack-filter | gcc -x assembler -r -o helloworld\n
\n\n\n

And this produces:

\n
main:\n.LFB0:\n    .cfi_startproc\n    movabsq $5, %r11\n.Lstack_cleaner_loop0:\n    movq    $0, -48(%rsp,%r11,8)\n    subq    $1, %r11\n    jne     .Lstack_cleaner_loop0\n    subq    $40, %rsp\n    .cfi_def_cfa_offset 48\n    movl    %edi, 12(%rsp)\n    movq    %rsi, (%rsp)\n    movl    $42, 28(%rsp)\n    movl    $0, %eax\n    call    f\n    movl    $0, %eax\n    addq    $40, %rsp\n    .cfi_def_cfa_offset 8\n    ret\n    .cfi_endproc\n
\n\n\n

Assembler wrapper

\n

A second step is to write an extended assembler as program which\naccepts an extra argument --filter my_shell_command. We could\nhardcode the filtering script in this wrapper but a generic assembler\nwrapper might be reused somewhere else.

\n

We need to:

\n
    \n
  1. \n

    interpret a part of the as command line arguments and our extra\n argument;

    \n
  2. \n
  3. \n

    apply the specified filter on the input assembly;

    \n
  4. \n
  5. \n

    pass the resulting assembly to the real assembler.

    \n
  6. \n
\n
#!/usr/bin/ruby\n# Wrapper around the real `as` which adds filtering capabilities.\n\nrequire \"tempfile\"\nrequire \"fileutils\"\n\ndef wrapped_as(argv)\n\n  args=[]\n  input=nil\n  as=\"as\"\n  filter=\"cat\"\n\n  i = 0\n  while i<argv.size\n    case argv[i]\n\n    when \"--as\"\n      as = argv[i+1]\n      i = i + 1\n    when \"--filter\"\n      filter = argv[i+1]\n      i = i + 1\n\n    when \"-o\", \"-I\"\n      args.push(argv[i])\n      args.push(argv[i+1])\n      i = i + 1\n    when /^-/\n      args.push(argv[i])\n    else\n      if input\n        exit 1\n      else\n        input = argv[i]\n      end\n    end\n    i = i + 1\n  end\n\n  if input==nil\n    # We dont handle pipe yet:\n    exit 1\n  end\n\n  # Generate temp file\n  tempfile = Tempfile.new(\"as-filter\")\n  unless system(filter, 0 => input, 1 => tempfile)\n    status=$?.exitstatus\n    FileUtils.rm tempfile\n    exit status\n  end\n  args.push(tempfile.path)\n\n  # Call the real assembler:\n  res = system(as, *args)\n  status = if res != nil\n             $?.exitstatus\n           else\n             1\n           end\n  FileUtils.rm tempfile\n  exit status\n\nend\n\nwrapped_as(ARGV)\n
\n\n\n

This is used like this:

\n
tools/as --filter \"sed s/world/abcde/\" helloworld.s\n
\n\n\n

We now can ask the compiler to use our assembler wrapper instead of\nthe real system assembler:

\n\n
gcc -B tools/ -Wa,--filter,'sed s/world/abcde/' \\\n  helloworld.c -o helloworld-modified-gcc\n
\n\n\n
clang -no-integrated-as -B tools/ -Wa,--filter,'sed s/world/abcde/' \\\n  helloworld.c -o helloworld-modified-clang\n
\n\n\n

Which produces:

\n
\n$ ./helloworld\nHello world!\n$ ./helloworld-modified-gcc\nHello abcde!\n$ ./helloworld-modified-clang\nHello abcde!\n
\n\n

By combining the two tools, we can get a compiler with stack-cleaning enabled:

\n
gcc -B tools/  -Wa,--filter,'stack-cleaning-filter' \\\n  helloworld.c -o helloworld\n
\n\n\n

Compiler wrapper

\n

Now we can write compiler wrappers which do this job automatically:

\n
#!/bin/sh\npath=(dirname $0)\nexec gcc -B $path -Wa,--filter,\"$path\"/clean-stack-filter \"$@\"\n
\n\n\n
#!/bin/sh\npath=(dirname $0)\nexec g++ -B $path -Wa,--filter,\"$path\"/clean-stack-filter \"$@\"\n
\n\n\n
\n

Warning

\n

As the assembly modification is implemented in as,\nthis compiler wrapper will output the unmodified assembly when using\ncc -S which be surprising. You need to objdump the .o file in\norder to see the effect of the filter.

\n
\n

Result

\n

The whole test suite of SimGrid with model-checking works with this\nimplementation. The next step is to see the impact of this\nmodification on the state comparison of SimGridMC.

"}, {"id": "http://www.gabriel.urdhr.fr/2014/09/26/adding-a-llvm-pass/", "title": "Adding a basic LLVM pass", "url": "https://www.gabriel.urdhr.fr/2014/09/26/adding-a-llvm-pass/", "date_published": "2014-09-26T00:00:00+02:00", "date_modified": "2014-09-26T00:00:00+02:00", "tags": ["computer", "simgrid", "llvm", "compilation", "assembly", "x86_64"], "content_html": "

The SimGrid model checker uses memory introspection (of the heap,\nstack and global variables) in order to detect the equality of the\nstate of a distributed application at the different nodes of its\nexecution graph. One difficulty is to deal with uninitialised\nvariables. The uninitialised global variables are usually not a big\nproblem as their initial value is 0. The heap variables are dealt with\nby memseting to 0 the content of the buffers returned by malloc\nand friends. The case of uninitialised stack variables is more\nproblematic as their value is whatever was at this place on the stack\nbefore. In order to evaluate the impact of those uninitialised\nvariables, we would like to clean each stack frame before using\nthem. This could be done with a LLVM plugin. Here's my first attempt\nto write a LLVM pass to modify the code of a function.

\n

A solution for this, would be to include, at compilation time,\ninstructions to clean the stack frame at the beginning of each\nfunction. This could be implemented as a LLVM\npass:

\n\n

This is mostly relevant when the generated code is not optimised. In\noptimised code, local variables do not need to live on the stack.

\n

Table of Content

\n
\n\n
\n

LLVM overview

\n

A good high level introduction to the LLVM architecture (LLVM IR and\npasses) can be found in The Architecture of Open Source\nApplications.

\n

IR generation

\n

LLVM uses an intermediate language, LLVM\nIR to optimise and generate native\ncode.

\n

For example, a simple hello world like this,

\n
#include <stdio.h>\n\nint main(int argc, char** argv) {\n  puts(\"Hello world!\");\n  return 0;\n}\n
\n\n\n

is turned into this LLVM IR:

\n
; ModuleID = 'helloworld.c'\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n@.str = private unnamed_addr constant [13 x i8] c\"Hello world!\\00\", align 1\n\n; Function Attrs: nounwind uwtable\ndefine i32 @main(i32 %argc, i8** %argv) #0 {\n  %1 = alloca i32, align 4\n  %2 = alloca i32, align 4\n  %3 = alloca i8**, align 8\n  store i32 0, i32* %1\n  store i32 %argc, i32* %2, align 4\n  store i8** %argv, i8*** %3, align 8\n  %4 = call i32 @puts(i8* getelementptr inbounds ([13 x i8]* @.str, i32 0, i32 0))\n  ret i32 0\n}\n\ndeclare i32 @puts(i8*) #1\n\nattributes #0 = { nounwind uwtable \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #1 = { \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\n\n!llvm.ident = !{!0}\n\n!0 = metadata !{metadata !\"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"}\n
\n\n\n

by

\n
clang -S -emit-llvm helloworold.c -o helloworld.ll\n
\n\n\n

The generated LLVM IR can be target-dependant as the type of the\nvariables may depend on the architecture/OS:

\n\n

The initial generation of LLVM IR is not done in LLVM but by the\nfrontend (clang, dragonegg\u2026).

\n

LLVM IR passes

\n

Many LLVM optimisations are implemented in an architecture independant\nway by IR passes which transform/optimise IR:

\n
opt -std-compile-opts -S helloworld.ll -o helloworld.opt.ll --time-passes 2> opt.log\n
\n\n\n

Generated IR:

\n
; ModuleID = 'helloworld.ll'\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n@.str = private unnamed_addr constant [13 x i8] c\"Hello world!\\00\", align 1\n\n; Function Attrs: nounwind uwtable\ndefine i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 {\n  %1 = tail call i32 @puts(i8* getelementptr inbounds ([13 x i8]* @.str, i64 0, i64 0)) #2\n  ret i32 0\n}\n\n; Function Attrs: nounwind\ndeclare i32 @puts(i8* nocapture readonly) #1\n\nattributes #0 = { nounwind uwtable \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #1 = { nounwind \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #2 = { nounwind }\n\n!llvm.ident = !{!0}\n\n!0 = metadata !{metadata !\"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"}\n
\n\n\n

CodeGen passes

\n

This optimized LLVM IR is then used to generate assembly/binary code\nfor the target architecture:

\n
llc  helloworld.opt.ll -o helloworld.s --time-passes 2> llc.log\n
\n\n\n

Generated assembly:

\n
        .text\n        .file   \"/home/foo/temp/helloworld.opt.ll\"\n        .globl  main\n        .align  16, 0x90\n        .type   main,@function\nmain:                                   # @main\n        .cfi_startproc\n# BB#0:\n        pushq   %rbp\n.Ltmp0:\n        .cfi_def_cfa_offset 16\n.Ltmp1:\n        .cfi_offset %rbp, -16\n        movq    %rsp, %rbp\n.Ltmp2:\n        .cfi_def_cfa_register %rbp\n        movl    $.L.str, %edi\n        callq   puts\n        xorl    %eax, %eax\n        popq    %rbp\n        retq\n.Ltmp3:\n        .size   main, .Ltmp3-main\n        .cfi_endproc\n\n        .type   .L.str,@object          # @.str\n        .section        .rodata.str1.1,\"aMS\",@progbits,1\n.L.str:\n        .asciz  \"Hello world!\"\n        .size   .L.str, 13\n\n\n        .ident  \"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"\n        .section        \".note.GNU-stack\",\"\",@progbits\n
\n\n\n

Summary

\n

A LLVM based compiler uses the following\nphases:

\n
    \n
  1. \n

    code analysis (preprocessing, lexing, parsing, semantic\n analysis\u2026);

    \n
  2. \n
  3. \n

    LLVM IR generation (by the compiler);

    \n
  4. \n
  5. \n

    LLVM IR transformation/optimisation (by applying IR passes);

    \n
  6. \n
  7. \n

    native code generation from IR (by applying CodeGen passes).

    \n
  8. \n
\n

Steps 1 and 2 are parts of the code of the compiler. Steps 3 and 4 are\nhandled by the LLVM framework (configurable/pluggable by the\ncompiler).

\n

As we want to touch the content of the stack, we want to add a CodeGen\npass.

\n

Adding a CodeGen pass

\n

Let's first try to add a pass to insert a NOP into every function.

\n

Header

\n

Let's create a new NoopInserter pass (NoopInserter.h). There are\nmeny kinds of passes. This pass is a MachineFunction pass: it is\ncalled (runOnMachineFunction) on each generarated native function\nand can modify it before it is passed to the next pass.

\n
#include <llvm/PassRegistry.h>\n#include <llvm/CodeGen/MachineFunctionPass.h>\n\nnamespace llvm {\n\n  class NoopInserter : public llvm::MachineFunctionPass {\n  public:\n    static char ID;\n    NoopInserter();\n    virtual bool runOnMachineFunction(llvm::MachineFunction &Fn);\n  };\n\n}\n
\n\n\n

The ID is used as a reference to the pass in LLVM: the value of this\nvariable is not important, only its address is used.

\n

Implementation

\n
#include \"NoopInserter.h\"\n\n#include <llvm/CodeGen/MachineInstrBuilder.h>\n#include <llvm/Target/TargetMachine.h>\n#include <llvm/Target/TargetInstrInfo.h>\n#include <llvm/PassManager.h>\n#include <llvm/Transforms/IPO/PassManagerBuilder.h>\n#include <llvm/CodeGen/Passes.h>\n#include <llvm/Target/TargetSubtargetInfo.h>\n#include \"llvm/Pass.h\"\n\n#define GET_INSTRINFO_ENUM\n#include \"../Target/X86/X86GenInstrInfo.inc\"\n\n#define GET_REGINFO_ENUM\n#include \"../Target/X86/X86GenRegisterInfo.inc.tmp\"\n\nnamespace llvm {\n  char NoopInserter::ID = 0;\n\n  NoopInserter::NoopInserter() : llvm::MachineFunctionPass(ID) {\n  }\n\n  bool NoopInserter::runOnMachineFunction(llvm::MachineFunction &fn) {\n    const llvm::TargetInstrInfo &TII = *fn.getSubtarget().getInstrInfo();\n    MachineBasicBlock& bb = *fn.begin();\n    llvm::BuildMI(bb, bb.begin(), llvm::DebugLoc(), TII.get(llvm::X86::NOOP));\n    return true;\n  }\n\n  char& NoopInserterID = NoopInserter::ID;\n}\n\nusing namespace llvm;\n\nINITIALIZE_PASS_BEGIN(NoopInserter, \"noop-inserter\",\n  \"Insert a NOOP\", false, false)\nINITIALIZE_PASS_DEPENDENCY(PEI)\nINITIALIZE_PASS_END(NoopInserter, \"noop-inserter\",\n  \"Insert a NOOP\", false, false)\n
\n\n\n

The runOnMachineFunction method find the beginning of the function\nand insert a X86 NOOP instruction. The method return true in order\nto tell the LLVM framework that this function has been modified by\nthis pass. This implementation will only work on X86/AMD64 targets. A\nreal pass should be target independent or at least check the target.

\n

The INITIALIZE_PASS macros declare the pass and declare its\ndependencies. Here, we are declaring a dependency on PEI a.k.a\nPrologEpilogInserter which adds the prolog and epilog to the code of\nnative function. Those macros define a function:

\n
void initializeNoopInserterPass(PassRegistry &Registry);\n
\n\n\n

The NoopInserterID may be used by other passes to refer to this\npass.

\n

Declarations

\n

We have to add a few declarations of this pass.

\n

In include/llvm/CodeGen/Passes.h:

\n
// NoopInserter - This pass inserts a NOOP instruction\nextern char &NoopInserterID;\n
\n\n\n

In include/llvm/InitializePasses.h:

\n
void initializeNoopInserterPass(PassRegistry &Registry)\n
\n\n\n

Registration

\n

The pass must be added in llvm::initializeCodeGen()\nlib/CodeGen/CodeGen.cpp:

\n
initializeNoopInserterPass(Registry);\n
\n\n\n

Result

\n
clang -O3 helloworld.c -S -o-\n
\n\n\n

We have a nice NOOP:

\n
    .text\n    .file   \"/home/foo/temp/helloworld.c\"\n    .globl  main\n    .align  16, 0x90\n    .type   main,@function\nmain:                                   # @main\n    .cfi_startproc\n# BB#0:                                 # %entry\n    nop\n    pushq   %rax\n.Ltmp0:\n    .cfi_def_cfa_offset 16\n    movl    $.L.str, %edi\n    callq   puts\n    xorl    %eax, %eax\n    popq    %rdx\n    retq\n.Ltmp1:\n    .size   main, .Ltmp1-main\n    .cfi_endproc\n\n    .type   .L.str,@object          # @.str\n    .section    .rodata.str1.1,\"aMS\",@progbits,1\n.L.str:\n    .asciz  \"Hello world!\"\n    .size   .L.str, 13\n\n\n    .ident  \"clang version 3.6.0 \"\n    .section    \".note.GNU-stack\",\"\",@progbits\n
\n\n\n

The program still works:

\n
$ clang -O3 helloworld.c -S -o-\n$ ./a.out\nHello world!\n
\n\n

Conclusion

\n

I successfully managed to add a pass in order to (actively) do nothing\nin each generated native function. In the next episode, I'll try do do\nsomething useful\ninstead.

"}]}