LLVM Machine Passes | Marcin Copik

LLVM has a very rich set of analysis and transformation passes dedicated to operate on the Intermediate Representation (IR) format which is used as a common abstraction between the frontend operating on the very high level of source code, and the backend responsible for lowering the IR into a respective target ISA. The IR is provided in a static single assignment form (SSA)¹. The primary feature of SSA is the requirement of a single assignment to every variable. Although LLVM as a project suffers from many undocumented features and tools, the documentation on IR passes is suprisingly good, including a propably complete list of available passes² and a quite good introduction³ on implementing passes, although the content on pass manager and specifying requirements could be updated to reflect the difference between legacy and modern pass manager.

But what happens when we try to analyze the machine representation of the code on the target side? According to code generator documentation⁴ and the avery good article ‘Life of an instruction in LLVM’⁵, IR building blocks have their counterparts in the backend. An IR function is represented by a MachineFunction on the backend side. MachineFunction contains possibly multiple MachineBasicBlock instances, and machine basic blocks contain instances of MachineInstruction. These three types have different relationships with their IR counterparts. A MachineFunction always represents an IR Function while the MachineBasicBlock might map a BasicBlock but it does not have to! That’s why getFunction() returns a reference and getBasicBlock() returns a pointer which might be null. Finally, the machine instruction does not correspond directly to IR Instruction at all.

But why is there a one-to-many mapping in basic blocks? The example below explains how the abstracted view of the machine in the IR can’t match perfectly the actual assembly.

    
double f(double * arr, int len)
{
    double sum = 1.0;
    for(int i = 0; i < len; ++i)
        sum += arr[i];
    return sum;
}

This code is compiled in clang with flags -O3 -fno-vectorize -fno-unroll-loops to keep the code simple; loop vectorization and unrolling will create additional machine basic blocks as well but the representation is harder to understand.

    
; Function Attrs: norecurse nounwind readonly uwtable
define dso_local double @_Z1fPdi(double* nocapture readonly, i32) local_unnamed_addr {
  %3 = icmp sgt i32 %1, 0
  br i1 %3, label %4, label %6

; <label>:4:                                      ; preds = %2
  %5 = zext i32 %1 to i64
  br label %8

; <label>:6:                                      ; preds = %8, %2
  %7 = phi double [ 1.000000e+00, %2 ], [ %13, %8 ]
  ret double %7

; <label>:8:                                      ; preds = %8, %4
  %9 = phi i64 [ 0, %4 ], [ %14, %8 ]
  %10 = phi double [ 1.000000e+00, %4 ], [ %13, %8 ]
  %11 = getelementptr inbounds double, double* %0, i64 %9
  %12 = load double, double* %11, align 8
  %13 = fadd double %10, %12
  %14 = add nuw nsw i64 %9, 1
  %15 = icmp eq i64 %14, %5
  br i1 %15, label %6, label %8
}

This is the most basic representation of our code - if the second parameter length is greater than zero, the control flow reaches loop preheader with label 4 and then loop with label 8 where the actual computation is performed. The exit block 6 defines the final result returned from the function with a phi node which selects either a constant 1 or the loop result, depending on whether the loop has executed at least one iteration. PHI nodes are a very fundamental concepts in the SSA representation but they don’t exist in hardware. Therefore, in this case the basic block is duplicated as we can see on the listing below.

    
# Machine code for function _Z1fPdi: NoPHIs, TracksLiveness, NoVRegs
Constant Pool:
  cp#0: 1.000000e+00, align=8
Function Live Ins: $rdi, $esi

bb.0 (%ir-block.2):
  successors: %bb.3(0x50000000), %bb.1(0x30000000); %bb.3(62.50%), %bb.1(37.50%)
  liveins: $esi, $rdi
  TEST32rr renamable $esi, renamable $esi, implicit-def $eflags
  JLE_1 %bb.1, implicit $eflags

bb.3 (%ir-block.4):
; predecessors: %bb.0
  successors: %bb.4(0x80000000); %bb.4(100.00%)
  liveins: $esi, $rdi
  renamable $eax = MOV32rr killed renamable $esi, implicit-def $rax
  renamable $xmm0 = MOVSDrm $rip, 1, $noreg, %const.0, $noreg :: (load 8 from constant-pool)
  renamable $ecx = XOR32rr undef $ecx(tied-def 0), undef $ecx, implicit-def dead $eflags, implicit-def $rcx

bb.4 (%ir-block.8, align 4):
; predecessors: %bb.3, %bb.4
  successors: %bb.2(0x04000000), %bb.4(0x7c000000); %bb.2(3.12%), %bb.4(96.88%)
  liveins: $rax, $rcx, $rdi, $xmm0
  renamable $xmm0 = ADDSDrm killed renamable $xmm0(tied-def 0), renamable $rdi, 8, renamable $rcx, 0, $noreg :: (load 8 from %ir.scevgep, !tbaa !2)
  renamable $rcx = nuw nsw ADD64ri8 killed renamable $rcx(tied-def 0), 1, implicit-def dead $eflags
  CMP64rr renamable $rax, renamable $rcx, implicit-def $eflags
  JNE_1 %bb.4, implicit $eflags

bb.2 (%ir-block.6):
; predecessors: %bb.4
  liveins: $xmm0
  RETQ $xmm0

bb.1:
; predecessors: %bb.0

  renamable $xmm0 = MOVSDrm $rip, 1, $noreg, %const.0, $noreg :: (load 8 from constant-pool)
  RETQ $xmm0

# End machine code for function _Z1fPdi.

The exit block from IR corresponds to the machine basic block bb.2 which returns the value in vector register xmm0. Additionally, a second exit block bb.1 is provided for the case when loop is not executed at all. A similar pattern can be seen in the assembly for x86 target where the new block moves the constant value to the register before executing retq.

From an LLVM bitcode, we can go directly generate object file which in our case is stored as an x86-64 ELF format, by using the LLVM static compiler llc

llc -filetype=obj file.bc -o file.o

with the option -filetype=obj. One can also generate an assembly listing

llc -S file.bc -o file.s
llvm-mc -filetype=obj file.bc -o file.o

Additionaly, the flag -asm-show-inst generates the listing with mapping to LLVM machine instructions and operands in comments. When building C/C++ programs, these steps are wrapped inside clang tool and hidden from the user ⁶.

The well-documented and properly explained infrastracture ends as soon as we try to write and run passes on machine code representation. It sounds from this November 2015 discussion on the mailing list that there’s no way of running the pass dynamically during target code generation. llc docs explain that the -load option, known also in opt, is used to dynamically load targets, not passes. For the purpose of this tutorial, we want to write a pass printing machine blocks and basic statistics on size and correspondence to IR block. llc already offers an ability to print out machine instructions after every optimization step by using the flag -print-machineinstrs. The contents of this tutorial are mostly based on the information found on LLVM mailing list⁷.

The code generation operates in several phases which are responsible for selecting instructions, allocating registers. performing late optimizations such as peephole optimizations and emitting the code⁸. We want to run our pass quite late, just after target optimizations and before the lowering of machine instructions to target code. Since there’s no way to dynamically load the pass, we need to statically embed our pass in X86 target generation chain.

First, we implement our machine pass that visits functions in the code. As the naming scheme suggests, a machine counterpart of a FunctionPass is simply a MachineFunctionPass. Only a single function runOnFunction needs to be implemented . Since we want to print a basic information on block size, it might look like this:

    
struct MachinePrinter: public llvm::MachineFunctionPass {

  static char ID;

  MachinePrinter():
    MachineFunctionPass(ID) {}

  bool runOnMachineFunction(llvm::MachineFunction& F)
      override
  { 
    for(auto & MBB : F) {
      const llvm::BasicBlock * bb = MBB.getBasicBlock();
      if(bb)
        llvm::outs() << bb->size() << ' '
            << MBB.size() << '\n';
      else
        llvm::outs() << MBB.getName() << ' '
            << MBB.size() << '\n';
      for(auto & inst : BB)
        llvm::outs() << inst << '\n';
      }
      return false;
  }
};

char MachinePrinter::ID = 0;

namespace llvm {
    void initializeMachinePrinterPass(llvm::PassRegistry &);
}

using llvm::PassRegistry;
using llvm::PassInfo;
using llvm::callDefaultCtor;
INITIALIZE_PASS(MachinePrinter, "x86-printer", "X86 Printer", true, true)

namespace llvm {
    FunctionPass * createMachinePrinter() {
        return new MachinePrinter();
    }
}

We use the macro INITIALIZE_PASS as a first step of pass registration. It defines a function initializePASSNAMEPass in namespace llvm and we need to provide an additional declaration. Furthermore, we need to supply a function returning a dynamically allocated pass object. Both function are used by the target configuration to add our pass to the pipeline.

The we need a forward declaration of the create function in lib/Target/X86/X86.h

namespace llvm {
    ...
    FunctionPass *createMachinePrinter();
}

Forward decalarations allow compiling the target configuration independently from passes implementation. Thus, passes can be developed and improved without a significant project rebuild. Now we need to insert our analysis pass into X86 target configuration. We start in lib/Target/X86/X86TargetMachine.cpp with a forward declaration and a call to initialization function in lib/Target/X86/X86TargetMachine.cpp.

namespace llvm {
  void initializeMachinePrinterPass(PassRegistry &);
}

extern "C" void LLVMInitializeX86Target() {
  PassRegistry &PR = *PassRegistry::getPassRegistry();
  ...
  initializeMachinePrinterPass(PR);
}

Since I want to look at machine code after all implementations have been finished, the pass is called just at the very end:

void X86PassConfig::addPreEmitPass2() {
  ...
  addPass(createMachinePrinter());
}

As the last step we modify the CMakeLists.txt in lib/Target/X86/CMakeLists.txt to add the new .cpp file to sources list. The new pass should be available when using llc to build x86 targets.

As we see, this process is not complicated and one can set up a machine pass prototype in just a few minutes. It’s a pity that this process is not so well documented and supported as IR passes. And as far as machine basic block statistics are concerned, this method is reliable only when static information is concerned. Dynamic block counts need profiling information that is available on the IR level, and the one-to-many mapping between IR and machine basic block prevent us from obtaining a complete and accurate result. For that purpose one needs to use other instrumentation tool, for example the Pin tool which is capable of counting basic block instances.

SSA on Wikipedia, a free textbook on SSA provided by Inria - work in progress! ↩
LLVM Passes ↩
Writing an LLVM Pass ↩
The LLVM Target-Independent Code Generator ↩
Life of an instruction LLVM, Eli Bendersky ↩
January 2018 mailing list discussion explains the differences between building an application with various pipelines based on clang, opt and llc. ↩
November 2015 discussion on writing a machine pass, April 2014 discussion which brings the problem of finding correspondence between IR and machine instructions. The mentioned -debug-ir was deprecated and removed in November 2014. As of October 2018, there’s an open review for a new debug pass. July 2016 question on matching machine instructions with IR, no answers unfortunately. ↩
High-level design of code generator ↩

Enjoy Reading This Article?