C++ Toolchain with Taint Analysis

Clang comes with a set of tools known as sanitizers that provide a runtime verification of common problems such as memory issues and undefined behavior. An interesting and a not well-known one is DfSan, a dataflow sanitizer. The name does not really reveal the true purpose: the library brings a compiler pass and runtime to implement taint analysis1. The main purpose is to taint specific memory regions and automatically propagate taint labels to other locations in memory that are affected by originally tainted regions. Taint labels are stored separately, in a so-called shadow memory, and compiler pass instruments codes with taint propagation. Thus, it is possible for every variable in a program to detect which values have affected it. The analysis is not only fully inter-procedural but it is completely memory agnostic, which is a common issue for static compiler analyses. The sanitizer implements data-flow tainting, the most common way of propagating labels:


void f(int * x)
{
  // Taint label input_parameter detected in y!
  int y = *x;
  do_something_important(y);
}

int input_parameter = 1;
taint(&input_parameter, "input_parameter");
int x = input_parameter * 10;
f(&x);

As a sidenote, we notice that variables can be affected through control-flow decision as well:

int m, n;
taint(&m, "m");
taint(&n, "n");
int sum = 0;
// sum is m * n
// sum should have both `m` and `n` as taint labels
for(int i = 0; i < m; ++i)
  sum += n;

The feature of control-flow tainting2 has not been implemented in DfSan yet.

To be fully precise, DfSan needs to instrument all library functions by propagating taint labels across function arguments and return values. As a result, it creates a wrapper function for each function found in the program. When handling functions that cannot be instrumented, e.g. library functions for which only declaration is available, the function will be considered as potentially problematic unless dfsan is notified through an ABI blacklist that this function does not write into user-accessible memory. For such function there can be no taint propagation or the taint label of return value from the function is overapproximated as a combination of taint labels in all input variables.

To make sure that we cover everything, we have to instrument all libraries used by the program, including the standard C++ library.

Goal

The main goal is to assemble a complete C++ toolchain with a support for dataflow (taint) analysis. According to Clang documentation3, the complete pipeline includes not only a compiler, but a linker, language standard library implementation and a runtime library. First, we’re going to build Clang with compiler runtime. Afterwards, we’re going to use the new build of clang to prepare sanitized build of libc++. Finally, we build libunwind, the library implementing stack unwinding which is necessary for exception handling, and assemble the toolchain in a Docker image to form a single distribution of a C++ dfsan pipeline. To demonstrate the tool usability for HPC software, we’re going to take a look on builds with OpenMP and MPI.

Shared libraries

Static libraries are prefered for dfsan to avoid issues with memory mappings. However, statical link of libraries might lead to serious issues when a specific library is linked more than once. For example, an application might link both OpenMP and a library already using OpenMP, leading to a scenario where two OpenMP runtimes are present. That could have catastrophic effects on the performance but it’s not something that we should worry about when trying to instrument an application4.

Using DfSan

Dfsan might produce a huge number of warnings related on uninstrumented functions. It’s quite benefitial to look through the log at least. Although a significant part of libc is uninstrumented and sqrt computing square root is marked as functional, implying that taint label of return value is infered from labels of input arguments, the cubic root function cbrt is marked only as uninstrumented. Afterwards it’s better to disable the output log entirely with the environment variable:

DFSAN_OPTIONS=warn_unimplemented=0

Build Clang with sanitizers

First, we need to get LLVM and clang with the support to sanitizers - we’re going to use dataflow sanitizer (dfsan). There are three ways to achieve that:

  • get prebuilt packages via APT for Debian-based systems: https://apt.llvm.org/
  • build from sources, starting with cloning the repository and switching to branch corresponding to selected release
  • download sources or prebuilt binaries from sources. The important part is we need to build Clang with compiler-rt since the latter provides sanitizers.

The build itself is quite straightforward as long as all projects are placed together, just like in a clone from directory, or in a common directory i.e. with directories clang and compiler-rt along LLVM. The CMake configuration with sanitizers and in release mode is as follows:

cmake -DLLVM_TARGETS_TO_BUILD="X86"\
  && -DLLVM_ENABLE_PROJECTS="clang;compiler-rt"\
  && -DCMAKE_BUILD_TYPE=Release\
  && -DCMAKE_INSTALL_PREFIX=/path/to/install\
  && /path/to/llvm

We can use dataflow sanitizer to propagate taint labels in a dataflow manner! We do a sanity check by running a shortened version of the simple example from docs:

#include <sanitizer/dfsan_interface.h>
#include <assert.h>

int main(void)
{
  int i = 1;
  dfsan_label i_label = dfsan_create_label("i", 0);
  dfsan_set_label(i_label, &i, sizeof(i));

  int j = 2;
  dfsan_label j_label = dfsan_create_label("j", 0);
  dfsan_set_label(j_label, &j, sizeof(j));

  int test = i + j;
  dfsan_label test_label = dfsan_read_label(&test, sizeof(test));
  assert(dfsan_has_label(test_label, i_label));
  assert(dfsan_has_label(test_label, j_label));

  return 0;
}

Which can be built with just a single additional compiler flag:

export PATH=/install/path/bin:$PATH
# make sure we use the correct version
which clang++
clang++ -fsanitize=dataflow dfsan_test.cpp -o dfsan_test.exe
./dfsan_test.exe

The Docker image with the complete build is available as mcopik/clang-dfsan:clang-${VERSION}.

C++ Standard Library

We want to build both libcxx and libcxxabi to provide a fully sanitized standard library implementation that can be used by C++ applications. The libraries are easily built as a part of LLVM tree by selectin LLVM projects libcxx and libcxxabi, similarly to the previous setup. However, this requires generation of targets for the entire LLVM project and we can’t easily reuse previous build step because compiler flags are now different.

Fortunately, the libraries support a standalone build although it is a bit more complicated due to a cyclic dependency: libcxx requires libcxxabi whereas libcxxabi requires existence of libcxx headers. The most generic approach would be to build libcxx against a different ABI, such as libstdc++, build libcxxabi and then finally rebuild libcxx with a proper ABI. But we can take a shortcut here since libcxxabi requires only headers from the other library, and for that reason we don’t need a full build.

First, let’s create a static build of libcxxabi

cmake -G "Ninja"\
  -DCMAKE_BUILD_TYPE=MinSizeRel\
  -DCMAKE_INSTALL_PREFIX=/path/to/install\
  -DCMAKE_C_COMPILER=clang\
  -DCMAKE_CXX_COMPILER=clang++\
  -DCMAKE_C_FLAGS=-fsanitize=dataflow\
  -DCMAKE_CXX_FLAGS=-fsanitize=dataflow\
  -DLLVM_PATH=/path/to/llvm/install\
  -DLIBCXXABI_ENABLE_SHARED=NO\
  -DLIBCXXABI_LIBCXX_PATH=../libcxx\
  ../libcxxabi

This will create a single static library file lib/libc++abi.a. Then, we can get a libc++. In addition, we can use the experimental option LIBCXX_ENABLE_STATIC_ABI_LIBRARY to link the contents of libcxxabi with our static copy of libcxx. This simplifies further usage since libc++abi.a does not have to be specified every time as link-time dependency.

cmake -G "Ninja"\
  -DCMAKE_BUILD_TYPE=MinSizeRel\
  -DCMAKE_INSTALL_PREFIX=/opt/llvm\
  -DCMAKE_C_COMPILER=clang\
  -DCMAKE_CXX_COMPILER=clang++\
  -DCMAKE_C_FLAGS=-fsanitize=dataflow\
  -DCMAKE_CXX_FLAGS=-fsanitize=dataflow\
  -DLIBCXX_ENABLE_SHARED=OFF\
  -DLIBCXX_CXX_ABI=libcxxabi\
  -DLIBCXX_ENABLE_STATIC_ABI_LIBRARY=ON\
  -DLIBCXX_CXX_ABI_INCLUDE_PATHS=../libcxxabi/include/\
  -DLIBCXX_CXX_ABI_LIBRARY_PATH=../build_libcxxabi/lib/\
  ../libcxx

Afterwards we should only libc++.a in the ${INSTALL}/lib directory. Let’s do a simple sanity check to verify that our new build works correctly:

#include <vector>
#include <numeric>
#include <cassert>

#include <sanitizer/dfsan_interface.h>

int main(void)
{

  int i = 1;
  dfsan_label i_label = dfsan_create_label("i", 0);
  dfsan_set_label(i_label, &i, sizeof(i));

  int j = 2;
  dfsan_label j_label = dfsan_create_label("j", 0);
  dfsan_set_label(j_label, &j, sizeof(j));

  std::vector<int> vec{1, 2, i};
  dfsan_label test_label = dfsan_read_label(&vec.at(2), sizeof(int));
  assert(dfsan_has_label(test_label, i_label));

  int test = std::accumulate(vec.begin(), vec.end(), j);
  test_label = dfsan_read_label(&test, sizeof(test));
  assert(dfsan_has_label(test_label, i_label));
  assert(dfsan_has_label(test_label, j_label));

  return 0;
}

To build this, we need to specify in clang arguments that libc++ should be used. Furthermore, we need to provide include and library directories to libc++, unless the standard library was installed in the same directory tree as clang. In such case, the compiler will be able to pick up paths automatically. The link with libc++abi is unnecesary when libraries were merged in the last build step.

clang++ -I /path/to/libc++/installation/include/c++/v1\
  -fsanitize=dataflow\
  -L /path/to/libc++/installation/lib\
  -Wl,--start-group,-lc++abi\
  -stdlib=libc++ dfsan_test.cpp\
  -o dfsan_test.exe

The Docker image with the complete build is available as `mcopik/clang-dfsan:libcxx-${VERSION}.

Complete Toolchain

Finally, we can assemble a complete C++ toolchain. In addition to standard C++ library, the compilation workflow has several dependencies such as linker, runtime library or implementation of exceptions. One could use standard, widely-used and easily available tools or try to build a pure LLVM pipeline.

GCC

The required packages on a Ubuntu-based distro are binutils for standard GNU linker ld, libc-dev for standard C library and libgcc-${VERSION}-dev to provide runtime library and stack unwinder. Atomics require additional libraries2.

Clang

On Debian-based distributions, I recommend to use libc-6-dev since it’s perfectly compatible with clang programs. The necessary runtime functions are already provided with compiler-rt. Common alternatives to GNU ld linker are GNU gold or LLVM lld.

Again, the Docker image with the complete build is available as mcopik/clang-dfsan:dfsan-${VERSION}. It comes with two wrappers, clang-dfsan and clang++-dfsan, that include flags necessary to enable dataflow sanitization and link against dedicated build of libc++.

OpenMP

LLVM has a separate implementation of OpenMP, a widely-used library for multithreading and shared-memory parallelism. The build steps are similar to other project, with minor differences. First, an additional CMake parameter LIBOMP_ENABLE_SHARED=Off is used to enforce building a static library. Second, there are several functions that cannot be instrumented because their implementations are provided only in assembly. On Linux, these functions are provided in runtime/src/z_Linux_asm.S and dfsan blacklist has to be updated:

fun:__kmp_x86_cpuid=uninstrumented
fun:__kmp_x86_cpuid=discard
fun:__kmp_store_x87_fpu_control_word=uninstrumented
fun:__kmp_store_x87_fpu_control_word=discard
fun:__kmp_load_x87_fpu_control_word=uninstrumented
fun:__kmp_load_x87_fpu_control_word=discard
fun:__kmp_clear_x87_fpu_status_word=uninstrumented
fun:__kmp_clear_x87_fpu_status_word=functional
fun:__kmp_hardware_timestamp=uninstrumented
fun:__kmp_hardware_timestamp=discard

fun:__kmp_fork_call=uninstrumented
fun:__kmp_fork_call=functional
fun:__kmp_serialized_parallel=uninstrumented
fun:__kmp_serialized_parallel=functional
fun:__kmp_invoke_microtask=uninstrumented
fun:__kmp_invoke_microtask=functional

The library can be build similarly to the previous step:

cmake -G "Ninja"\
  -DCMAKE_BUILD_TYPE=MinSizeRel\
  -DCMAKE_INSTALL_PREFIX=/opt/llvm\
  -DCMAKE_C_COMPILER=clang\
  -DCMAKE_CXX_COMPILER=clang++\
  -DCMAKE_C_FLAGS="-fsanitize=dataflow -fsanitize-blacklist=/dfsan_abilist.txt"\
  -DCMAKE_CXX_FLAGS="-fsanitize=dataflow -fsanitize-blacklist=/dfsan_abilist.txt"\
  -DLIBCXX_ENABLE_SHARED=OFF\
  -DLIBCXX_CXX_ABI=libcxxabi\
  -DLIBCXX_ENABLE_STATIC_ABI_LIBRARY=ON\
  -DLIBCXX_CXX_ABI_INCLUDE_PATHS=../libcxxabi/include/\
  -DLIBCXX_CXX_ABI_LIBRARY_PATH=../build_libcxxabi/lib/\
  ../libcxx

Finally, we can try to compile a program parallelized with OpenMP by adding the flags -I/path/to/openmp/include -fopenmp /path/to/openmp/lib/libiomp5.a.

#include <vector>
#include <numeric>
#include <cassert>

#include <omp.h>

#include <sanitizer/dfsan_interface.h>

int main(void)
{

  int m = 1;
  dfsan_label m_label = dfsan_create_label("m", 0);
  dfsan_set_label(m_label, &m, sizeof(m));
  int n = 100;

  #pragma omp parallel
  {
    int * sum = new int[omp_get_num_threads()];
    #pragma omp for
    for(int i = 0; i < n; ++i)
      sum[omp_get_thread_num()] += m;
    dfsan_label test_label = dfsan_read_label(&sum[omp_get_thread_num()], sizeof(int));
    assert(dfsan_has_label(test_label, m_label));
    delete[] sum;
  }

  return 0;
}

MPI

The MPI can be compiled with dfsan support, an example of build for OpenMPI:

CC=${CLANG} CFLAGS="-fsanitize=dataflow"\
  && ${OPENMPI_DIR}/configure\
  && --enable-static=yes\
  && --enable-shared=false

Compiling the library requires additional blacklist update because small assembly functions are used during the configuration to test the compiler:

fun:gsym_test_func=uninstrumented

Apart from that change, the compilation should work seamlessly. C++ compiler should be unnecessary because C++ bindings are deprecated and not built by default

An entirely different question is if instrumenting MPI is even necessary? MPI operations tend to overwrite memory content with data received from network and it requires an additional library to handle label propagation through MPI messages. Furthermore, using a dedicated build prevents from running the program on a supercomputer. Thus, a better option might be to simply put MPI functions in the blacklist.

References




    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • String formatting with optional values
  • LLVM Machine Passes
  • Read the Freaking Documentation
  • Yet another lesson on undefined behavior
  • Minimal LLVM lit configuration