RDMA + FaaS = rFaaS

Solving the high latency invocations problem of FaaS.

The utilization of CPU cores in the Piz Daint supercomputer, April 2021.

The world of supercomputing is dominated by the MPI programming model and batch systems. These solutions provide the high performance needed for scalable scientific computing, but they don’t make it particularly difficult to achieve high machine utilization. As shown on the CPU utilization graph from Piz Daint (above), many resources are available only for a short time, and the long-running and persistent batch jobs cannot use them.

rFaaS combines the flexibility of FaaS and high-performance of batch jobs.

FaaS provides an elastic resource management system that could address this utilization problem. However, serverless struggles to achieve the performance needed in high-performance computing: slow invocations, low network bandwidth, and the overheads of the FaaS management system make it difficult to incorporate serverless functions when every millisecond counts. Therefore, we decided to combine the best of both worlds: elasticity of FaaS and high-performance of cluster batch systems. We built a new FaaS platform with RDMA-accelerated network transport.

The lifetime of rFaaS invocations.

In rFaaS, the centralized schedulers and API gateway are replaced with a decentralized allocation mechanism. Instead of using a traditional cloud trigger, MPI applications query executor servers, obtain resource allocation and establish RDMA connections to remote workers. Every function is invoked by writing input data directly to the memory of the worker. This allows us to achieve a single-digit microsecond hot invocation latency - hot invocations add less than 350 nanoseconds overhead on top of the fastest available network transmission.

The integration of rFaaS with cluster batch systems and MPI applications.

rFaaS comes with a dedicated resource manager to notify MPI jobs about changing resource availability in an efficient and scalable manner. In addition, the accounting system utilizes RDMA atomic operations to minimize the overheads of user management. In rigid MPI jobs, processes (ranks) allocate and use functions independently, allowing for decentralized offloading parts of the computation to the idle cluster capacity.

Invocation latencies of FaaS frameworks.

We have shown that rFaaS invocations provide the lowest latency and the highest bandwidth compared to state-of-the-art solutions. Furthermore, we have shown that rFaaS functions can be integrated into scientific applications to provide cheap and easy acceleration. More insights and results can be found in the paper that has been accepted at IPDPS 2023.