# Work-stealing prefix scan

Developing new prefix scan algorithms for unbalanced workloads.

In my Master thesis, supervised by Prof. Paolo Bientinesi and Prof. Benjamin Berkels, Iâ€™ve had the opportunity to work on the problem of **electron microscopy image registration**. Instead of acquiring a single and high-quality image, which requires a long acquisition with a huge amount of energy affecting the sample, one could acquire a series of low-energy images. The images are a bit noisy, the sample is drifting while the acquisition takes place, and frames need to be *registered* by aligning them to a single, common coordinate system.

To align images \(f_0\) and \(f_1\), we need to find a transformation \(\phi_{0,1}\). Due to the periodical behavior of samples, we are not allowed to align any two images directly - we need to consider all subsequent steps. For example, alignment of \(f_0\) and \(f_3\) requires computing the deformations \(\phi_{0, 1}, \phi_{1, 2}\) to produce \(\phi_{0, 2}\), and estimating the transformation \(\phi_{0,3}\) using \(\phi_{0,2}\) and \(\phi_{2,3}\).

To a computer scientist, this problem is a classic example of a **prefix sum (scan)**. In a prefix computation, a sequence of elements \(x_0, x_1, \dots, x_{n}\) is transformed to produce elements \(y_i = x_0 \odot x_1 \odot \dots \odot x_{i-1}\). Prefix sums have been studied extensively since the 60â€™s because binary addition is a prefix sum problem, and scan parallelizations allow to build faster binary adders. Prefix scans are considered a classic **parallelism pattern** - many seemingly sequential problems can be parallelized when represented as a prefix scan.

While our problem can be represented as a prefix scan, its characteristics are quite different from the scans analyzed in the literature: the image registration operator is **computationally intensive** and has **unpredictable and highly imbalanced execution cost**. In the Master thesis, I analyzed the prefix sum algorithms and their distributed implementations, and I found out that the `MPI_Scan`

performance is unsatisfactory. The work has been presented as a poster at the Student Research Competition at SC â€˜17.

To provide a fast and efficient parallelization strategy for the image registration, we developed a novel **work-stealing prefix scan** algorithm to provide **load balance** for the tightly constrained scan algorithms. We have shown that the new approach provides up to a two-fold decrease in computation time and up to a 2.25x decrease in CPU resources and energy consumption. The research paper has been published in the IEEE Transactions on Parallel and Distributed Systems.