publications | Marcin Copik

2025

CGO

A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications

Lukas Trümper, Philipp Schaad, Berke Ates, Alexandru Calotoiu, Marcin Copik, and Torsten Hoefler

In Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, 2025

@inproceedings{2025normalizationcgo,
  author = {Trümper, Lukas and Schaad, Philipp and Ates, Berke and Calotoiu, Alexandru and Copik, Marcin and Hoefler, Torsten},
  title = {A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications},
  year = {2025},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  booktitle = {Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization},
  series = {CGO '25},
}

EuroSys
SeBS-Flow: Benchmarking Serverless Cloud Function Workflows

Larissa Schmid, Marcin Copik, Alexandru Calotoiu, Laurin Brandner, Anne Koziolek, and Torsten Hoefler

In Proceedings of the Twentieth European Conference on Computer Systems, 2025

Abs DOI arXiv Bib PDF

Artifact

Serverless computing has emerged as a prominent paradigm, with a significant adoption rate among cloud customers. While this model offers advantages such as abstraction from the deployment and resource scheduling, it also poses limitations in handling complex use cases due to the restricted nature of individual functions. Serverless workflows address this limitation by orchestrating multiple functions into a cohesive application. However, existing serverless workflow platforms exhibit significant differences in their programming models and infrastructure, making fair and consistent performance evaluations difficult in practice. To address this gap, we propose the first serverless workflow benchmarking suite SeBS-Flow, providing a platform-agnostic workflow model that enables consistent benchmarking across various platforms. SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns. We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations. We make our benchmark suite open-source, enabling rigorous and comparable evaluations of serverless workflows over time.
@inproceedings{2025sebsflow, title = {SeBS-Flow: Benchmarking Serverless Cloud Function Workflows}, author = {Schmid, Larissa and Copik, Marcin and Calotoiu, Alexandru and Brandner, Laurin and Koziolek, Anne and Hoefler, Torsten}, year = {2025}, url = {https://doi.org/10.1145/3689031.3717465}, doi = {10.1145/3689031.3717465}, primaryclass = {cs.DC}, archiveprefix = {arXiv}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the Twentieth European Conference on Computer Systems}, series = {EuroSys '25}, }
arXiv
Reasoning Language Models: A Blueprint

Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, and Torsten Hoefler

2025

Abs arXiv Bib PDF

Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI’s o1 and o3, DeepSeek-V3, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending LLMs with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint’s versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between "rich AI" and "poor AI" by lowering barriers to RLM design and experimentation.
@article{2025reasoningblueprint, title = {Reasoning Language Models: A Blueprint}, author = {Besta, Maciej and Barth, Julia and Schreiber, Eric and Kubicek, Ales and Catarino, Afonso and Gerstenberger, Robert and Nyczyk, Piotr and Iff, Patrick and Li, Yueling and Houliston, Sam and Sternal, Tomasz and Copik, Marcin and Kwaśniewski, Grzegorz and Müller, Jürgen and Flis, Łukasz and Eberhard, Hannes and Niewiadomski, Hubert and Hoefler, Torsten}, year = {2025}, archiveprefix = {arXiv}, primaryclass = {cs.DC}, }
arXiv
Core Hours and Carbon Credits: Incentivizing Sustainability in HPC

Alok Kamatar, Maxime Gonthier, Valerie Hayot-Sasson, Andre Bauer, Marcin Copik, Torsten Hoefler, Raul Castro Fernandez, Kyle Chard, and Ian Foster

2025

Abs arXiv Bib PDF

Realizing a shared responsibility between providers and consumers is critical to manage the sustainability of HPC. However, while cost may motivate efficiency improvements by infrastructure operators, broader progress is impeded by a lack of user incentives. We conduct a survey of HPC users that reveals fewer than 30 percent are aware of their energy consumption, and that energy efficiency is among users’ lowest priority concerns. One explanation is that existing pricing models may encourage users to prioritize performance over energy efficiency. We propose two transparent multi-resource pricing schemes, Energy- and Carbon-Based Accounting, that seek to change this paradigm by incentivizing more efficient user behavior. These two schemes charge for computations based on their energy consumption or carbon footprint, respectively, rewarding users who leverage efficient hardware and software. We evaluate these two pricing schemes via simulation, in a prototype, and a user study.
@article{2025carboncredits, title = {Core Hours and Carbon Credits: Incentivizing Sustainability in HPC}, author = {Kamatar, Alok and Gonthier, Maxime and Hayot-Sasson, Valerie and Bauer, Andre and Copik, Marcin and Hoefler, Torsten and Fernandez, Raul Castro and Chard, Kyle and Foster, Ian}, year = {2025}, archiveprefix = {arXiv}, primaryclass = {cs.DC}, }
SESAME @ EuroSyS
SeBS 2.0: Keeping up with the Clouds

Marcin Copik, Alexandru Calotoiu, and Torsten Hoefler

2025

Abs Bib PDF

Three years ago, SeBS, the serverless benchmarking suite, was introduced to address the need for an automatic, representative, and easy-to-use benchmarking framework for FaaS applications. SeBS has been widely adopted in research projects and has evolved to incorporate new features addressing the changing landscape of FaaS platforms. As serverless workflows and services continue to grow in size and complexity, there is an ongoing need to support emerging application classes. In this paper, we outline both the progress made and the roadmap for supporting new workloads and frameworks, while identifying upcoming trends and paradigm shifts to provide researchers with reliable and reproducible benchmarks. The serverless community needs an open and portable benchmarking framework to drive future progress, and SeBS aims to fulfill this crucial role.
@article{2025sebs20, title = {SeBS 2.0: Keeping up with the Clouds}, author = {Copik, Marcin and Calotoiu, Alexandru and Hoefler, Torsten}, year = {2025}, series = {SESAME '25}, }

2024

HPDC
FaaSKeeper: Learning from Building Serverless Services with ZooKeeper as an Example

Marcin Copik, Alexandru Calotoiu, Pengyu Zhou, Konstantin Taranov, and Torsten Hoefler

In , 2024

Abs arXiv Bib PDF

Code Artifact

Slides

FaaS (Function-as-a-Service) brought a fundamental shift into cloud computing: (persistent) virtual machines have been replaced with dynamically allocated resources, trading locality and statefulness for a pay-as-you-go model more suitable for varying and infrequent workloads. However, adapting services to functions in the serverless paradigm while still fulfilling functional requirements is challenging. In this work, we demonstrate how ZooKeeper, a centralized coordination service that offers a safe and wait-free consensus mechanism, can be redesigned to benefit from serverless computing. We define synchronization primitives to extend the capabilities of scalable cloud storage and contribute a set of requirements for efficient and scalable FaaS computing. We introduce FaaSKeeper, the first coordination service built on serverless functions and cloud-native services, and share serverless design lessons based on our experiences of implementing a ZooKeeper model deployable to clouds today. FaaSKeeper provides the same consistency guarantees and interface as ZooKeeper, with a serverless price model that lowers costs up to 450 times on infrequent workloads.
@inproceedings{copik2024faaskeeper, url = {https://arxiv.org/abs/2203.14859}, author = {Copik, Marcin and Calotoiu, Alexandru and Zhou, Pengyu and Taranov, Konstantin and Hoefler, Torsten}, title = {FaaSKeeper: Learning from Building Serverless Services with ZooKeeper as an Example}, year = {2024}, series = {HPDC '24}, }
IPDPS
Software Resource Disaggregation for HPC with Serverless Computing

Marcin Copik, Marcin Chrapek, Larissa Schmid, Alexandru Calotoiu, and Torsten Hoefler

In Proceedings of the 38th IEEE Interational Parallel and Distributed Processing Symposium, 2024

Abs arXiv Bib PDF

Artifact

Slides

Aggregated HPC resources have rigid allocation systems and programming models which struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses. In this paper, we improve the utilization of supercomputers by employing the new cloud paradigm of serverless computing. We show how serverless functions provide fine-grained access to the resources of batch-managed cluster nodes. We present an HPC-oriented Function-as-a-Service (FaaS) that satisfies the requirements of high-performance applications. We demonstrate a \emphsoftware resource disaggregation approach where placing functions on unallocated and underutilized nodes allows idle cores and accelerators to be utilized while retaining near-native performance.
@inproceedings{copik2024disagg, author = {Copik, Marcin and Chrapek, Marcin and Schmid, Larissa and Calotoiu, Alexandru and Hoefler, Torsten}, title = {Software Resource Disaggregation for HPC with Serverless Computing}, year = {2024}, series = {IPDPS '24}, booktitle = {Proceedings of the 38th IEEE Interational Parallel and Distributed Processing Symposium}, }
IEEE CiSE
XaaS: Acceleration as a Service to Enable Productive High-Performance Cloud Computing

Torsten Hoefler, Marcin Copik, Pete Beckman, Andrew Jones, Ian Foster, Manish Parashar, Daniel Reed, Matthias Troyer, Thomas Schulthess, Dan Ernst, and Jack Dongarra

2024

Abs arXiv Bib PDF

HPC and Cloud have evolved independently, specializing their innovations into performance or productivity. Acceleration as a Service (XaaS) is a recipe to empower both fields with a shared execution platform that provides transparent access to computing resources, regardless of the underlying cloud or HPC service provider. Bridging HPC and cloud advancements, XaaS presents a unified architecture built on performance-portable containers. Our converged model concentrates on low-overhead, high-performance communication and computing, targeting resource-intensive workloads from climate simulations to machine learning. XaaS lifts the restricted allocation model of Function-as-a-Service (FaaS), allowing users to benefit from the flexibility and efficient resource utilization of serverless while supporting long-running and performance-sensitive workloads from HPC.
@misc{2024xaas, title = {{XaaS}: Acceleration as a Service to Enable Productive High-Performance Cloud Computing}, author = {Hoefler, Torsten and Copik, Marcin and Beckman, Pete and Jones, Andrew and Foster, Ian and Parashar, Manish and Reed, Daniel and Troyer, Matthias and Schulthess, Thomas and Ernst, Dan and Dongarra, Jack}, year = {2024}, eprint = {2401.04552}, archiveprefix = {arXiv}, primaryclass = {cs.DC}, }
SoCC
Process-as-a-Service: Unifying Elastic and Stateful Clouds with Serverless Processes

Marcin Copik, Alexandru Calotoiu, Gyorgy Rethy, Roman Böhringer, Rodrigo Bruno, and Torsten Hoefler

In Proceedings of the 2024 ACM Symposium on Cloud Computing, Redmond, WA, USA, 2024

Abs DOI Bib Paper PDF

Fine-grained and ephemeral functions power many new applications that benefit from elastic scaling and pay-as-you-use billing model with minimal infrastructure management overhead. To achieve these properties, Function-as-a-Service (FaaS) platforms disaggregate compute and state and, consequently, introduce non-trivial costs due to data locality loss, complex control plane interactions, and expensive communication to access state. We revisit the foundations of FaaS and propose a new cloud abstraction, cloud process, that retains all the benefits of FaaS while significantly reducing the overheads that result from the disaggregation. We show how established operating system abstractions can be adapted to provide powerful granular computing on dynamically provisioned cloud resources while building our Process as a Service (PraaS) platform. PraaS improves current FaaS by offering data locality, fast invocations, and efficient communication. PraaS delivers remote invocations up to 17 times faster and reduces communication overhead by up to 99%.
@inproceedings{copik2024praas, author = {Copik, Marcin and Calotoiu, Alexandru and Rethy, Gyorgy and Böhringer, Roman and Bruno, Rodrigo and Hoefler, Torsten}, title = {Process-as-a-Service: Unifying Elastic and Stateful Clouds with Serverless Processes}, year = {2024}, isbn = {9798400712869}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3698038.3698567}, doi = {10.1145/3698038.3698567}, booktitle = {Proceedings of the 2024 ACM Symposium on Cloud Computing}, keywords = {Serverless, Function-as-a-Service, Operating Systems}, location = {Redmond, WA, USA}, series = {SoCC '24} }
IXPUG @ SC
Protocol Buffer Deserialization DPU Offloading in the RPC Datapath

Raphaël Frantz, Jerónimo Sánchez García, Marcin Copik, Idelfonso Tafur Monroy, Juan José Vegas Olmos, Gil Bloch, and Salvatore Di Girolamo

In , 2024

Abs Bib PDF

In the microservice paradigm, monolithic applications are decomposed into finer-grained modules invoked independently in a data-flow fashion. The different modules communicate through remote procedure calls (RPCs), which constitute a critical component of the infrastructure. To ensure portable passage of RPC metadata, arguments, and return values between different microservices, RPCs involve serialization/deserialization activities, part of the RPC data center tax. We demonstrate how RPC server logic, including \serdes, can be offloaded to Data Processing Units (DPUs). This effectively reduces the RPC data center tax on the host, where applications’ business logic runs. While we focus on offloading Protocol Buffers deserialization used by the popular gRPC framework, our findings can be applied to other RPC infrastructures. Our experimental results demonstrate that RPC offloading performs similarly to traditional methods while significantly reducing CPU usage.
@inproceedings{2024dpuixpug, author = {Frantz, Rapha\"el and Garc\'ia, Jer\'onimo S\'anchez and Copik, Marcin and Monroy, Idelfonso Tafur and Olmos, Juan Jos\'e Vegas and Bloch, Gil and Girolamo, Salvatore Di}, title = {Protocol Buffer Deserialization DPU Offloading in the RPC Datapath}, year = {2024}, }
SC

MIGnificient: Fast, Isolated, and GPU-Enabled Serverless Functions

Marcin Copik, Alexandru Calotoiu, Pengyu Zhou, Lukas Tobler, and Torsten Hoefler

Research Poster at ACM/IEEE Supercomputing, 2024

Best Research Poster Award

Abs Paper PDF

Poster Slides

The security of High-Performance Computing is becoming more important with new applications in machine learning and medical data processing. At the same time, the convergence of HPC and cloud computing brings a demand for workload co-location and resource sharing. Instead of providing security guarantees through exclusive resource locations and physical isolation, HPC systems must offer new methods that retain high utilization. These have to include GPUs that have become essential in HPC systems: accelerators are used by 68% of the first 50 systems on the TOP500 list. However, GPUs are often underutilized, as even workloads like machine learning training spend a significant fraction of their time on communication and CPU tasks. The growing capabilities of each new GPU generation make it even more difficult to saturate the device with a single application. These devices could be shared in Function-as-a-Service (FaaS), a new cloud programming model designed around fine-grained functions. There, functions execute on resources assigned by the load balancer, allowing system operators to boost utilization through dynamic scheduling. While traditional functions use containers and microVMs to share CPUs, they need a new model to securely and efficiently co-locate computations on GPUs.
arXiv
Demystifying Chains, Trees, and Graphs of Thoughts

Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Aidan O’Mahony, Onur Mutlu, and Torsten Hoefler

2024

Abs arXiv Bib PDF

The field of natural language processing (NLP) has witnessed significant progress in recent years, with a notable focus on improving large language models’ (LLM) performance through innovative prompting techniques. Among these, prompt engineering coupled with structures has emerged as a promising paradigm, with designs such as Chain-of-Thought, Tree of Thoughts, or Graph of Thoughts, in which the overall LLM reasoning is guided by a structure such as a graph. As illustrated with numerous examples, this paradigm significantly enhances the LLM’s capability to solve numerous tasks, ranging from logical or mathematical reasoning to planning or creative writing. To facilitate the understanding of this growing field and pave the way for future developments, we devise a general blueprint for effective and efficient LLM reasoning schemes. For this, we conduct an in-depth analysis of the prompt execution pipeline, clarifying and clearly defining different concepts. We then build the first taxonomy of structure-enhanced LLM reasoning schemes. We focus on identifying fundamental classes of harnessed structures, and we analyze the representations of these structures, algorithms executed with these structures, and many others. We refer to these structures as reasoning topologies, because their representation becomes to a degree spatial, as they are contained within the LLM context. Our study compares existing prompting schemes using the proposed taxonomy, discussing how certain design choices lead to different patterns in performance and cost. We also outline theoretical underpinnings, relationships between prompting and other parts of the LLM ecosystem such as knowledge bases, and the associated research challenges. Our work will help to advance future prompt engineering techniques.
@article{2024chainsdemistifying, title = {Demystifying Chains, Trees, and Graphs of Thoughts}, author = {Besta, Maciej and Memedi, Florim and Zhang, Zhenyu and Gerstenberger, Robert and Piao, Guangyuan and Blach, Nils and Nyczyk, Piotr and Copik, Marcin and Kwaśniewski, Grzegorz and Müller, Jürgen and Gianinazzi, Lukas and Kubicek, Ales and Niewiadomski, Hubert and O'Mahony, Aidan and Mutlu, Onur and Hoefler, Torsten}, year = {2024}, archiveprefix = {arXiv}, primaryclass = {cs.DC}, }

2023

ICS
FMI: Fast and Cheap Message Passing for Serverless Functions

Marcin Copik, Roman Böhringer, Alexandru Calotoiu, and Torsten Hoefler

In Proceedings of the 37th International Conference on Supercomputing, Orlando, FL, USA, 2023

Abs DOI arXiv Bib Paper PDF

Code

Slides

Serverless functions provide elastic scaling and a fine-grained billing model, making Function-as-a-Service (FaaS) an attractive programming model. However, for distributed jobs that benefit from large-scale and dynamic parallelism, the lack of fast and cheap communication is a major limitation. Individual functions cannot communicate directly, group operations do not exist, and users resort to manual implementations of storage-based communication. This results in communication times multiple orders of magnitude slower than those found in HPC systems. We overcome this limitation and present the FaaS Message Interface (FMI). FMI is an easy-to-use, high-performance framework for general-purpose point-to-point and collective communication in FaaS applications. We support different communication channels and offer a model-driven channel selection according to performance and cost expectations. We model the interface after MPI and show that message passing can be integrated into serverless applications with minor changes, providing portable communication closer to that offered by high-performance systems. In our experiments, FMI can speed up communication for a distributed machine learning FaaS application by up to 162x, while simultaneously reducing cost by up to 397 times.
@inproceedings{copik2022fmi, author = {Copik, Marcin and B\"{o}hringer, Roman and Calotoiu, Alexandru and Hoefler, Torsten}, title = {FMI: Fast and Cheap Message Passing for Serverless Functions}, year = {2023}, isbn = {9798400700569}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3577193.3593718}, doi = {10.1145/3577193.3593718}, booktitle = {Proceedings of the 37th International Conference on Supercomputing}, pages = {373–385}, numpages = {13}, keywords = {faas, function-as-a-service, serverless, I/O, high-performance computing}, location = {Orlando, FL, USA}, series = {ICS '23}, }
IPDPS
rFaaS: Enabling High Performance Serverless with RDMA and Leases

Marcin Copik, Konstantin Taranov, Alexandru Calotoiu, and Torsten Hoefler

In Proceedings of the 37th IEEE Interational Parallel and Distributed Processing Symposium, 2023

Abs arXiv Bib PDF

Code Artifact

Slides

High performance is needed in many computing systems, from batch-managed supercomputers to general-purpose cloud platforms. However, scientific clusters lack elastic parallelism, while clouds cannot offer competitive costs for high-performance applications. In this work, we investigate how modern cloud programming paradigms can bring the elasticity needed to allocate idle resources, decreasing computation costs and improving overall data center efficiency. Function-as-a-Service (FaaS) brings the pay-as-you-go execution of stateless functions, but its performance characteristics cannot match coarse-grained cloud and cluster allocations. To make serverless computing viable for high-performance and latency-sensitive applications, we present rFaaS, an RDMA-accelerated FaaS platform. We identify critical limitations of serverless - centralized scheduling and inefficient network transport - and improve the FaaS architecture with allocation leases and microsecond invocations. We show that our remote functions add only negligible overhead on top of the fastest available networks, and we decrease the execution latency by orders of magnitude compared to contemporary FaaS systems. Furthermore, we demonstrate the performance of rFaaS by evaluating real-world FaaS benchmarks and parallel applications. Overall, our results show that new allocation policies and remote memory access help FaaS applications achieve high performance and bring serverless computing to HPC.
@inproceedings{2021rfaas, title = {{r}FaaS: Enabling High Performance Serverless with RDMA and Leases}, author = {Copik, Marcin and Taranov, Konstantin and Calotoiu, Alexandru and Hoefler, Torsten}, year = {2023}, series = {IPDPS '23}, booktitle = {Proceedings of the 37th IEEE Interational Parallel and Distributed Processing Symposium}, }
IPDPS

High-Performance Serverless for HPC and Clouds

Marcin Copik and Torsten Hoefler

PhD Forum at 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2023

Abs PDF

Poster

Function-as-a-Service (FaaS) computing brought a fundamental shift in resource management. It allowed for new and better solutions to the problem of low resource utilization, an issue that has been known for decades in data centers. The problem persists as the frequently changing resource availability cannot be addressed entirely with the techniques employed so far, such as persistent cloud allocations and batch jobs. The elastic fine-grained tasking and largely unconstrained scheduling of FaaS create new opportunities. Still, modern serverless platforms struggle to achieve the high performance needed for the most demanding and latency-critical workloads. Furthermore, many applications cannot be “FaaSified” without non-negligible loss in performance, and the short and stateless functions must be easy to program, debug, and optimize. By solving the fundamental performance challenges of FaaS, we can build a fast and efficient programming model that brings innovative cloud techniques into HPC data centers, allowing users to benefit from pay-as-you-go billing and helping operators to decrease running costs and their environmental impact. My PhD research attempts to bridge the gap between high-performance programming and modern FaaS computing frameworks. I have been working on tailored solutions for different levels of the FaaS computing stack: from computing and network devices to high-level optimizations and efficient system designs.
SC

High-Performance Serverless for HPC and Clouds

Marcin Copik and Torsten Hoefler

Doctoral Showcase at ACM/IEEE Supercomputing (S23), 2023

Abs

Poster

Function-as-a-Service (FaaS) computing brought a fundamental shift in resource management. It allowed for new and better solutions to the problem of low resource utilization, an issue that has been known for decades in data centers. The problem persists as the frequently changing resource availability cannot be addressed entirely with the techniques employed so far, such as persistent cloud allocations and batch jobs. The elastic fine-grained tasking and largely unconstrained scheduling of FaaS create new opportunities. Still, modern serverless platforms struggle to achieve the high performance needed for the most demanding and latency-critical workloads. Furthermore, many applications cannot be “FaaSified” without non-negligible loss in performance, and the short and stateless functions must be easy to program, debug, and optimize. By solving the fundamental performance challenges of FaaS, we can build a fast and efficient programming model that brings innovative cloud techniques into HPC data centers, allowing users to benefit from pay-as-you-go billing and helping operators to decrease running costs and their environmental impact. My PhD research attempts to bridge the gap between high-performance programming and modern FaaS computing frameworks. I have been working on tailored solutions for different levels of the FaaS computing stack: from computing and network devices to high-level optimizations and efficient system designs.
Big Data
User-guided Page Merging for Memory Deduplication in Serverless Systems

Wei Qiu, Marcin Copik, Yun Wang, Alexandru Calotoiu, and Torsten Hoefler

2023

Abs Bib PDF

Slides

Serverless computing is an emerging cloud paradigm that offers an elastic and scalable allocation of computing resources with pay-as-you-go billing. In the Function-as-a-Service (FaaS) programming model, applications comprise short-lived and stateless serverless functions executed in isolated containers or microVMs, which can quickly scale to thousands of instances and process terabytes of data. This flexibility comes at the cost of duplicated runtimes, libraries, and user data spread across many function instances, and cloud providers do not utilize this redundancy. The memory footprint of serverless forces removing idle containers to make space for new ones, which decreases performance through more cold starts and fewer data caching opportunities. We address this issue by proposing deduplicating memory pages of serverless workers with identical content, based on the content-based page-sharing concept of Linux Kernel Same-page Merging (KSM). We replace the background memory scanning process of KSM, as it is too slow to locate sharing candidates in short-lived functions. Instead, we design User-Guided Page Merging (UPM), a built-in Linux kernel module that leverages the madvise system call: we enable users to advise the kernel of memory areas that can be shared with others. We show that UPM reduces memory consumption by up to 55% on 16 concurrent containers executing a typical image recognition function, more than doubling the density for containers of the same function that can run on a system.
@article{2023upmbigdata, author = {Qiu, Wei and Copik, Marcin and Wang, Yun and Calotoiu, Alexandru and Hoefler, Torsten}, title = {User-guided Page Merging for Memory Deduplication in Serverless Systems}, year = {2023}, booktitle = {2023 IEEE International Conference on Big Data (Big Data)}, publisher = {IEEE Computer Society}, }
arXiv
Cppless: Productive and Performant Serverless Programming in C++

Lukas Möller, Marcin Copik, Alexandru Calotoiu, and Torsten Hoefler

2023

Abs Bib PDF

The rise of serverless introduced a new class of scalable, elastic and highly available parallel workers in the cloud. Many systems and applications benefit from offloading computations and parallel tasks to dynamically allocated resources. However, the developers of C++ applications found it difficult to integrate functions due to complex deployment, lack of compatibility between client and cloud environments, and loosely typed input and output data. To enable single-source and efficient serverless acceleration in C++, we introduce Cppless, an end-to-end framework for implementing serverless functions which handles the creation, deployment, and invocation of functions. Cppless is built on top of LLVM and requires only two compiler extensions to automatically extract C++ function objects and deploy them to the cloud. We demonstrate that offloading parallel computations from a C++ application to serverless workers can provide up to 30x speedup, requiring only minor code modifications and costing less than one cent per computation.
@article{2023cppless, title = {Cppless: Productive and Performant Serverless Programming in C++}, author = {M\"oller, Lukas and Copik, Marcin and Calotoiu, Alexandru and Hoefler, Torsten}, year = {2023}, }

2022

TPDS
Work-stealing prefix scan: Addressing load imbalance in large-scale image registration

Marcin Copik, Tobias Grosser, Torsten Hoefler, Paolo Bientinesi, and Benjamin Berkels

IEEE Transactions on Parallel and Distributed Systems, 2022

Abs DOI arXiv Bib Paper PDF

Code

Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this paper, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.
@periodical{2020prefixsum, title = {Work-stealing prefix scan: Addressing load imbalance in large-scale image registration}, author = {Copik, Marcin and Grosser, Tobias and Hoefler, Torsten and Bientinesi, Paolo and Berkels, Benjamin}, year = {2022}, type = {article}, journal = {IEEE Transactions on Parallel and Distributed Systems}, url = {https://ieeexplore.ieee.org/document/9477174}, volume = {33}, number = {3}, pages = {523-535}, doi = {10.1109/TPDS.2021.3095230}, }
ICS
Performance-Detective: Automatic Deduction of Cheap and Accurate Performance Models

Larissa Schmid, Marcin Copik, Alexandru Calotoiu, Dominik Werle, Andreas Reiter, Michael Selzer, Anne Koziolek, and Torsten Hoefler

In Proceedings of the 36th ACM International Conference on Supercomputing, Virtual Event, 2022

Abs DOI Bib Paper PDF

Artifact

The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements.
@inproceedings{schmid2022perfdetective, author = {Schmid, Larissa and Copik, Marcin and Calotoiu, Alexandru and Werle, Dominik and Reiter, Andreas and Selzer, Michael and Koziolek, Anne and Hoefler, Torsten}, title = {Performance-Detective: Automatic Deduction of Cheap and Accurate Performance Models}, year = {2022}, isbn = {9781450392815}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3524059.3532391}, doi = {10.1145/3524059.3532391}, booktitle = {Proceedings of the 36th ACM International Conference on Supercomputing}, articleno = {3}, numpages = {13}, keywords = {empirical performance modeling, automatic performance modeling, configurable systems, experiment design}, location = {Virtual Event}, series = {ICS '22}, }
ACM SRC

Software Resource Disaggregation for HPC with Serverless Computing

Marcin Copik, Alexandru Calotoiu, and Torsten Hoefler

ACM Student Research Competition at ACM/IEEE Supercomputing, 2022

Gold Medal in the competition

Abs Paper PDF

Poster Slides

Aggregated HPC resources have rigid allocation systems and programming models and struggle to adapt to diverse and changing workloads. Thus, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses. In this project, we use the new cloud paradigm of serverless computing to improve the utilization of supercomputers. We show that the FaaS programming model satisfies the requirements of high-performance applications and how idle memory helps resolve cold startup issues. We demonstrate a software resource disaggregation approach where the co-location of functions allows idle cores and accelerators to be utilized while retaining near-native performance.
IMPACT
MOM: Matrix Operations in MLIR

Lorenzo Chelini, Henrik Barthels, Paolo Bientinesi, Marcin Copik, Tobias Grosser, and Daniele G Spampinato

12th International Workshop on Polyhedral Compilation Techniques, 2022

Abs Bib Paper PDF

Modern research in code generators for dense linear algebra computations has shown the ability to produce optimized code with a performance which compares and often exceeds the one of state-of-the-art implementations by domain experts. However, the underlying infrastructure is often developed in isolation making the interconnection of logically combinable systems complicated if not impossible. In this paper, we propose to leverage MLIR as a unifying compiler infrastructure for the optimization of dense linear algebra operations. We propose a new MLIR dialect for expressing linear algebraic computations including matrix properties to enable high-level algorithmic transformations. The integration of this new dialect in MLIR enables end-to-end compilation of matrix computations via conversion to existing lower-level dialects already provided by the framework.
@article{chelini2022mom, title = {MOM: Matrix Operations in MLIR}, author = {Chelini, Lorenzo and Barthels, Henrik and Bientinesi, Paolo and Copik, Marcin and Grosser, Tobias and Spampinato, Daniele G}, year = {2022}, url = {https://acohen.gitlabpages.inria.fr/impact/impact2022/papers/paper7.pdf}, journal = {12th International Workshop on Polyhedral Compilation Techniques}, series = {IMPACT 2022}, }

2021

Middleware
SeBS: A Serverless Benchmark Suite for Function-as-a-Service Computing

Marcin Copik, Grzegorz Kwasniewski, Maciej Besta, Michal Podstawski, and Torsten Hoefler

In Proceedings of the 22nd International Middleware Conference, Québec city, Canada, 2021

Abs DOI arXiv Bib Paper PDF

Code Docker Artifact

Slides Recording

Function-as-a-Service (FaaS) is one of the most promising directions for the future of cloud services, and serverless functions have immediately become a new middleware for building scalable and cost-efficient microservices and applications. However, the quickly moving technology hinders reproducibility, and the lack of a standardized benchmarking suite leads to ad-hoc solutions and microbenchmarks being used in serverless research, further complicating metaanalysis and comparison of research solutions. To address this challenge, we propose the Serverless Benchmark Suite: the first benchmark for FaaS computing that systematically covers a wide spectrum of cloud resources and applications. Our benchmark consists of the specification of representative workloads, the accompanying implementation and evaluation infrastructure, and the evaluation methodology that facilitates reproducibility and enables interpretability. We demonstrate that the abstract model of a FaaS execution environment ensures the applicability of our benchmark to multiple commercial providers such as AWS, Azure, and Google Cloud. Our work facilities experimental evaluation of serverless systems, and delivers a standardized, reliable and evolving evaluation methodology of performance, efficiency, scalability and reliability of middleware FaaS platforms.
@inproceedings{copik2020sebs, title = {SeBS: A Serverless Benchmark Suite for Function-as-a-Service Computing}, author = {Copik, Marcin and Kwasniewski, Grzegorz and Besta, Maciej and Podstawski, Michal and Hoefler, Torsten}, year = {2021}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3464298.3476133}, pages = {64–78}, numpages = {15}, keywords = {function-as-a-service, benchmark, serverless, FaaS}, location = {Qu\'{e}bec city, Canada}, booktitle = {Proceedings of the 22nd International Middleware Conference}, series = {Middleware '21}, doi = {10.1145/3464298.3476133}, }
PPoPP
Extracting Clean Performance Models from Tainted Programs

Marcin Copik, Alexandru Calotoiu, Tobias Grosser, Nicolas Wicki, Felix Wolf, and Torsten Hoefler

In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, 2021

Abs DOI arXiv Bib Paper PDF

Code Docker Artifact

Slides Recording

Performance models are well-known instruments to understand the scaling behavior of parallel applications. They express how performance changes as key execution parameters, such as the number of processes or the size of the input problem, vary. Besides reasoning about program behavior, such models can also be automatically derived from performance data. This is called empirical performance modeling. While this sounds simple at the first glance, this approach faces several serious interrelated challenges, including expensive performance measurements, inaccuracies inflicted by noisy benchmark data, and overall complex experiment design, starting with the selection of the right parameters. The more parameters one considers, the more experiments are needed and the stronger the impact of noise. In this paper, we show how taint analysis, a technique borrowed from the domain of computer security, can substantially improve the modeling process, lowering its cost, improving model quality, and help validate performance models and experimental setups.
@inproceedings{2021perftaint, author = {Copik, Marcin and Calotoiu, Alexandru and Grosser, Tobias and Wicki, Nicolas and Wolf, Felix and Hoefler, Torsten}, title = {Extracting Clean Performance Models from Tainted Programs}, year = {2021}, isbn = {9781450382946}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3437801.3441613}, doi = {10.1145/3437801.3441613}, booktitle = {Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, pages = {403–417}, numpages = {15}, keywords = {performance modeling, high-performance computing, compiler techniques, LLVM, taint analysis}, location = {Virtual Event, Republic of Korea}, series = {PPoPP '21}, }
VLDB
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra

Maciej Besta, Zur Vonarburg-Shmaria, Yannick Schaffner, Leonardo Schwarz, Grzegorz Kwasniewski, Lukas Gianinazzi, Jakub Beranek, Kacper Janda, Tobias Holenstein, Sebastian Leisinger, Peter Tatkowski, Esref Ozdemir, Adrian Balla, Marcin Copik, Philipp Lindenberger, Pavel Kalvoda, Marek Konieczny, Onur Mutlu, and Torsten Hoefler

In Proceedings of the 47th International Conference on Very Large Data Bases (VLDB’21), Aug 2021

Abs DOI arXiv Bib Paper PDF

Artifact

We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive literature review, prescribing representative problems, algorithms, and datasets. Second, GMS offers a carefully designed software platform for seamless testing of different fine-grained elements of graph mining algorithms, such as graph representations or algorithm subroutines. The platform includes parallel implementations of more than 40 considered baselines, and it facilitates developing complex and fast mining algorithms. High modularity is possible by harnessing set algebra operations such as set intersection and difference, which enables breaking complex graph mining algorithms into simple building blocks that can be separately experimented with. GMS is supported with a broad concurrency analysis for portability in performance insights, and a novel performance metric to assess the throughput of graph mining algorithms, enabling more insightful evaluation. As use cases, we harness GMS to rapidly redesign and accelerate state-of-the-art baselines of core graph mining problems: degeneracy reordering (by up to >2x), maximal clique listing (by up to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x), also obtaining better theoretical performance bounds.
@inproceedings{besta2021graphminesuite, title = {GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra}, author = {Besta, Maciej and Vonarburg-Shmaria, Zur and Schaffner, Yannick and Schwarz, Leonardo and Kwasniewski, Grzegorz and Gianinazzi, Lukas and Beranek, Jakub and Janda, Kacper and Holenstein, Tobias and Leisinger, Sebastian and Tatkowski, Peter and Ozdemir, Esref and Balla, Adrian and Copik, Marcin and Lindenberger, Philipp and Kalvoda, Pavel and Konieczny, Marek and Mutlu, Onur and Hoefler, Torsten}, year = {2021}, eprint = {2103.03653}, archiveprefix = {arXiv}, primaryclass = {cs.DC}, month = aug, booktitle = {Proceedings of the 47th International Conference on Very Large Data Bases (VLDB'21)}, doi = {10.14778/3476249.3476252}, }
MICRO
SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems

Maciej Besta, Raghavendra Kanakagiri, Grzegorz Kwasniewski, Rachata Ausavarungnirun, Jakub Beránek, Konstantinos Kanellopoulos, Kacper Janda, Zur Vonarburg-Shmaria, Lukas Gianinazzi, Ioana Stefan, Juan Gómez Luna, Marcin Copik, Lukas Kapp-Schwoerer, Salvatore Di Girolamo, Marek Konieczny, Onur Mutlu, and Torsten Hoefler

In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Oct 2021

Abs DOI arXiv Bib Paper PDF

Artifact

Simple graph algorithms such as PageRank have recently been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listing. These algorithms are memory-bound and thus could be accelerated by hardware techniques such as Processing-in-Memory (PIM). However, they also come with non-straightforward parallelism and complicated memory access patterns. In this work, we address this with a simple yet surprisingly powerful observation: operations on sets of vertices, such as intersection or union, form a large part of many complex graph mining algorithms, and can offer rich and simple parallelism at multiple levels. This observation drives our cross-layer design, in which we (1) expose set operations using a novel programming paradigm, (2) express and execute these operations efficiently with carefully designed set-centric ISA extensions called SISA, and (3) use PIM to accelerate SISA instructions. The key design idea is to alleviate the bandwidth needs of SISA instructions by mapping set operations to two types of PIM: in-DRAM bulk bitwise computing for bitvectors representing high-degree vertices, and near-memory logic layers for integer arrays representing low-degree vertices. Set-centric SISA-enhanced algorithms are efficient and outperform hand-tuned baselines, offering more than 10x speedup over the established Bron-Kerbosch algorithm for listing maximal cliques. We deliver more than 10 SISA set-centric algorithm formulations, illustrating SISA’s wide applicability.
@inproceedings{besta2021sisa, title = {SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems}, author = {Besta, Maciej and Kanakagiri, Raghavendra and Kwasniewski, Grzegorz and Ausavarungnirun, Rachata and Beránek, Jakub and Kanellopoulos, Konstantinos and Janda, Kacper and Vonarburg-Shmaria, Zur and Gianinazzi, Lukas and Stefan, Ioana and Luna, Juan Gómez and Copik, Marcin and Kapp-Schwoerer, Lukas and Girolamo, Salvatore Di and Konieczny, Marek and Mutlu, Onur and Hoefler, Torsten}, year = {2021}, month = oct, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3466752.3480133}, doi = {10.1145/3466752.3480133}, booktitle = {MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture}, series = {MICRO '21}, }

2020

Book Chapter
ExtraPeak: Advanced Automatic Performance Modeling for HPC Applications

Alexandru Calotoiu, Marcin Copik, Torsten Hoefler, Marcus Ritter, Sergei Shudler, and Felix Wolf

In Software for Exascale Computing - SPPEXA 2016-2019, Oct 2020

Abs Bib Paper PDF

Performance models are powerful tools allowing developers to understand the behavior of their applications, and empower them to address performance issues already during the design or prototyping phase. Unfortunately, the difficulties of creating such models manually and the effort involved render performance modeling a topic limited to a relatively small community of experts. This article summarizes the results of the two projects Catwalk, which aimed to create tools that automate key activities of the performance modeling process, and ExtraPeak, which built upon the results of Catwalk and worked toward making this powerful methodology more flexible, streamlined and easy to use. The sew projects both provide accessible tools and methods that bring performance modeling to a wider audience of HPC application developers. Since its outcome represents the final state of the two projects, we expand to a greater extent on the results of ExtraPeak.
@inproceedings{10.1007/978-3-030-47956-5_15, author = {Calotoiu, Alexandru and Copik, Marcin and Hoefler, Torsten and Ritter, Marcus and Shudler, Sergei and Wolf, Felix}, editor = {Bungartz, Hans-Joachim and Reiz, Severin and Uekermann, Benjamin and Neumann, Philipp and Nagel, Wolfgang E.}, title = {ExtraPeak: Advanced Automatic Performance Modeling for HPC Applications}, booktitle = {Software for Exascale Computing - SPPEXA 2016-2019}, year = {2020}, publisher = {Springer International Publishing}, address = {Cham}, pages = {453--482}, isbn = {978-3-030-47956-5}, }

2019

ACM SRC

perf-taint: Taint Analysis for Automatic Many-Parameter Performance Modeling

Marcin Copik and Torsten Hoefler

ACM Student Research Competition at ACM/IEEE Supercomputing, Oct 2019

Gold Medal in the competition

Abs Paper PDF

Poster Slides

Performance modeling is a well-known technique for understanding the scaling behavior of an application. Although the modeling process is today often automatic, it still relies on a domain expert selecting program parameters and deciding relevant sampling intervals. Since existing empirical methods attempt blackbox modeling, the decision on which parameters influence a selected part of the program is based on measured data, making empirical modeling sensitive to human errors and instrumentation noise. We introduce a hybrid analysis to mitigate the current limitations of empirical modeling, combining the confidence of static analysis with the ability of dynamic taint analysis to capture the effects of control-flow and memory operations. We construct models of computation and communication volumes that help the modeler to remove effects of noise and improve the correctness of estimated models. Our automatic analysis prunes irrelevant program parameters and brings an understanding of parameter dependencies which helps in designing the experiment.

2018

CGO
The Generalized Matrix Chain Algorithm

Henrik Barthels, Marcin Copik, and Paolo Bientinesi

In Proceedings of the 2018 International Symposium on Code Generation and Optimization, Vienna, Austria, Oct 2018

Abs DOI arXiv Bib Paper PDF

In this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product M := A1 A2 ⋯ An that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 9. The motivation for this work comes from the fact that—despite great advances in the development of compilers—the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.
@inproceedings{10.1145/3168804, author = {Barthels, Henrik and Copik, Marcin and Bientinesi, Paolo}, title = {The Generalized Matrix Chain Algorithm}, year = {2018}, isbn = {9781450356176}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3168804}, doi = {10.1145/3168804}, booktitle = {Proceedings of the 2018 International Symposium on Code Generation and Optimization}, pages = {138–148}, numpages = {11}, keywords = {matrix chain problem, linear algebra, compiler}, location = {Vienna, Austria}, series = {CGO 2018}, }

2017

IWOCL
Using SYCL as an Implementation Framework for HPX.Compute

Marcin Copik and Hartmut Kaiser

In Proceedings of the 5th International Workshop on OpenCL, Toronto, Canada, Oct 2017

Abs DOI Bib Paper PDF

Slides

The recent advancements in High Performance Computing and ongoing research to reach Exascale has been heavily supported by introducing dedicated massively parallel accelerators. Programmers wishing to maximize utilization of current supercomputers are required to develop software which not only involves scaling across multiple nodes but are capable of offloading data-parallel computation to dedicated hardware such as graphic processors. Introduction of new types of hardware has been followed by developing new languages, extensions, compilers and libraries. Unfortunately, none of those solutions seem to be fully portable and independent from specific vendor and type of hardware.HPX.Compute, a programming model developed on top of HPX, a C++ standards library for concurrency and parallelism, uses existing and proposed C++ language and library capabilities to support various types of parallelism. It aims to provide a generic interface allowing for writing code which is portable between hardware architectures.We have implemented a new backend for HPX.Compute based on SYCL, a Khronos standard for single-source programming of OpenCL devices in C++. We present how this runtime may be used to target OpenCL devices through our C++ API. We have evaluated performance of new implementation on graphic processors with STREAM benchmark and compare results with existing CUDA-based implementation.
@inproceedings{2017sycl, author = {Copik, Marcin and Kaiser, Hartmut}, title = {Using SYCL as an Implementation Framework for HPX.Compute}, year = {2017}, isbn = {9781450352147}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3078155.3078187}, doi = {10.1145/3078155.3078187}, booktitle = {Proceedings of the 5th International Workshop on OpenCL}, articleno = {30}, numpages = {7}, keywords = {GPGPU, parallel programming, SYCL, C++, heterogeneous programming, HPX}, location = {Toronto, Canada}, series = {IWOCL 2017}, }
ACM SRC

Parallel Prefix Algorithms for the Registration of Arbitrarily Long Electron Micrograph Series

Marcin Copik, Paolo Bientinesi, and Benjamin Berkels

ACM Student Research Competition at ACM/IEEE Supercomputing, Oct 2017

Abs Paper PDF

Poster

Recent advances in the technology of transmission electron microscopy have allowed for a more precise visualization of materials and physical processes, such as metal oxidation. Nevertheless, the quality of information is limited by the damage caused by an electron beam, movement of the specimen or other environmental factors. A novel registration method has been proposed to remove those limitations by acquiring a series of low dose microscopy frames and performing a computational registration on them to understand and visualize the sample. This process can be represented as a prefix sum with a complex and computationally intensive binary operator and a parallelization is necessary to enable processing long series of microscopy images. With our parallelization scheme, the time of registration of results from ten seconds of microscopy acquisition has been decreased from almost thirteen hours to less than seven minutes on 512 Intel IvyBridge cores.
Thesis
Parallel Prefix Algorithms for the Registration of Arbitrarily Long Electron Micrograph Series

Marcin Copik

Master Thesis, Oct 2017

Abs arXiv Bib PDF

Recent advances in the technology of transmission electron microscopy have allowed for a more precise visualization of materials and physical processes, such as metal oxidation. Nevertheless, the quality of information is limited by the damage caused by an electron beam, movement of the specimen or other environmental factors. A novel registration method has been proposed to remove those limitations by acquiring a series of low dose microscopy frames and performing a computational registration on them to understand and visualize the sample. This process can be represented as a prefix sum with a complex and computationally intensive binary operator and a parallelization is necessary to enable processing long series of microscopy images. With our parallelization scheme, the time of registration of results from ten seconds of microscopy acquisition has been decreased from almost thirteen hours to less than seven minutes on 512 Intel IvyBridge cores.
@article{2017masterthesis, title = {Parallel Prefix Algorithms for the Registration of Arbitrarily Long Electron Micrograph Series}, author = {Copik, Marcin}, year = {2017}, journal = {Master Thesis}, }

2016

CS&P

A GPGPU-based Simulator for Prism: Statistical Verification of Results of PMC (extended abstract)

Marcin Copik, Artur Rataj, and Bozena Wozna-Szczesniak

In Proceedings of the 25th International Workshop on Concurrency, Specification and Programming, Rostock, Germany, September 28-30, 2016, Oct 2016

Abs Bib Paper PDF

Code

Slides

We describe a GPGPU–based Monte Carlo simulator integrated with Prism. It supports Markov chains with discrete or continuous time and a subset of properties expressible in PCTL, CSL and their variants extended with rewards. The simulator allows an automated statistical verification of results obtained using Prism’s formal methods.

@inproceedings{2016gpu,
  author = {Copik, Marcin and Rataj, Artur and Wozna{-}Szczesniak, Bozena},
  editor = {Schlingloff, Bernd{-}Holger},
  title = {A GPGPU-based Simulator for Prism: Statistical Verification of Results
                of {PMC} (extended abstract)},
  booktitle = {Proceedings of the 25th International Workshop on Concurrency, Specification
              and Programming, Rostock, Germany, September 28-30, 2016},
  series = {{CEUR} Workshop Proceedings},
  volume = {1698},
  pages = {199--208},
  publisher = {CEUR-WS.org},
  year = {2016},
  url = {http://ceur-ws.org/Vol-1698/CS\&P2016\_19\_Copik\&Rataj\&Wozna-Szczesniak\_A-GPGPU-based-Simulator-for-Prism-Statistical-Verification-of-Results-of-PMC.pdf},
  timestamp = {Wed, 12 Feb 2020 16:45:14 +0100},
  biburl = {https://dblp.org/rec/conf/csp/CopikRW16.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org},
}

2014

Thesis

GPU-accelerated stochastic simulator engine for PRISM model checker.

Marcin Copik

Bachelor Thesis, Oct 2014

Abs PDF

Slides

This project provides a new simulator engine for the PRISM model checker, an enhancement allowing for faster approximate model checking. The simulator is designed as a substitute for the current engine for simple integration with GUI and CLI. The engine was implemented with the OpenCL, an open standard for massively parallel computing on heterogeneous platforms. The engine generates a proper OpenCL kernel for a PRISM model, which will execute on OpenCL devices. This approach enables the generation of samples both on CPU and GPU. The performance and correctness tests included three case studies taken from the official PRISM benchmark. The results showed a huge gain in performance over the existing simulator; in the most extreme case, the new engine, working on seven years old NVIDIA GPU, verified a test property in 20 seconds, where the existing simulator engine needed over two hours.
CAS
Methods for abdominal respiratory motion tracking

Dominik Spinczyk, Adam Karwan, and Marcin Copik

Computer Aided Surgery, Oct 2014

Abs DOI Bib Paper PDF

Non-invasive surface registration methods have been developed to register and track breathing motions in a patient’s abdomen and thorax. We evaluated several different registration methods, including marker tracking using a stereo camera, chessboard image projection, and abdominal point clouds. Our point cloud approach was based on a time-of-flight (ToF) sensor that tracked the abdominal surface. We tested different respiratory phases using additional markers as landmarks for the extension of the non-rigid Iterative Closest Point (ICP) algorithm to improve the matching of irregular meshes. Four variants for retrieving the correspondence data were implemented and compared. Our evaluation involved 9 healthy individuals (3 females and 6 males) with point clouds captured in opposite breathing phases (i.e., inhalation and exhalation). We measured three factors: surface distance, correspondence distance, and marker error. To evaluate different methods for computing the correspondence measurements, we defined the number of correspondences for every target point and the average correspondence assignment error of the points nearest the markers.
@article{doi:10.3109/10929088.2014.891657, author = {Spinczyk, Dominik and Karwan, Adam and Copik, Marcin}, journal = {Computer Aided Surgery}, title = {Methods for abdominal respiratory motion tracking}, year = {2014}, number = {1-3}, pages = {34-47}, volume = {19}, doi = {10.3109/10929088.2014.891657}, eprint = {https://doi.org/10.3109/10929088.2014.891657}, publisher = {Taylor & Francis}, url = {https://doi.org/10.3109/10929088.2014.891657}, }