CHEOPS Workshop at EuroSys 2025

The Fifth Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems (CHEOPS'25)

Held in conjunction with ASPLOS & EuroSys 2025 on March 31st 2025, Rotterdam, Netherlands

The workshop will take place in Leeuwen Room I at the Postillion Hotel & Convention Centre WTC Rotterdam, Beursplein 37, 3011 AA Rotterdam, The Netherlands.

The proceedings will be available in the ACM Digital Library.

Photos

Agenda

Time	Speaker / Authors (Affiliation)	Content
8:30-9:00		Coffee and Registration
9:00-9:10	Suren Byna	CHEOPS Welcome
9:10-10:00	Keynote - Gustavo Alonso (ETH Zurich, Switzerland)	Vertically integrated storage systems
10:00-10:30	Louis-Marie NICOLAS (Lab-STICC, CNRS UMR 6285, ENSTA, Institut Polytechnique de Paris), Salim MIMOUNI (Atos BDS R&D Data Management), Philippe COUVEE (Atos BDS R&D Data Management), Jalil BOUKHOBZA (Lab-STICC, CNRS UMR 6285, ENSTA, Institut Polytechnique de Paris)	Characterizing the Use of DVFS for HPC I/O Optimization: A Microbenchmarking Approach (Paper)
10:30-11:00		Coffee break
11:00-11:30	Robin Vonk (Delft University of Technology), Joost Hoozemans (Voltron Data), Zaid Al-Ars (Delft University of Technology)	GSST: Parallel string decompression at 191 GB/s on GPU (Paper)
11:30-12:00	Shadi Ibrahim (Inria, Rennes), Jad Darrous (Inria, Rennes)	Erasure Coding Aware Block Placement for Data-Intensive Applications (Paper)
12:00-12:30	Invited talk - Jean-Thomas Acquaviva (DDN Storage)	From HPC to AI: A Data Journey
12:30-14:00		Lunch break
14:00-14:30	Zebin Ren (Vrije Universiteit Amsterdam), Krijn Doekemeijer (Vrije Universiteit Amsterdam), Tiziano De Matteis (Vrije Universiteit Amsterdam), Christian Pinto (IBM Research Europe), Radu Stoica (IBM Research Europe), Animesh Trivedi (IBM Research Europe)	An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD (Paper)
14:30-15:00	Invited talk - Yang Zheng (Huawei Technologies)	Reliability challenges and opportunities for AI infra: from industry perspective
15:00-15:30	Invited talk - Shadi Ibrahim (Inria, Rennes)	Scalable and Efficient Big Data Processing in Clouds: Addressing Performance Variability
15:30-16:00		Coffee break
16:00-16:15	Joost Hoozemans (Voltron Data, Delft University of Technology), Robin Vonk (Delft University of Technology), Johan Peltenburg (Voltron Data), Felipe Aramburu (Voltron Data), Zaid Al-Ars (Delft University of Technology)	Using GPU Direct Storage with High-Performance Distributed Filesystems
16:15-16:30	Pınar Tözün (IT University of Copenhagen), Karl B. Torp (Samsung Denmark Research Center), Simon A. F. Lund (Samsung Denmark Research Center)	A Quest to Reduce Dependency on CPUs in Deep Learning Data Pipelines
16:30-16:50	All participants	Discussion
16:50-17:00	CHEOPS Organizers	Closing remarks
18:00-19:30		Welcome Reception

Keynote Speaker

Professor Gustavo Alonso

Keynote: Vertically integrated storage systems

Abstract

Storage systems are often seen as being separated from compute systems. In this talk, I will argue that storage should be vertically integrated into compute units, i.e., storage should be an integral and seamless component actively participating in the processing of data. The motivation to do so is obvious from the sheer amount of data that needs to be processed these days, not only on ML/LLM/AI applications but also in more conventional data analytics. And the concept of vertically integrating storage applies whether it is a local disk or disaggregated storage in the cloud. In fact, the performance characteristics of cloud storage has already led to pushing some amount of data processing down to the storage layer to minimize the amount of data to be transferred across the network and to compute nodes. In the talk, I will argue that these initial steps should be expanded to create a computational pipeline from storage to the compute node memory that includes active storage, smart NICs, and accelerators. While some of these ideas have been pursued in isolation and are often centered around a particular type of technology, today we have the opportunity to think about such designs end-to-end taking advantage of innovations such as CXL. In the talk I will motivate the idea, suggest ways to implement initial prototypes, and discuss its integration into software systems – often the biggest bottleneck when trying to take advantage of hardware advances.

Bio

Gustavo Alonso is a professor in the Department of Computer Science of ETH Zurich where he is a member of the Systems Group (www.systems.ethz.ch) and the head of the Institute of Computing Platforms. He leads the AMD HACC (Heterogeneous Accelerated Compute Cluster) deployment at ETH (https://github.com/fpgasystems/hacc), with several hundred users worldwide, a research facility that supports exploring data center hardware-software co-design. His research interests include data management, cloud computing architecture, and building systems on modern hardware. Gustavo holds degrees in telecommunication from the Madrid Technical University and a MS and PhD in Computer Science from UC Santa Barbara. Previous to joining ETH, he was a research scientist at IBM Almaden in San Jose, California. Gustavo has received 4 Test-of-Time Awards for his research in databases, software runtimes, middleware, and mobile computing. He is an ACM Fellow, an IEEE Fellow, a Distinguished Alumnus of the Department of Computer Science of UC Santa Barbara, and has received the Lifetime Achievements Award from the European Chapter of ACM SIGOPS (EuroSys).

Invited Speakers

Dr. Jean-Thomas Acquaviva, DDN Storage

Invited Talk: From HPC to AI: a data Journey

Abstract

Two technological revolutions have recently marked the storage community: the mass availability of Flash and the rise of object-type APIs. These revolutions seem to be coming to an end, and a common architecture is emerging. Data centers tend to become more and more “data centric” with different computing services attached to a central data space. Due of this centrality, the datahub must meet five main criteria: extreme scalability, uncompromising performance, operational efficiency, 24/7 availability and ease of allocation of shared resources. To which point this architecture differ from AI and HPC? In this talk, we will discuss from an industrial standpoints the main area of convergence and differenciation depending on the dominant workload.

Bio

Jean-Thomas successively worked for Intel, the University of Versailles and the French Atomic Commission (CEA). He participated to the creation of their joint laboratory on Exascale Research. At DDN, Jean-Thomas’ role includes overseeing research collaborations in Europe as well as product management for some advanced DDN’s solutions.

Dr. Shadi Ibrahim, Inria - Rennes

Invited Talk: Scalable and Efficient Big Data Processing in Clouds: Addressing Performance Variability

Abstract

Clouds can provide abundant resources to meet the ever-increasing computing and storage demands of data processing, making Clouds the primary infrastructure for Big Data processing. However, the unprecedented scale of Clouds and multi-tenancy nature leads to the critical challenge of performance variability and causes stragglers that delay Big Data processing. MapReduce uses speculative execution to mitigate stragglers. State-of-the-art straggler mitigation mechanisms assume homogeneous resources and is not efficient in Clouds (for both detection and processing). In the talk, I will show that providing adequate and timely resources for starting speculative copies is important to improve the effectiveness and efficiency of straggler mitigation mechanisms.

Bio

Shadi Ibrahim is a Research Scientist at Inria. He obtained his Ph.D. degree in Computer Science from Huazhong University of Science and Technology (HUST) in Wuhan, China in 2011. His current research interests are in scalable data management, distributed storage systems, data-intensive computing, parallel I/O and distributed file systems, virtualization technology and Cloud/Fog computing. His work has been published in prestigious international journals and conferences in distributed and parallel systems (IEEE TPDS, ACM computing surveys, Elsevier FGCS, ACM SoCC, ICDCS, SC, IPDPS, VEE, CCGrid, Cluster). He has been the recipient of the IEEE TCSC Award for Excellence in Scalable Computing (Middle Career Researcher) in 2020. He has published more than 60 papers in peer-reviewed conferences workshops, and journals; and has served in the Program Committees of more than 80 conferences and workshops in his area, as the General (co)chair for SSDBM 2024, CHEOPS 2024, PDSW 2021 and IEEE SmartData-2020, as the Program (co)chair for CHEOPS 2022, IEEE HPCC 2021, PDSW 2020, IEEE BigDataSE 2019, IEEE SmartData-2019, IEEE BigData Congress 2018, ICA3PP-2017. He currently serves in the Editorial Board of IEEE Internet Computing magazine.He is a Distinguished member of the ACM and a Senior member of IEEE.

Dr. Yang Zheng , Huawei Technologies

Invited Talk: Reliability challenges and opportunities for AI infra: from industry perspective

Abstract

AI clusters are emerging as a critical infrastructure and technological frontier. As models grow in size following scaling laws, ensuring stable and reliable operation of large-scale model tasks on massive AI clusters has become a significant challenge in the industry.

Training and inference tasks for large models are highly coupled and low-fault-tolerant systems. Distributed training involves frequent communication between nodes, strong dependencies across parallel domains, and requirements for proper computational accuracy. These factors lead to frequent training interruptions due to hardware failures, slow recovery, and fail-slow. Additionally, silent data corruptions can result in model non-convergence. As the scale of training expands, reliability becomes a major bottleneck.

The key challenge is to build a highly available AI system architecture capable of supporting scenarios such as training on clusters with hundreds of thousands of cards, inference on super-nodes with hundreds or thousands of cards, and integrated training-inference tasks. Achieving “zero” perception of fault recovery in business operations is essential for ensuring the reliability of large model infrastructures. Addressing these challenges will be critical for advancing the scalability and robustness of AI systems in the future.

Bio

Dr. Yang Zheng is a principle Engineer of Reliability Technology Lab of Huawei Technologies Co., Ltd.. Dr Zheng is also currently a member of reliable AI infra project, focus on research on AI Infra testing, monitoring and recovery. Dr Yang Zheng received his PhD degree from Imperial College London in UK. Research interest includes elastic training/inference, silent data corruption.

Workshop Description

The fifth workshop on “Challenges and Opportunities of Efficient and Performant Storage Systems” (CHEOPS) is aimed at researchers, developers of scientific applications, engineers and everyone interested in the evolution of storage systems. As the developments of computing power, storage and network technologies continue to diverge, the bandwidth performance gap between them widens. This trend, combined with the ever-growing data volumes and data-driven computing such as machine learning, results in I/O and storage limitations, impacting the scalability and efficiency of current and future computing systems. Some of these challenges are quantitative, such as scale to match exascale system requirements, or latency reduction of the software stack to efficiently integrate new generations of hardware like storage class memory (SCM). Some other issues are more subtle and arise with the increased complexity of the storage solutions, like new smarter and more potent data management tools, monitoring systems or interoperability between I/O components or data formats.

The main objective of this workshop is to discuss state-of-the-art research, innovative ideas and experiences that focus on the design and implementation of storage systems in both academic and industrial worlds.

Important Dates

Abstract Submission: ~~Jan 10th~~ Jan 17th, 2025 (Anywhere on Earth) [Extended deadline]
Paper Submission: ~~Jan 17th~~ Jan 24th, 2025 (Anywhere on Earth) [Extended deadline]
Notification to Authors: ~~Feb 7th~~ Feb 14th, 2025
Camera-Ready Deadline: Feb 28th, 2025
Workshop Date: March 31st, 2025

Submission Guidelines

In order to guarantee the quality of the submissions, we have formed a globally distributed, diverse program committee. All submissions will be reviewed by the program committee. We will use HotCRP to manage the submissions. The reviewing process will be double blind with at least 3 reviews for each submission. An online discussion will determine which papers to accept.

Only original and novel work not currently under review in other venues will be considered for publication. Submissions can either be full papers (6 pages) or short papers (4 pages). The page count includes the title, text, figures, appendices but excludes the references. They must be submitted electronically as PDF files formatted according to the submission rules of EuroSys. Accepted submissions will have to comply with the EuroSys proceedings format. One author of each accepted paper is required to register for the workshop and present the paper. Extended versions of selected papers will be considered for publication in the ACM SIGOPS Operating Systems Review journal.

Camera-Ready Format

You should use the acmart document class (https://www.acm.org/publications/proceedings-template, the same as for submission), as follows: \documentclass[sigplan,10pt]{acmart}

As mentioned above, you will receive the instruction regarding some LaTeX directives (\setcopyright, \acmConference, \acmDOI etc.) after completing the copyright form.

All accepted papers can use up to 2 additional pages for the camera-ready version, for a final limit of 8 (full papers) or 6 (short papers) pages, references not included.

Note that Type 1 fonts (scalable) should be used, not Type 3 (bitmapped), and that all fonts must be embedded. Type and embedding of fonts can be checked with various tools including pdffonts. Page numbers should be suppressed. Make also sure that the PDF is searchable by testing the search function in a PDF reader.

Information regarding the Rights forms and Uploading Final versions is coming soon.

Topics of Interest

Submissions may be more hands-on than research papers and we therefore explicitly encourage submissions in the early stages of research. Topics of interest include, but are not limited to:

Operating system optimizations
Kernel and user space file/storage systems
- Including virtual file systems
Cloud, parallel and distributed file/storage systems
- Network challenges, such as scalability, QoS and partitionability
Approaches for low-latency and heterogeneous storage systems
- Such as SCM and NVRAM combined with HDDs
Metadata management
Machine Learning and Artificial Intelligence
- Storage requirements of ML and AI applications (including LLMs)
- Using ML and AI within storage systems (e.g., to replace heuristics, to optimize storage and I/O systems)
Hybrid solutions using file systems and databases
- Approaches using query and database interfaces, including key-value stores
- Optimized indexing techniques
Data organizations to support online workflows
Data privacy and data security
Domain-specific data management solutions
- Application I/O characterization
Storage systems modeling and analysis tools
Data reduction techniques
- Lossless and lossy compression, deduplication, dimensionality reduction, surrogate modeling
UI/UX for storage systems
Related experiences from users: what worked, what didn’t?
- Feedback and empirical evaluation of storage systems

WORK IN PROGRESS (WIP) SESSION

There will be a WIP session where presenters provide brief (5-minute) talks on their on-going work, with fresh problems/solutions. WIP will not be included in the proceedings. A one-page abstract is required.

DEADLINES

Work in Progress (WIP) submissions due: Feb 21st, 2025
Notification: Mar 1st, 2025

Submissions by email: Please email your submission as a PDF attachment of the one-page abstract to Amelie Chi Zhou (amelieczhou@hkbu.edu.hk) and Kira Duwe (kira.duwe@epfl.ch). Put “CHEOPS 2025 WIP” as the first part of the message subject.

Organization

Steering Committee

Jean-Thomas Acquaviva - DDN, France
Jalil Boukhobza - National Institute of Advanced Technologies of Brittany (ENSTA Bretagne), France
Suren Byna - The Ohio State University, USA
Konstantinos Chasapis - DDN, France
Kira Duwe - École polytechnique fédérale de Lausanne (EPFL), Switzerland
Shadi Ibrahim - Inria, France
Michael Kuhn - Otto von Guericke University Magdeburg (OVGU), Germany

General Chair

Suren Byna - The Ohio State University, USA

Program Chairs

Amelie Chi Zhou - Hong Kong Baptist University, Hong Kong
Kira Duwe - École polytechnique fédérale de Lausanne (EPFL), Switzerland

Program Committee

Anastasios Papagiannis, Isovalent at Cisco
Animesh Trivedi, IBM Research Europe, Zurich
Christos Kozanitis, FORTH-ICS
Diana Moise, Hewlett Packard Enterprise (HPE)
Jalil Boukhobza, ENSTA Bretagne
Jay Lofstead, Sandia National Laboratories
Jean Luca Bez, Lawrence Berkeley National Laboratory
Jerry Chou, National Tsing Hua University
Marc-André Vef, DDN
Marcus Paradies, LMU Munich
Matthieu Dorier, Argonne National Laboratory
Thomas Lambert, Université de Lorraine
Yusen Li Nankai University