The Fifth Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems (CHEOPS'25)
Held in conjunction with ASPLOS & EuroSys 2025 on March 31st 2025, Rotterdam, Netherlands
The workshop will take place in Leeuwen Room I at the Postillion Hotel & Convention Centre WTC Rotterdam, Beursplein 37, 3011 AA Rotterdam, The Netherlands.
Agenda
Time | Speaker / Authors (Affiliation) | Content |
---|---|---|
8:30-9:00 | Coffee and Registration | |
9:00-9:10 | Suren Byna | CHEOPS Welcome |
9:10-10:00 | Keynote - Gustavo Alonso (ETH Zurich, Switzerland) | Vertically integrated storage systems |
10:00-10:30 | Louis-Marie NICOLAS (Lab-STICC, CNRS UMR 6285, ENSTA, Institut Polytechnique de Paris), Salim MIMOUNI (Atos BDS R&D Data Management), Philippe COUVEE (Atos BDS R&D Data Management), Jalil BOUKHOBZA (Lab-STICC, CNRS UMR 6285, ENSTA, Institut Polytechnique de Paris) | Characterizing the Use of DVFS for HPC I/O Optimization: A Microbenchmarking Approach |
10:30-11:00 | Coffee break | |
11:00-11:30 | Robin Vonk (Delft University of Technology), Joost Hoozemans (Voltron Data), Zaid Al-Ars (Delft University of Technology) | GSST: Parallel string decompression at 191 GB/s on GPU |
11:30-12:00 | Shadi Ibrahim (Inria, Rennes), Jad Darrous (Inria, Rennes) | Erasure Coding Aware Block Placement for Data-Intensive Applications |
12:00-12:30 | Invited talk - Jean-Thomas Acquaviva (DDN Storage) | From HPC to AI: A Data Journey |
12:30-14:00 | Lunch break | |
14:00-14:30 | Zebin Ren (Vrije Universiteit Amsterdam), Krijn Doekemeijer (Vrije Universiteit Amsterdam), Tiziano De Matteis (Vrije Universiteit Amsterdam), Christian Pinto (IBM Research Europe), Radu Stoica (IBM Research Europe), Animesh Trivedi (IBM Research Europe) | An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD |
14:30-15:00 | Invited talk - Yang Zheng (Huawei Technologies) | Reliability challenges and opportunities for AI infra: from industry perspective |
15:00-15:30 | Invited talk - Shadi Ibrahim (Inria, Rennes) | TBD |
15:30-16:00 | Coffee break | |
16:00-16:15 | Joost Hoozemans (Voltron Data, Delft University of Technology), Robin Vonk (Delft University of Technology), Johan Peltenburg (Voltron Data), Felipe Aramburu (Voltron Data), Zaid Al-Ars (Delft University of Technology) | Using GPU Direct Storage with High-Performance Distributed Filesystems |
16:15-16:30 | Pınar Tözün (IT University of Copenhagen), Karl B. Torp (Samsung Denmark Research Center), Simon A. F. Lund (Samsung Denmark Research Center) | A Quest to Reduce Dependency on CPUs in Deep Learning Data Pipelines |
16:30-16:50 | All participants | Discussion |
16:50-17:00 | CHEOPS Organizers | Closing remarks |
18:00-19:30 | Welcome Reception |
Keynote Speaker
Professor Gustavo Alonso
Keynote: Vertically integrated storage systems
Abstract
Storage systems are often seen as being separated from compute systems. In this talk, I will argue that storage should be vertically integrated into compute units, i.e., storage should be an integral and seamless component actively participating in the processing of data. The motivation to do so is obvious from the sheer amount of data that needs to be processed these days, not only on ML/LLM/AI applications but also in more conventional data analytics. And the concept of vertically integrating storage applies whether it is a local disk or disaggregated storage in the cloud. In fact, the performance characteristics of cloud storage has already led to pushing some amount of data processing down to the storage layer to minimize the amount of data to be transferred across the network and to compute nodes. In the talk, I will argue that these initial steps should be expanded to create a computational pipeline from storage to the compute node memory that includes active storage, smart NICs, and accelerators. While some of these ideas have been pursued in isolation and are often centered around a particular type of technology, today we have the opportunity to think about such designs end-to-end taking advantage of innovations such as CXL. In the talk I will motivate the idea, suggest ways to implement initial prototypes, and discuss its integration into software systems – often the biggest bottleneck when trying to take advantage of hardware advances.
Bio
Gustavo Alonso is a professor in the Department of Computer Science of ETH Zurich where he is a member of the Systems Group (www.systems.ethz.ch) and the head of the Institute of Computing Platforms. He leads the AMD HACC (Heterogeneous Accelerated Compute Cluster) deployment at ETH (https://github.com/fpgasystems/hacc), with several hundred users worldwide, a research facility that supports exploring data center hardware-software co-design. His research interests include data management, cloud computing architecture, and building systems on modern hardware. Gustavo holds degrees in telecommunication from the Madrid Technical University and a MS and PhD in Computer Science from UC Santa Barbara. Previous to joining ETH, he was a research scientist at IBM Almaden in San Jose, California. Gustavo has received 4 Test-of-Time Awards for his research in databases, software runtimes, middleware, and mobile computing. He is an ACM Fellow, an IEEE Fellow, a Distinguished Alumnus of the Department of Computer Science of UC Santa Barbara, and has received the Lifetime Achievements Award from the European Chapter of ACM SIGOPS (EuroSys).
Invited Speakers
Dr. Jean-Thomas Acquaviva, DDN Storage
Invited Talk: From HPC to AI: a data Journey
Abstract
Two technological revolutions have recently marked the storage community: the mass availability of Flash and the rise of object-type APIs. These revolutions seem to be coming to an end, and a common architecture is emerging. Data centers tend to become more and more “data centric” with different computing services attached to a central data space. Due of this centrality, the datahub must meet five main criteria: extreme scalability, uncompromising performance, operational efficiency, 24/7 availability and ease of allocation of shared resources. To which point this architecture differ from AI and HPC? In this talk, we will discuss from an industrial standpoints the main area of convergence and differenciation depending on the dominant workload.
Bio
Jean-Thomas successively worked for Intel, the University of Versailles and the French Atomic Commission (CEA). He participated to the creation of their joint laboratory on Exascale Research. At DDN, Jean-Thomas’ role includes overseeing research collaborations in Europe as well as product management for some advanced DDN’s solutions.
Dr. Shadi Ibrahim, Inria - Rennes
Invited Talk: TBD
Abstract
Bio
Dr. Yang Zheng, Huawei
Invited Talk: Reliability challenges and opportunities for AI infra: from industry perspective
Abstract
AI clusters are emerging as a critical infrastructure and technological frontier. As models grow in size following scaling laws, ensuring stable and reliable operation of large-scale model tasks on massive AI clusters has become a significant challenge in the industry.
Training and inference tasks for large models are highly coupled and low-fault-tolerant systems. Distributed training involves frequent communication between nodes, strong dependencies across parallel domains, and requirements for proper computational accuracy. These factors lead to frequent training interruptions due to hardware failures, slow recovery, and fail-slow. Additionally, silent data corruptions can result in model non-convergence. As the scale of training expands, reliability becomes a major bottleneck.
The key challenge is to build a highly available AI system architecture capable of supporting scenarios such as training on clusters with hundreds of thousands of cards, inference on super-nodes with hundreds or thousands of cards, and integrated training-inference tasks. Achieving “zero” perception of fault recovery in business operations is essential for ensuring the reliability of large model infrastructures. Addressing these challenges will be critical for advancing the scalability and robustness of AI systems in the future.
Bio
Dr. Yang Zheng is a principle Engineer of Reliability Technology Lab of Huawei Technologies Co., Ltd.. Dr Zheng is also currently a member of reliable AI infra project, focus on research on AI Infra testing, monitoring and recovery. Dr Yang Zheng received his PhD degree from Imperial College London in UK. Research interest includes elastic training/inference, silent data corruption.
Workshop Description
The fifth workshop on “Challenges and Opportunities of Efficient and Performant Storage Systems” (CHEOPS) is aimed at researchers, developers of scientific applications, engineers and everyone interested in the evolution of storage systems. As the developments of computing power, storage and network technologies continue to diverge, the bandwidth performance gap between them widens. This trend, combined with the ever-growing data volumes and data-driven computing such as machine learning, results in I/O and storage limitations, impacting the scalability and efficiency of current and future computing systems. Some of these challenges are quantitative, such as scale to match exascale system requirements, or latency reduction of the software stack to efficiently integrate new generations of hardware like storage class memory (SCM). Some other issues are more subtle and arise with the increased complexity of the storage solutions, like new smarter and more potent data management tools, monitoring systems or interoperability between I/O components or data formats.
The main objective of this workshop is to discuss state-of-the-art research, innovative ideas and experiences that focus on the design and implementation of storage systems in both academic and industrial worlds.
Important Dates
- Abstract Submission:
Jan 10thJan 17th, 2025 (Anywhere on Earth) [Extended deadline] - Paper Submission:
Jan 17thJan 24th, 2025 (Anywhere on Earth) [Extended deadline] - Notification to Authors:
Feb 7thFeb 14th, 2025 - Camera-Ready Deadline: Feb 28th, 2025
- Workshop Date: March 31st, 2025
Submission Guidelines
In order to guarantee the quality of the submissions, we have formed a globally distributed, diverse program committee. All submissions will be reviewed by the program committee. We will use HotCRP to manage the submissions. The reviewing process will be double blind with at least 3 reviews for each submission. An online discussion will determine which papers to accept.
Only original and novel work not currently under review in other venues will be considered for publication. Submissions can either be full papers (6 pages) or short papers (4 pages). The page count includes the title, text, figures, appendices but excludes the references. They must be submitted electronically as PDF files formatted according to the submission rules of EuroSys. Accepted submissions will have to comply with the EuroSys proceedings format. One author of each accepted paper is required to register for the workshop and present the paper. Extended versions of selected papers will be considered for publication in the ACM SIGOPS Operating Systems Review journal.
Camera-Ready Format
You should use the acmart
document class (https://www.acm.org/publications/proceedings-template, the same as for submission), as follows: \documentclass[sigplan,10pt]{acmart}
As mentioned above, you will receive the instruction regarding some LaTeX directives (\setcopyright
, \acmConference
, \acmDOI
etc.) after completing the copyright form.
All accepted papers can use up to 2 additional pages for the camera-ready version, for a final limit of 8 (full papers) or 6 (short papers) pages, references not included.
Note that Type 1 fonts (scalable) should be used, not Type 3 (bitmapped), and that all fonts must be embedded.
Type and embedding of fonts can be checked with various tools including pdffonts
.
Page numbers should be suppressed.
Make also sure that the PDF is searchable by testing the search function in a PDF reader.
Information regarding the Rights forms and Uploading Final versions is coming soon.
Topics of Interest
Submissions may be more hands-on than research papers and we therefore explicitly encourage submissions in the early stages of research. Topics of interest include, but are not limited to:
- Operating system optimizations
- Kernel and user space file/storage systems
- Including virtual file systems
- Cloud, parallel and distributed file/storage systems
- Network challenges, such as scalability, QoS and partitionability
- Approaches for low-latency and heterogeneous storage systems
- Such as SCM and NVRAM combined with HDDs
- Metadata management
- Machine Learning and Artificial Intelligence
- Storage requirements of ML and AI applications (including LLMs)
- Using ML and AI within storage systems (e.g., to replace heuristics, to optimize storage and I/O systems)
- Hybrid solutions using file systems and databases
- Approaches using query and database interfaces, including key-value stores
- Optimized indexing techniques
- Data organizations to support online workflows
- Data privacy and data security
- Domain-specific data management solutions
- Application I/O characterization
- Storage systems modeling and analysis tools
- Data reduction techniques
- Lossless and lossy compression, deduplication, dimensionality reduction, surrogate modeling
- UI/UX for storage systems
- Related experiences from users: what worked, what didn’t?
- Feedback and empirical evaluation of storage systems
WORK IN PROGRESS (WIP) SESSION
There will be a WIP session where presenters provide brief (5-minute) talks on their on-going work, with fresh problems/solutions. WIP will not be included in the proceedings. A one-page abstract is required.
DEADLINES
- Work in Progress (WIP) submissions due: Feb 21st, 2025
- Notification: Mar 1st, 2025
Submissions by email: Please email your submission as a PDF attachment of the one-page abstract to Amelie Chi Zhou (amelieczhou@hkbu.edu.hk) and Kira Duwe (kira.duwe@epfl.ch). Put “CHEOPS 2025 WIP” as the first part of the message subject.
Organization
Steering Committee
- Jean-Thomas Acquaviva - DDN, France
- Jalil Boukhobza - National Institute of Advanced Technologies of Brittany (ENSTA Bretagne), France
- Suren Byna - The Ohio State University, USA
- Konstantinos Chasapis - DDN, France
- Kira Duwe - École polytechnique fédérale de Lausanne (EPFL), Switzerland
- Shadi Ibrahim - Inria, France
- Michael Kuhn - Otto von Guericke University Magdeburg (OVGU), Germany
General Chair
- Suren Byna - The Ohio State University, USA
Program Chairs
- Amelie Chi Zhou - Hong Kong Baptist University, Hong Kong
- Kira Duwe - École polytechnique fédérale de Lausanne (EPFL), Switzerland
Program Committee
- Anastasios Papagiannis, Isovalent at Cisco
- Animesh Trivedi, IBM Research Europe, Zurich
- Christos Kozanitis, FORTH-ICS
- Diana Moise, Hewlett Packard Enterprise (HPE)
- Jalil Boukhobza, ENSTA Bretagne
- Jay Lofstead, Sandia National Laboratories
- Jean Luca Bez, Lawrence Berkeley National Laboratory
- Jerry Chou, National Tsing Hua University
- Marc-André Vef, DDN
- Marcus Paradies, LMU Munich
- Matthieu Dorier, Argonne National Laboratory
- Thomas Lambert, Université de Lorraine
- Yusen Li Nankai University