In Open Access: Selection of Five Articles on Data Storage Solutions with Tape

EPJ Web of Conferences has published online, and in Open Access, the papers of the 27^th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2024). Below, a selection of 5 papers on data storage on tape experiences. The full papers are available at this webpage.

Cold Data Support for the CERN Open Data Portal

Paper describes challenges, present our prototype solution, and outline future developments aimed at enhancing the accessibility, efficiency, and resilience of CERN Open Data Portal’s data ecosystem

Paper written by Jose Benito Gonzalez Lopez, Pablo Saiz^, and Zacharias Zacharodimos, CERN, Esplanade des Particules 1, 1211 Geneva 23, Switzerland.

Abstract: “The CERN Open Data Portal holds over 5 petabytes of high-energy physics experiment data, serving as a hub for global scientific collaboration. Committed to Open Science principles, the portal aims to democratize access to these datasets for outreach, training, education, and independent research. Recognizing the limitations of current disk-based storage, we are starting a project to expand our data storage methodologies. Our approach involves integrating hot storage (such as spinning disks) for immediate data access and cold storage (such as tape, or even interfaces to the experiment frameworks) for cost-effective long-term preservation. This innovative strategy will significantly expand the portal’s capacity to accommodate more experiment data. However, we anticipate challenges in navigating technical complexities and logistical hurdles. These challenges include the latency to access cold data, monitoring and automatizing the transitions between hot and cold and ensuring the long-term preservation of data in the experiment frameworks. The strategy is to integrate existing solutions like EOS, FTS, CTA and Rucio. The paper describes these challenges, present our prototype solution, and outline future developments aimed at enhancing the accessibility, efficiency, and resilience of the CERN Open Data Portal’s data ecosystem.“

Evolution of the CERN Tape Archive Scheduling System

Paper discusses of an alternative Scheduler DB implementation, based on relational database technology, and authors include a status report and roadmap.

Paper written by Jaroslav Guenther, João Afonso, Richard Bachmann, Vladimír Bahyl, Niels Bügel, Pablo Oliver Cortés, Michael Davis, Idriss Larbi, Julien Leduc, Sergio Alfageme Perez, Konstantina Skovola, and David Smith, CERN, Esplanade des Particules 1, 1211 Geneva 23, Switzerland.

Abstract: “The CERN Tape Archive (CTA) scheduling system implements the workflow and lifecycle of archive, retrieve and repack requests. The transient metadata for queued requests is stored in the Scheduler backend store (Scheduler DB). In our previous work, we presented the CTA Scheduler together with an object-store backend implementation of the Scheduler DB. Now, with four years of experience in production, the strengths and limitations of this approach are better understood. While highly efficient for FIFO queueing operations (archive/retrieve), non-FIFO operations (delete, priority queues) require some workarounds. Additionally, this backend imposes constraints on how the CTA Scheduler code can be modified and is an additional software dependency and technology for developers to learn. This paper discusses an alternative Scheduler DB implementation, based on relational database technology. We include a status report and roadmap.“

Challenges of Repack in the Era of the High-capacity Tape Cartridge

Contribution details these problems and describes the various approaches authors have taken to mitigate and solve them, and authors include a roadmap for future repack developments.

Paper written by João Afonso, Michael Davis, Julien Leduc, Vladimír Bahyl, Jaroslav Guenther, Pablo Oliver Cortés, Jorge Camarero Vera, Konstantina Skovola, Niels Alexander Bugel, and Richard Bachmann, CERN, Esplanade des Particules 1, 1211 Geneva 23, Switzerland.

Abstract: “The latest tape drive technologies (LTO-9, IBM TS1170) impose new constraints on the management of data archived to tape. In the past, new drives could read/write the previous one or even two generations of media, but this is no longer the case. This means that repacking older media to new media must be carried out on a more aggressive schedule than in the past. An additional challenge is the large capacity of the newer media. A 50 TB tape can contain a vast number of ﬁles, whose metadata must be tracked during repacking.“

“Repacking an entire tape also requires a signiﬁcant amount of disk storage. At CERN Tier-0, these challenges have created new operational problems to solve, in particular contention for resources between physics archival and repack operations. This contribution details these problems and describes the various approaches we have taken to mitigate and solve them. We include a roadmap for future repack developments.“

Archive Metadata for Efficient Data Collocation on Tape

Authors present the implementation and deployment in the CERN Tape Archive and our preliminary experiences of consuming Archive Metadata at WLCG Tier-0.

Paper written by Julien Leduc, Niels Bügel, Pablo Oliver Cortés, Jaroslav Guenther, Konstantina Skovola, Michael Davis, João Afonso, Richard Bachmann, Vladimír Bahyl and Sergio Alfageme Perez, CERN, Esplanade des Particules 1, 1211 Geneva 23, Switzerland.

Abstract: “Due to the increasing volume of physics data being produced, the LHC experiments are making more active use of archival storage. Constraints on available disk storage have motivated the evolution towards the “data carousel” and similar models. Datasets on tape are recalled multiple times for reprocessing and analysis, and this trend is expected to accelerate during the Hi-Lumi era (LHC Run-4 and beyond).“

“Currently, storage endpoints are optimised for efficient archival, but it is becoming increasingly important to optimise for efficient retrieval. This problem has two dimensions. To reduce unnecessary tape mounts, the spread of each dataset – the number of tapes containing files which will be recalled at the same time – should be minimised. To reduce seek times, files from the same dataset should be physically collocated on the tape. The Archive Metadata specification is an agreed format for experiments to provide scheduling and collocation hints to storage endpoints to achieve these goals.“

“This contribution describes the motivation, the review process with the various stakeholders and the constraints that led to the Archive Metadata proposal. We present the implementation and deployment in the CERN Tape Archive and our preliminary experiences of consuming Archive Metadata at WLCG Tier-0.“

A Tape RSE for Extremely Large Data Collection Backups

Describe the design and implementation of the TRSE and how it relates to current data management practices, and authors also present performance characteristics that make backups of extremely large scale data collections practical.

Paper written by Andrew Hanushevsky, Guangwei Che, Lance Nakata, and Wei Yang, SLAC National Accelerator Laboratory, 2575 Sand Hill Rd., Menlo Park, CA 94025, USA

Abstract: “The Vera Rubin Observatory is a very ambitious project. Using the world’s largest ground-based telescope, it will take two panoramic sweeps of the visible sky every three nights using a 3.2 Giga-pixel camera. The observation products will generate 15 PB of new data each year for 10 years. Accounting for reprocessing and related data products the total amount of critical data will reach several hundred PB. Because the camera consists of 4kx4k CCD panels, the majority of the data products will consist of relatively small files in the low megabyte range, impacting data transfer performance. Yet, all of this data needs to be backed up in offline storage and still be easily retrievable not only for groups of files but also for individual files. This paper describes how we are building a Ruciocentric specialized Tape Remote Storage Element (TRSE) that automatically creates a copy of a Rucio dataset as a single indexed file avoiding transferring many small files. This not only allows high-speed transfer of the data to tape for backup and dataset restoral, but also simple retrieval of individual dataset members in order to restore lost files. We describe the design and implementation of the TRSE and how it relates to current data management practices. We also present performance characteristics that make backups of extremely large scale data collections practical.“