Emergency Backup for Scientific Applications

Aniello Esposito, Christopher Haine, Ali Mohammed

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A framework for the efficient in-network data transfer between a parallel application and an independent storage server is proposed. The case of an unexpected and unrecoverable interruption of the application is considered, where the server takes the role of an emergency backup service preventing the unnecessary loss of valuable information. Workload managers such as SLURM provide a time buffer between the interruption and the termination of an application which can be optimally exploited by the framework making use of RDMA transport and redistribution of data by means of the Maestro middleware. An alternative could consist in a state-of-the-art checkpoint restart mechanism relying on a possibly shared storage hierarchy which suffers from variability and is not scalable in general or in-memory checkpointing which increases memory consumption considerably. Experiments are performed on a HPE/Cray EX system to construct a heuristics for amounts of data that can realistically be backed up during a given time buffer. The method proves to be faster than VELOC and plain MPI-IO using one server node already, for a number of user ranks up to a hundred, with the promise of also better scalability in the long run due to the in-network approach as opposed to filesystem transport.
Original languageEnglish (US)
Title of host publication2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)
PublisherIEEE
DOIs
StatePublished - Jan 27 2023
Externally publishedYes

Fingerprint

Dive into the research topics of 'Emergency Backup for Scientific Applications'. Together they form a unique fingerprint.

Cite this