TY - GEN
T1 - Emergency Backup for Scientific Applications
AU - Esposito, Aniello
AU - Haine, Christopher
AU - Mohammed, Ali
N1 - KAUST Repository Item: Exported on 2023-01-31
Acknowledgements: This work is supported by the HPE/Cray/KAUST center of excellence collaboration. We want to thank Timothy Dykes and Utz-Uwe Haus from the EMEA research lab at Hewlett Packard Enterprise for insightful discussions and suggestions.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.
PY - 2023/1/27
Y1 - 2023/1/27
N2 - A framework for the efficient in-network data transfer between a parallel application and an independent storage server is proposed. The case of an unexpected and unrecoverable interruption of the application is considered, where the server takes the role of an emergency backup service preventing the unnecessary loss of valuable information. Workload managers such as SLURM provide a time buffer between the interruption and the termination of an application which can be optimally exploited by the framework making use of RDMA transport and redistribution of data by means of the Maestro middleware. An alternative could consist in a state-of-the-art checkpoint restart mechanism relying on a possibly shared storage hierarchy which suffers from variability and is not scalable in general or in-memory checkpointing which increases memory consumption considerably. Experiments are performed on a HPE/Cray EX system to construct a heuristics for amounts of data that can realistically be backed up during a given time buffer. The method proves to be faster than VELOC and plain MPI-IO using one server node already, for a number of user ranks up to a hundred, with the promise of also better scalability in the long run due to the in-network approach as opposed to filesystem transport.
AB - A framework for the efficient in-network data transfer between a parallel application and an independent storage server is proposed. The case of an unexpected and unrecoverable interruption of the application is considered, where the server takes the role of an emergency backup service preventing the unnecessary loss of valuable information. Workload managers such as SLURM provide a time buffer between the interruption and the termination of an application which can be optimally exploited by the framework making use of RDMA transport and redistribution of data by means of the Maestro middleware. An alternative could consist in a state-of-the-art checkpoint restart mechanism relying on a possibly shared storage hierarchy which suffers from variability and is not scalable in general or in-memory checkpointing which increases memory consumption considerably. Experiments are performed on a HPE/Cray EX system to construct a heuristics for amounts of data that can realistically be backed up during a given time buffer. The method proves to be faster than VELOC and plain MPI-IO using one server node already, for a number of user ranks up to a hundred, with the promise of also better scalability in the long run due to the in-network approach as opposed to filesystem transport.
UR - http://hdl.handle.net/10754/687385
UR - https://ieeexplore.ieee.org/document/10025539/
U2 - 10.1109/supercheck56652.2022.00008
DO - 10.1109/supercheck56652.2022.00008
M3 - Conference contribution
BT - 2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)
PB - IEEE
ER -