Abstract
Based on distributed and uncoordinated check pointing, numerical methods presented in this chapter can reconstruct a consistent state in parallel application, despite storing checkpoints of various processes at different time steps. The main purpose of these algorithms is to avoid the expensive rollback operation to the last consistent distributed checkpoint, losing all the subsequent work and adding a significant overhead for applications running on thousands of processors because of coordinated checkpoints. The first method, the forward implicit scheme, requires for the reconstruction procedure, the boundary variables of each time step to be stored along with the current solution; the second method, based on explicit space/time marching, requires check pointing the solution of each process every time step. To stabilize the scheme, a hyperbolic regularization such as the telegraph equation that is a perturbation of the heat equation may be added. Performance results comparing both methods with respect to the checkpoints overhead have been presented. The checkpointing infrastructure implemented in the 3D-heat equation uses two groups of processes a solver group composed by processes that will solve the problem itself and a spare group of processes whose main function is to store the local data from solver processes. © 2007
Original language | English (US) |
---|---|
Title of host publication | Parallel Computational Fluid Dynamics 2006 |
Publisher | Elsevier Ltd |
Pages | 123-130 |
Number of pages | 8 |
ISBN (Print) | 9780444530356 |
DOIs | |
State | Published - 2007 |
Externally published | Yes |
ASJC Scopus subject areas
- General Chemical Engineering