Abstract
With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most checkpoint/restart schemes. As a fault tolerant communication library, we use the fault tolerant message passing interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three-dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution.
Original language | English (US) |
---|---|
Pages (from-to) | 663-677 |
Number of pages | 15 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 68 |
Issue number | 5 |
DOIs | |
State | Published - May 2008 |
Externally published | Yes |
Keywords
- Parabolic problems
- Parallel numerical algorithms
- Process fault tolerance
ASJC Scopus subject areas
- Software
- Artificial Intelligence
- Theoretical Computer Science
- Hardware and Architecture
- Computer Networks and Communications