TY - JOUR
T1 - An error-resilient redundant subspace correction method
AU - Cui, Tao
AU - Xu, Jinchao
AU - Zhang, Chen Song
N1 - Generated from Scopus record by KAUST IRTS on 2023-02-15
PY - 2017/1/1
Y1 - 2017/1/1
N2 - Due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability (increase the mean time to failure) of computing systems, a lot of efforts have been devoted to developing techniques to forecast, prevent, and recover from errors at different levels, including architecture, application, and algorithm. In this paper, we focus on algorithmic error resilient iterative solvers and introduce a redundant subspace correction method. Using a general framework of redundant subspace corrections, we construct iterative methods, which have the following properties: (1) maintain convergence when error occurs assuming it is detectable; (2) introduce low computational overhead when no error occurs; (3) require only small amount of point-to-point communication compared to traditional methods and maintain good load balance; (4) improve the mean time to failure. Preliminary numerical experiments demonstrate the efficiency and effectiveness of the new subspace correction method. For simplicity, the main ideas of the proposed framework were demonstrated using the Schwarz methods without a coarse space, which do not scale well in practice.
AB - Due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability (increase the mean time to failure) of computing systems, a lot of efforts have been devoted to developing techniques to forecast, prevent, and recover from errors at different levels, including architecture, application, and algorithm. In this paper, we focus on algorithmic error resilient iterative solvers and introduce a redundant subspace correction method. Using a general framework of redundant subspace corrections, we construct iterative methods, which have the following properties: (1) maintain convergence when error occurs assuming it is detectable; (2) introduce low computational overhead when no error occurs; (3) require only small amount of point-to-point communication compared to traditional methods and maintain good load balance; (4) improve the mean time to failure. Preliminary numerical experiments demonstrate the efficiency and effectiveness of the new subspace correction method. For simplicity, the main ideas of the proposed framework were demonstrated using the Schwarz methods without a coarse space, which do not scale well in practice.
UR - http://link.springer.com/10.1007/s00791-016-0270-6
UR - http://www.scopus.com/inward/record.url?scp=85006894350&partnerID=8YFLogxK
U2 - 10.1007/s00791-016-0270-6
DO - 10.1007/s00791-016-0270-6
M3 - Article
SN - 1433-0369
VL - 18
SP - 65
EP - 77
JO - Computing and Visualization in Science
JF - Computing and Visualization in Science
IS - 2-3
ER -