TY - GEN
T1 - Performance scaling variability and energy analysis for a resilient ULFM-based PDE solver
AU - Morris, K.
AU - Rizzi, F.
AU - Cook, B.
AU - Mycek, P.
AU - LeMaitre, O.
AU - Knio, O. M.
AU - Sargsyan, K.
AU - Dahlgren, K.
AU - Debusschere, B. J.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/30
Y1 - 2017/1/30
N2 - We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM complement each other to support missing tasks, making the application resilient to clients failing.We present weak and strong scaling results on Edison, National Energy Research Scientific Computing Center (NERSC), for a nominal and a fault-injected case, showing that even in the presence of faults, scalability tested up to 50k cores is within 90%. We then quantify the variability of weak and strong scaling due to the presence of faults. Finally, we discuss the performance of our application with respect to subdomain size, server/client configuration, and the interplay between energy and resilience.
AB - We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM complement each other to support missing tasks, making the application resilient to clients failing.We present weak and strong scaling results on Edison, National Energy Research Scientific Computing Center (NERSC), for a nominal and a fault-injected case, showing that even in the presence of faults, scalability tested up to 50k cores is within 90%. We then quantify the variability of weak and strong scaling due to the presence of faults. Finally, we discuss the performance of our application with respect to subdomain size, server/client configuration, and the interplay between energy and resilience.
KW - Client-server systems
KW - Dynamic voltage scaling
KW - Fault tolerance
KW - Partial differential equations
KW - Resilience
UR - http://www.scopus.com/inward/record.url?scp=85015189347&partnerID=8YFLogxK
U2 - 10.1109/ScalA.2016.010
DO - 10.1109/ScalA.2016.010
M3 - Conference contribution
AN - SCOPUS:85015189347
T3 - Proceedings of ScalA 2016: 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 41
EP - 48
BT - Proceedings of ScalA 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2016
Y2 - 13 November 2016 through 18 November 2016
ER -