TY - JOUR
T1 - A Fault-Tolerant HPC Scheduler Extension for Large and Operational Ensemble Data Assimilation:Application to the Red Sea
AU - Toye, Habib
AU - Kortas, Samuel
AU - Zhan, Peng
AU - Hoteit, Ibrahim
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: The research reported in this manuscript was supported by King Abdullah University of Science and Technology (KAUST) and Saudi ARAMCO, and made use of the resources of the Supercomputing Core Laboratory of KAUST.
PY - 2018/4/26
Y1 - 2018/4/26
N2 - A fully parallel ensemble data assimilation and forecasting system has been developed for the Red Sea based on the MIT general circulation model (MITgcm) to simulate the Red Sea circulation and the Data Assimilation Research Testbed (DART) ensemble assimilation software. An important limitation of operational ensemble assimilation systems is the risk of ensemble members’ collapse. This could happen in those situations when the filter update step imposes large corrections on one, or more, of the forecasted ensemble members that are not fully consistent with the model physics. Increasing the ensemble size is expected to improve the assimilation system performances, but obviously increases the risk of members’ collapse. Hardware failure or slow numerical convergence encountered for some members should also occur more frequently. In this context, the manual steering of the whole process appears as a real challenge and makes the implementation of the ensemble assimilation procedure uneasy and extremely time consuming.This paper presents our efforts to build an efficient and fault-tolerant MITgcm-DART ensemble assimilation system capable of operationally running thousands of members. Built on top of Decimate, a scheduler extension developed to ease the submission, monitoring and dynamic steering of workflow of dependent jobs in a fault-tolerant environment, we describe the assimilation system implementation and discuss in detail its coupling strategies. Within Decimate, only a few additional lines of Python is needed to define flexible convergence criteria and to implement any necessary actions to the forecast ensemble members, as for instance (i) restarting faulty job in case of job failure, (ii) changing the random seed in case of poor convergence or numerical instability, (iii) adjusting (reducing or increasing) the number of parallel forecasts on the fly, (iv) replacing members on the fly to enrich the ensemble with new members, etc.We demonstrate the efficiency of the system with numerical experiments assimilating real satellites sea surface height and temperature observations in the Red Sea.
AB - A fully parallel ensemble data assimilation and forecasting system has been developed for the Red Sea based on the MIT general circulation model (MITgcm) to simulate the Red Sea circulation and the Data Assimilation Research Testbed (DART) ensemble assimilation software. An important limitation of operational ensemble assimilation systems is the risk of ensemble members’ collapse. This could happen in those situations when the filter update step imposes large corrections on one, or more, of the forecasted ensemble members that are not fully consistent with the model physics. Increasing the ensemble size is expected to improve the assimilation system performances, but obviously increases the risk of members’ collapse. Hardware failure or slow numerical convergence encountered for some members should also occur more frequently. In this context, the manual steering of the whole process appears as a real challenge and makes the implementation of the ensemble assimilation procedure uneasy and extremely time consuming.This paper presents our efforts to build an efficient and fault-tolerant MITgcm-DART ensemble assimilation system capable of operationally running thousands of members. Built on top of Decimate, a scheduler extension developed to ease the submission, monitoring and dynamic steering of workflow of dependent jobs in a fault-tolerant environment, we describe the assimilation system implementation and discuss in detail its coupling strategies. Within Decimate, only a few additional lines of Python is needed to define flexible convergence criteria and to implement any necessary actions to the forecast ensemble members, as for instance (i) restarting faulty job in case of job failure, (ii) changing the random seed in case of poor convergence or numerical instability, (iii) adjusting (reducing or increasing) the number of parallel forecasts on the fly, (iv) replacing members on the fly to enrich the ensemble with new members, etc.We demonstrate the efficiency of the system with numerical experiments assimilating real satellites sea surface height and temperature observations in the Red Sea.
UR - http://hdl.handle.net/10754/627684
UR - http://www.sciencedirect.com/science/article/pii/S1877750317312905
UR - http://www.scopus.com/inward/record.url?scp=85046831861&partnerID=8YFLogxK
U2 - 10.1016/j.jocs.2018.04.018
DO - 10.1016/j.jocs.2018.04.018
M3 - Article
SN - 1877-7503
VL - 27
SP - 46
EP - 56
JO - Journal of Computational Science
JF - Journal of Computational Science
ER -