TY - JOUR
T1 - Tile low-rank approximations of non-Gaussian space and space-time Tukey g-and-h random field likelihoods and predictions on large-scale systems
AU - Mondal, Sagnik
AU - Abdulah, Sameh
AU - Ltaief, Hatem
AU - Sun, Ying
AU - Genton, Marc G.
AU - Keyes, David E.
N1 - KAUST Repository Item: Exported on 2023-06-14
Acknowledgements: This work is funded and supported by King Abdullah University of Science and Technology (KAUST) through the Office of Sponsored Research (OSR). This research used the resources of the Extreme Computing Research Center (ECRC) and the KAUST Supercomputing Laboratory, including Cray XC40, Shaheen II supercomputer.
PY - 2023/6/2
Y1 - 2023/6/2
N2 - Large-scale statistical modeling has become necessary with the vast flood of geospace data coming from various sources. In space statistics, the Maximum Likelihood Estimation (MLE) is widely considered for modeling geospace data by estimating a set of statistical parameters related to a predefined covariance function. This covariance function describes the correlation between a set of geospace locations where the main goal is to model given data samples and impute missing data. Climate/weather modeling is a prevalent application for the MLE operation where data interpolation and forecasting are highly required. In the literature, the Gaussian random field is often used to describe geospace data as one of the most popular models for MLE. However, real-life datasets are often skewed and/or have extreme values, and non-Gaussian random field models are more appropriate for capturing such features. In this work, we provide an exact and approximate parallel implementation of the well-known Tukey g-and-h (TGH) non-Gaussian random field in the context of climate/weather applications. The proposed implementation alleviates the computation complexity of the log-likelihood function, which requires O(n2) storage and O(n3) operations, where N is the number of geospace locations, M is the number of time slots, and n=N×M. Based on tile low-rank (TLR) approximations, our implementation of the TGH model can tackle large-scale problems. Furthermore, we rely on task-based programming models and dynamic runtime systems to provide fast execution for the MLE operation in space and space-time cases. We assess the performance and accuracy of the proposed implementations using synthetic space and space-time datasets up to 800K. We also consider a 12-month precipitation dataset in Germany to demonstrate the advantage of using non-Gaussian over Gaussian random field models. We evaluate the prediction accuracy of the TGH model on the precipitation dataset using the Probability Integral Transformation (PIT) tool showing that the TGH model outperforms the Gaussian modeling in the real dataset. Moreover, our performance assessment indicates that TLR computations allow solving larger matrix sizes while preserving the required accuracy for prediction. The TLR-based approximation shows a speedup up to 7.29X and 2.96X over the exact solution.
AB - Large-scale statistical modeling has become necessary with the vast flood of geospace data coming from various sources. In space statistics, the Maximum Likelihood Estimation (MLE) is widely considered for modeling geospace data by estimating a set of statistical parameters related to a predefined covariance function. This covariance function describes the correlation between a set of geospace locations where the main goal is to model given data samples and impute missing data. Climate/weather modeling is a prevalent application for the MLE operation where data interpolation and forecasting are highly required. In the literature, the Gaussian random field is often used to describe geospace data as one of the most popular models for MLE. However, real-life datasets are often skewed and/or have extreme values, and non-Gaussian random field models are more appropriate for capturing such features. In this work, we provide an exact and approximate parallel implementation of the well-known Tukey g-and-h (TGH) non-Gaussian random field in the context of climate/weather applications. The proposed implementation alleviates the computation complexity of the log-likelihood function, which requires O(n2) storage and O(n3) operations, where N is the number of geospace locations, M is the number of time slots, and n=N×M. Based on tile low-rank (TLR) approximations, our implementation of the TGH model can tackle large-scale problems. Furthermore, we rely on task-based programming models and dynamic runtime systems to provide fast execution for the MLE operation in space and space-time cases. We assess the performance and accuracy of the proposed implementations using synthetic space and space-time datasets up to 800K. We also consider a 12-month precipitation dataset in Germany to demonstrate the advantage of using non-Gaussian over Gaussian random field models. We evaluate the prediction accuracy of the TGH model on the precipitation dataset using the Probability Integral Transformation (PIT) tool showing that the TGH model outperforms the Gaussian modeling in the real dataset. Moreover, our performance assessment indicates that TLR computations allow solving larger matrix sizes while preserving the required accuracy for prediction. The TLR-based approximation shows a speedup up to 7.29X and 2.96X over the exact solution.
UR - http://hdl.handle.net/10754/692593
UR - https://linkinghub.elsevier.com/retrieve/pii/S0743731523000850
UR - http://www.scopus.com/inward/record.url?scp=85160736553&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2023.104715
DO - 10.1016/j.jpdc.2023.104715
M3 - Article
SN - 0743-7315
VL - 180
SP - 104715
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
ER -