Large-scale environmental data science with ExaGeoStatR

Sameh Abdulah, Yuxiao Li, Jian Cao, Hatem Ltaief, David E. Keyes, Marc G. Genton*, Ying Sun

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Scopus citations


Parallel computing in exact Gaussian process (GP) calculations becomes necessary for avoiding computational and memory restrictions associated with large-scale environmental data science applications. The exact evaluation of the Gaussian log-likelihood function requires (Formula presented.) storage and (Formula presented.) operations, where (Formula presented.) is the number of geographical locations. Thus, exactly computing the log-likelihood function with a large number of locations requires exploiting the power of existing parallel computing hardware systems, such as shared-memory, possibly equipped with GPUs, and distributed-memory systems, to solve this exact computational complexity. In this article, we present ExaGeoStatR, a package for exascale geostatistics in R that supports a parallel computation of the exact maximum likelihood function on a wide variety of parallel architectures. Furthermore, the package allows scaling existing GP methods to a large spatial/temporal domain. Prohibitive exact solutions for large geostatistical problems become possible with ExaGeoStatR. Parallelization in ExaGeoStatR depends on breaking down the numerical linear algebra operations in the log-likelihood function into a set of tasks and rendering them for a task-based programming model. The package can be used directly through the R environment on parallel systems without the user needing any C, CUDA, or MPI knowledge. Currently, ExaGeoStatR supports several maximum likelihood computation variants such as exact, diagonal super tile and tile low-rank approximations, and mixed-precision. ExaGeoStatR also provides a tool to simulate large-scale synthetic datasets. These datasets can help assess different implementations of the maximum log-likelihood approximation methods. Herein, we show the implementation details of ExaGeoStatR, analyze its performance on various parallel architectures, and assess its accuracy using synthetic datasets with up to 250K observations. The experimental analysis covers the exact computation of ExaGeoStatR to demonstrate the parallel capabilities of the package. We provide a hands-on tutorial to analyze a sea surface temperature real dataset. The performance evaluation involves comparisons with the popular packages GeoR, fields, and bigGP for exact Gaussian likelihood evaluation. The approximation methods in ExaGeoStatR are not considered in this article since they were analyzed in previous studies.

Original languageEnglish (US)
Article numbere2770
Issue number1
StatePublished - Feb 2023


  • environmental application
  • Gaussian process
  • Matérn covariance function
  • maximum likelihood optimization
  • parameter estimation
  • prediction

ASJC Scopus subject areas

  • Statistics and Probability
  • Ecological Modeling


Dive into the research topics of 'Large-scale environmental data science with ExaGeoStatR'. Together they form a unique fingerprint.

Cite this