Covariance matrices are ubiquitous in computational sciences, typically describing the correlation of elements of large multivariate spatial data sets. For example, covari ance matrices are employed in climate/weather modeling for the maximum likelihood estimation to improve prediction, as well as in computational groundbased astronomy to enhance the observed image quality by filtering out noise produced by the adap tive optics instruments and atmospheric turbulence. The structure of these covariance matrices is dense, symmetric, positivedefinite, and often datasparse, therefore, hier archically of lowrank. This thesis investigates the performance limit of dense matrix computations (e.g., Cholesky factorization) on covariance matrix problems as the number of unknowns grows, and in the context of the aforementioned applications. We employ recursive formulations of some of the basic linear algebra subroutines (BLAS) to accelerate the covariance matrix computation further, while reducing data traffic across the memory subsystems layers. However, dealing with large data sets (i.e., covariance matrices of billions in size) can rapidly become prohibitive in memory footprint and algorithmic complexity. Most importantly, this thesis investigates the tile lowrank data format (TLR), a new compressed data structure and layout, which is valuable in exploiting data sparsity by approximating the operator. The TLR com pressed data structure allows approximating the original problem up to userdefined numerical accuracy. This comes at the expense of dealing with tasks with much lower arithmetic intensities than traditional dense computations. In fact, this thesis con
solidates the two trends of dense and datasparse linear algebra for HPC. Not only does the thesis leverage recursive formulations for dense Choleskybased matrix al gorithms, but it also implements a novel TLRCholesky factorization using batched linear algebra operations to increase hardware occupancy and reduce the overhead of the API. Performance reported of the dense and TLRCholesky shows manyfold speedups against stateoftheart implementations on various systems equipped with GPUs. Additionally, the TLR implementation gives the user flexibility to select the desired accuracy. This tradeoff between performance and accuracy is, currently, a wellestablished leading trend in the convergence of the third and fourth paradigm, i.e., HPC and Big Data, when moving forward with exascale software roadmap.
Date of Award  May 24 2018 

Original language  English (US) 

Awarding Institution   Computer, Electrical and Mathematical Sciences and Engineering


Supervisor  David Keyes (Supervisor) 

 data sparse
 Hierarchical
 covariance matrix
 GPU
 tile lowrank
 Dense Linear Algebra