TY - GEN
T1 - A scalable high performant cholesky factorization for multicore with GPU accelerators
AU - Ltaief, Hatem
AU - Tomov, Stanimire
AU - Nath, Rajib
AU - Du, Peng
AU - Dongarra, Jack
PY - 2011
Y1 - 2011
N2 - We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs' compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. This results in a scalable hybrid Cholesky factorization of unprecedented performance. In particular, using NVIDIA's Tesla S1070 (4 C1060 GPUs, each with 30 cores @1.44 GHz) connected to two dual-core AMD Opteron @1.8GHz processors, we reach up to 1.163 TFlop/s in single and up to 275 GFlop/s in double precision arithmetic. Compared with the performance of the embarrassingly parallel xGEMM over four GPUs, where no communication between GPUs are involved, our algorithm still runs at 73% and 84% for single and double precision arithmetic respectively.
AB - We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs' compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. This results in a scalable hybrid Cholesky factorization of unprecedented performance. In particular, using NVIDIA's Tesla S1070 (4 C1060 GPUs, each with 30 cores @1.44 GHz) connected to two dual-core AMD Opteron @1.8GHz processors, we reach up to 1.163 TFlop/s in single and up to 275 GFlop/s in double precision arithmetic. Compared with the performance of the embarrassingly parallel xGEMM over four GPUs, where no communication between GPUs are involved, our algorithm still runs at 73% and 84% for single and double precision arithmetic respectively.
UR - http://www.scopus.com/inward/record.url?scp=79952585280&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-19328-6_11
DO - 10.1007/978-3-642-19328-6_11
M3 - Conference contribution
AN - SCOPUS:79952585280
SN - 9783642193279
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 93
EP - 101
BT - High Performance Computing for Computational Science, VECPAR 2010 - 9th International Conference, Revised Selected Papers
T2 - 9th International Conference on High Performance Computing for Computational Science, VECPAR 2010
Y2 - 22 June 2010 through 25 June 2010
ER -