TY - JOUR
T1 - Accelerating turbulent reacting flow simulations on many-core/GPUs using matrix-based kinetics
AU - Uranakara, Harshavardhana Ashoka
AU - Barwey, Shivam
AU - Hernandez Perez, Francisco
AU - Vijayarangan, Vijayamanikandan
AU - Raman, Venkat
AU - Im, Hong G.
N1 - KAUST Repository Item: Exported on 2022-09-26
Acknowledgements: This work was sponsored by King Abdullah University of Science and Technology (KAUST ) and used computational resources of the KAUST Supercomputing Laboratory (KSL). The authors thank Dr. Mohsin for support on using Ibex.
PY - 2022/9/22
Y1 - 2022/9/22
N2 - The present work assesses the impact - in terms of time to solution, throughput analysis, and hardware scalability - of transferring computationally intensive tasks, found in compressible reacting flow solvers, to the GPU. Attention is focused on outlining the workflow and data transfer penalties associated with “plugging in” a recently developed GPU-based chemistry library into (a) a purely CPU-based solver and (b) a GPU-based solver, where, except for the chemistry, all other variables are computed on the GPU. This comparison allows quantification of host-to-device (and vice versa) data transfer penalties on the overall solver speedup as a function of mesh and reaction mechanism size. To this end, a recently developed GPU-based chemistry library known as UMChemGPU is employed to treat the kinetics in the flow solver KARFS. UMChemGPU replaces conventional CPU-based Cantera routines using a matrix-based formulation. The impact of i) data transfer times, ii) chemistry acceleration, and iii) the hardware architecture is studied in detail in the context of GPU saturation limits. Hydrogen and dimethyl ether (DME) reaction mechanisms are used to assess the impact of the number of species/reactions on overall/chemistry-only speedup. It was found that offloading the source term computation to UMChemGPU results in up to 7X reduction in overall time to solution and four orders of magnitude faster source term computation compared to conventional CPU-based methods. Furthermore, the metrics for achieving maximum performance gain using GPU chemistry with an MPI + CUDA solver are explained using the Roofline model. Integrating the UMChemGPU with an MPI + OpenMP solver does not improve the overall performance due to the associated data copy time between the device (GPU) and host (CPU) memory spaces. The performance portability was demonstrated using three different GPU architectures, and the findings are expected to translate to a wide variety of high-performance codes in the combustion community.
AB - The present work assesses the impact - in terms of time to solution, throughput analysis, and hardware scalability - of transferring computationally intensive tasks, found in compressible reacting flow solvers, to the GPU. Attention is focused on outlining the workflow and data transfer penalties associated with “plugging in” a recently developed GPU-based chemistry library into (a) a purely CPU-based solver and (b) a GPU-based solver, where, except for the chemistry, all other variables are computed on the GPU. This comparison allows quantification of host-to-device (and vice versa) data transfer penalties on the overall solver speedup as a function of mesh and reaction mechanism size. To this end, a recently developed GPU-based chemistry library known as UMChemGPU is employed to treat the kinetics in the flow solver KARFS. UMChemGPU replaces conventional CPU-based Cantera routines using a matrix-based formulation. The impact of i) data transfer times, ii) chemistry acceleration, and iii) the hardware architecture is studied in detail in the context of GPU saturation limits. Hydrogen and dimethyl ether (DME) reaction mechanisms are used to assess the impact of the number of species/reactions on overall/chemistry-only speedup. It was found that offloading the source term computation to UMChemGPU results in up to 7X reduction in overall time to solution and four orders of magnitude faster source term computation compared to conventional CPU-based methods. Furthermore, the metrics for achieving maximum performance gain using GPU chemistry with an MPI + CUDA solver are explained using the Roofline model. Integrating the UMChemGPU with an MPI + OpenMP solver does not improve the overall performance due to the associated data copy time between the device (GPU) and host (CPU) memory spaces. The performance portability was demonstrated using three different GPU architectures, and the findings are expected to translate to a wide variety of high-performance codes in the combustion community.
UR - http://hdl.handle.net/10754/681646
UR - https://linkinghub.elsevier.com/retrieve/pii/S1540748922001778
U2 - 10.1016/j.proci.2022.07.144
DO - 10.1016/j.proci.2022.07.144
M3 - Article
SN - 1540-7489
JO - Proceedings of the Combustion Institute
JF - Proceedings of the Combustion Institute
ER -