TY - GEN
T1 - Exploration of automatic optimization for CUDA programming
AU - Al-Mouhamed, Mayez
AU - Khan, Ayaz ul Hassan
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: Thanks to the ICS-KFUPM and KAUST for givingaccess to their GPU computers and workstations.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.
PY - 2012/12
Y1 - 2012/12
N2 - Graphic processing Units (GPUs) are gaining ground in high-performance computing. CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. However, the task of writing optimized CUDA program is complex even for experts. We present a method for restructuring loops into an optimized CUDA kernels based on a 3-step algorithm which are loop tiling, coalesced memory access, and resource optimization. We also establish the relationships between the influencing parameters and propose a method for finding possible tiling solutions with coalesced memory access that best meets the identified constraints. We also present a simplified algorithm for restructuring loops and rewrite them as an efficient CUDA Kernel. The execution model of synthesized kernel consists of uniformly distributing the kernel threads to keep all cores busy while transferring a tailored data locality which is accessed using coalesced pattern to amortize the long latency of the secondary memory. In the evaluation, we implement some simple applications using the proposed restructuring strategy and evaluate the performance in terms of execution time and GPU throughput. © 2012 IEEE.
AB - Graphic processing Units (GPUs) are gaining ground in high-performance computing. CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. However, the task of writing optimized CUDA program is complex even for experts. We present a method for restructuring loops into an optimized CUDA kernels based on a 3-step algorithm which are loop tiling, coalesced memory access, and resource optimization. We also establish the relationships between the influencing parameters and propose a method for finding possible tiling solutions with coalesced memory access that best meets the identified constraints. We also present a simplified algorithm for restructuring loops and rewrite them as an efficient CUDA Kernel. The execution model of synthesized kernel consists of uniformly distributing the kernel threads to keep all cores busy while transferring a tailored data locality which is accessed using coalesced pattern to amortize the long latency of the secondary memory. In the evaluation, we implement some simple applications using the proposed restructuring strategy and evaluate the performance in terms of execution time and GPU throughput. © 2012 IEEE.
UR - http://hdl.handle.net/10754/598291
UR - http://ieeexplore.ieee.org/document/6449791/
UR - http://www.scopus.com/inward/record.url?scp=84874413450&partnerID=8YFLogxK
U2 - 10.1109/PDGC.2012.6449791
DO - 10.1109/PDGC.2012.6449791
M3 - Conference contribution
SN - 9781467329255
SP - 55
EP - 60
BT - 2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing
PB - Institute of Electrical and Electronics Engineers (IEEE)
ER -