TY - GEN
T1 - AMA: Asynchronous management of accelerators for task-based programming models
AU - Planas, Judit
AU - Badia, Rosa M.
AU - Ayguade, Eduard
AU - Labarta, Jesus
N1 - KAUST Repository Item: Exported on 2022-06-24
Acknowledgements: European Commission (HiPEAC-3 Network of Excellence, FP7-ICT 287759), Intel-BSC Exas-cale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU), Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-1051). We thank KAUST IT Research Computing for granting access to their machines.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.
PY - 2015/6/1
Y1 - 2015/6/1
N2 - Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multiaccelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.
AB - Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multiaccelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.
UR - http://hdl.handle.net/10754/679324
UR - https://linkinghub.elsevier.com/retrieve/pii/S1877050915010200
UR - http://www.scopus.com/inward/record.url?scp=84939143628&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2015.05.212
DO - 10.1016/j.procs.2015.05.212
M3 - Conference contribution
SP - 130
EP - 139
BT - Procedia Computer Science
PB - Elsevier BV
ER -