TY - GEN
T1 - Fork-join and data-driven execution models on multi-core architectures: Case study of the FMM
AU - Amer, Abdelhalim
AU - Maruyama, Naoya
AU - Pericàs, Miquel
AU - Taura, Kenjiro
AU - Yokota, Rio
AU - Matsuoka, Satoshi
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2013
Y1 - 2013
N2 - Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highly optimized fork-join based implementation of the FMM and extend it to a data-driven implementation using a distributed task scheduling approach. This study exposes some limitations of the conventional fork-join implementation in terms of synchronization overheads. We find that these are not negligible and their elimination by the data-driven method, with a careful data locality strategy, was beneficial. Experimental evaluation of both methods on state-of-the-art multi-socket multi-core architectures showed up to 22% speed-ups of the data-driven approach compared to the original method. We demonstrate that a data-driven execution of FMM not only improves performance by avoiding global synchronization overheads but also reduces the memory-bandwidth pressure caused by memory-intensive computations. © 2013 Springer-Verlag.
AB - Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highly optimized fork-join based implementation of the FMM and extend it to a data-driven implementation using a distributed task scheduling approach. This study exposes some limitations of the conventional fork-join implementation in terms of synchronization overheads. We find that these are not negligible and their elimination by the data-driven method, with a careful data locality strategy, was beneficial. Experimental evaluation of both methods on state-of-the-art multi-socket multi-core architectures showed up to 22% speed-ups of the data-driven approach compared to the original method. We demonstrate that a data-driven execution of FMM not only improves performance by avoiding global synchronization overheads but also reduces the memory-bandwidth pressure caused by memory-intensive computations. © 2013 Springer-Verlag.
UR - http://hdl.handle.net/10754/575764
UR - http://link.springer.com/10.1007/978-3-642-38750-0_19
UR - http://www.scopus.com/inward/record.url?scp=84884470402&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-38750-0_19
DO - 10.1007/978-3-642-38750-0_19
M3 - Conference contribution
SN - 9783642387494
SP - 255
EP - 266
BT - Lecture Notes in Computer Science
PB - Springer Nature
ER -