TY - JOUR
T1 - Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
AU - Hasanov, Khalid
AU - Quintin, Jean-Noël
AU - Lastovetsky, Alexey
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: The research in this paper was supported by IRCSET (Irish Research Council for Science, Engineering and Technology) and IBM, grant numbers EPSG/2011/188 and EPSPD/2011/207. Some of the experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr) Another part of the experiments was carried out using the resources of the Supercomputing Laboratory at King Abdullah University of Science & Technology (KAUST) in Thuwal, Saudi Arabia. The authors would like to thank Ashley DeFlumere for her useful comments and corrections.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.
PY - 2014/3/4
Y1 - 2014/3/4
N2 - © 2014, Springer Science+Business Media New York. Many state-of-the-art parallel algorithms, which are widely used in scientific applications executed on high-end computing systems, were designed in the twentieth century with relatively small-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel algorithms for execution on large-scale distributed-memory systems. The idea is to reduce the communication cost by introducing hierarchy and hence more parallelism in the communication scheme. We apply this approach to SUMMA, the state-of-the-art parallel algorithm for matrix–matrix multiplication, and demonstrate both theoretically and experimentally that the modified Hierarchical SUMMA significantly improves the communication cost and the overall performance on large-scale platforms.
AB - © 2014, Springer Science+Business Media New York. Many state-of-the-art parallel algorithms, which are widely used in scientific applications executed on high-end computing systems, were designed in the twentieth century with relatively small-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel algorithms for execution on large-scale distributed-memory systems. The idea is to reduce the communication cost by introducing hierarchy and hence more parallelism in the communication scheme. We apply this approach to SUMMA, the state-of-the-art parallel algorithm for matrix–matrix multiplication, and demonstrate both theoretically and experimentally that the modified Hierarchical SUMMA significantly improves the communication cost and the overall performance on large-scale platforms.
UR - http://hdl.handle.net/10754/598455
UR - http://link.springer.com/10.1007/s11227-014-1133-x
UR - http://www.scopus.com/inward/record.url?scp=84945484132&partnerID=8YFLogxK
U2 - 10.1007/s11227-014-1133-x
DO - 10.1007/s11227-014-1133-x
M3 - Article
SN - 0920-8542
VL - 71
SP - 3991
EP - 4014
JO - The Journal of Supercomputing
JF - The Journal of Supercomputing
IS - 11
ER -