TY - JOUR
T1 - Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures
AU - Sukkari, Dalal E.
AU - Ltaief, Hatem
AU - Faverge, Mathieu
AU - Keyes, David E.
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: The authors would like to thank Samuel Thibault from Inria for his support with StarPU, Jack Poulson from Google Inc. for his help in tuning Elemental and the vendors Cray/IBM/Intel/NVIDIA for their hardware donations and/or systems’ remote accesses in the context of the Cray Center of Excellence, the Intel Parallel Computing Center and the NVIDIA GPU Research Center awarded to the Extreme Computing Research Center at KAUST.
PY - 2017/9/29
Y1 - 2017/9/29
N2 - This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations for the polar decomposition on latest shared-memory vendors' systems, while maintaining numerical accuracy.
AB - This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations for the polar decomposition on latest shared-memory vendors' systems, while maintaining numerical accuracy.
UR - http://hdl.handle.net/10754/625885
UR - http://ieeexplore.ieee.org/document/8053812/
UR - http://www.scopus.com/inward/record.url?scp=85030786640&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2017.2755655
DO - 10.1109/TPDS.2017.2755655
M3 - Article
SN - 1045-9219
VL - 29
SP - 312
EP - 323
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 2
ER -