TY - GEN
T1 - High Performance Polar Decomposition on Distributed Memory Systems
AU - Sukkari, Dalal E.
AU - Ltaief, Hatem
AU - Keyes, David E.
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: For computer time, this research used the resources from the Swiss National Supercomputing Centre (CSCS) in Lugano, Switzerland.
PY - 2016/8/9
Y1 - 2016/8/9
N2 - The polar decomposition of a dense matrix is an important operation in linear algebra. It can be directly calculated through the singular value decomposition (SVD) or iteratively using the QR dynamically-weighted Halley algorithm (QDWH). The former is difficult to parallelize due to the preponderant number of memory-bound operations during the bidiagonal reduction. We investigate the latter scenario, which performs more floating-point operations but exposes at the same time more parallelism, and therefore, runs closer to the theoretical peak performance of the system, thanks to more compute-bound matrix operations. Profiling results show the performance scalability of QDWH for calculating the polar decomposition using around 9200 MPI processes on well and ill-conditioned matrices of 100K×100K problem size. We study then the performance impact of the QDWH-based polar decomposition as a pre-processing step toward calculating the SVD itself. The new distributed-memory implementation of the QDWH-SVD solver achieves up to five-fold speedup against current state-of-the-art vendor SVD implementations. © Springer International Publishing Switzerland 2016.
AB - The polar decomposition of a dense matrix is an important operation in linear algebra. It can be directly calculated through the singular value decomposition (SVD) or iteratively using the QR dynamically-weighted Halley algorithm (QDWH). The former is difficult to parallelize due to the preponderant number of memory-bound operations during the bidiagonal reduction. We investigate the latter scenario, which performs more floating-point operations but exposes at the same time more parallelism, and therefore, runs closer to the theoretical peak performance of the system, thanks to more compute-bound matrix operations. Profiling results show the performance scalability of QDWH for calculating the polar decomposition using around 9200 MPI processes on well and ill-conditioned matrices of 100K×100K problem size. We study then the performance impact of the QDWH-based polar decomposition as a pre-processing step toward calculating the SVD itself. The new distributed-memory implementation of the QDWH-SVD solver achieves up to five-fold speedup against current state-of-the-art vendor SVD implementations. © Springer International Publishing Switzerland 2016.
UR - http://hdl.handle.net/10754/622144
UR - http://link.springer.com/chapter/10.1007%2F978-3-319-43659-3_44
UR - http://www.scopus.com/inward/record.url?scp=84984801505&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-43659-3_44
DO - 10.1007/978-3-319-43659-3_44
M3 - Conference contribution
SN - 9783319436586
SP - 605
EP - 616
BT - Euro-Par 2016: Parallel Processing
PB - Springer Nature
ER -