TY - GEN
T1 - Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures
AU - Luszczek, Piotr
AU - Ltaief, Hatem
AU - Dongarra, Jack
PY - 2011
Y1 - 2011
N2 - While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for two-sided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and difficult research problem due to expensive memory-bound operations occurring during the panel factorization. The processor-memory speed gap continues to widen, which has even further exacerbated the problem. This paper focuses on an efficient implementation of the tridiagonal reduction, which is the first algorithmic step toward computing the spectral decomposition of a dense symmetric matrix. The original matrix is translated into a tile layout i.e., a high performance data representation, which substantially enhances data locality. Following a two-stage approach, the tile matrix is then transformed into band tridiagonal form using compute intensive kernels. The band form is further reduced to the required tridiagonal form using a left-looking bulge chasing technique to reduce memory traffic and memory contention. A dependence translation layer associated with a dynamic runtime system allows for scheduling and overlapping tasks generated from both stages. The obtained tile tridiagonal reduction significantly outperforms the state-of-the-art numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical software Intel MKL) from medium to large matrix sizes.
AB - While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for two-sided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and difficult research problem due to expensive memory-bound operations occurring during the panel factorization. The processor-memory speed gap continues to widen, which has even further exacerbated the problem. This paper focuses on an efficient implementation of the tridiagonal reduction, which is the first algorithmic step toward computing the spectral decomposition of a dense symmetric matrix. The original matrix is translated into a tile layout i.e., a high performance data representation, which substantially enhances data locality. Following a two-stage approach, the tile matrix is then transformed into band tridiagonal form using compute intensive kernels. The band form is further reduced to the required tridiagonal form using a left-looking bulge chasing technique to reduce memory traffic and memory contention. A dependence translation layer associated with a dynamic runtime system allows for scheduling and overlapping tasks generated from both stages. The obtained tile tridiagonal reduction significantly outperforms the state-of-the-art numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical software Intel MKL) from medium to large matrix sizes.
KW - Bulge Chasing
KW - Scheduling
KW - Tile Algorithms
KW - Translation Layer
KW - Tridiagonal Reduction
UR - http://www.scopus.com/inward/record.url?scp=80053252490&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2011.91
DO - 10.1109/IPDPS.2011.91
M3 - Conference contribution
AN - SCOPUS:80053252490
SN - 9780769543857
T3 - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
SP - 944
EP - 955
BT - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
T2 - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Y2 - 16 May 2011 through 20 May 2011
ER -