TY - JOUR
T1 - Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations
AU - Malas, Tareq M.
AU - Hager, Georg
AU - Ltaief, Hatem
AU - Keyes, David E.
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: For computer time, this research used the resources of the Extreme Computing Research Center (ECRC) at KAUST. The authors thank the ECRC for supporting T. Malas.
PY - 2017/12/20
Y1 - 2017/12/20
N2 - Optimizing the performance of stencil algorithms has been the subject of intense research over the last two decades. Since many stencil schemes have low arithmetic intensity, most optimizations focus on increasing the temporal data access locality, thus reducing the data traffic through the main memory interface with the ultimate goal of decoupling from this bottleneck. There are, however, only a few approaches that explicitly leverage the shared cache feature of modern multicore chips. If every thread works on its private, separate cache block, the available cache space can become too small, and sufficient temporal locality may not be achieved. We propose a flexible multidimensional intratile parallelization method for stencil algorithms on multicore CPUs with a shared outer-level cache. This method leads to a significant reduction in the required cache space without adverse effects from hardware prefetching or TLB shortage. Our Girih framework includes an autotuner to select optimal parameter configurations on the target hardware. We conduct performance experiments on two contemporary Intel processors and compare with the state-of-the-art stencil frameworks Pluto and Pochoir, using four corner-case stencil schemes and a wide range of problem sizes. Girih shows substantial performance advantages and best arithmetic intensity at almost all problem sizes, especially on low-intensity stencils with variable coefficients. We study in detail the performance behavior at varying grid sizes using phenomenological performance modeling. Our analysis of energy consumption reveals that our method can save energy through reduced DRAM bandwidth usage even at a marginal performance gain. It is thus well suited for future architectures that will be strongly challenged by the cost of data movement, be it in terms of performance or energy consumption.
AB - Optimizing the performance of stencil algorithms has been the subject of intense research over the last two decades. Since many stencil schemes have low arithmetic intensity, most optimizations focus on increasing the temporal data access locality, thus reducing the data traffic through the main memory interface with the ultimate goal of decoupling from this bottleneck. There are, however, only a few approaches that explicitly leverage the shared cache feature of modern multicore chips. If every thread works on its private, separate cache block, the available cache space can become too small, and sufficient temporal locality may not be achieved. We propose a flexible multidimensional intratile parallelization method for stencil algorithms on multicore CPUs with a shared outer-level cache. This method leads to a significant reduction in the required cache space without adverse effects from hardware prefetching or TLB shortage. Our Girih framework includes an autotuner to select optimal parameter configurations on the target hardware. We conduct performance experiments on two contemporary Intel processors and compare with the state-of-the-art stencil frameworks Pluto and Pochoir, using four corner-case stencil schemes and a wide range of problem sizes. Girih shows substantial performance advantages and best arithmetic intensity at almost all problem sizes, especially on low-intensity stencils with variable coefficients. We study in detail the performance behavior at varying grid sizes using phenomenological performance modeling. Our analysis of energy consumption reveals that our method can save energy through reduced DRAM bandwidth usage even at a marginal performance gain. It is thus well suited for future architectures that will be strongly challenged by the cost of data movement, be it in terms of performance or energy consumption.
UR - http://hdl.handle.net/10754/631616
UR - https://dl.acm.org/citation.cfm?doid=3175004.3155290
UR - http://www.scopus.com/inward/record.url?scp=85053396616&partnerID=8YFLogxK
U2 - 10.1145/3155290
DO - 10.1145/3155290
M3 - Article
SN - 2329-4949
VL - 4
SP - 1
EP - 32
JO - ACM Transactions on Parallel Computing
JF - ACM Transactions on Parallel Computing
IS - 3
ER -