TY - GEN
T1 - Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model
AU - Stengel, Holger
AU - Treibig, Jan
AU - Hager, Georg
AU - Wellein, Gerhard
N1 - KAUST Repository Item: Exported on 2022-06-23
Acknowledgements: We thank Andrey Semin (Intel Germany) for useful discussions, Olaf Schenk (USI Lugano) for providing the “uxx” benchmark case, and Hatem Ltaief (KAUST) for providing the 3D long-range stencil case. Part of this work was supported by the DFG priority programme 1648 “SPPEXA” under the project “EXASTEEL.”
This publication acknowledges KAUST support, but has no KAUST affiliated authors.
PY - 2015/6/8
Y1 - 2015/6/8
N2 - Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of expected speedup. Understanding the performance properties and bottlenecks by performance modeling enables a clear view on promising optimization opportunities. In this work we refine the recently developed Execution-Cache-Memory (ECM) model and use it to quantify the performance bottlenecks of stencil algorithms on a contemporary Intel processor. This includes applying the model to arrive at single-core performance and scalability predictions for typical "corner case" stencil loop kernels. Guided by the ECM model we accurately quantify the significance of "layer conditions," which are required to estimate the data traffic through the memory hierarchy, and study the impact of typical optimization approaches such as spatial blocking, strength reduction, and temporal blocking for their expected benefits. We also compare the ECM model to the widely known Roofline model.
AB - Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of expected speedup. Understanding the performance properties and bottlenecks by performance modeling enables a clear view on promising optimization opportunities. In this work we refine the recently developed Execution-Cache-Memory (ECM) model and use it to quantify the performance bottlenecks of stencil algorithms on a contemporary Intel processor. This includes applying the model to arrive at single-core performance and scalability predictions for typical "corner case" stencil loop kernels. Guided by the ECM model we accurately quantify the significance of "layer conditions," which are required to estimate the data traffic through the memory hierarchy, and study the impact of typical optimization approaches such as spatial blocking, strength reduction, and temporal blocking for their expected benefits. We also compare the ECM model to the widely known Roofline model.
UR - http://hdl.handle.net/10754/679277
UR - https://dl.acm.org/doi/10.1145/2751205.2751240
UR - http://www.scopus.com/inward/record.url?scp=84940765158&partnerID=8YFLogxK
U2 - 10.1145/2751205.2751240
DO - 10.1145/2751205.2751240
M3 - Conference contribution
SN - 9781450335591
SP - 207
EP - 216
BT - Proceedings of the 29th ACM on International Conference on Supercomputing
PB - ACM
ER -