TY - JOUR
T1 - Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor
AU - Malas, Tareq Majed Yasin
AU - Ahmadia, Aron
AU - Brown, Jed
AU - Gunnels, John A.
AU - Keyes, David E.
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2012/5/21
Y1 - 2012/5/21
N2 - Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer's PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7× speedup over the best previously published results. © The Author(s) 2012.
AB - Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer's PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7× speedup over the best previously published results. © The Author(s) 2012.
UR - http://hdl.handle.net/10754/562189
UR - http://arxiv.org/abs/arXiv:1201.3496v1
UR - http://www.scopus.com/inward/record.url?scp=84877260365&partnerID=8YFLogxK
U2 - 10.1177/1094342012444795
DO - 10.1177/1094342012444795
M3 - Article
SN - 1094-3420
VL - 27
SP - 193
EP - 209
JO - International Journal of High Performance Computing Applications
JF - International Journal of High Performance Computing Applications
IS - 2
ER -