TY - GEN
T1 - Hybrid programming model for implicit PDE simulations on multicore architectures
AU - Kaushik, Dinesh
AU - Keyes, David E.
AU - Balay, Satish
AU - Smith, Barry F.
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2011
Y1 - 2011
N2 - The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case. © 2011 Springer-Verlag.
AB - The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case. © 2011 Springer-Verlag.
UR - http://hdl.handle.net/10754/564330
UR - http://link.springer.com/10.1007/978-3-642-21487-5_2
UR - http://www.scopus.com/inward/record.url?scp=79959198742&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-21487-5_2
DO - 10.1007/978-3-642-21487-5_2
M3 - Conference contribution
SN - 9783642214868
SP - 12
EP - 21
BT - Lecture Notes in Computer Science
PB - Springer Nature
ER -