TY - GEN
T1 - Efficient sparse collective communication and its application to accelerate distributed deep learning
AU - Fei, Jiawei
AU - Ho, Chen-Yu
AU - Sahu, Atal N.
AU - Canini, Marco
AU - Sapio, Amedeo
N1 - KAUST Repository Item: Exported on 2021-08-12
Acknowledged KAUST grant number(s): OSR-CRG2020-4382
Acknowledgements: We are grateful to Arvind Krishnamurthy, Jacob Nelson and Dan R. K. Ports for their helpful suggestions. We are thankful to Meituan for granting us access to a multi-GPU server testbed. We thank our shepherd, Kate Lin, and the anonymous reviewers for their helpful feedback. This publication is based upon work supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-CRG2020-4382. For computer time, this research used the resources of the Supercomputing Laboratory at KAUST. The work of Jiawei Fei at KAUST is supported by a sponsorship from China Scholarship Council (CSC). This
work was partially supported by a gift in kind from Huawei.
PY - 2020/9/30
Y1 - 2020/9/30
N2 - Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.
We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2x. Even at 100 Gbps, OmniReduce delivers 1.4--2.9x better performance for network-bottlenecked DNNs.
AB - Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.
We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2x. Even at 100 Gbps, OmniReduce delivers 1.4--2.9x better performance for network-bottlenecked DNNs.
UR - http://hdl.handle.net/10754/665369
UR - https://dl.acm.org/doi/10.1145/3452296.3472904
U2 - 10.1145/3452296.3472904
DO - 10.1145/3452296.3472904
M3 - Conference contribution
SN - 9781450383837
BT - Proceedings of the 2021 ACM SIGCOMM 2021 Conference
PB - ACM
ER -