SLAMB: Accelerated Large Batch Training with Sparse Communication

Hang Xu*, Wenxuan Zhang, Jiawei Fei, Yuzhe Wu, Ting Wen Xie, Jun Huang, Yuchen Xie, Mohamed Elhoseiny, Panos Kalnis*

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

2 Scopus citations

Abstract

Distributed training of large deep neural networks requires frequent exchange of massive data between machines, thus communication efficiency is a major concern. Existing compressed communication methods are either not compatible with large batch optimization algorithms, or do not provide sufficient speedup in large scale. In this paper, we combine sparsification-based gradient compression with the layer-wise adaptive moments optimizer for large batch training (LAMB). We propose SLAMB, a novel communication-efficient optimizer that supports large batch sizes and scales to thousands of GPUs. SLAMB employs momentum masking, local error compensation, and element-wise adaptive rescaling to achieve accurate layer-wise weight updates, which translates to fast convergence for very large batches. Our empirical results show that, compared to the state-of-the-art, SLAMB transmits half the amount of data in large-batch BERT pretraining, without sacrificing accuracy. Moreover, SLAMB achieves excellent scalability in large computing infrastructures. For instance, SLAMB with 128 GPUs reduces the training time of Swin Transformer pre-training on ImageNet to 5.35 hours, which is 2 hours faster than the state-of-the-art. At the extreme, we trained BERT-XL (2.8B parameters) on 1,024 NVIDIA A100 GPUs, where SLAMB achieved 90% scaling efficiency.

Original languageEnglish (US)
Pages38801-38825
Number of pages25
StatePublished - 2023
Event40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States
Duration: Jul 23 2023Jul 29 2023

Conference

Conference40th International Conference on Machine Learning, ICML 2023
Country/TerritoryUnited States
CityHonolulu
Period07/23/2307/29/23

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'SLAMB: Accelerated Large Batch Training with Sparse Communication'. Together they form a unique fingerprint.

Cite this