Global-QSGD: Allreduce-Compatible Quantization for Distributed Learning with Theoretical Guarantees

Jihao Xin, Marco Canini

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Distributed training enables large-scale deep learning, but suffers from high communication overhead, especially as models and datasets grow. Gradient compression, particularly quantization, is a promising approach to mitigate this bottleneck. However, existing quantization schemes are often incompatible with Allreduce, the dominant communication primitive in distributed deep learning, and many prior solutions rely on heuristics without theoretical guarantees. We introduce Global-QSGD, an Allreduce-compatible gradient quantization method that leverages global norm scaling to reduce communication overhead while preserving accuracy. Global-QSGD is backed by rigorous theoretical analysis, extending standard unbiased compressor frameworks to establish formal convergence guarantees. Additionally, we develop a performance model to evaluate its impact across different hardware configurations. Extensive experiments on NVLink, PCIe, and large-scale cloud environments show that Global-QSGD accelerates distributed training by up to 3.51× over baseline quantization methods, making it a practical and efficient solution for large-scale deep learning workloads.

Original languageEnglish (US)
Title of host publicationEuroMLSys 2025 - Proceedings of the 2025 5th Workshop on Machine Learning and Systems
PublisherAssociation for Computing Machinery, Inc
Pages216-229
Number of pages14
ISBN (Electronic)9798400715389
DOIs
StatePublished - Apr 1 2025
Event5th Workshop on Machine Learning and Systems, EuroMLSys 2025, held in conjunction with ACM EuroSys 2025 also ASPLOS 2025. EuroMLSys 2025 - Rotterdam, Netherlands
Duration: Mar 30 2025Apr 3 2025

Publication series

NameEuroMLSys 2025 - Proceedings of the 2025 5th Workshop on Machine Learning and Systems

Conference

Conference5th Workshop on Machine Learning and Systems, EuroMLSys 2025, held in conjunction with ACM EuroSys 2025 also ASPLOS 2025. EuroMLSys 2025
Country/TerritoryNetherlands
CityRotterdam
Period03/30/2504/3/25

Keywords

  • collective communication
  • distributed training
  • gradient compression

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Human-Computer Interaction
  • Information Systems
  • Software
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'Global-QSGD: Allreduce-Compatible Quantization for Distributed Learning with Theoretical Guarantees'. Together they form a unique fingerprint.

Cite this