TY - GEN
T1 - Global-QSGD
T2 - 5th Workshop on Machine Learning and Systems, EuroMLSys 2025, held in conjunction with ACM EuroSys 2025 also ASPLOS 2025. EuroMLSys 2025
AU - Xin, Jihao
AU - Canini, Marco
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/4/1
Y1 - 2025/4/1
N2 - Distributed training enables large-scale deep learning, but suffers from high communication overhead, especially as models and datasets grow. Gradient compression, particularly quantization, is a promising approach to mitigate this bottleneck. However, existing quantization schemes are often incompatible with Allreduce, the dominant communication primitive in distributed deep learning, and many prior solutions rely on heuristics without theoretical guarantees. We introduce Global-QSGD, an Allreduce-compatible gradient quantization method that leverages global norm scaling to reduce communication overhead while preserving accuracy. Global-QSGD is backed by rigorous theoretical analysis, extending standard unbiased compressor frameworks to establish formal convergence guarantees. Additionally, we develop a performance model to evaluate its impact across different hardware configurations. Extensive experiments on NVLink, PCIe, and large-scale cloud environments show that Global-QSGD accelerates distributed training by up to 3.51× over baseline quantization methods, making it a practical and efficient solution for large-scale deep learning workloads.
AB - Distributed training enables large-scale deep learning, but suffers from high communication overhead, especially as models and datasets grow. Gradient compression, particularly quantization, is a promising approach to mitigate this bottleneck. However, existing quantization schemes are often incompatible with Allreduce, the dominant communication primitive in distributed deep learning, and many prior solutions rely on heuristics without theoretical guarantees. We introduce Global-QSGD, an Allreduce-compatible gradient quantization method that leverages global norm scaling to reduce communication overhead while preserving accuracy. Global-QSGD is backed by rigorous theoretical analysis, extending standard unbiased compressor frameworks to establish formal convergence guarantees. Additionally, we develop a performance model to evaluate its impact across different hardware configurations. Extensive experiments on NVLink, PCIe, and large-scale cloud environments show that Global-QSGD accelerates distributed training by up to 3.51× over baseline quantization methods, making it a practical and efficient solution for large-scale deep learning workloads.
KW - collective communication
KW - distributed training
KW - gradient compression
UR - http://www.scopus.com/inward/record.url?scp=105003637914&partnerID=8YFLogxK
U2 - 10.1145/3721146.3721932
DO - 10.1145/3721146.3721932
M3 - Conference contribution
AN - SCOPUS:105003637914
T3 - EuroMLSys 2025 - Proceedings of the 2025 5th Workshop on Machine Learning and Systems
SP - 216
EP - 229
BT - EuroMLSys 2025 - Proceedings of the 2025 5th Workshop on Machine Learning and Systems
PB - Association for Computing Machinery, Inc
Y2 - 30 March 2025 through 3 April 2025
ER -