Distributed learning with compressed gradient differences*

K. Mishchenko*, E. Gorbunov, M. Takáč, P. Richtárik

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression of updates were recently proposed using sparsification or quantization. However, none of the prior methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode. In this work we propose a new method–DIANA–which resolves this issue via compression of gradient differences. We provide theory in the strongly convex and nonconvex settings that shows improved convergence rates, and use it to obtain the first convergence rate for the previously proposed method TernGrad. Finally, we provide theory to support non-smooth regularizers.

Original languageEnglish (US)
JournalOptimization Methods and Software
DOIs
StateAccepted/In press - 2024

Keywords

  • communication compression
  • convex optimization
  • Distributed optimization
  • federated learning
  • non-convex optimization

ASJC Scopus subject areas

  • Software
  • Control and Optimization
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Distributed learning with compressed gradient differences*'. Together they form a unique fingerprint.

Cite this