TY - GEN
T1 - Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization
AU - Safaryan, Mher
AU - Hanzely, Filip
AU - Richtarik, Peter
N1 - KAUST Repository Item: Exported on 2022-06-23
PY - 2021/1/1
Y1 - 2021/1/1
N2 - Large scale distributed optimization has become the default tool for the training of supervised machine learning models with a large number of parameters and training data. Recent advancements in the field provide several mechanisms for speeding up the training, including compressed communication, variance reduction and acceleration. However, none of these methods is capable of exploiting the inherently rich data-dependent smoothness structure of the local losses beyond standard smoothness constants. In this paper, we argue that when training supervised models, smoothness matrices-information-rich generalizations of the ubiquitous smoothness constants-can and should be exploited for further dramatic gains, both in theory and practice. In order to further alleviate the communication burden inherent in distributed optimization, we propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses. To showcase the power of this tool, we describe how our sparsification technique can be adapted to three distributed optimization algorithms-DCGD [Khirirat et al., 2018], DIANA [Mishchenko et al., 2019] and ADIANA [Li et al., 2020]-yielding significant savings in terms of communication complexity. The new methods always outperform the baselines, often dramatically so.
AB - Large scale distributed optimization has become the default tool for the training of supervised machine learning models with a large number of parameters and training data. Recent advancements in the field provide several mechanisms for speeding up the training, including compressed communication, variance reduction and acceleration. However, none of these methods is capable of exploiting the inherently rich data-dependent smoothness structure of the local losses beyond standard smoothness constants. In this paper, we argue that when training supervised models, smoothness matrices-information-rich generalizations of the ubiquitous smoothness constants-can and should be exploited for further dramatic gains, both in theory and practice. In order to further alleviate the communication burden inherent in distributed optimization, we propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses. To showcase the power of this tool, we describe how our sparsification technique can be adapted to three distributed optimization algorithms-DCGD [Khirirat et al., 2018], DIANA [Mishchenko et al., 2019] and ADIANA [Li et al., 2020]-yielding significant savings in terms of communication complexity. The new methods always outperform the baselines, often dramatically so.
UR - http://hdl.handle.net/10754/667470
UR - https://arxiv.org/pdf/2102.07245
UR - http://www.scopus.com/inward/record.url?scp=85131932905&partnerID=8YFLogxK
M3 - Conference contribution
SN - 9781713845393
SP - 25688
EP - 25702
BT - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
PB - Neural information processing systems foundation
ER -