DC2: Delay-aware Compression Control for Distributed Machine Learning

Ahmed M. Abdelmoniem, Marco Canini

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Scopus citations

Abstract

Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41× and 5.3 × over baselines with no-compression and uniform compression, respectively.
Original languageEnglish (US)
Title of host publicationIEEE INFOCOM 2021 - IEEE Conference on Computer Communications
PublisherIEEE
ISBN (Print)978-1-6654-3131-6
DOIs
StatePublished - 2021

Fingerprint

Dive into the research topics of 'DC2: Delay-aware Compression Control for Distributed Machine Learning'. Together they form a unique fingerprint.

Cite this