TY - GEN
T1 - DC2: Delay-aware Compression Control for Distributed Machine Learning
AU - Abdelmoniem, Ahmed M.
AU - Canini, Marco
N1 - KAUST Repository Item: Exported on 2021-07-29
PY - 2021
Y1 - 2021
N2 - Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41× and 5.3 × over baselines with no-compression and uniform compression, respectively.
AB - Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41× and 5.3 × over baselines with no-compression and uniform compression, respectively.
UR - http://hdl.handle.net/10754/670320
UR - https://ieeexplore.ieee.org/document/9488810/
U2 - 10.1109/INFOCOM42981.2021.9488810
DO - 10.1109/INFOCOM42981.2021.9488810
M3 - Conference contribution
SN - 978-1-6654-3131-6
BT - IEEE INFOCOM 2021 - IEEE Conference on Computer Communications
PB - IEEE
ER -