TY - GEN
T1 - CANITA
T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
AU - Li, Zhize
AU - Richtárik, Peter
N1 - Publisher Copyright:
© 2021 Neural information processing systems foundation. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov’s accelerated gradient descent [31, 32] and Adam [14]. In order to combine the benefits of communication compression and convergence acceleration, we propose a compressed and accelerated gradient method based on ANITA [20] for distributed optimization, which we call CANITA. Our CANITA achieves the first accelerated rate O (r(1 + qωn3)Lǫ + ω(1ǫ)31) , which improves upon the state-of-the-art non-accelerated rate O ((1 + ωn)Lǫ + ωω2++nω 1ǫ ) of DIANA [12] for distributed general convex problems, where ǫ is the target error, L is the smooth parameter of the objective, n is the number of machines/devices, and ω is the compression parameter (larger ω means more compression can be applied, and no compression implies ω = 0). Our results show that as long as the number of devices n is large (often true in distributed/federated learning), or the compression ω is not very high, CANITA achieves the faster convergence rate O(qLǫ ) , i.e., the number of communication rounds is O (qLǫ ) (vs. O(Lǫ ) achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).
AB - Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov’s accelerated gradient descent [31, 32] and Adam [14]. In order to combine the benefits of communication compression and convergence acceleration, we propose a compressed and accelerated gradient method based on ANITA [20] for distributed optimization, which we call CANITA. Our CANITA achieves the first accelerated rate O (r(1 + qωn3)Lǫ + ω(1ǫ)31) , which improves upon the state-of-the-art non-accelerated rate O ((1 + ωn)Lǫ + ωω2++nω 1ǫ ) of DIANA [12] for distributed general convex problems, where ǫ is the target error, L is the smooth parameter of the objective, n is the number of machines/devices, and ω is the compression parameter (larger ω means more compression can be applied, and no compression implies ω = 0). Our results show that as long as the number of devices n is large (often true in distributed/federated learning), or the compression ω is not very high, CANITA achieves the faster convergence rate O(qLǫ ) , i.e., the number of communication rounds is O (qLǫ ) (vs. O(Lǫ ) achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).
UR - http://www.scopus.com/inward/record.url?scp=85122122510&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85122122510
T3 - Advances in Neural Information Processing Systems
SP - 13770
EP - 13781
BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
A2 - Ranzato, Marc'Aurelio
A2 - Beygelzimer, Alina
A2 - Dauphin, Yann
A2 - Liang, Percy S.
A2 - Wortman Vaughan, Jenn
PB - Neural information processing systems foundation
Y2 - 6 December 2021 through 14 December 2021
ER -