TY - JOUR

T1 - Differentially Quantized Gradient Methods

AU - Lin, Chung-Yi

AU - Kostina, Victoria

AU - Hassibi, Babak

N1 - KAUST Repository Item: Exported on 2022-09-12
Acknowledgements: This work was supported in part by the National Science Foundation (NSF) under Grant CCF-1751356, Grant CCF-1956386, Grant CNS-0932428, Grant CCF-1018927, Grant CCF-1423663, and Grant CCF-1409204; in part by a grant from Qualcomm Inc.; in part by the National Aeronautics and Space Administration’s (NASA) Jet Propulsion Laboratory through the President and Director’s Fund; and in part by the King Abdullah University of Science and Technology. The authors would like to thank Dr. Himanshu Tyagi for pointing out related works [19], [55]; Dr. Vincent Tan for bringing a known result on the worst-case contraction factor of unquantized GD [75] to their attention; Dr. Victor Kozyakin for a helpful discussion about joint spectral radius; and two anonymous reviewers for detailed comments.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.

PY - 2022/4/28

Y1 - 2022/4/28

N2 - Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call differential quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that differentially quantized gradient descent (DQ-GD) attains a linear contraction factor of $\max \{\sigma _{\mathrm {GD}}, \rho _{n} 2^{-R}\}$ , where $\sigma _{\mathrm {GD}}$ is the contraction factor of unquantized gradient descent (GD), $\rho _{n} \geq 1$ is the covering efficiency of the quantizer, and $R$ is the bitrate per problem dimension $n$ . Thus at any $R\geq \log _{2} \rho _{n} /\sigma _{\mathrm {GD}}$ bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show a converse demonstrating that no algorithm within a certain class can converge faster than $\max \{\sigma _{\mathrm {GD}}, 2^{-R}\}$ . Since quantizers exist with $\rho _{n} \to 1$ as $n \to \infty $ (Rogers, 1963), this means that DQ-GD is asymptotically optimal. In contrast, naively quantized GD where the worker directly quantizes the gradient barely attains $\sigma _{\mathrm {GD}} + \rho _{n}2^{-R}$ . The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov’s accelerated gradient descent, and Polyak’s heavy ball method. For these algorithms as well, if the rate is above a certain threshold, there is no loss in contraction factor obtained by the differentially quantized algorithm compared to its unquantized counterpart, and furthermore, the differentially quantized heavy ball method attains the optimal contraction achievable among all (even unquantized) gradient methods. Experimental results on least-squares problems validate our theoretical analysis.

AB - Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call differential quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that differentially quantized gradient descent (DQ-GD) attains a linear contraction factor of $\max \{\sigma _{\mathrm {GD}}, \rho _{n} 2^{-R}\}$ , where $\sigma _{\mathrm {GD}}$ is the contraction factor of unquantized gradient descent (GD), $\rho _{n} \geq 1$ is the covering efficiency of the quantizer, and $R$ is the bitrate per problem dimension $n$ . Thus at any $R\geq \log _{2} \rho _{n} /\sigma _{\mathrm {GD}}$ bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show a converse demonstrating that no algorithm within a certain class can converge faster than $\max \{\sigma _{\mathrm {GD}}, 2^{-R}\}$ . Since quantizers exist with $\rho _{n} \to 1$ as $n \to \infty $ (Rogers, 1963), this means that DQ-GD is asymptotically optimal. In contrast, naively quantized GD where the worker directly quantizes the gradient barely attains $\sigma _{\mathrm {GD}} + \rho _{n}2^{-R}$ . The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov’s accelerated gradient descent, and Polyak’s heavy ball method. For these algorithms as well, if the rate is above a certain threshold, there is no loss in contraction factor obtained by the differentially quantized algorithm compared to its unquantized counterpart, and furthermore, the differentially quantized heavy ball method attains the optimal contraction achievable among all (even unquantized) gradient methods. Experimental results on least-squares problems validate our theoretical analysis.

UR - http://hdl.handle.net/10754/681138

UR - https://ieeexplore.ieee.org/document/9764884/

UR - http://www.scopus.com/inward/record.url?scp=85129589744&partnerID=8YFLogxK

U2 - 10.1109/TIT.2022.3171173

DO - 10.1109/TIT.2022.3171173

M3 - Article

SN - 1557-9654

VL - 68

SP - 6078

EP - 6097

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

IS - 9

ER -