TY - GEN

T1 - PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization

AU - Li, Zhize

AU - Bao, Hongyan

AU - Zhang, Xiangliang

AU - Richtarik, Peter

N1 - KAUST Repository Item: Exported on 2021-10-07
Acknowledged KAUST grant number(s): URF/1/3756-01-01
Acknowledgements: Zhize Li and Peter Richtarik were supported by KAUST Baseline Research Fund. Hongyan Bao and Xiangliang Zhang were supported by KAUST Competitive Research Grant URF/1/3756-01-01.

PY - 2021

Y1 - 2021

N2 - In this paper, we propose a novel stochastic gradient estimator-ProbAbilistic Gradient Estimator (PAGE)-for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability p t or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability 1 - p(t). We give a simple formula for the optimal choice of p(t). Moreover, we prove the first tight lower bound Omega (n + root n/epsilon(2)), for non-convex finite-sum problems, which also leads to a tight lower bound Omega (b + root b/epsilon(2)) for non- convex online problems, where b := min{sigma(2)/epsilon(2), n} . Then, we show that PAGE obtains the optimal convergence results O(n + root n/epsilon(2)) (finite-sum) and O(b + root b/is an element of(2)) (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-Lojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate O(. log 1/epsilon). Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.

AB - In this paper, we propose a novel stochastic gradient estimator-ProbAbilistic Gradient Estimator (PAGE)-for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability p t or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability 1 - p(t). We give a simple formula for the optimal choice of p(t). Moreover, we prove the first tight lower bound Omega (n + root n/epsilon(2)), for non-convex finite-sum problems, which also leads to a tight lower bound Omega (b + root b/epsilon(2)) for non- convex online problems, where b := min{sigma(2)/epsilon(2), n} . Then, we show that PAGE obtains the optimal convergence results O(n + root n/epsilon(2)) (finite-sum) and O(b + root b/is an element of(2)) (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-Lojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate O(. log 1/epsilon). Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.

UR - http://hdl.handle.net/10754/665111

UR - https://arxiv.org/pdf/2008.10898

M3 - Conference contribution

BT - International Conference on Machine Learning (ICML)

PB - arXiv

ER -