Momentum-based variance reduction in non-convex SGD

Ashok Cutkosky, Francesco Orabona

Research output: Chapter in Book/Report/Conference proceedingConference contribution

199 Scopus citations

Abstract

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large “mega-batches” in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses F, STORM finds a point x with E[k?F(x)k] = O(1/vT + s1/3/T1/3) in T iterations with s2 variance in the gradients, matching the optimal rate and without requiring knowledge of s.
Original languageEnglish (US)
Title of host publicationAdvances in Neural Information Processing Systems
PublisherNeural information processing systems foundation
StatePublished - Jan 1 2019
Externally publishedYes

Fingerprint

Dive into the research topics of 'Momentum-based variance reduction in non-convex SGD'. Together they form a unique fingerprint.

Cite this