Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large “mega-batches” in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses F, STORM finds a point x with E[k?F(x)k] = O(1/vT + s1/3/T1/3) in T iterations with s2 variance in the gradients, matching the optimal rate and without requiring knowledge of s.
|Title of host publication
|Advances in Neural Information Processing Systems
|Neural information processing systems foundation
|Published - Jan 1 2019