Better Performance Of SGD Than As We Expected

Updated: January 10, 2023

This post summarizes the review paper “On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks” by researchers at TELECOM-PARIS including Umut Simsekli.

In current deep learning methodologies (except when using the Forward-Forward algorithm), model training utilizes gradient descent methods to minimize the loss function. Among these, Stochastic Gradient Descent (SGD) follows an approach where instead of recalculating model weights using all data each time, it minimizes the loss using randomly sampled subsets of data. While using all data at once results in deterministic gradient directions, the weight matrix movement of models trained through SGD can be characterized as a “Random walk” due to the random sampling of data.

This paper presents a more rigorous analysis of model behavior when trained using SGD. Previous studies often arbitrarily assumed the model’s random walk to be “Brownian motion” (also known as “Gaussian noise”) to analyze SGD’s effects. This assumption had the advantage of simplifying the stochastic differential equations for mathematical analysis of model motion. Additionally, the extensive existing research on Brownian particle movement compared to other motion patterns made it easier to understand deep learning model training through this analogy.

However, if the model’s movement significantly differs from Brownian motion, most research based on this assumption becomes questionable. This paper demonstrates that the learning pattern following SGD aligns with Levy walk rather than Brownian motion, and presents various modified analytical results accordingly.

Levy_vs_Brownian

Results under Brownian motion assumption:

Learning rate / Batch size determines the width of final minima (wider minima correlate with better performance)
Time to reach minima increases exponentially with dimension
- While reaching minima takes polynomial time
- Time to escape from one minimum to another increases exponentially
Escape time increases exponentially with minima depth
Increases polynomially with minima width

Empirical results:

Initially appears as Gaussian noise, but large jumps frequently observed over longer periods
Training time is shorter than theoretical predictions
Many results show training converging to wide minima

Proposed Framework:

Brownian motion assumption stems from classical CLT (understanding descent force as averaging across data points)
Finite-variance assumption frequently fails
This case demonstrates infinite variance in descent forces
Results in Heavy-tailed random walk, known as Levy walk

Results under Levy walk assumption:

Motion becomes discontinuous rather than continuous
Escape time no longer depends on minima height
Instead depends on minima width

Key findings:

Gradient noise is non-Gaussian
Batch size has less impact than previously thought
Initial large gradient noise jumps decrease rapidly, leading to sudden early accuracy improvements

Share on

Twitter Facebook LinkedIn

Myeongseon Choi

Better Performance Of SGD Than As We Expected

Share on

You may also enjoy

Matrix Transpose in CUDA

Kummer’s Theorem

Numb3rs S1E3: Spatial SIR Model

Numb3rs S1E1: The Rossmo Formula