Better Performance Of SGD Than As We Expected
This post summarizes the review paper “On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks” by researchers at TELECOM-PARIS including Umut Simsekli.
In current deep learning methodologies (except when using the Forward-Forward algorithm), model training utilizes gradient descent methods to minimize the loss function. Among these, Stochastic Gradient Descent (SGD) follows an approach where instead of recalculating model weights using all data each time, it minimizes the loss using randomly sampled subsets of data. While using all data at once results in deterministic gradient directions, the weight matrix movement of models trained through SGD can be characterized as a “Random walk” due to the random sampling of data.
This paper presents a more rigorous analysis of model behavior when trained using SGD. Previous studies often arbitrarily assumed the model’s random walk to be “Brownian motion” (also known as “Gaussian noise”) to analyze SGD’s effects. This assumption had the advantage of simplifying the stochastic differential equations for mathematical analysis of model motion. Additionally, the extensive existing research on Brownian particle movement compared to other motion patterns made it easier to understand deep learning model training through this analogy.
However, if the model’s movement significantly differs from Brownian motion, most research based on this assumption becomes questionable. This paper demonstrates that the learning pattern following SGD aligns with Levy walk rather than Brownian motion, and presents various modified analytical results accordingly.

Results under Brownian motion assumption:
- Learning rate / Batch size determines the width of final minima (wider minima correlate with better performance)
- Time to reach minima increases exponentially with dimension
- While reaching minima takes polynomial time
- Time to escape from one minimum to another increases exponentially
- Escape time increases exponentially with minima depth
- Increases polynomially with minima width
Empirical results:
- Initially appears as Gaussian noise, but large jumps frequently observed over longer periods
- Training time is shorter than theoretical predictions
- Many results show training converging to wide minima
Proposed Framework:
- Brownian motion assumption stems from classical CLT (understanding descent force as averaging across data points)
- Finite-variance assumption frequently fails
- This case demonstrates infinite variance in descent forces
- Results in Heavy-tailed random walk, known as Levy walk
Results under Levy walk assumption:
- Motion becomes discontinuous rather than continuous
- Escape time no longer depends on minima height
- Instead depends on minima width
Key findings:
- Gradient noise is non-Gaussian
- Batch size has less impact than previously thought
- Initial large gradient noise jumps decrease rapidly, leading to sudden early accuracy improvements