Paper Review : Generative Modeling By Estimating Gradients Of The Data Distribution
Paper information
Yang Song, and Stefano Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution”, NeurIPS 32 (2019)
Abstract
We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold. Our framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons. Our models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn effective representations via image inpainting experiments.
Introduction to Generative modeling of image
This paper describes Score-based generative modeling (SBGM) for image generation. Before diving into the main content of the paper, let’s briefly understand generative modeling.
What is image and image-like object?
First, what is an image? For a computer, an image is a 2D matrix with rgb values assigned to each pixel. Then, is every rgb matrix an image?
| Random noise | Real Image |
|---|---|
![]() |
![]() |
Among the images above, we only consider the right one as an image. A matrix with randomly assigned RGB values without any correlation between pixels can hardly be considered an image. Since objects in photos typically span multiple pixels, they need to have color continuity. Due to these constraints, only a tiny fraction of RGB matrices can be called photos, and their dimensionality would be small (compared to the dimension of the entire space).
So how should we understand what makes a “photo” that we want to create? Humans recognize objects in photos by holistically processing color information from pixels through their eyes, and in that process identify it as a photo. However, getting computers to recognize objects is not an easy problem. So in machine learning, we show many photos to computers to help them learn what constitutes a “photo-like” object. Under the assumption that areas around given photo data are also likely to be photos, computers learn which regions in the entire space contain “photo-like” objects.
Generative modeling
To be more mathematically precise: A vector x containing RGB information for N×N pixels would be an element of the [0,1]^(3N^2) space. ([0,1] is the range of normalized RGB values, and 3N^2 is the number of vector components.) The ultimate goal of machine learning is to discover the function p(x) that represents the distribution of “photo-like” objects in this space.
These learned distributions p_data(x) are what we see in Generative models that create random RGB matrices following this distribution. So how does this paper create random RGB matrices?
Langevin dynamics and Random number generation
There are several methods for generating random numbers following arbitrary distributions. There’s the Metropolis-Hastings algorithm which is part of Markov chain Monte Carlo methods, and there’s the Langevin dynamics method used here.

Langevin dynamics refers to the mechanics of particles subject to “random forces”. These random forces can come from various sources. For particles in fluids, it’s the sum of random collisions from fluid molecules, When considering changes in human decision-making, various external information is viewed as random kicks. Langevin dynamics is the field that predicts particle motion using these random forces.
Like Newtonian dynamics has Newton’s equation, Langevin dynamics has the Langevin equation. The general form is as follows. \begin{equation} m \dv{v}{t} = - \gamma v + F + \xi(t) \end{equation} This equation shows that the acceleration of an object is determined by three forces: the resistance from the fluid, the force from the outside, and the random impact from the fluid. This equation is also known as the underdamped Langevin equation.
At this point, the velocity of the particle tends to converge to a specific velocity due to the resistance from the fluid. If the resistance is very strong and the object’s velocity converges very quickly, we call this an overdamped system. Since the velocity can converge in a very short time, we can consider the situation where the velocity is determined by the external force at every moment and the acceleration becomes 0 immediately ($\dv{v}{t} = 0$). Substituting this condition into the previous underdamped Langevin equation gives the following equation, which is called the overdamped Langevin equation. \begin{equation} v = \frac{1}{\gamma} F + \frac{\xi(t)}{\gamma} \end{equation}
If the external force $F$ is determined by some potential energy $V(x)$, we can use the relationship $F = - \dv{V}{x}$ to further change the equation.
\begin{equation}
v = - \frac{1}{\gamma} \dv{V}{x} + \frac{\xi(t)}{\gamma}
\end{equation}
The position of a particle moving in this way is determined by some distribution that changes over time (due to randomness), and if the distribution reaches a equilibrium state after a sufficiently long time, it will inevitably follow a Boltzmann distribution determined by the potential energy $V(x)$.
\begin{equation}
P(x) \propto \exp(- \frac{V(x)}{k_B T})
\end{equation}
Here, $k_B T$ represents the temperature of the system, and according to the Einstein relation, it satisfies $k_B T = \gamma D$.
This means that if the temperature increases, the diffusion becomes so strong that it occurs independently of the potential energy, creating a uniform distribution, and if the temperature decreases, the diffusion decreases, creating a distribution that strongly correlates with the potential energy.
So if we can adjust the temperature $T$ and the potential energy $V(x)$ well, we can induce the desired equilibrium distribution. If we set the potential energy to $V(x) = - k_B T log p(x)$ for some probability distribution $p(x)$, the equilibrium distribution of the particles after a long time is exactly $p(x)$. This is how random numbers following the data distribution are generated in this paper, and why the score, which plays the role of force, is given by $\nabla_x log p(x)$.
Characteristics of Score-based generative modeling
Generative modeling is a generator that outputs random numbers following a given data distribution.
Now let’s move on to the story about Score-based modeling introduced in this paper.
Basically, this model learns the slope of the data distribution and generates images by applying Langevin dynamics with this slope as force.
Although it has a very simple structure, it is difficult to expect good performance with just this, so let’s look into the reasons why.
Main challenges on Training data distribution
The training process of the data distribution has two main challenges that significantly affect performance.
Low-dimensionality of data
As mentioned earlier, a random arrangement of colors is not a photo. Only those that satisfy countless restrictions can be recognized as photos. This means that the region of the color array that is recognized as a photo is a small dimension space rather than a small ratio in the entire space. Like the ratio of a plane or a line in a 3-dimensional space is 0, the region recognized as a photo in the entire space is also 0. So almost all points in the entire color array space are points where the probability distribution of photos is 0, and therefore the slope is also 0, making it impossible to induce an area similar to photos through Langevin dynamics.
The method to solve this is to add noise to the photos. Since noise has no correlation with the surrounding pixels, it can move data points regardless of the restrictions for being a photo, and as a result, the region of photos with added noise can have volume.
Real Data

