Concept Of Diffusion Model
Diffusion model is a recent Generative model that has been updated on many areas with the 2015 paper by Sohl-Dickstein, J. and the 2019 paper by Y. Song. While the two papers generated images, it has been expanded to various data types such as 3D image1, audio2, video3, Protein45, Molecule6 and even functions7. This article aims to explain the general structure of this popular Diffusion model.
Diffusion model의 기본 구조
Diffusion model is a model that ‘Compute Data distribution’s reverse diffusion and generate synthetic data from random initialization’ (Generative model). Let’s look at each keyword in detail.
Data distribution and Generative model
If we want to create A, we need to clarify two things. First, we need to know what A is in the space. For example, what space does an image belong to? In the digital world, an image is a vector containing the RGB information of each pixel. Therefore, an image of size $w \times h$ is an element of $\qty{0, 1, \cdots, 255}^{3wh}$. If we map the RGB information to a real number between 0 and 1 by dividing it by 255, it can also be viewed as an element of $[0, 1]^{3wh}$. Audio is composed of sound signals $a_t$ at time step $t = 1, \cdots, T$, so it is an element of $\mathbb{R}_+^{T}$. A molecule is composed of atoms, and can be viewed as a network with nodes being atoms and edges being bonds, so a molecule with $n$ atoms is an element of $\qty{ 1, \cdots, N }^n \qty{ 0, 1, 2, 3 }^{n(n - 1)/2}$.
Then, can all vectors in the space $[0, 1]^{3wh}$ be considered images? Are all elements of $\mathbb{R}_+^{T}$ sound? Are all elements of $\qty{ 1, \cdots, N }^n \qty{ 0, 1, 2, 3 }^{n(n - 1)/2}$ molecules? Of course not. For example, the left figure below is not considered an image.
| Random noise | Real Image |
|---|---|
![]() |
![]() |
This is related to the question ‘What is A in the space?’ If there is a condition that A must satisfy and all data that satisfies the condition can be called A, we can generate A by creating arbitrary data that satisfies the condition.
However, if it is difficult to clarify the condition, what should we do? What condition must be satisfied for ‘picture’ or ‘photo’? If it is difficult to explain, what methods can we use to find an appropriate condition?
Machine learning induces the machine to find the condition from the training data, and we say that the machine learns the data distribution.
In other words, it learns where the data that is judged as A is located in the entire space, and if the data is located in a place where the probability of being A is high, it is judged as A.
Here, there is one important assumption: If the data is similar to the training data of A, it is A.
For example, if we change a few pixels or add very small noise to the data that we recognize as an image, we still recognize the result as an image.
Based on this, if we train the machine to create data around the area where the training data is located, that is, data close to the training data, we can create quite realistic fake data.
(If the data is easily changed even with a little change, the area of the data cluster becomes smaller and the boundary becomes more numerous, so the training data for the model that learns the boundary between the data clusters becomes more necessary.)
However, this method is greatly affected by the distribution of the training data, so it is important to construct training data without bias so that it can sufficiently simulate the desired data distribution.
From this point of view, Generative model is a random number generator in a large sense.
It is a model that creates any number that follows the learned data distribution.
Reverse diffusion
The core of how diffusion model creates random data is reverse diffusion. What does it mean to reverse diffusion? First, let’s think about what diffusion is.
Physically, diffusion means the flow of particles from a high density area to a low density area. Not all particles move depending on density, but each particle moves randomly. There are particles moving from a high density area to a low density area and vice versa. However, since the number of particles moving from a high density area to a low density area is much greater, diffusion appears as a macroscopic form.
To correspond this to the data, let’s imagine the ‘space of data’ described earlier and assume that there are countless data in it, each represented by a point. Then, let’s give the data a very small noise and let it move slightly in the data space. If we repeat this, the data will gradually spread and cover the entire space evenly.
Original Data

