Metropolis Hastings Algorithm
What is the simplest way to sample a one-dimensional random variable that follows a probability density function $p(x)$? The method involves using the cumulative distribution function $C(x) = P(X \leq x)$. By drawing a uniform random number $p$ from the interval $[0,1]$ and finding $x$ such that $C(x) = p$, this sample $x$ will follow the probability distribution $p(x)$.
However, what should we do when it’s difficult to calculate the probability density function $p(x)$ precisely, or when the dimension of the random variable is too large to compute $x$ where $C(x) = p$?
This is where the Markov Chain Monte Carlo (MCMC) methodology comes into play.
While we’ll cover the detailed definition of MCMC in another post, today we’ll focus on understanding how the Metropolis-Hastings (MH) algorithm works.
Purpose of algorithm
The MH algorithm can sample random variables following $p(x)$ in scenarios where we cannot determine $p(x)$ exactly, but we know that $p(x)$ is proportional to some function $f(x)$, or when it’s easier to calculate the probability ratio $p(x_1)/p(x_2)$ for any two domain values $x_1, x_2$.
Unlike the cumulative distribution $C(X)$ method, the MH algorithm doesn’t operate in a single step. Regardless of the initial prior probability, it iteratively generates samples while forcing the sampling distribution to converge towards $p(x)$. It’s called a Markov Chain Monte Carlo method because it generates the $(n+1)$th sample based solely on information from the $n$th sample.
Toy model: $\int_0^L e^{- e^x} \dd x$
Before diving into the MH algorithm, let’s examine basic Monte Carlo methods. The simplest example is using random sampling to compute integrals. When trying to evaluate the integral $\int_0^L e^{- e^x} \dd x$, the most basic Monte Carlo method can be implemented as follows:
N = 100000
x = np.random.uniform(0, L, N)
y = np.random.uniform(0, 1, N)
count = (y <= np.exp(- np.exp(x))).sum()
area = L * (count / N)
We sample N uniform random numbers $(x,y)$ from the region $[0,L] \times [0,1]$ and count how many points fall below the graph $y = e^{-e^x}$. The ratio of points below the curve to total points, multiplied by the total area $L$, gives us the integral value.
Mathematically, this can be expressed as:
\begin{align} \int_0^L e^{- e^x} \dd x &= \int_0^L \int_0^1 I\qty(y \leq e^{-e^{x}}) \dd y \dd x \newline &= L \;E\qty[I\qty(y \leq e^{-e^{x}})] \simeq L \frac{1}{N} \sum_{x_i, y_i} I\qty(y_i \leq e^{-e^{x_i}}) \end{align}
Here, $I$ is the indicator function that equals 1 when the given proposition is true and 0 when false. The second equality comes from interpreting $dx dy/L$ as the probability of being in the region $(x,x+dx) \times (y,y+dy)$ and expressing it as an expectation value. By performing multiple samplings and counting the number of samples below the curve each time, we can calculate the integral result.
A slightly improved method is to sample only $x$ instead of both $x$ and $y$. Mathematically, this utilizes:
\begin{equation} \int_0^L e^{-e^x} \dd x = L \times \int_0^L e^{-e^x} \frac{1}{L} \dd x = L \; E\qty[e^{-e^x}] \simeq L \frac{1}{N} \sum_{x_i} e^{-e^{x_i}} \end{equation}
Once $x$ is determined, the proportion of $y$ samples that would fall below the graph is $e^{-e^x}$. Therefore, it is more intuitive to directly calculate this probability rather than sampling it and increasing error. Naturally, this leads to faster convergence as the error is reduced.
N = 100000
x = np.random.uniform(0, L, N)
p = np.exp(- np.exp(x)).sum() / N
area = L * p
However, when using these methods, we can approximate the integral result with about 10% error at $N=10^4$, and we need around $N=10^6$ to achieve an error level of 0.1%. (The second method shows slightly better convergence than the first approach.) This slow convergence is due to the inefficient sampling of the naive Monte Carlo method. To elaborate, since the function we want to integrate decreases very rapidly, most samples in regions with large $x$ values fall above the graph and are meaningless for area calculation. The majority of the graph’s area is concentrated in regions with small $x$ values, so for more accurate calculations, we need more frequent sampling in these regions.
First improvement: Importance sampling
Importance sampling is a method that overcomes the limitations of naive Monte Carlo methods. It involves sampling more frequently from regions with smaller $x$ values and calculating the area based on this sampling.
As we’ve seen earlier, Monte Carlo Integration calculates integrals by converting them into expected values of random variables. By carefully adjusting the probability distribution and modifying the objective function accordingly, we can maintain the same expected value with different distributions. To state the conclusion, the expected value of $f(x)$ with respect to probability distribution $p(x)$ equals the expected value of $f(x)p(x)/q(x)$ with respect to probability distribution $q(x)$.
\begin{equation} E_{x \sim p} \qty[f(x)] = \int f(x) p(x) \dd x = \int f(x) \frac{p(x)}{q(x)} q(x) \dd x = E_{x \sim q} \qty[f(x) \frac{p(x)}{q(x)}] \end{equation}
In our case, we have $p(x) = 1/L, f(x) = e^{-e^x}$, and if we set $q(x) = e^{-x} / (1 - e^{-L})$, we can perform the same integration using the following code.
N = 100000
C = 1 / (1 - np.exp(-L))
q = np.random.uniform(0, 1, N)
x = - np.log(1 - q / C)
s = (np.exp(- np.exp(x)) / (L * C * np.exp(-x))).sum() / N
area = L * s
In this case, we can achieve more precise approximation with 0.01% error at $N=10^6$. By adjusting sampling to focus more on important regions, performance improves accordingly.
Metropolis algorithm
We’ve confirmed that importance sampling can efficiently calculate integrals. However, it’s difficult to calculate the probability distribution accurately and sample a random variable that follows it in complex systems. Therefore, the Metropolis algorithm was introduced, and the Metropolis-Hastings algorithm was developed from it. (The initial method called the Metropolis algorithm was actually a joint study by Metropolis and Hastings, but the name was later changed to include Hastings’ contribution as the method evolved.)
The idea is to start from any initial distribution $p_0(x)$ and ensure that the sampling distribution converges to the desired distribution $p(x)$ through Markov chain sampling. In this case, we need the function $w(x)$ that is proportional to the desired distribution. This method is useful when it’s difficult to calculate the normalization factor by integrating $w(x)$ over the entire space.
Let’s calculate the integral using the Metropolis algorithm when $w(x) = e^{-x}$ and we don’t know the normalization factor $C$. The function we want to calculate is the same as before.
\begin{equation} \int_0^L e^{-e^x} \dd x = \int_0^L \frac{e^{-e^x}}{C e^{-x}} C e^{-x} \dd x = \frac{1}{N} \sum_{i=1}^N \frac{e^{-e^{x_i}}}{C e^{-x_i}}, \quad X \sim C e^{-x} \end{equation}
The two unknowns in this process are how to sample $X$ from a distribution proportional to $w(x)$ and how to calculate the normalization factor $C$. The sampling method will be explained when introducing the Metropolis algorithm, so we’ll defer it for now, and $C$ can be calculated by approximating the integral.
\begin{align} 1 &= \int_0^L \frac{1}{L} \dd x = \int_0^L \frac{1}{L C e^{-x}} C e^{-x} \dd x = E\qty[\frac{1}{L C e^{-x}}] \newline C &= E \qty[\frac{1}{L e^{-x}}] = \frac{1}{NL} \sum_{i=1}^N \frac{1}{e^{-x_i}} \end{align}
Thus, the integral is summarized as follows, \begin{equation} \int_0^L e^{-e^x} \dd x = \frac{1}{N} \sum_{i=1}^N \frac{e^{-e^{x_i}}}{C e^{-x_i}} = L \frac{\sum_{i=1}^N e^{-e^{x_i}} / e^{-x_i}}{ \sum_{i = 1}^N 1 / e^{-x_i}} \end{equation} The algorithm to calculate this is as follows.
def w(x):
return np.exp(-x)
current = np.random.uniform(0, L)
samples = []
for _ in range(N + 100):
samples.append(current)
new = np.random.uniform(0, L) # 1
rate = w(new) / w(current) # 2
p = np.random.uniform(0, 1)
if p < rate: # 3
current = new
samples = np.array(samples[100:]) # 4
C = (1 / np.exp(-samples)).sum()
S = (np.exp(-np.exp(samples)) / np.exp(-samples)).sum()
return L * S / C
The core of the Metropolis algorithm is in lines 1 to 3.
We randomly select a new sample (the method can also be designed to depend on the current sample, but this method is much more accurate) and decide whether to accept it using the rate variable.
If rate is greater than 1, we always accept the new sample, and if it’s less than 1, we accept it with a probability of rate.
(In other words, the smaller rate is, the smaller the probability of renewal becomes.)
Regardless of whether renewal occurs, we include current in the sample set.
By repeating this process, the sample becomes a random variable that follows the distribution proportional to $w(x)$. However, due to the influence of the initial distribution, there may be errors in the early results, so we need to exclude the early results as shown in line 4.
Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm is an extension of the Metropolis algorithm that generalizes the method of selecting a new sample candidate. The method of selecting a new sample candidate in the algorithm introduced earlier (or conceivable) is as follows:
new = np.random.uniform(0, L) # case 1
new = current + np.random.uniform(-dx, dx) # case 2
Both methods are symmetric, as can be seen from the transition probability $p(x \rightarrow x’)$. (When the probability of a candidate being selected as $x’$ is the same as the probability of a candidate being selected as $x$, it is called symmetric transition.) However, it is not always necessary to sample with symmetric transition probabilities, so the Metropolis-Hastings algorithm proposed by Hastings is an extension that allows for asymmetric transitions.
In the Hastings algorithm, the acceptance probability is changed as follows:
\begin{align} A(x’|x) &= min\qty(1, \frac{w(x’)}{w(x)}) \tag{Metropolis} \newline A(x’|x) &= min\qty(1, \frac{w(x’)p(x’\rightarrow x)}{w(x)p(x \rightarrow x’)}) \tag{Hastings} \end{align}
By adjusting the acceptance probability in this way, even if the transition probability is asymmetric, the sampling distribution follows the probability distribution proportional to $w(x)$.
Detailed balance
Why does the sample distribution converge to the desired distribution when repeating the Metropolis-Hastings algorithm? To understand this, we need to track how the sample distribution changes as we repeat the sampling. Let $p_n(x)$ be the sample distribution at the $n$-th step, and let’s find out how $p_{n+1}(x)$ is given.
Given a sample $x$ with probability $p_n(x) \dd x$ at the $n$-th step, the probability of the next sample being $x’$ is determined by the transition probability, so it is $p(x \rightarrow x’) \dd x’$. The probability that this sample is accepted as the next sample is $A(x’|x)$. Therefore, the probability of being in the neighborhood of $x$ at the $n$-th step, $p_n(x) \dd x$, becomes the probability of being in the neighborhood of $x’$ at the $(n+1)$-th step, $p(x \rightarrow x’) A(x’ | x) p_n(x) \dd x \dd x’$.
Conversely, there is also a probability of moving to $x$ from another position $x’$. The value is $p(x’ \rightarrow x) A(x | x’) p_n(x’) \dd x \dd x’$. Integrating the probability flow over all $x’$ allows us to determine the net change in $p_n(x)$, and we can derive the recurrence relation for $p_n(x)$.
\begin{equation} \hspace{-30pt} p_{n+1}(x) = p_n(x) + \int \dd x’ (p(x’ \rightarrow x) A(x|x’) p_n(x’) - p(x \rightarrow x’) A(x’|x) p_n(x)) \end{equation}
Solving this recurrence relation is not easy, but it is easy to see that the equilibrium condition is satisfied. \begin{equation} p(x’ \rightarrow x) A(x|x’) p(x’) = p(x \rightarrow x’) A(x’|x) p(x) \end{equation} Therefore, the probability distribution ratio between two points in equilibrium is determined as follows. \begin{equation} \frac{p(x’)}{p(x)} = \frac{p(x \rightarrow x’) A(x’|x)}{p(x’ \rightarrow x) A(x|x’)} \end{equation}
Now, we need to know the ratio of acceptance probabilities. If $w(x’)p(x’ \rightarrow x) > w(x) p(x \rightarrow x’)$, then $A(x’|x) = 1$ is given, so \begin{equation} \frac{A(x’ | x)}{A (x | x’)} = \frac{1}{A(x|x’)} = \frac{w(x’)p(x’ \rightarrow x)}{w(x)p(x \rightarrow x’)} \end{equation} The result is the same for the opposite case. Substituting this into the previous equation, we can see that the equilibrium distribution is given by the ratio of $w(x)$, and the sample following the equilibrium distribution naturally follows the desired distribution. \begin{equation} \frac{p(x’)}{p(x)} = \frac{p(x \rightarrow x’)}{p(x’ \rightarrow x)}\frac{w(x’)p(x’ \rightarrow x)}{w(x)p(x \rightarrow x’)} = \frac{w(x’)}{w(x)} \end{equation}
Conclusion
We’ve seen how the complex sampling method of the MH algorithm follows the desired distribution.
- The use of the MH algorithm: When we know the functional form of the probability distribution but it is very difficult to calculate the normalization factor
- The core of the MH algorithm: The definition of the acceptance probability of rejecting the new sample
- The working principle of the MH algorithm: The detailed balance that forces the equilibrium distribution to follow the desired probability distribution