Sherman Khoo

Sherman Khoo

Statistics Masters Student at the University of Warwick

Coding Posterior Simulations with (Metropolis-Hastings and Gradient Based) MCMC

Posted on March 20, 2022

Posterior Simulations

In Bayesian inference, the posterior probability distribution is the key object of interest. Unfortunately, we often are only able to obtain the posterior distribution up to a normalising constant (through Bayes’s rule), and the evaluation of said normalising constants are usually intractable. Hence, an entire class of algorithms called Markov Chain Monte Carlo (MCMC) has been developed to allow us to sample from the posterior distribution.

In this post, I’ll walk through the basics of two MCMC sampling algorithms using the PyTorch library on Python.


As we will be comparing two MCMC algorithms, we shall use a non-trivial model for our simulation. This model is taken from experiment 5.1 in the Stochastic Gradient Langevin Dynamics Paper by Welling and Teh.

Our observations will be generated from a gaussian mixture model, as detailed below.


And we will set gaussian priors on our two parameters.


def generate_samples(parameters=(0.,1.), n=100):
    #Hardcoded parameters
    sigma_x = torch.sqrt(torch.tensor(2.0))

    #Given parameters (theta_1, theta_2)
    theta_1 = parameters[0]
    theta_2 = parameters[1]

    location_1 = theta_1
    location_2 = theta_1 + theta_2
    samples = []

    for _ in range(n):
        if torch.bernoulli(torch.tensor(0.5)):
            samples.append(torch.normal(mean=location_1, std=sigma_x))
            samples.append(torch.normal(mean=location_2, std=sigma_x))

    return torch.tensor(samples)

Note above that to sample from a mixture model, we first sample from a categorical distribution with the correspoding weights and then sample from the individual gaussians. As we have two Gaussians equally weighted, we use a bernoulli distribution with probability of 0.5.

MAP Estimation using numerical optimisation

Before we start, let’s provide a sanity check by obtaining the Maximum a Posteriori (MAP) estimate. Note that optimising the posterior up to a normalising constant is also equivalent to optimising the posterior itself, hence there is no need to concern ourself with the normalising constant (or lack thereof). Since PyTorch provides us with a Adam optimiser out of the box, we shall use that.

theta_1 = torch.tensor(10.,requires_grad=True)
theta_2 = torch.tensor(10.,requires_grad=True)

opt = torch.optim.Adam([theta_1, theta_2], lr=0.01)

for i in range(50000):
    loss = - torch.sum(log_likelihood(x, [theta_1,theta_2])) - log_prior(theta=[theta_1, theta_2])

print(f"Loss: {loss.item():0.2f}, Theta 1: {theta_1.item():0.2f}, Theta 2: {theta_2.item():0.2f}")
Loss: 185.97, Theta 1: 0.68, Theta 2: -0.55

There is still some loss, but we now know that our mode is around this value. However, as we will soon see, we have a bimodal distribution, and so the MAP estimate doesn’t provide the full description of our posterior distribution.

Random Walk Metropolis-Hastings

Our first MCMC sampler is the standard Random Walk Metropolis-Hastings algorithm. An issue with such algorithms is determining the proposal, in particular, the variance of our multivariate gaussian. In my case, I picked the value 0.35 as it gave us the optimal acceptance rate (see reference section at bottom of page).

def rwmh(x_obs, log_likelihood, log_prior, theta_init, n_runs, n_dim):
    samples = torch.zeros(size=(n_runs+1, n_dim))
    rejection = 0
    samples[0] = theta_init

    for i in tqdm(range(n_runs)):
        # Proposal is random normal
        # A-R probability should be log LL(x_prop) + log PP(x_prop) - log LL(x_prev) - logPP(x_prev)
        eps = torch.distributions.MultivariateNormal(torch.zeros(n_dim), .35 * torch.eye(n_dim)).sample()
        theta_proposal  = samples[i] + eps
        u = torch.rand(1).item()

        prob = torch.sum(log_likelihood(x=x_obs, theta=theta_proposal)) + log_prior(theta=theta_proposal) \
                  - torch.sum(log_likelihood(x=x_obs, theta=samples[i])) - log_prior(theta=samples[i])

        # In log terms, use min(0, ...) as the AR probability
        if prob > 0:
            prob = torch.tensor(0.) 

        if u < torch.exp(prob):
            samples[i+1] = theta_proposal

            samples[i+1] = samples[i]
            rejection += 1

    return rejection / n_runs, samples
rr, ss = rwmh(x, log_likelihood, log_prior, torch.tensor([0.0,0.0]), 50000, 2)
100%|██████████| 50000/50000 [00:36<00:00, 1376.63it/s]
1 - rr

Around 23% is considered to be optimal.

We use the trace plot to determine convergence to the target distribution (our posterior distribution).




sns.kdeplot(x=t_1[5000:], y=t_2[5000:])


And thus we can estimate the posterior distribution from the samples we obtained. However, MCMC is a rather computationally expensive algorithm, and is not very efficient as we only accept a portion of the proposed values. Can we do better?

Langevin Dynamics (Gradient MCMC)

Alternatively, a stochastic differential equation called the Langevin Diffusion can be shown to have the posterior distribution as the stationary distribution. Hence, if we could sample from such a diffusion, we would be able to obtain samples from the posterior. The Langevin Diffusion incorporates the gradient of the posterior distribution (which we can easily obtain thanks to PyTorch), and does not require a accept-reject step as with Metropolis-Hasting algorithm.

samples = []
paramCurrent = torch.tensor([torch.tensor(10.), torch.tensor(10.)])
dt = torch.tensor(1e-3)
for i in tqdm(range(50000)):
    paramGradCurrent = get_grad(x, paramCurrent[0], paramCurrent[1])
    with torch.no_grad():
        paramCurrent = paramCurrent + dt*paramGradCurrent + \
100%|██████████| 50000/50000 [00:29<00:00, 1701.91it/s]
tt_1 = [s[0].item() for s in samples]
tt_2 = [s[1].item() for s in samples]
sns.kdeplot(x=tt_1[5000:], y=tt_2[5000:])


As we can see, we obtain a fairly similar posterior, but it seems slightly off. Unfortunately, the Lagevin Diffusion is a continuous time process, and what we in fact did was to sample from a Euler discretisation of the diffusion. What we did is known as the Unadjusted Langevin Algorithm. This discretisation means that we have a biased approximation of the posterior distribution.

Another extension to the ULA is in the Stochastic Gradient Langevin Dynamics Paper by Welling and Teh mentioned earlier. In this extension, they implemented stochastic gradients, whereby the exact gradients are replaced by a unbiased estimate of the gradients using a minibatch, and a way to adjust the step size for better sampling.