TP 2: Approximate Inference in Classification

You can open this practical session in colab here : https://colab.research.google.com/drive/1ISeoFqmccBJJIuZRB5OzyGywyM60t7oo?usp=sharing

or download the notebook directly here.

In classification taks, even for a mere Logistic Regression, we don’t have access to a closed form of the posterior \(p(\pmb{w} \vert \mathcal{D})\). Unlike in Linear regression, the likelihood isn’t conjugated to the Gaussian prior anymore. We ill need to approximate this posterior.

During this session, we will explore and compare approximate inference approaches on 2D binary classification datasets. Studied approaches include Laplacian approximation, variational inference with mean-field approximation and Monte Carlo dropout.

Goal: Take hand on approximate inference methods and understand how they works on linear and non-linear 2D datasets.

All Imports and Useful Functions

Here we are going to install and import everything we are going to need for this tutorial.

Note: You can double-click the title of the collapsed cells (as the ones below) to expand them and read their content.

#@title Import libs
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_moons
from IPython import display
%matplotlib inline

import torch
import torch.nn as nn
from torch.utils import data
import torch.nn.functional as F
from torch.autograd import grad
import torch.distributions as dist
#@title Useful plot function
def plot_decision_boundary(model, X, Y, epoch, accuracy, model_type='classic',
                           nsamples=100, posterior=None, tloc=(-4,-7),
                           nbh=2, cmap='RdBu'):
    """ Plot and show learning process in classification """
    h = 0.02*nbh
    x_min, x_max = X[:,0].min() - 10*h, X[:,0].max() + 10*h
    y_min, y_max = X[:,1].min() - 10*h, X[:,1].max() + 10*h
    xx, yy = np.meshgrid(np.arange(x_min*2, x_max*2, h),
                         np.arange(y_min*2, y_max*2, h))

    test_tensor = torch.from_numpy(np.c_[xx.ravel(), yy.ravel()]).type(torch.FloatTensor)
    model.eval()
    with torch.no_grad():
      if model_type=='classic':
          pred = model(test_tensor)
      elif model_type=='laplace':
          #Save original mean weight
          original_weight = model.state_dict()['fc.weight'].detach().clone()
          outputs = torch.zeros(nsamples, test_tensor.shape[0], 1)
          for i in range(nsamples):
              state_dict = model.state_dict()
              state_dict['fc.weight'] = torch.from_numpy(posterior[i].reshape(1,2))
              model.load_state_dict(state_dict)
              outputs[i] = net(test_tensor)
          pred = outputs.mean(0).squeeze()
          state_dict['fc.weight'] = original_weight
          model.load_state_dict(state_dict)
      elif model_type=='vi':
          outputs = torch.zeros(nsamples, test_tensor.shape[0], 1)
          for i in range(nsamples):
              outputs[i] = model(test_tensor)
          pred = outputs.mean(0).squeeze()
      elif model_type=='mcdropout':
          model.eval()
          model.training = True
          outputs = torch.zeros(nsamples, test_tensor.shape[0], 1)
          for i in range(nsamples):
              outputs[i] = model(test_tensor)
          pred = outputs.mean(0).squeeze()
    Z = pred.reshape(xx.shape).detach().numpy()

    plt.cla()
    ax.set_title('Classification Analysis')
    ax.contourf(xx, yy, Z, cmap=cmap, alpha=0.25)
    ax.contour(xx, yy, Z, colors='k', linestyles=':', linewidths=0.7)
    ax.scatter(X[:,0], X[:,1], c=Y, cmap='Paired_r', edgecolors='k');
    ax.text(tloc[0], tloc[1], f'Epoch = {epoch+1}, Accuracy = {accuracy:.2%}', fontdict={'size': 12, 'fontweight': 'bold'})
    display.display(plt.gcf())
    display.clear_output(wait=True)

Part I: Bayesian Logistic Regression

In linear regression, model prediction is of the continuous form \(f(\pmb{x})=\pmb{w}^T\pmb{x}+b\).

For classification, we wish to predict discrete class labels \(\mathcal{C}_k\) to a sample \(\pmb{x}\). For simplicity, let’s consider here binary classification:

\[f(\pmb{x}) = \sigma(\pmb{w}^T\pmb{x} + b)\]

where \(\sigma(t)= \frac{1}{1+e^t}\) is the sigmoid function.

As in linear regression, we define a Gaussian prior:

\[p(\pmb{w}) = \mathcal{N}(\pmb{w}; \pmb{\mu}_0, \pmb{\Sigma}_0^2)\]

Unfortunately, the posterior distribution isn’t tractable as the likelihood isn’t conjugate to the prior anymore.

We will explore in the following different methods to obtain an estimate of the posterior distribution and hence the predictive distribution.

I.0 Dataset

#@title Hyperparameters for model and approximate inference { form-width: "30%" }

WEIGHT_DECAY = 5e-2 #@param
NB_SAMPLES = 400 #@param
TEXT_LOCATION = (-5,-7)
# Load linear dataset
X, y = make_blobs(n_samples=NB_SAMPLES, centers=[(-2,-2),(2,2)], cluster_std=0.80, n_features=2)
X, y = torch.from_numpy(X), torch.from_numpy(y)
X, y = X.type(torch.float), y.type(torch.float)
torch_train_dataset = data.TensorDataset(X,y) # create your datset
train_dataloader = data.DataLoader(torch_train_dataset, batch_size=len(torch_train_dataset))

# Visualize dataset
plt.scatter(X[:,0], X[:,1], c=y, cmap='Paired_r', edgecolors='k')
plt.show()

I.1 Maximum-A-Posteriori Estimate

In this “baseline”, we reduce our posterior distribution \(p(\pmb{w} | \mathcal{D})\) to a point estimate \(\pmb{w}_{MAP}\). For a new sample \(\pmb{x^*}\), the predictive distribution can then be approximated by

\[p(\mathbf{y} = 1|\pmb{x^*},\mathcal{D}) = \int p(\mathbf{y} =1 |\pmb{x},\pmb{w})p(\pmb{w} | \mathcal{D})d\pmb{w} \approx p(y =1 |\pmb{x},\pmb{w}_{\textrm{MAP}}).\]

This approximation is called the plug-in approximation.

The point estimate corresponds to the Maximum-A-Posteriori minimum given by:

\[\pmb{w}_{\textrm{MAP}} = arg \max_{\pmb{w}} p(\pmb{w} \vert \mathcal{D}) = arg \max_{\pmb{w}} p(\mathcal{D} \vert \pmb{w})p(\pmb{w}) = arg \max_{\pmb{w}} \prod_{n=1}^N p(y_n \vert \pmb{x}_n, \pmb{w})p(\pmb{w})\]

Looking for the maximum solution of previous equation is equivalent to the minimum solution of \(- \log p(\pmb{w} \vert \mathcal{D})\). In case of a Gaussian prior, it can further be derived as:

\[\pmb{w}_{\textrm{MAP}} = arg \min_{\pmb{w}} \sum_{n=1}^N \big ( -y_n \log \sigma(\pmb{w}^T \pmb{x}_n + b) - (1-y_n) \log (1 - \sigma(\pmb{w}^T \pmb{x}_n + b)) + \frac{1}{2 \Sigma_0^2} \vert \vert \pmb{w} \vert \vert_2^2 \big )\]

Note that: - This actually correspond to the minimum given by the standard cross-entropy loss in classification with a weight decay regularization - Unlike in linear regression, \(\pmb{w}_{MAP}\) cannot be computed analytically - But we can use optimization methods to compute it, e.g. stochastic gradient descent - Nevertheless, we only obtain a point-wise estimate, and not a full distribution over parameters \(\pmb{w}\)

Consequently, the objective is simply to implement and train a Logistic Regression model with Pytorch and then compute \(p(\mathbf{y} = 1|\pmb{x}^*,\mathcal{D})\) on a new sample \(\pmb{x}^*\) as in a deterministic model.

[CODING TASK] Implement Logistic Regression

class LogisticRegression(nn.Module):
  """ A Logistic Regression Model with sigmoid output in Pytorch"""
  def __init__(self, input_size):
    super().__init__()
    # ============ YOUR CODE HERE ============
    # cf. nn.Linear()
    # self.fc = n.Linear(...)

  def forward(self, x):
    # ============ YOUR CODE HERE ============
    # Don't forget to apply the sigmoid function when returning the output
    # cf. torch.sigmoid()

    return None

[CODING TASK] Train a Logistic Regression model with stochastic gradient descent for 20 epochs.

net = LogisticRegression(input_size=X.shape[1])
net.train()
criterion = nn.BCELoss()

# L2 regularization is included in Pytorch's optimizer implementation
# as "weigth_decay" option
optimizer = torch.optim.SGD(net.parameters(), lr=0.1, momentum=0.9, weight_decay=WEIGHT_DECAY)

fig, ax = plt.subplots(figsize=(7,7))

# ============ YOUR CODE HERE ============
# Train previously defined network for 20 epochs with SGD
# and plot result for each epoch by uncommenting function below

for epoch in range(20):  # loop over the dataset multiple times
    # ============ YOUR CODE HERE ============
    # zero the parameter gradients
    optimizer.zero_grad()

    # forward + backward + optimize
    # output = net(X).squeeze() # rwhere W is the input
    # loss = criterion(output,y) # where y is the label
    loss.backward()
    optimizer.step()

    # For plotting and showing learning process at each epoch
    # plot_decision_boundary(net, X, y, epoch, ((output>=0.5) == y).float().mean(),
    #                       model_type='classic', tloc=TEXT_LOCATION)

[Question 1.1]: Analyze the results provided by previous plot. Looking at :math:`p(mathbf{y}=1 | pmb{x}, pmb{w}_{textrm{MAP}})`, what can you say about points far from train distribution?

Part II Monte Carlo Dropout

Training a neural network with randomly dropping some activations, such as with dropout layers, can actually be seen as an approximate variational inference method!

[Gal and Ghahramani, 2016] showed this can be fullfilled for: - \(p(\pmb{w}) = \prod_l p(\pmb{W}_l) = \prod_l \mathcal{N}(0, I/ l_i^2)\) \(\Rightarrow\) Multivariate Gaussian distribution factorized over layers - \(q(\pmb{w}) = \prod_l q(\pmb{W}_l) = \prod_l \textrm{diag}(\varepsilon\_l)\odot\pmb{M}_l\) with \(\varepsilon_l \sim \textrm{Ber}(1-p_l)\).

We will now implement a MLP with dropout layers and perform Monte-Carlo sampling to obtain the predictive distribution \(p(\mathbf{y} \vert \pmb{x}^*, \pmb{w})\) for a new sample \(\pmb{x}^*\).

#@title **[CODING TASK]** Implement a MLP with dropout (p=0.2)
# Code MLP with 1 hidden layer and a dropout layer. Be careful, the dropout
# layer should be also activated during test time.
# (Hint: we may want to look out at F.dropout())

class MLP(nn.Module):
    """ Pytorch MLP for binary classification model with an added dropout layer"""
    def __init__(self, input_size, hidden_size):
        super().__init__()
      # ============ YOUR CODE HERE ============

    def forward(self, x):
      # ============ YOUR CODE HERE ============
      # Don't forget to apply the sigmoid function when returning the output
      return None

We train our model as usual:

net = MLP(input_size=X.shape[1], hidden_size=50)
net.train()
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
fig, ax = plt.subplots(figsize=(7,7))

for epoch in range(500):  # loop over the dataset multiple times

    # zero the parameter gradients
    optimizer.zero_grad()

    # forward + backward + optimize
    output = net(X).squeeze()
    loss = criterion(output, y)
    loss.backward()
    optimizer.step()

    # For plotting and showing learning process at each epoch, uncomment and indent line below
    if (epoch+1)%50==0:
      plot_decision_boundary(net, X, y, epoch, ((output.squeeze()>=0.5) == y).float().mean(), tloc=TEXT_LOCATION, model_type='classic')

Now let’s look at the results given by MC Dropout:

fig, ax = plt.subplots(figsize=(7,7))
plot_decision_boundary(net, X, y, epoch, ((output.squeeze()>=0.5) == y).float().mean(), tloc=TEXT_LOCATION, model_type='mcdropout')

[Question 2.1]: Again, analyze the results showed on plot. What is the benefit of MC Dropout variational inference over Bayesian Logistic Regression with variational inference?

Bonus : Laplace Approximation

In this part, we wil implement the Laplace approximation which is expected to be better than the MAP approximation but still leads to rough estimation of the density.

We will use Laplace approximation to estimate the intractable posterior \(p(\pmb{w} | \mathcal{D})\).

Here, \(p(\pmb{w} | \mathcal{D})\) is approximated with a normal distribution \(\mathcal{N}(\pmb{w} ; \pmb{\mu}_{lap}, \pmb{\Sigma}_{lap}^2)\) where:

  • the mean of the normal distribution \(\pmb{\mu}_{lap}\) corresponds to the mode of \(p(\pmb{w} | \mathcal{D})\). In other words, it simply consists in taking the optimum weights of Maximum-A-Posteriori estimation :

    \[\pmb{\mu}_{lap} = \pmb{w}_{\textrm{MAP}} = \arg \min_{\pmb{w}} -\log p(\pmb{w} | \mathcal{D})\]

    .

  • the covariance matrix is obtained by computing the Hessian of the loss function \(-\log p(\pmb{w} \vert \mathcal{D})\) at \(\pmb{w}=\pmb{w}_{\textrm{MAP}}\):

    \[(\pmb{\Sigma}^2_{lap})^{-1} = \nabla\nabla_{\pmb{w}} [p(\pmb{w} \vert \mathcal{D}) ]_{\pmb{w}=\pmb{w}_{\textrm{MAP}}}\]

[CODING TASK] Extract μ_lap from previously trained model. NB: Select only weights parameters. The weights of a given layer ``l_id`` from a model ``net`` is accessible as ``net.l_id.weight``

# ============ YOUR CODE HERE ============
w_map = None

To compute the Hessian, we first compute the gradient at \(\pmb{w}_{\textrm{MAP}}\):

# Computing first derivative w.r.t to model's weights
optimizer.zero_grad()
output = net(X).squeeze()
loss = criterion(output, y) + net.fc.weight.norm()**2
gradf_weight = grad(loss, net.fc.weight, create_graph=True)[0]
# Apply the same grad function on each scalar element of the gradient to get
# each raw of the Hessian. Concatenate both and compute the covariance
# by inverting the Hessian
# NB: to avoid accumulated gradient when debugging and running the cell
# multiple times, you should convert your grad results to numpy straight away
hess_weights = []
for i in range(gradf_weight.shape[1]):
    grad2 = grad(gradf_weight[0][i], net.fc.weight, retain_graph=True)[0].squeeze()
    hess_weights.append(grad2)
hess_weights = torch.stack(hess_weights)
Sigma_laplace = torch.inverse(hess_weights)
print(hess_weights)

We now compute the posterior approximate \(\mathcal{N}(\pmb{w} ; \pmb{\mu}_{lap}, \pmb{\Sigma}_{lap}^2)\) with the parameters found.

Given this distribution, we can compute the posterior thanks to Monte-Carlo sampling and plot results for the last epoch corresponding to \(\pmb{w}_{\textrm{MAP}}\):

# Defining posterior distribution
laplace_posterior =  np.random.multivariate_normal(w_map.detach().numpy().reshape(2,), Sigma_laplace.detach().numpy(), NB_SAMPLES)

# Plotting results
fig, ax = plt.subplots(figsize=(7,7))
plot_decision_boundary(net, X, y, epoch, ((output.squeeze()>=0.5) == y).float().mean(), model_type='laplace',
                       tloc=TEXT_LOCATION, nsamples=NB_SAMPLES, posterior=laplace_posterior)

[Question 1.2]: Analyze the results provided by previous plot. Compared to previous MAP estimate, how does the predictive distribution behave?