Bayesian Linear Regression

[[Frequentist linear regression]] derives fixed model parameters and fixed prediction for a given input. In contrast, Bayesian linear regression (BLR) derives a probability distribution of the model parameters, therefore the prediction will also be a distribution. ## Objective We want to find the probability distribution of the weights $\mathbf w$ from the data $\mathbf y$. We assume for now it follows a Gaussian distribution with mean $\boldsymbol \mu_\mathbf w$ and variance $\mathbf \Sigma_\mathbf w$. $p(\mathbf w \mid \mathbf y)=\mathcal N(\mathbf w\mid \boldsymbol \mu_\mathbf w, \mathbf \Sigma_\mathbf w)$ Once we obtained the weights $\mathbf w$ (which was derived via $\mathbf y$), given a new input $\mathbf X_*$, the output $\mathbf y_*$ should also follow a [[Gaussian Distribution|Gaussian distribution]] with mean $\boldsymbol \mu_*$ and variance $\mathbf \Sigma_*$. $p(\mathbf y_*\mid \mathbf {y})=\mathcal N(\mathbf y_*\mid \boldsymbol \mu_*,\mathbf \Sigma_* )$ Our goal is to find $\boldsymbol \mu_\mathbf w$, $\mathbf \Sigma_\mathbf w$, $\boldsymbol \mu_*$ and $\mathbf \Sigma_*$. Finally, we will see that those values depend on two hyperparameters $\alpha$ and $\beta$, so we will conclude by showing how to find the optimal values for them. ## Posterior distribution The distribution $p(\mathbf w\mid \mathbf y)$ is called the **posterior distribution** because it is the weights' distribution *after* seeing the data. From [[Bayes' Theorem]], we have $\begin{align}p(\mathbf w\mid \mathbf y) &= p(\mathbf w) \times p(\mathbf y\mid \mathbf w) \times \frac{1}{p(\mathbf y)} \\ p(\mathbf w\mid \mathbf y) &\propto p(\mathbf w) \times p(\mathbf y\mid \mathbf w) \end{align}$ $p(\mathbf w)$ is the **prior distribution** which is the weights' distribution *before* seeing the data. Without data this is really just our faith, believing it has a mean of $\mathbf 0$ and precision of $\alpha$ $p(\mathbf w) =\mathcal{N}(\mathbf w\mid \mathbf 0, \alpha^{-1}\mathbf I)$ $p(\mathbf y\mid \mathbf w)$ is the **likelihood** describing how well the model explains the observed data. We give it a precision of $\beta$ $p(\mathbf y\mid \mathbf w)=\mathcal{N}(\mathbf y\mid \mathbf{\hat y} , \beta^{-1} \mathbf I)$ where $\mathbf {\hat y}=\mathbf {Xw}$. The input matrix $\mathbf X$ is called the **Design Matrix** (or feature matrix). Now we are ready to solve the posterior distribution. $\begin{align} p(\mathbf w\mid \mathbf y) &\propto p(\mathbf w) \times p(\mathbf y\mid \mathbf w) \\&\propto \exp\big \{ -\frac{\alpha}{2}\mathbf w^\mathrm T \mathbf w \big\} \exp\big \{ -\frac{\beta}{2}(\mathbf y - \mathbf {\hat y})^\mathrm T (\mathbf y - \mathbf {\hat y}) \big \} \\&\propto \exp\big \{ -\frac{1}{2}\mathbf w^\mathrm T\big (\alpha\mathbf I + \beta\mathbf {X}^\mathrm T\mathbf {X}\big) \mathbf w + \mathbf w^\mathrm T\big(\beta\mathbf {X}^\mathrm T\mathbf y\big) \big \} \end{align} $ Detailed proof [[Gaussian Distribution#Example posterior distribution in BLR|here]]. Hence $\begin{align} \boldsymbol \mu_\mathbf w &= \beta\mathbf \Sigma_\mathbf w \mathbf {X}^\mathrm T \mathbf {y}\\\mathbf \Sigma_\mathbf w &= \big(\alpha\mathbf I + \beta\mathbf {X}^\mathrm T\mathbf {X}\big)^{-1} \\ \end{align}$ ## Predictive distribution Now we have the Gaussian distribution of the weights $\mathbf w$. Given a new input $\mathbf X_*$ we want to get the distribution of $\mathbf y_*$ $p(\mathbf y_* \mid \mathbf{y}) = \int p(\mathbf y_{*} \mid \mathbf{w}) p(\mathbf{w} \mid \mathbf{y}) \, d\mathbf{w}$ Since we are integrating the product of two Gaussians, the result is known to be a Gaussian: $p(y_{*} \mid \mathbf{y}) = \mathcal{N}(y_{*} \mid \boldsymbol\mu_*, \mathbf \Sigma_*)$ The **predictive mean** $\boldsymbol \mu_*$ can be found using the [[Law of Total Expectation]] $\begin{align} \boldsymbol \mu_*=\mathbb E[\mathbf y_*]&=\mathbb E_{\mathbf w}[\mathbb E[\mathbf y_*\mid \mathbf w]] \\&=\mathbf X_*\boldsymbol \mu_\mathbf w\end{align}$ The **predictive variance** $\mathbb \Sigma_*$ can be found using the [[Law of Total Variance]] $\begin{align} \mathbf \Sigma_*=\text{Var}[\mathbf y_*]&=\mathbb E_\mathbf w[\text{Var}[\mathbf y_*\mid \mathbf w]] + \text{Var}_\mathbf w[\mathbb E[\mathbf y_* \mid \mathbf w]]\\&= \beta^{-1}\mathbf I+ \mathbf X_* \mathbf \Sigma_\mathbf w \mathbf X_*^\mathrm T\end{align}$ ## Demo Here we observe 20 data points that follows the source of truth $\sin(2\pi x)$ (green line) with random Gaussian noise. Using Bayesian linear regression ($\alpha=0.005,\beta=10$), we get the mean (red line) and standard deviation (pink shade) of the predictive distribution. ![[blr.png]] View the code in [Google Colab](https://colab.research.google.com/drive/1QpNFAyNgmGRcv7cTBMBXXDAeVA4V45wK?usp=sharing) ## Evidence maximization In the previous example, we assumed $\alpha=0.005,\beta=10$. We want a principled approach to find the optimal values for these hyperparameters. This can be done via an iterative process. We begin with $\begin{align}\alpha_0&=1\\ \beta_0&=1\\\end{align}$ And proceed with the next improvement $\begin{align}\alpha&\rightarrow \frac{M}{\boldsymbol\mu^\mathrm T_{\mathbf w}\boldsymbol \mu_\mathbf w +\mathrm{Tr}(\mathbf\Sigma_\mathbf w)}\\ \beta&\rightarrow \frac{N-\gamma}{\| \mathbf y-\mathbf X\boldsymbol\mu_\mathbf w \|^2}\end{align}$ where $(N,M)$ is the shape of $\mathbf X$, and $\gamma=M-\alpha\cdot \mathrm{Tr}(\Sigma_\mathbf w)$ is the effective number of parameters. We repeat until the relative change of the hyperparameters are smaller than some $\epsilon$. ## Demo ![[blr_ml.png]] View the code in [Google Colab](https://colab.research.google.com/drive/1GUsJJw_3RcZgNv3EdcOf7ffwuuxqSsNa?usp=sharing) We notice that $\beta=1.31e+02$ is 13x higher than the initial guess $\beta=1.00e+01$. This means the **Evidence Maximization** concluded that the noise variance $\sigma^2=1/\beta$ is 13x smaller than initially assumed. This is why the predictive standard deviation (pink shade) is slightly tighter. ## Production Usage While the code is mathematically correct, it is far from ready to use in production. One reason is due to precision issues. For example, when computing the posterior mean, the more precise Python code should be ```python posterior_mean = np.linalg.solve( self._alpha * np.identity(X.shape[1]) + self._beta * X.T @ X, self._beta * X.T @ Y ) ``` The second issue is efficiency. For example, the matrix $\mathbf{X}^{\mathrm{T}}\mathbf{X}$ is needed to calculate the posterior covariance $\mathbf{\Sigma}_{\mathbf{w}}$ and the posterior mean $\boldsymbol{\mu}_{\mathbf{w}}$ in every iteration, making its computation the most expensive recurring calculation. Pre-computing this matrix outside the iterative loop saves significant repeated matrix multiplication.