L2 Regularization from Probabilistic Perspective
Published:
Hi everyone! In the previous post we have noted that least-squared regression is very prone to overfitting. Due to some assumptions used to derive it, L2 loss function is sensitive to outliers i.e. outliers can penalize the L2 loss function heavily, messing up the model entirely. You must have been aware that adding regularization terms helps to improve the robustness of the model. The regulazired L2 loss is expressed as follow:
\[\begin{aligned} \mathcal{E}(w) = \frac{1}{2} \sum^n_{i=1} \Big( f_w(x_i) - y_i \Big)^2 \underbrace{+ \lambda w^T w}_{\text{regularization term}} \end{aligned}\]In this post we show how this regularization term is derived from the probabilistic perspective. Enjoy!
Assumptions
Recall these assumptions that we have used to derive the least-squared regression:
- Each target value is generated according to this function: \(y_i = f_{\hat{w}} + e_i\)
- The error on each data point is distributed independently and identically according to a gaussian distribution of mean 0 and some variance \(\sigma^2\), namely, \(e_i \sim \mathcal{N}\big(0, \sigma^2 \big)\)
In overfitting, our model fit the noise or random error instead of the relationship between descriptor and target value. When it happens, the model parameters or weights are usually excessively complex. The weights could be represented as a vector in vector space which has a very high norm. In other words, the value of each elements is very large.
To avoid this to happen, we shall introduce another assumption. We could add prior assumption that model parameters \(w\) distributes according to a multivariate gaussian of mean 0 and covariance matrix \(\Sigma\). With this prior we could keep the values of \(w\) to be small or sufficiently close to 0.
We could in fact take this assumption even further. We suppose that variables of this multivariate normal are independent to each other. Thus, the covariance matrix decomposed as an identity matrix multiplied by a scalar \(\alpha\), namely, \(\Sigma = \alpha*\mathcal{I}\). We can think of \(\alpha\) as the constant that controls the width of the gaussian curve.
Probabilistic Point of View
We have previously shown that minimizing \(\mathcal{E}(w)\) is equivalent to maximizing the likelihood of \(w\) or also the same of maximizing the probability of generating the data given model parameters, \(p(\mathcal{D}|w)\). By conditional probability rule, we can express the following equalities:
\[\begin{aligned} p(w|\mathcal{D})*p(\mathcal{D}) & = p(\mathcal{D}|w)*p(w)\newline p(w|\mathcal{D}) & = \frac{\overbrace{p(\mathcal{D}|w)}^{\text{likelihood term}}*p(w)}{p(\mathcal{D})} \end{aligned}\]From above derivation, we can see that maximizing the likelihood term, we are also maximizing the posterior term on the left hand side of the equation. One should question: why should we maximize the posterior instead of the likelihood?
Notice that by explicitly maximizing the posterior, we have the advantage of incorporating the prior \(p(w)\) to our estimation. This way, our assumption that \(e_i \sim \mathcal{N}\big(0, \sigma^2 \big)\), could be put to use. Lastly, since the denominator \(p(\mathcal{D})\) only acts as a constant in this maximization, we can simply ignore it.
Now, we recast our goal from finding \(w\) that maximize the likelihood, to finding \(w\) that maximize the posterior. We refer this goal as a maximum a-posterior estimation of \(w\), which could be denoted as \(w_{\text{MAP}}\).
Having all this intuition cleared, we can now start deriving our L2 loss function! See how \(w_{\text{MAP}}\) could be further decomposed as follow:
\[\begin{aligned} w_{\text{MAP}} & = \operatorname*{argmax}_{w} p(\mathcal{D}|w)*p(w) \newline & = \operatorname*{argmax}_{w} \prod_i^n p(y_{i}|x_{i},w) * p(w|aI,\mu = 0) \newline & \text{by independence rule:}\newline & = \operatorname*{argmax}_{w} \prod_i^n \frac{1}{\sigma \sqrt{2\pi}} exp \big\{ -\frac{(f_w (x_i)-y_i)^2}{2\sigma^2} \big\} * \frac{1}{(2\pi)^{\frac{d}{2}} \mid\alpha I \mid^{\frac{1}{2}}} exp\big\{ -\frac{1}{2}(w-\mu)^T (\alpha I)^{-1} (w-\mu) \big\}\newline & \text{simplified further by the fact that } \mu = 0\newline & = \operatorname*{argmax}_{w} \prod_i^n \frac{1}{\sigma \sqrt{2\pi}} exp \big\{ -\frac{(f_w (x_i)-y_i)^2}{2\sigma^2} \big\} * \frac{1}{(2\pi)^{\frac{d}{2}} \mid\alpha I \mid^{\frac{1}{2}}} exp\big\{ -\frac{1}{2}w^T (\alpha I)^{-1} w \big\}\newline & \textbf{note: }d\text{ is the dimension of the weight vector } w \end{aligned}\]At this point, we can use the following useful properties of identity matrix:
- \(det(cA) = c^n det(A)\), where A is a matrix of size \(n*n\), and c is a constant
- determinant of an identity matrix is 1
- \((\alpha I)^{-1} = \frac{1}{\alpha} I\), where \(\alpha\) is a constant and \(I\) is the identity matrix. We can re-express the our \(w_{\text{MAP}}\) equation using the above properties:
As we already know, by finding the MAP is equivalent to minimizing the negative log of MAP. Therefore, we can restate \(w_{\text{MAP}}\) to be as follow:
\[\begin{aligned} w_{\text{MAP}} & = \operatorname*{argmin}_{w} \sum_i^n \log \frac{1}{\sigma \sqrt{2\pi}} + \frac{(f_w (x_i)-y_i)^2}{2\sigma^2} + \log \frac{1}{(2\pi\alpha)^{\frac{d}{2}}} + \frac{1}{2\alpha} w^T w \newline & \text{... the two terms with log are constants that can be ignored:}\newline & = \operatorname*{argmin}_{w} \sum_i^n \frac{(f_w (x_i)-y_i)^2}{2\sigma^2} + \frac{1}{2\alpha} w^T w \end{aligned}\]Note: We cannot ignore and remove the remaining denominators since it could translate the original function. The solution to the original function would be different to the resulting function.
We can additionally perform simple mathematical manipulation by multiplying both terms by a constant \(\sigma^2\): \(\operatorname*{argmin}_{w} \sum_i^n \frac{1}{2} (f_w (x_i)-y_i)^2 + \frac{\sigma^2}{2\alpha} w^T w\).
Finally, we can introduce a constant \(\lambda\), where \(\lambda = \frac{\sigma^2}{\alpha}\), and take the constant \(\frac{1}{2}\) to the outside of the summation, then voila! we arrived at the equation we show in the beginning of this post:
\[\begin{aligned} \mathcal{E}(w) = \frac{1}{2} \sum^n_{i=1} \Big( f_w(x_i) - y_i \Big)^2 + \lambda w^T w \end{aligned}\]Additional Notes
We have seen how the L2 regularization term is derived, and what the magic scalar \(\lambda\) is. We saw that \(\sigma^2\) controls how fat the tail of the error distribution, while \(\alpha\) controls the shape (or width) of the gaussian prior distribution for weights \(w\). Therefore, \(\lambda\) implicitly controls both quantity. Namely, as \(\lambda\) gets larger, we allow the model to tolerate more error (by having fatter tail of error distribution), and at the same time, we narrow down the weights \(w\) distribution closer to 0. This way, we avoid having the weights that are too large, and also avoid outliers to penalize the loss function too much.
That is all for now. I hope you enjoy it and see you in the next post.