Linear Regression from Maximum Likelihood View
Published:
Hi everyone! In this post, we will discuss a very simple subject, particularly for those who has been studying machine learning: linear regression. However, we will try to cover it from the maximum likelihood angle of view. Hopefully this can give you some intuition of how linear regression objective function could be derived.
You should be pretty familiar with mean-squared-error function which linear regression setup tries to minimize. Namely: \[ \begin{aligned} \mathcal{E}(w) = \frac{1}{2} \sum^n_{i=1} \big( f_w(x_i) - y_i \big)^2 \end{aligned} \]
Where \(y_i\) is the true target value and \(f_w (x_i)\) is a linear function that parameterized by the weights \(w\) we estimated from the data. Namely, \(f_w (x_i) = w * \phi (x)\), where \(\phi (x)\) is the feature map \(\phi (x): \mathcal{R}^D \rightarrow \mathcal{R}^m\) (if you’re not familiar with feature map, then simply regard \(\phi (x)\) just as regular \(x\)).
How one could justify this particular kind of error function?
We will start this argument by an assumption that our data \({(x_1,y_2), (x_2,y_2), ..., (x_i,y_i)}\) are generated by some unknown true target function \(f_{\hat{w}}\) where \(\hat{w} \in \mathcal{R}^m\), and some random noise \(e_i\) for each data point \(x_i\). We then can formulate each target value \(y_i\) as:
\[ \begin{aligned} y_i & = f_{\hat{w}} + e_i\newline & = \hat{w} * \phi (x_i) + e_i \end{aligned} \]
The second assumption is that each \(e_i\) is i.i.d. and that it distributes according to some normal/gaussian distribution of certain variance and centered at 0:
\[ e_i \sim \mathcal{N}\big(0, \sigma^2 \big) \]
By this assumption, we now have the true target values \(y_i\) also distributes by a normal distribution:
\[ y_i \sim \mathcal{N}\big(f_{\hat{w}}, \sigma^2 \big) \]
Note: \(\sigma \in (0, \infty)\) and this sigma is shared among all \(i = 1, 2, \dots, n\).
One fundamental notion in statistics and machine learning that we will introduce here is likelihood function and maximum likelihood estimate. Likelihood function is the function of parameters \(w\), that measure the likelihood of the parameters given data \(D\). Turns out, the likelihood is equals to the probability that data \(D\) is generated given parameters \(w\). Formally, likelihood function \(\mathcal{L}: \mathcal{R}^m \rightarrow \mathcal{R}\), maps parameter (vector) to a real number. In our case of linear regression, the likelihood function of \(w\) is defined as follow:
\[ \begin{aligned} \mathcal{L}(w \mid D) & = P(D \mid w)\newline & = P(y_1,y_2,\dots,y_n \mid x_1,x_2,\dots,x_n, w) \end{aligned} \]
Note: in regression problem, \(x\) are the observed variable and therefore we have \(x\) as the condition in addition to \(w\).
The next important notion would be the maximum likelihood estimation. It simply defined as finding the set of parameters which gives the maximum likelihood given the data. We can denote the maximum likelihood estimate of \(w\) as follow:
\[\begin{aligned} \hat{w} & = \operatorname*{argmax}_{w \in \mathcal{R}^m} \mathcal{L}(w \mid D)\newline & = \operatorname*{argmax}_{w \in \mathcal{R}^m} P(D \mid w)\newline & = \operatorname*{argmax}_{w \in \mathcal{R}^m} P(y_1,y_2,\dots,y_n \mid x_1,x_2,\dots,x_n, w)\newline & = \operatorname*{argmax}_{w} \prod_{i=1}^n P(y_i \mid x_i, w) & \text{(by conditional independence of $y_i$)} \end{aligned}\]Now, recall our assumption that: \(y_i \sim \mathcal{N}\big(f_{\hat{w}}, \sigma^2 \big)\). By this assumption, we can plugin a gaussian distribution formula into our maximum likelihood estimate equation. Namely:
\[\begin{aligned} \hat{w} & = \operatorname*{argmax}_{w} \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{( y_i - f_w (x_i))^2}{2 \sigma^2}} \end{aligned}\]Maximizing the above expression is equivalent to minimizing the negative logarithm of it. We do so to remove the exponential and make it more convenient to deal with this equation. We should now have:
\[\begin{aligned} \hat{w} & = \operatorname*{argmin}_{w} \sum_{i=1}^n \frac{( y_i - f_w (x_i))^2}{2 \sigma^2} - \log \frac{1}{\sqrt{2 \pi \sigma^2}} \end{aligned}\]Because we minimizing with respect to \(w\), the terms \(2\sigma^2\) and log on the right becomes a constant that we can ignore. We finally multiply the expression with a constant \(\frac{1}{2}\) to make it more convenient when taking the derivative of it.
Finally, we ended up with this expression that we need to minimize:
\[\begin{aligned} \hat{w} & = \operatorname*{argmin}_{w} \sum_{i=1}^n \frac{1}{2} (y_i - f_w (x_i))^2 \end{aligned}\]Voila! There we have the loss function that we really familiar with. And so, the take on lesson from this derivation is that finding the maximum likelihood estimate of \(w\) is equivalent to minimizing the the sum-of-squares error.
You should also notice that we do use several assumptions in this derivation. In practice, these assumptions don’t always hold true. For example, the assumption of gaussian error distribution makes it very sensitive to outliers. In the next post, we shall show you why that is the case, and how regularization term could be justified to improve the model robustness to outliers.
See you in the next post.