6

Linear regression uses summation of least squares to find the best fit. Why? I fully understand that we do not want to use actual residuals, otherwise, positive and negative numbers may cancel out each other. Then, why don't we use absolute values? Sorry if this sounds like a duplicate question. I did see many explanations but did not see an easy-to-understand answer. For example, some said that squares made calculation easier. How come?

Your insight is highly appreciated!

  • 1
    If you haven't seen this one yet, please at least take a look at Least Squares: Minimum norm estimate that deals with a very important reason to use least "squares". Another related one, although possibly not as useful, is How does minimum squared error relate to a linear system?. – John Omielan Mar 13 '20 at 20:51
  • 2
    None of the answers so far mention this and the question is now closed, but the two could also give very different answers. Squaring heavily penalises a large error, so it could give a significantly different result which tries to be closer to outliers (which may or may not be desired). Consider that 10 errors of 1 would match 1 error of 10 using absolute values, compared to 1 error of only $\sqrt{10} = 3.16$ when using squares. – Bernhard Barker Mar 14 '20 at 15:31
  • Work through the proof for the Gauss-Markov theorem which proves OLS is BLUE, ( BEST LINEAR UNBIASED ESTIMATOR). I can't see any other way of getting the intuition solid. The proof should make it obvious. – Hexatonic Jun 22 '23 at 14:07

6 Answers6

12

As mentioned by others, the least-squares problem is much easier to solve. But there’s another important reason: assuming IID Gaussian noise, the least-squares solution is the Maximum-Likelihood estimate.

  • 2
    Yes, this is the “proper reason”. With Gaussian noise, a faraway outlier is very unlikely to be merely the result of chance, so it makes sense to penalise data points with strong deviation from the model more than proportionally to closer ones. OTOH, this can break badly if the noise is not Gaussian – if some of the outliers really are measurement errors, then these will ruin a least-square fit much more than a “robust fit” such as L¹. – leftaroundabout Mar 14 '20 at 11:00
  • 1
    @leftaroundabout Very true. It's worth testing noise's Normality. Luckily, there are many reasons for errors to be Normal (although my linked discussion presents some important counterexamples). – J.G. Mar 14 '20 at 11:51
  • 1
    Right! It might be also worth pointing out that if we assume Laplace-distributed noise, the maximum likelihood estimate corresponds to the solution of the least absolute deviations. – Dan Oneață Mar 15 '20 at 19:39
8

$$\min_{a,b}\sum_{k=1}^n(ax_k+b-y_k)^2$$ has a simple analytical solution.

$$\min_{a,b}\sum_{k=1}^n|ax_k+b-y_k|$$ is difficult.

One of reasons is that the absolute value is not differentiable.

  • 3
    Even worse, the solution for the second one does not have to be unique. E.g. if you pick the four points in a square: $(\pm 1,\pm 1)$, then any line entering that square on the left and leaving on the right is minimal with value 4. – mlk Mar 14 '20 at 11:24
  • @mlk I think that deserves putting in its own answer as it is often overlooked. – Andrew Stacey Mar 14 '20 at 11:50
  • @mlk while this is true and good to know, it's not actually very relevant in practice. For typical fitting of some function to the data, an L¹-minimum tends to work quite well, even if it's not unique: an arbitrary parameter-point within the set of argmins will do. (Your square is actually a nice example for why this can make sense: you have no real reason why the center point would be any more likely than some other point in the minimum. Why would all the measurements have the same error-magnitude?) – leftaroundabout Mar 14 '20 at 17:49
2

In addition to the previous answers, I want to highlight the differences in the solutions obtained when optimizing each of the two objective functions. In particular, if we look at the response variable $y$ conditioned on the explanatory variables $\mathbf{x}$, that is $y | \mathbf{x}$, the algorithm estimates

  • the mean of response values, in the case of squared differences;
  • the median of the response values, in the case of of absolute differences.

By replacing the absolute value with a tilted absolute value loss function, we obtain quantile regression. The figures below exemplify the differences in solutions for the two methods (these images were taken from this assignment, see §2):

enter image description here

The same resource provides some motivating examples for using quantile regression:

  • A device manufacturer may wish to know what are the 10% and 90% quantiles for some feature of the production process, so as to tailor the process to cover 80% of the devices produced.
  • For risk management and regulatory reporting purposes, a bank may need to estimate a lower bound on the changes in the value of its portfolio which will hold with high probability.
1

In actuality least absolute value methods of regression is sometimes used, but there are a few reasons why least squares is more popular.

1) In calculus, when trying to solve an optimization problems (which is what regression is, minimizing error) we take the derivative to find the points where it is equal to 0. When differentiating, absolute value signs are a nightmare and create a kind of piecewise function whereas squares are far simpler to differentiate, especially due to their non-linearity.

2) Least squares regression lines are more efficient (they don't require as great of a number of samples to get a good estimate of the true regression line for the population).

But in all honesty, least squares is more common because it ended up that way. There are many good arguments as to why in many scenarios least absolute value is better, including the fact that least squares regression is far more sensitive to outliers.

This is shown in this example. Sourced from: https://demonstrations.wolfram.com/ComparingLeastSquaresFitAndLeastAbsoluteDeviationsFit/

ajax2112
  • 257
  • Thanks for your answer. Most of the other answers here just baffle me, however yours actually makes the most sense. – Simon East Nov 12 '23 at 05:00
1

One can think of a set of $n$ observations as being an $n$-dimensional vector. We then have the Euclidean norm $\sqrt {\sum (y_i-\hat y_i)^2}$. Since minimizing the square root of a value is the same as minimizing the value (for positive numbers), it's simpler to talk of finding the least squares, rather than finding least root mean square.

Using $\sum (y_i-\hat y_i)^2$ over $\sqrt {\sum (y_i-\hat y_i)^2}$ has further advantages, such as that we can split $\sum y_i^2$ into the "explained" part $\sum (y_i-\hat y_i)^2$ and the "unexplained" part $\sum y_i^2-\sum (y_i-\hat y_i)^2$.

Once we have the Euclidean norm, many questions can be answered by looking at the geometry of the space. For instance, if we take the set of points obtained by $\hat y = mx+b$, this is a plane in the space. Finding the least squares means finding the point on this plane closest to the observation vector, which can be obtained simply by looking at the hyperplane perpendicular to that plane that goes through the observation vector, and seeing what point it intersects the plane, which is a simple linear algebra problem.

Acccumulation
  • 12,864
0

It is easy to minimize the error when it is given by the least squares. Consider the following: there given points $(x_k,y_k), \ k=1,\ldots,n $ and you want to find $a,b$ constants such that $y \approx ax+b$. What does $y\approx ax+b$ mean? E.g. $E(a,b):=\sum_{k=1}^n (y_k-ax_k-b)^2$ is minimal in $a,b$. Now \begin{align*} \frac{\partial}{\partial a} E(a,b) &= -2\sum_{k=1}^n (y_k-ax_k-b)x_k = 0\\ \frac{\partial}{\partial b} E(a,b) &= -2\sum_{k=1}^n (y_k-ax_k-b) = 0 \end{align*} The solution is given as the solution to $$ \begin{bmatrix}1 & \frac1n\sum_{k=1}^n x_k \\ \frac1n\sum_{k=1}^n x_k & \frac1n\sum_{k=1}^n x_k^2 \end{bmatrix}\begin{bmatrix} b \\ a \end{bmatrix} = \begin{bmatrix} \frac1n\sum_{k=1}^n y_k \\ \frac1n\sum_{k=1}^n x_ky_k \end{bmatrix} $$ it can be shown that this is indeed a minimum by looking at Hessian of $E(a,b)$.

Mick
  • 2,285