Why the sum of residuals equals 0 when we do a sample regression by OLS?

Question

That's my question, I have looking round online and people post a formula by they don't explain the formula. Could anyone please give me a hand with that ? cheers

From a geometric point of view, it is almost obvious. The residual vector $r$ is perpendicular to the linear space spanned by column feature vectors, which by convention, contains the intercept vector $1$. Therefore $1^Tr = 0$, and we are done. — Zhanxiong, Jan 30 '19 at 03:28

Alecos Papadopoulos · Accepted Answer · 2017-04-11T13:59:00.227

70

If the OLS regression contains a constant term, i.e. if in the regressor matrix there is a regressor of a series of ones, then the sum of residuals is exactly equal to zero, as a matter of algebra.

For the simple regression,
specify the regression model $$y_i = a +bx_i + u_i\,,\; i=1,...,n$$

Then the OLS estimator $(\hat a, \hat b)$ minimizes the sum of squared residuals, i.e.

$$(\hat a, \hat b) : \sum_{i=1}^n(y_i - \hat a - \hat bx_i)^2 = \min$$

For the OLS estimator to be the argmin of the objective function, it must be the case as a necessary condition, that the first partial derivatives with respect to $a$ and $b$, evaluated at $(\hat a, \hat b)$ equal zero. For our result, we need only consider the partial w.r.t. $a$:

$$\frac {\partial}{\partial a} \sum_{i=1}^n(y_i - a - bx_i)^2 \Big |_{(\hat a, \hat b)} = 0 \Rightarrow -2\sum_{i=1}^n(y_i - \hat a - \hat bx_i) = 0 $$

But $y_i - \hat a - \hat bx_i = \hat u_i$, i.e. is equal to the residual, so we have that

$$\sum_{i=1}^n(y_i - \hat a - \hat bx_i) = \sum_{i=1}^n\hat u_i = 0 $$

The above also implies that if the regression specification does not include a constant term, then the sum of residuals will not, in general, be zero.

For the multiple regression,
let $\mathbf X$ be the $n \times k$ matrix containing the regressors, $\hat {\mathbf u}$ the residual vector and $\mathbf y$ the dependent variable vector. Let $\mathbf M = I_n-\mathbf X(\mathbf X'\mathbf X)^{-1}\mathbf X'$ be the "residual-maker" matrix, called thus because we have

$$\hat {\mathbf u} = \mathbf M\mathbf y$$

It is easily verified that $\mathbf M \mathbf X = \mathbf 0$. Also $\mathbf M$ is idempotent and symmetric.

Now, let $\mathbf i$ be a column vector of ones. Then the sum of residuals is

$$\sum_{i=1}^n \hat u_i = \mathbf i'\hat {\mathbf u} =\mathbf i'\mathbf M\mathbf y = \mathbf i'\mathbf M'\mathbf y = (\mathbf M\mathbf i)'\mathbf y = \mathbf 0' \mathbf y = \mathbf 0$$

So we need the regressor matrix to contain a series of ones, so that we get $\mathbf M\mathbf i = \mathbf 0$.

edited Apr 11 '17 at 13:59

answered Sep 17 '13 at 21:40

Alecos Papadopoulos

10,546

1

Thansk a lot, I didnt see your answer till now, really appreciate it :) – Maximilian1988 Sep 19 '13 at 04:13
@Alecos By "constant term", do you mean $\hat{a}$? – Stan Shunpike May 17 '16 at 01:21
@StanShunpike Just $a$ (the unknown quantity) - $\hat a$ is an estimator/estimate. In practice this means that we include as a regressor a series of ones. – Alecos Papadopoulos May 17 '16 at 02:20
1

How does the regresssor matrix having a series of ones imply $Mi=0$? Thanks for your answer! – manofbear Feb 10 '17 at 01:14
3

@manofbear $\mathbf i$ is a series of ones. Part of the properties of $M$ is that if it premultiplies any column of $X$, we get the zero vector. So if a column of ones is not in the regressor matrix, we don't get $\mathbf M \mathbf i=0$. – Alecos Papadopoulos Feb 10 '17 at 01:51
OLS estimator should be argmin $a,b$ that minimizes SSE. Not sure what your expression is. – qwr Oct 24 '18 at 07:38
@qwr The expression is the sum of squared residuals. I cannot see what is ambiguous about it. – Alecos Papadopoulos Oct 24 '18 at 07:41
You have a min on the RHS which doesn't correspond to anything. I think it should be argmin – qwr Oct 24 '18 at 07:42
@qwr The expression translates "alpha and beta such as (:) the sum of squared residuals is (=) minimum. Certainly, alternatively one could write $$(\hat a, \hat b) = argmin{ \sum_{i=1}^n(y_i - \hat a - \hat bx_i)^2 }$$ – Alecos Papadopoulos Oct 24 '18 at 07:45
ok you should write that then, since the current expression does not make sense. – qwr Oct 24 '18 at 07:47
@qwr I disagree. "min" means both "the minimum of" (an expression), but also can be used as a plain shorthand for the word "minimum". – Alecos Papadopoulos Oct 24 '18 at 07:50

score 13 · Answer 2 · edited Feb 20 '20 at 10:36

The accepted solution by Alecos Papadopoulos has a mistake at the end. I can't comment so I will have to submit this correction as a solution, sorry.

It's true that a series of ones would do the job. But it's not true that we need it. We do not need the regressor to have a series of ones in order for $Mi = 0$.

Theorem: If $\exists$ a $p$ x $1$ vector $v$ such that: $$Xv = 1_n$$

where $1_n$ is a $n$ x $1$ vector of ones, then $$\sum_{i=1}^ne_i=0$$ Proof: $\sum_{i=1}^ne_i= e^T1_n =e^T X v = (e^T X) v=(X^Te)^T v = (0)^T v = 0 $

Above I am using the fact that $X^Te=0$. Having a series of ones in X (a.k.a. intercept) is just a special case of $v$. If the intercept is in the first column $v$ would look like this $[1,0,0,0,0,0...]$

score 1 · Answer 3 · answered Mar 04 '21 at 18:52

I want to provide a more general answer from the statistical sense of the word "residual". I figured this out in my quest to understand degrees of freedom and Bessels's correction in statistics.

A residual in statistics means the difference between a variable's value and the sample mean (not the true, usually unknowable average).

So if $x_i, i \in \{1, ..., N\}$ represents a sample value:

$$ \sum_i r_i = \sum_i x_i - \mu \\ = \sum_i (x_i - \frac{1}{N}\sum_i x_i) \\ = \sum_i x_i - N \frac{1}{N}\sum_i x_i \\ = 0 $$

Which is the basis behind the argument behind Bessel's Correction, which is the practice of dividing the sum of squared residuals by $N-1$ rather than $N$.

$$ \sigma^2 = \frac{1}{N - 1}\sum(x_i - \mu)^2$$

The idea is (according to Wikipedia) that the residuals are not independent because they sum to zero, therefore you subtract one. I do not actually understand this statement (how do we know that the span of the residuals is $N-1$ and not $N-2$ for example?). However, this nice proof explains the intuition behind the correction from a functional/applied point of view.

score -1 · Answer 4 · answered Oct 16 '13 at 01:35

take the estimated values from the line of best fit and use these y values to subtract from the original y values then add them up. if it is a good line of best fit then it should approach zero, but bad lines of best fit will be much less or more than zero

Why the sum of residuals equals 0 when we do a sample regression by OLS?

4 Answers4

Linked