1

I'm reading the Wikipedia article Simple Linear Regression . In the article they write a function to be minimized by choosing $\alpha$ and $\beta$:

$$Q(\alpha,\beta) = \sum_{i=0}^{n} (y_i - \alpha - \beta x_i)^2$$

They show equations for $\alpha$ and $\beta$ to minimize the function. But, they don't show the derivation. Thus, I am trying to work through it, but I'm having difficulty....

To find $\alpha$ I'm letting $\frac{\partial Q}{\partial\alpha}=0$ and solving for $\alpha$. Similarly, to find $\beta$ I'm letting $\frac{\partial Q}{\partial\beta}=0$ and solving for $\beta$. However, I'm finding that in both cases I'm getting the equation $\hat \alpha = \bar y - \hat \beta \bar x$. This equation does relate $\alpha$ and $\beta$, but doesn't help you find $\beta$.

What approach could I take to find $\beta$? I wonder I'm doing the derivative with respect to $\beta$ wrong, but I don't know.

p.s. What is distinction between bars and hats in the article? I see both of them and think "average".

1 Answers1

1

One defines $\bar y = (y_1+\cdots+y_n)/n$ and similarly with $x$. The bar means the average of the $n$ observations.

Notice that you don't have $n$ different observed values of $\alpha$ or $\beta$ --- in fact you don't have any observed values of those at all. The idea is that $\{(x_i,y_i):i=1,\ldots,n\}$ is a sample of size $n$ taken from a large population, and $\alpha$ and $\beta$ are in effect properties of that population, whereas $\hat\alpha$ and $\hat\beta$ are estimates of $\alpha$ and $\beta$ based on the observed sample of $n$ data points. If you toss the $n$ observed individuals back into the population and stir it up and take another random sample of size $n$, then the values of $\hat\alpha$ and $\hat\beta$ change, but the (unobservable) values of $\alpha$ and $\beta$ remain the same. Another use of the hat notation is when one writes $$ \hat y_i = \hat\alpha+\hat\beta x_i. $$ Notice that $\hat y_i$ has a "hat" and $x_i$ does not. The number $\hat y_i$ is the $i$th "fitted value". It is an estimate of the average $y$-value among members of the population for which the $x$-value is $x_i$. The difference $\hat\varepsilon_i=y_i-\hat y_i$ is the $i$th residual, also equal to $y_i-(\hat\alpha+\hat\beta x_i)$, not to be confused with the $i$th error $\varepsilon_i=y_i-(\alpha+\beta x_i)$. The residuals are observable; the errors are not. The residuals must satisfy the two constraints $\sum_{i=1}^n \hat\varepsilon_i=0$ and $\sum_{i=1}^n \hat\varepsilon_i x_i=0$; the errors are subject to no such constraints. That there are two such linear constraints is why one says there are $n-2$ degrees of freedom for error.

I don't use derivatives to find $\hat\beta$ and $\hat\alpha$; instead I talk about matrices and orthogonal projections. But let's see if I can do it using derivatives:

$$ \begin{align} \frac{\partial Q}{\partial\alpha} & = -2\sum_{i=0}^{n} (y_i - \alpha - \beta x_i) =0 \tag1 \\[10pt] \frac{\partial Q}{\partial\beta} & = -2\sum_{i=0}^{n} (y_i - \alpha - \beta x_i)x_i = 0 \tag2 \end{align} $$

From $(1)$ we conclude that $$ \alpha = \bar y - \beta \bar x. \tag 3 $$ Thus the least-squares line must pass through the "point of averages" $(\bar x,\bar y)$, which is just the average of the observed data points $(x_i,y_i),\ i=1,\ldots,n$.

Now substitute $(3)$ for $\alpha$ in the $(2)$: $$ \sum_{i=1}^n (y_i - (\bar y - \beta \bar x) -\beta x_i)x_i, $$ or: $$ \sum_{i=1}^n ((y_i-\bar y) -\beta(x_i-\bar x))x_i=0 $$ $$ \beta = \frac{\sum_{i=1}^n (y_i-\bar y)x_i }{\sum_{i=1}^n (x_i-\bar x)x_i}. $$

That that is the same as $$ \beta = \frac{\sum_{i=1}^n (y_i-\bar y)(x_i-\bar x) }{\sum_{i=1}^n (x_i-\bar x)(x_i-\bar x)} $$ I leave as an exercise in algebra for the moment, but I'll post more on it if you have questions about that.