Why do Least Squares Fitting and Propagation of Uncertainty Derivations Rely on Normal Distribution

Question

In learning more about the Normal distribution, I came upon the following sentence in the first section of the Normal Distribution article on Wikipedia

Moreover, many results and methods (such as propagation of uncertainty and least squares parameter fitting) can be derived analytically in explicit form when the relevant variables are normally distributed.

My question is why? Can somebody show me where in the derivation of the least squares fitting formula the assumption of normally-distributed data must be made, and/or why least squares fitting breaks down if the data are not so-distributed? As an example of why I am confused, here is a derivation of least squares fitting (in the linear case), and I cannot figure out where the assumption of being normally distributed comes in.

Same question for least squares fitting.

Thank you!

The least squares estimator, per se, does not need the data to be drawn from a normal distribution. This is used for statistical purposes. For example, you can show that the least square estimator has an equivalent representation as a maximum likelihood problem. — Pantelis Sopasakis, May 02 '19 at 23:06
The "relevant variables" it is referring to must be the residuals, as they must be normally distributed for least squares to produce reliable results. — WaveX, May 02 '19 at 23:21
As noted by others, fitting a line by least squares does not require any assumptions about the distribution of your data. (And there's nothing wrong with that.) But suppose you want to find a confidence interval for the slope of the line, for example. Then you need to make some assumptions about the distribution. — awkward, May 03 '19 at 13:35

Justin Le · Accepted Answer · 2019-05-03T01:43:17.683

Define the data set to be a sequence $\{(Y_i, X_i)\}_{i=1}^n$ of pairs of real numbers. Fitting a line to this sequence amounts to finding $\theta = (\alpha, \beta) \in \mathbb R^2$ such that the equation $Y_i = \alpha X_i + \beta$ "captures" the relationship between each $Y_i$ and $X_i$ for each $i$ in some sense. But in what sense? One way to make this question meaningful is to assume a statistical model. Namely, assume that, for each $i$, $$Y_i = \alpha X_i + \beta + e_i,$$ where each $e_i$ is a random variable with known distribution. In other words, assume that each $X_i$ is a deterministic variable and that $Y_i$ is related to $X_i$ in a linear but "noisy" way. Under this assumption, the question of finding $\theta$ can be cast as a question of finding the $\theta$ that maximizes the "likelihood" that $\{(Y_i, X_i)\}_{i=1}^n$ was observed given that $\theta$ was the parameter of the above statistical model.

One way to define likelihood is by use of a conditional probability density function (p.d.f.), which we denote by $f(Y_i \; | \; X_i, \theta)$ for each $i$. (I use this notation to emphasize that $X_i$ is deterministic, so it admits no useful notion of probability density. Each $X_i$ will instead play a role in parametrizing the distribution of $Y_i$.) For example, if each $e_i$ is a normal random variable with zero mean and variance $\sigma^2$, and the $e_i$ are all mutually independent, then the $Y_i$ are also independent normal variables with variance $\sigma^2$, but each $Y_i$ has a mean of $\alpha X_i + \beta$. Hence, we may define the likelihood of the data to be the following joint conditional p.d.f.: \begin{align*} \ell((Y,X); \theta) &= \prod_{i=1}^n f(Y_i \; | \; X_i, \theta) \\ &= \frac{1}{(2\pi)^{n/2}\sigma^n}\exp\left(\frac{-1}{2\sigma^2}\left(\sum_{i=1}^n(Y_i - \alpha X_i - \beta)^2\right)\right). \end{align*} Note that the mutual independence of $e_i$ was essential for writing this formula, for otherwise the joint conditional p.d.f. would not factorize in the first step above. Clearly, maximizing this likelihood function over $\theta$ is equivalent to minimizing $$\sum_{i=1}^n(Y_i - \alpha X_i - \beta)^2$$ over $\theta$. (Again, keep in mind that $\theta = (\alpha, \beta)$.) This last expression is the square-error formula used in least-squares fitting. The qualifier "least" implies that we're minimizing this formula (or, equivalently, maximizing the above likelihood formula). This equivalence between minimizing square-error and maximizing likelihood could only be shown because the $e_i$ in the statistical model are i.i.d. normally distributed with zero mean and finite variance. Hence, the normality assumption is essential in motivating the use of least-squares fitting from a statistical standpoint. However, as mentioned by Pantelis Sopasakis in the comments, it is not essential to the actual computation of $\theta$, which is the focus of the derivations in the link you posted.

This has helped me greatly -- thank you. But one quick question -- in this case it appears that each $Y_i$ has the same variance $\sigma^2$. To allow for a weighted least squares fit, would it be ok to simply give each $Y_i$ its own $\sigma_i$, and go from there? We would end up with a $\sigma_i^2$ in the denominator of the summation. — Bunji, May 09 '19 at 22:35
@Bunji Yes, if you'd like your estimate for $\theta$ to be a maximum likelihood estimate, you have to choose the weight as $1/\sigma_i^2$ for each term $i$ in the summation. On the other hand, if you don't care about having the MLE, or if you don't have access to $\sigma_i^2$ in real life, you could choose the weights to be anything else and still get a good fit/prediction with your model (it just depends on how you define "good"), but you won't be able to justify/analyze your model from the statistical (MLE) standpoint. — Justin Le, May 11 '19 at 23:24

Why do Least Squares Fitting and Propagation of Uncertainty Derivations Rely on Normal Distribution

1 Answers1