6

Basic Question

Is there an intuitive explanation of standard deviation in terms of Euclidean distance in $n$ dimensional space?

Longer Version of Question

To begin a more detailed sketch of my question, for simplicity let's just focus on the simple case of a discrete random variable that is uniformly distributed. In this case, the variance is given by the following formula, which I've abducted straight from Wikipedia:

$$ \frac1{n}\sum_{i =1}^n (x_i - \mu)^2$$

where $\mu$ is the mean. The standard deviation is then the square root of this. Now, I can't help noticing that the square root of the sum returns the euclidean distance from the vector $X = (x_1, x_2, \dots, x_n)$ to the vector $\vec \mu = (\mu, \mu, \dots, \mu)$. That is, the standard deviation can be expressed as:

$$ \frac1{\sqrt{n}}|X - \vec \mu |$$

So I wonder, is there any significant conceptual relationship between this distance $|X - \vec \mu |$ and standard devation or is this just a coincidence?

Even More Details...

I have looked up many explanations of standard deviation and its cousin variance. Here are some that I've seen already, each sort of following from the previous one:

  • We square the values before summing to get rid of the sign, which is obviously not important. This explanation is often criticised by hardcore statisticians and I can sort of see why: it doesn't explain why squaring beats taking the absolute value.
  • We square the values so that we pay a greater price for greater deviations. This explains why squaring beats taking absolute values. But why not raise to the power of $4$, or $6$, or any other even power before summing? What is so special about $2$?
  • The thing that is so special about $2$ is that it's the second moment of intertia, whereas the mean is the first moment, so mechanically it makes sense. I don't follow this. My intuition is totally OK with the mean: the point where, if I put my finger, the weights on either side will balance. But the second moment is harder for me to imagine physically like this.

Note, this is a question about intuition. I "understand" the mathematical formula at a shallow level: what all its terms mean, how to calculate it given a dataset. But I am not comfortable with my grasp on why this formula is "the best" one to use in so many applications e.g. the least squares method to fit data. I'm particularly confused as to why squaring is chosen as opposed to raising to some other even power e.g. $9234324$.

And this is where my intuition steps in and tries to provide an explanation that goes right back to the fundamental theorem of Pythagoras: euclidean distance. Here is my thought process: "The number $2$ is special. It's the unique power that makes Euclidean distance work. So maybe it's also the unique number that makes variance work." But then why the multiplying factor of $\frac1{\sqrt{n}}$? Is it just simply a case of: swallow it up and accept the definition, or can this intuition be resolved somehow?

Colm Bhandal
  • 4,819
  • Related: https://math.stackexchange.com/questions/1860579/why-work-with-squares-of-error-in-regression-analysis, https://math.stackexchange.com/questions/1621759/neural-network-cost-function-why-squared-error, https://math.stackexchange.com/questions/1872055/is-minimizing-the-squared-errors-optimal – Colm Bhandal May 11 '20 at 08:59

2 Answers2

5

There is certainly a very clear "conceptual relationship" between the standard deviation and Euclidean distance: If we treat the whole available sample (the $x_i$'s) as a vector, then the Euclidean distance is a measure of how much this vector deviates from the vector containing the mean value, which is "the center" of the population.

But the standard deviation attempts to measure how much a single observation, not the whole sample, deviates "on average" from the mean value. Ah, then, why are we dividing by $\sqrt {n}$ and not by $n$?

Well, this becomes clear if we consider a vector $\mathbf x = (x,x,x,...x)$ Then, euclidean distance becomes

$$\sqrt {\sum_{i =1}^n (x_i - \mu)^2}=\sqrt n |x-μ|$$

So due to the square-root, Euclidean distance is not linearly additive, as we move from the one dimension, to $n$ dimensions : it does not increase by a factor of $n$, but only by a factor of $\sqrt {n}$. So to recover the "individual distance on average" we have to divide by $\sqrt n$.

  • Great explanation. I do still worry a small bit about the "on average" bit. We're not exactly measuring "how much a single observation... deviates 'on average' from the mean value". Otherwise we'd be back at just taking the absolute value of the deviations, summing, and dividing by $n$ i.e. $\frac1{n}\sum_{i =1}^n |x_i - \mu|$. Still, this is a good intuitive explanation, thanks for sharing it. – Colm Bhandal Sep 04 '15 at 11:10
  • ...it seems to me like the standard deviation wants to measure how much the whole population vector deviates from its average, and then normalise this for the number of elements so that the special case of $(x, x, x, \dots, x)$ gives you back exactly the average deviation... – Colm Bhandal Sep 04 '15 at 11:18
  • @ColmBhadal We also do that, i.e. taking the average of the absolute deviation. It is called, ahem, "average absolute deviation". The use of second moments instead of the absolute deviations, has many arguments behind it, from mathematical convenience, to properties of statistical distributions, to the connection to Euclidean distance... your second comment has promise. – Alecos Papadopoulos Sep 04 '15 at 12:34
  • Yes I feel a bit more settled in my soul about the whole thing now, especially since I recently saw a result (not the proof though) of Gauss relating least squares to the normal curve. Making the second comment a bit more rigorous: We want to normalise the euclidean distance by multiplying it by a function of $n$ so that the result gives exactly the average absolute deviation for a "hyper-cubic" vector i.e. $(x, x, x, \dots, x)$. With this condition, by your above argument, the only solution is to multiply by $\frac1{\sqrt{n}}$. Yes I think I'm satisfied. For now. – Colm Bhandal Sep 04 '15 at 13:45
1

First, consider moving from the origin $(0,0,0)$ of an Euclidean space and ending up at the point of coordinates $(x,y,z)$.

Your distance from the origin is $\sqrt{x^2+y^2+z^2}$ (see this post for a generalization of the Pythagorean theorem to three dimensions).

The idea of the standard deviation is to study how far you deviate from the mean. It therefore makes sense to consider something like : \begin{align*} \sqrt{\sum_{i=1}^d (x_i-\mu)^2}, \end{align*} at least based on the distances of our physical world. But other norms and distances exist in topology and statistics, and such a choice remains to be justified from a information perspective.

Different norms such as the $1$-norm are often used in optimization for example, where squared errors can give too much weight to outliers, compare this post and this one for example (the median minimizes differences in 1-norm whereas the average minimizes differences in 2-norm).

I think the best approach is actually to start from scalar products. The similarity between vectors of numbered information can be quantified by scalar products: two random variables are correlated if they go in the same direction. This actually corresponds to the definition of the covariance. All scalar products on $\mathbb{R}^d$, starting from the condition to be a symmetric positive definite bilinear form, can indeed be written $x^\top S y$ with some $S\succ 0$ symmetric.

Such a consideration is more than subjective. Squares are the power that guarantee additivity of the variance for independent variables (which you may not have with different powers).

To go into more details, the conditional expectation, $\mathbb{E}[X|\mathcal{G}]$ for example, which can be rigorously and meaningfully defined from measure theory considerations (see here for example) as the unique random variable $Y$ integrable and $\mathcal{G}$-measurable such that \begin{align*} \mathbb{E}[Y1_A|\mathcal{G}] = \mathbb{E}[X1_A|\mathcal{G}] \quad \forall A\in\mathcal{G}, \end{align*} also corresponds - when $X$ is squared integrable - to an orthogonal projection onto the space of all squared integrable and $\mathcal{G}$-measurable random variables (see here for example). As far as I'm concerned, this definitely legitimates scalar products considerations (along with the induced norms for the standard deviation).

The central limit theorem also requires a $1/\sqrt{n}$ factor for scaling, which is also the factor that appears when considering standard deviation (with squares and square roots).

(By the way, such kind of "natural rate" is similar to that of differentiation, that imposes a linear rate of approximation for most functions, at least those that we naturally consider such as polynomials, the exponential, sine and cosine, lipschitz continuous functions almost everywhere by Rademacher's theorem...)

reded
  • 335