17

From Wikipedia

In probability theory, the total variation distance between two probability measures $P$ and $Q$ on a sigma-algebra $F$ is $$ \sup\left\{\,\left|P(A)-Q(A)\right| : A\in F\,\right\}. $$ Informally, this is the largest possible difference between the probabilities that the two probability distributions can assign to the same event.

For a finite alphabet we can write $$ \delta(P,Q) = \frac 1 2 \sum_x \left| P(x) - Q(x) \right|\;. $$ Sometimes the statistical distance between two probability distributions is also defined without the division by two.

I was wondering if there is some particular consideration when having that $\frac 1 2$ for the finite case, while not in the general case? My understanding of this total variation distance/metric is that it is induced from upper variation of the whole set(which is a norm if I am correct). From there, I can't see the need of dividing by 2.

Also in the finite case, why not define similarly in terms of $\sup$ over $A \in F$?

Thanks and regards!

Tim
  • 49,162
  • 2
    I don't understand what your question is. I suspect you've misinterpreted what that Wikipedia page (I strongly hesitate calling it an article) is saying. (Maybe the confusion arises having from the horrendous disconnect in the notation on that page.) Plus, the last result generalizes easily. – cardinal Jan 07 '12 at 20:18
  • 2
    Also, see the question and comments here: http://math.stackexchange.com/questions/69166/understanding-the-relationship-of-the-l1-norm-to-the-total-variation-distance – cardinal Jan 07 '12 at 20:21
  • @cardinal: Thanks! My questions are why there are inconsistency between the definition for the general case and the one for the discrete probability measures? One point is whether to divide by 2, and the other is whether to take sup over the sigma algebra. – Tim Jan 07 '12 at 20:26
  • 4
    In the finite case the two expressions are equivalent, though the notation on the Wikipedia page doesn't make that obvious. In other words, in the discrete case, $$\delta(P,Q) = \sup{|P(A) - Q(A)|: A \in F} = \frac{1}{2} \sum_x |P(x)-Q(x)|>. $$ Can you supply a simple proof? See the hint in my comment to the above-linked question and define an appropriately analogous set $A$. – cardinal Jan 07 '12 at 20:27
  • 1
    @cardinal: The sup is achieved at either $A={x: P(x) \geq Q(x) }$ or $A^c$, and $P(A)-Q(A) + P(A^c)-Q(A^c) = 0$. So the sup is achieved at both $A={x: P(x) \geq Q(x) }$ and $A^c$. So ... Am I right? – Tim Jan 07 '12 at 20:46
  • The result is correct, yes. Your proof seems slightly muddled, though. In particular, you need an argument showing the sup is, indeed, achieved at $A$. (Consider what happens if you either add to, or remove from, $A$ any element $a$ such that $P(a) \neq Q(a)$.) – cardinal Jan 07 '12 at 21:44
  • 1
    Very relevant: https://math.stackexchange.com/questions/1481101/confusion-about-definition-of-the-total-variation-distance?rq=1 – D.R. Nov 13 '19 at 04:58

2 Answers2

16

It is not a matter of adding a factor of $\frac{1}{2}$ in the finite case. The second expression is a sum over all elements of the underlying set, while the first expression is not a sum, but a sup over all events in the space. The reason for the $\frac{1}{2}$ in the second expression is that it can be proved that in the finite case, the two quantities are equal. See for example Proposition 4.2 on page 48 of Markov chains and mixing times by Levin, Peres, and Wilmer. I do not know the full extent of analogies to the second expression for cases when the underlying set is infinite, but the sum would have to become an integral. See cardinal's comments for more information.

Jonas Meyer
  • 55,715
1

Here is a proof that shows that we can define it similarly for the non-finite case. I'm $75\%$ citing my own question/answer:

$$\|\mu-\nu\|=|\mu-\nu|(X)=2 \sup \{|\mu(A)-\nu(A)|: A \in \Sigma\}$$

where the total variation norm $\|\cdot\|$ and the total variation of a measure $|\cdot|$ are defined in the article.

First we need to show $\mu-\nu(\cdot)$ is a (finite) signed measure. I am not sure if that is trivial or requires some clever trick. Then, we let $B$ be the measurable set $E\cap D^+$ using the Hahn decomposition ($D^+\sqcup D^-$) of $X$ under $\mu-\nu$ s.t. $\mu-\nu$ is positive on $D^+$. It is then easy to prove that for all $A \in \Sigma$, $\mu(A)-\nu(A) \le \mu(B)-\nu(B)$ and $\nu(B)-\mu(A)$ $\le \nu(B^c)-\mu(B^c) = \mu(B)-\nu(B) $. See ($\star$) below for the proof. Then, you go on to say that

$$\sup_{A\in\Sigma} |\mu(A)-\nu(A)| \le \mu(B)-\nu(B)$$

and, since $B\in\Sigma$,

$$\sup_{A\in\Sigma} |\mu(A)-\nu(A)| \ge \mu(B)-\nu(B).$$

From this, it is possible to conclude

$$\begin{aligned}\sup_{A\in\Sigma} |\mu(A)-\nu(A)| &= \mu(B)-\nu(B) \\ &=\frac{1}{2}(\mu(B)-\nu(B) + \nu(B^c)-\mu(B^c) ) \\ &=\frac{1}{2}(\mu(A\cup D^+)-\nu(A\cup D^+) + \nu(A\cup D^-)-\mu(A\cup D^-) ) \\ &=\frac{1}{2}( (\mu-\nu)(A\cup D^+) + (\mu-\nu)(A\cup D^-) ) \\ &=|\mu-\nu|(A).\end{aligned}$$

($\star$) Here goes the proof of the abovementioned inequalities

\begin{align} \hspace{-7mm} \mu(A) - \nu(A) & = \mu(A\cap B) + \mu(A\cap B^c) - \nu(A\cap B) - \nu(A\cap B^c) \\ & \le \mu (A\cap B) - \nu(A\cap B) (1) \\ & \leq \mu (B) - \nu(B) (2). \end{align}

(1) follows because $\mu(A\cap B^c)-\nu (A\cap B^c)<0\\$ while (2) follows because $\mu((B\setminus A)\cap B) - \nu((B\setminus A)\cap B) \geq 0$.

In a similar way: $\nu(A) - \mu(A) < \nu(B^c) - \mu(B^c) = 1- \nu(B) - (1-\mu(B)) = \mu(B) - \nu(B)$