2

Introduction. I found the following definitions of the total variation distance $d_{TV}$ between two probability distributions (also called probability measures) $P$ and $Q$ on $\mathcal{A}$ (please note that I tried to use a consistent notation in the following definitions!):

\begin{align} &{\color{blue}{\textbf{"Definition 2.4" on page 84, in Tsybakov (2009)}}} \\&\hspace{10ex} d_{TV}(P,Q) = \sup_{A \in \mathcal{A}} \left| P(A)-Q(A) \right| = \sup_{A \in \mathcal{A}} \left| \int_{A} (p-q)d\nu \,\right| \\ \\ &{\color{blue}{\textbf{"2.1 Definition" on page 5, in Strasser (1985)}}} \\&\hspace{10ex}d_{TV}(P,Q) = \left\Vert P-Q\right\Vert = \sup \{\left| P(A)-Q(A) \right| : {A \in \mathcal{A}} \} \\ \\ &{\color{blue}{\textbf{"4.1. Total Variation Distance" on page 47, in Levin&Peres (2017)}}} \\&\hspace{10ex}d_{TV}(P,Q) = \left\Vert P-Q\right\Vert = \max_{A \subseteq \mathcal{A}} \left| P(A)-Q(A) \right| \\ \\ &{\color{blue}{\textbf{On page 22, in Villani (2008)}}}\\&\hspace{10ex}d_{TV}(P,Q) = \left\Vert P-Q\right\Vert = 2 \inf \left\{ \mathbb{E} [\mathcal{1}_{X \neq Y}]; \,\text{law}(X)=P, \text{law}(Y)=Q \right\} \\ \end{align}

Question. Since I got confusions on the variety of definitions of the total variation distance, could you please show/prove/derive one, or more (or all!), of the following equalities? Or suggest some references proving those equalities?

\begin{align} &{\color{red}{\textbf{First Equality:}}} \qquad &&\sup_{A \in \mathcal{A}} \left| P(A)-Q(A) \right| \stackrel{\bf{{\color{red}?}}}{=} \sup_{A \in \mathcal{A}} \left| \int_{A} (p-q)d\nu \,\right| \\ \\ &{\color{red}{\textbf{Second Equality:}}} \qquad &&\left\Vert P-Q\right\Vert \stackrel{\bf{{\color{red}?}}}{=} \sup \{\left| P(A)-Q(A) \right| : {A \in \mathcal{A}} \} \\ \\ &{\color{red}{\textbf{Third Equality:}}} \qquad &&\left\Vert P-Q\right\Vert \stackrel{\bf{{\color{red}?}}}{=} \max_{A \subseteq \mathcal{A}} \left| P(A)-Q(A) \right| \\ \\ &{\color{red}{\textbf{Fourth Equality:}}} \qquad &&\left\Vert P-Q\right\Vert \stackrel{\bf{{\color{red}?}}}{=} 2 \inf \left\{ \mathbb{E} [\mathcal{1}_{X \neq Y}]; \,\text{law}(X)=P, \text{law}(Y)=Q \right\} \\ \end{align}

References.

  1. Tsybakov (2009)
  2. Strasser (1985)
  3. Levin&Peres (2017)
  4. Villani (2008)
Graham Kemp
  • 133,231
Ommo
  • 359

2 Answers2

3

I think it is clear that the second and third definitions are equivalent. In complete generality, the third definition should use a supremum instead of a maximum, but it's possible they were only considering finite probability spaces, in which case the maximum is sufficient.

The first definition only makes sense if both $P$ and $Q$ have densities with respect to a measure $\nu$. Since $\int_A (p-q) \, d\nu = P(A) - Q(A)$, it is clear that this definition is equivalent to the second and third (when the densities exist).


For the fourth definition note that the infimum is over all couplings (joint distribution) $\mathbb{P}$ for $(X, Y)$ such that the marginal distributions are $P$ and $Q$.

Fix a coupling $\mathbb{P}$.

\begin{align} P(A) - Q(A) &= \mathbb{P}(X \in A) - \mathbb{P}(Y \in A) \\ &= \mathbb{P}(X \in A, X = Y) + \mathbb{P}(X \in A, X \ne Y) - P(Y \in A, X = Y) - P(Y \in A, X \ne Y) \\ &= \mathbb{P}(X \in A, X \ne Y) - P(Y \in A, X \ne Y) \\ &\le \mathbb{P}(X \ne Y) = \mathbb{E}_{\mathbb{P}}[1_{X \ne Y}] \end{align} A similar argument shows $Q(A) - P(A) \le P(X \ne Y)$. Thus, for any coupling $\mathbb{P}$, we have $$\sup_{A \in \mathcal{A}} |P(A) - Q(A)| \le \mathbb{E}_{\mathbb{P}}[1_{X \ne Y}].$$

It remains to show that the infimum of the right-hand side is equal to the left-hand side. This can be done by constructing an "optimal" coupling. For finite probability spaces, see Lemma 4.1.13 here, Lemma 1(b) here, or Lemma 2.2 here. For more general spaces, see Theorem 2.12 here.


Response to comments:

Comment 1: I'm not an expert on the history of this, but yes it seems this definition is the most general and requires the least assumptions on the probability space. The others seem to correspond to special cases.

Comment 2: There is a minor issue when you tried to standardize the notation. I think $\mathcal{A}$ should be a sigma algebra on a probability space $\Omega$. Then the $A \in \mathcal{A}$ makes sense for definitions 1 and 2. But for definition 3, it should be either $A \subseteq \Omega$ or $A \in \mathcal{A}$. I am not sure what context Peres is using, but I think this definition only makes sense for finite spaces $\Omega$ (with the power set as the sigma algebra), since if the sigma algebra is infinite, there may not be a maximum. So in short, Definition 2 is the more general definition, and for finite spaces with the power set as the sigma algebra, the supremum over measurable sets can be written as a maximum over all subsets.

Comment 3: Yes, this is also mentioned on Wikipedia.

Comment 4:

  • If $\Omega$ is finite, then any $\sigma$-algebra $\mathcal{A}$ is finite, since the power set is finite.
  • If $\mathcal{A}$ is finite, then $\max_{A \in \mathcal{A}}$ exists and is equivalent to $\sup_{A \in \mathcal{A}}$.
  • $\max_{A \subseteq \mathcal{A}}$ does not make sense. If $\mathcal{A}$ is the power set, then $\max_{A \in \mathcal{A}}$ is equivalent to $\max_{A \subseteq \Omega}$.
angryavian
  • 93,534
  • Many thanks @angryavian! Sorry for my late reply, but, through your and John Dawkins's fantastic answers, I tried to study and understand a bit more the entire topic of the "total variation distance"... And I would have several comments (probably I could open a new thread later)... I try to summarise the most impelling ideas here anyway... – Ommo Sep 14 '23 at 14:55
  • First comment. To me, it looks like that the "supremum formula", i.e. \begin{equation} d_{TV}(P,Q) := \sup_{A \in \mathcal{A}} \left| P(A)-Q(A) \right| \end{equation} is the very general definition of total variation distance, and all the other "equivalences" can be derived through Propositions and Lemmas from it. Right or wrong? (I do not know who was the creator of the total variation distance, and in which paper it was published first) – Ommo Sep 14 '23 at 15:07
  • Second comment. Yes, "the second and third definitions are equivalent". But, how to pass from the "supremum formula" to the "maximum formula"? \begin{equation} \sup_{A \in \mathcal{A}} \left| P(A)-Q(A) \right| = \max_{A \subseteq \mathcal{A}} \left| P(A)-Q(A) \right| \end{equation} Should I just say that "$\mathcal{A}$ is a countable or finite set" to justify the change from $\sup_{A \in \mathcal{A}}$ to $\max_{A \subseteq \mathcal{A}}$? – Ommo Sep 14 '23 at 15:13
  • Third comment. Does $\left\Vert P-Q\right\Vert$ correspond to "$p-$norm" (also called as "$L^p-$norm", https://en.wikipedia.org/wiki/Lp_space), that in this case would take the "$L^1-$norm" form? \begin{equation} \left\Vert P-Q\right\Vert {1} = \sum{i=1} \left|P_i-Q_i\right| \end{equation} Checking around (Def.1, https://people.csail.mit.edu/costis/6896sp11/lec3s.pdf), I saw that \begin{equation} \left\Vert P-Q\right\Vert {1} = \frac{1}{2}\sum{x} \left|P(x)-Q(x)\right| \end{equation} So, it looks like $\left\Vert P-Q\right\Vert$ is similar to Half of the "$L^1-$norm". Right? – Ommo Sep 14 '23 at 16:02
  • 1
    @limone Updated my answer with some responses. – angryavian Sep 14 '23 at 16:08
  • Thanks a lot, very kind! So, just for my understanding: \begin{equation} \ \end{equation} Fourth comment (part 1). Let $\mathcal{A}$ be a $\sigma-$algebra on a probability space $\Omega$. We can therefore write in a concise way as $(\Omega,\mathcal{A})$. If the probability space $\Omega$ is finite, and consequently the $\sigma-$algebra $\mathcal{A}$ is finite, we can replace $\sup_{A \in \mathcal{A}}$ with $\max_{A \subseteq \mathcal{A}}$... Right or wrong? – Ommo Sep 14 '23 at 16:46
  • Fourth comment (part 2). Or replace $\sup_{A \in \mathcal{A}}$ with $\max_{A \in \mathcal{A}}$, in order to get "the supremum over measurable sets can be written as a maximum over all subsets"? – Ommo Sep 14 '23 at 16:48
  • 1
    @limone Updated my answer. – angryavian Sep 15 '23 at 01:00
  • Extremely thankful! Amazing :-) – Ommo Sep 15 '23 at 08:34
2

One can take the measure $\nu$ in the first definition to be $P+Q$. Then $P\ll\nu$ so (Radon-Nikdym) there is a density $p$ such that $P(A)=\int_Ap\,d\nu$ for all $A\in\mathcal A$. Likewise, there's a density $q$ such that $Q(A)=\int_A q\,d\nu$.

There is another (equivalent) definition based on these densities: $\|P-Q\|={1\over 2}\int|p-q|\,d\nu$.

Using this you can show that the supremum in definition 2 is attained at $A=\{x: p(x) > q(x)\}$. This shows that definitions 2 and 3 are equivalent.

John Dawkins
  • 29,845
  • 1
  • 23
  • 39
  • Many many thanks @John Dawkins!!! I have several comments, I try to put all of them, in the same spot.. :-) – Ommo Sep 14 '23 at 14:56
  • In https://math.stackexchange.com/questions/1481101/definition-of-the-total-variation-distance-vp-q-frac12-int-p-qd-n, I found the proof of this equality (by using both $B={p\geq q}$ and $A$ sets): \begin{equation} \frac{1}{2} \int_{A} \left| p-q \right| = \sup_{A \in \mathcal{A}} \left| \int_{A} (p-q) d\nu \right| \end{equation} However, how to achieve the following (show that definitions 2 and 3 are equivalent)? \begin{equation} \sup_{A \in \mathcal{A}} \left| \int_{A} (p-q) d\nu \right| = \max_{A \in \mathcal{A}} \left| \int_{A} (p-q) d\nu \right| \end{equation} – Ommo Sep 15 '23 at 09:12
  • I posed this question here as well: https://math.stackexchange.com/questions/4769436/equivalence-between-the-supremum-formula-and-the-maximum-formula-of-the-tot – Ommo Sep 15 '23 at 13:07