6

Consider domain $X$, label set $ Y=\{0,1\}$ and the zero-one loss.

Given any probability distribution D over $ X\times \{0,1\} $, we've defined the Bayes classifier $ f_D $ to be-

$$ f_{D}(x)= \begin{cases} 1 & \text{if }\mathbb{P}[y=1|x]\geq\tfrac{1}{2}\\ 0 & \text{otherwise.} \end{cases} $$

I wish to prove that, for any classifer $ g\colon X\rightarrow\{0,1\}$, $ L_D(f_D)\leq L_D(g)$, which means that $ f_D$ is optimal.

$L_D(h) $ is defined to be the "true error" of the classifier $h$. That is, $L_D(h)=D\{(x,y)\mid h(x)\not = y\}$.

I'm having some hard time proving this given the definitions above, and some hints/intuition will be appreciated.

David Richerby
  • 82,470
  • 26
  • 145
  • 239
Alex Goft
  • 235
  • 2
  • 7

1 Answers1

11

The true error of a classifier $h$ is $$ \begin{align*} L_D(h) &= \sideset{\mathbb{E}}{}{}_{x,y \sim D} \Pr[h(x) \neq y] \\ &= \sideset{\mathbb{E}}{}{}_{x,y \sim D} \begin{cases} \Pr[y \neq 0|x] & \text{if } h(x) = 0, \\ \Pr[y \neq 1|x] & \text{if } h(x) = 1. \end{cases} \end{align*} $$ (All probabilities are with respect to $D$.)

The optimal classifier is thus the one that minimizes the loss function $$ \phi(x) = \begin{cases} \Pr[y \neq 0|x] & \text{if } h(x) = 0, \\ \Pr[y \neq 1|x] & \text{if } h(x) = 1 \end{cases} $$ for all $x$. We can rewrite the loss function as $$ \phi(x) = \begin{cases} \Pr[y = 1|x] & \text{if } h(x) = 0, \\ 1-\Pr[y = 1|x] & \text{if } h(x) = 1 \end{cases} $$ So if $\Pr[y=1|x] < 1-\Pr[y=1|x]$ we should choose $h(x) = 0$, and if $\Pr[y=1|x] > 1-\Pr[y=1|x]$ we should choose $h(x) = 1$; if $\Pr[y=1|x] = 1-\Pr[y=1|x]$ then the choice doesn't matter.

Finally, $\Pr[y=1|x] < 1-\Pr[y=1|x]$ is the same as $\Pr[y=1|x] < 1/2$, and this explains the formula for $f_D(x)$.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514