4

Let $f$ be a function on domain $X$ with binary output $f: X\to \{0,1\}$. We define an arbiatry distribution $\mathcal{Q}$ over $X$ and the empircal distribution of $n$ samples from $\mathcal{Q}$ -- $\mathbf{Q}^n$

By the Glivenko–Cantelli theorem the expected average value of $f$ on the empircal distribution converges to the probability that $f(x)=1$:

\begin{equation} \mathbb{E}_{\mathbf{Q}^n}\left[\frac{1}{n}\sum_{x\in \mathbf{Q}^n} f(x)\right] \longrightarrow \mathbb{P}_{x\sim \mathcal{Q}}[f(x)=1] \end{equation}

Now to makes things slightly more complicated we consider $F$ to be some class of binary functions. We can consider a simple game where we draw $n$ samples from $\mathcal{Q}$ (i.e sample an empircal distribution $\mathbf{Q}^n$) then find the function $f\in F$ that minimizes the sum $\sum_{x\in \mathbf{Q}^n} f(x)$. I would like to show that this the mean of $f$ on $\mathbf{Q}^n$ converges (at least weakly):

\begin{equation} \mathbb{E}_{\mathbf{Q}^n}\left[\min_{f\in F}\frac{1}{n}\sum_{x\in \mathbf{Q}^n} f(x)\right] \longrightarrow \min_{f\in F} \mathbb{P}_{x\sim \mathcal{Q}}[f(x)=1] \end{equation}

This seems intuitively true to me, but the problem is that $f(x)$ is not iid for all $x\in \mathbf{Q}^n$ (a simple example is to let $F$ be the class of functions that output $1$ only in some fixed $L_p$ ball then if you observe $f(x_i)=1$ you know that $f(x_j)=0$ for any $x_j$ that cannot be contained in a ball with $x_i$). I'm a bit stuck on how to reason about this dependence structure, but I'm still convinced that Equation 2 should converge, at least for some minimal set of assumptions on the behaviour of $F$.

Thanks

1 Answers1

1

This is not a complete answer, but an example for why you should expect strong conditions on F (or Q or X) to be required.

Take Q to be the uniform distribution on [0,1]. Take F to be all measure 0.5 boolean functions on [0,1] (indicators of measure half sets).

Then for any $n$ and any sample of $n$ points, some $f \in F$ will avoid all the points, so the expected value on the left is always $0$. On the other hand, $$P_{Q}(f(x)=1)=0.5$$ by definition of F.

You can fiddle with Q and F to get an arbitrarily large gap, and the problem should persist whenever F is big enough to distinguish 'most' sets of n points.

On that note, perhaps a complete answer could pull in some ideas from VC dimension - though it's not clear to me that this will be enough.

Artimis Fowl
  • 1,960
  • Thank you this is very helpful! I'm thinking that if we limit $VC(f) < C$ (a reasonable assumption in ML) then we won't be stuck with this particular problem. Although proving the convergence is still not easy. – user2757771 May 12 '22 at 18:43