1

I have a bunch of random variables $X_1,\ldots,X_n$ that assume the value 1 with probability $p$, and 0 with probability $1 - p$. They are not necessarily independent.

I obtain a new random variable $\overline X_n$ as the average of the $X_i$, and it has expectation $p$, and standard deviation $\sigma_n$, say. Can we say in some generality how the standard deviation changes as $n$ changes? If necessary, under some assumptions on the $X_i$.

doetoe
  • 4,082

2 Answers2

3

We always have \begin{align} \sigma_n^2&=Var(\bar X_n)=\frac1{n^2}\left(\sum_{i=1}^n Var(X_i)+\sum_{1\le i<j\le n} 2Cov(X_i,X_j)\right) \\ &=\frac1{n^2}\left(np(1-p)+\sum_{1\le i<j\le n} 2Cov(X_i,X_j)\right) \end{align} By Cauchy-Scwartz inequality $|Cov(X_i,X_j)|\le \sqrt{Var(X_i)Var(X_j)}=p(1-p)$ so $\sigma_n^2\le \frac1{n^2}(np(1-p)+n(n-1)p(1-p)=p(1-p)$.

The worst case scenario is when all they ar equal and then $Var(\bar X_n)=Var(X_1)=p(1-p)$, so this upper bound is achieved.

If $n=2k$ is even, we can try to "cancel" alternating $X_n$s; for example if $p=1/2$ we can make $X_{2k}=1-X_{2k-1}$, $k=1,2,\dots$, and then $\sigma_n=0$.

Finally, if $p<1/2$ (the case $p>1/2$ is symmetric), we can create $k$ uniform[0,1] random variable $U_1,\dots,U_k$ and set $$ X_{2k-1}=1_{0<U_k<p},\qquad X_{2k}=1_{1-p<U_k<1}, $$ so $X_{2k-1}+X_{2k}$ is Bernoulli$(2p)$ with variance $2p(1-2p)$ and thus $$ Var(\bar X_n)=\frac{2p(1-2p)\times k}{n^2}=\frac{p(1-2p)}{n}<\frac{p(1-p)}n $$ which is in the independent case.

(Perhaps there is even a better lower bound in case $p\ne 1/2$.)

van der Wolf
  • 5,743
  • 4
  • 13
1

In fact, there is a better lower estimate, which shows that $Var(S_n)=n^2 \sigma_n^2$, where $S_n=X_1+\dots+X_n$, might remain bounded for all $n$.

Theorem Suppose that $X_i$, $i=1,\dots, n$ are Bernoulli($p$) random variables and denote $S_n=X_1+\dots+X_n$. Let $m=\lfloor np\rfloor$, and let $a=np-m$ be the fractional part of $np$ (which can be $0$). Then $Var(S_n)\ge a(1-a)$, and this lower bound can be achieved.

Remark. This lower bound does not grow with $n$, and the lowest achievable value of $\sigma_n^2$ is thus $$ \sigma_n^2=\frac{a(1-a)}{n^2}. $$

Proof. (a) First, let us show that this is indeed the lowest possible value. For simplicity, assume that $0<a<1$ (the case when $np$ is an integer can be handled similarly).

We claim that for any integer-valued random variable $S$, with $\mathbb E(S)=np\in(m,m+1)$, the variance cannot be less than $a(1-a)$, and moreover this minimum is achieved when $\mathbb P(S\in\{m,m+1\})=1$ (and then the variance must be $a(1-a)$).

Suppose the contrary, and the minimum is achieved for some $S$ such that $\mathbb P(S\not\in\{m,m+1\})>0$. Then there exists either $i\le m-1$ such that $\mathbb P(S=i)>0$, or $j\ge m+2$ such that $\mathbb P(S=j)>0$. Because of symmetry, it suffices only to study the first case.

Assume that there is an $i\le m-1$ such that $\mathbb P(S=i)>0$. We'll show that the variance of $S$ can be made smaller. Since $\mathbb E S>m$, there must be an integer $j\ge m+1$ such that $\mathbb P(S=j)>0$. Let $\epsilon=\min(\mathbb P(S=i),\mathbb P(S=j))>0$. Note also that $\mathbb P(S=m)<1$.

Take a very small $\delta>0$ and create a new variable $S'$ by changing the probabilities only at three points: $i,m,j$. Namely, set \begin{align} \mathbb P(S'=i)&=\mathbb P(S=i)-x,\\ \mathbb P(S'=m)&=\mathbb P(S=m)+\delta,\\ \mathbb P(S'=j)&=\mathbb P(S=j)-y. \end{align} To ensure that the probabilities sum up to $1$ and to keep $\mathbb E S=\mathbb E S'$ unchanged, we need \begin{align} -x+\delta-y&=0,\\ -ix+m\delta-jy&=0 \end{align} from which we get $$ x=\delta\frac{j-m}{j-i}>0,\qquad x=\delta\frac{m-i}{j-i}>0 $$ which are possible as long as $\delta$ is small enough. However, with this change, the variance of $S$ will decrease: \begin{align} Var(S')-Var(S)&=\mathbb E(S'^2)-\mathbb E(S^2) = - i^2 x +m^2 \delta-j^2 y\\ &=-(j - m)(m - i)\delta<0 = \end{align} so $Var(S')<Var(S)$ contradicting the assumption that $S$ has the smallest variance.

By the same argument, we can show that if $\mathbb P(S=j)>0$ for some $j\ge m+2$, then $S$ does not have the smallest variance.

(b) Now let us show how this minimum can be achieved. Let $U$ be a uniform $[0,1]$ random variable, and consider the intervals $$ I_1=[0,p],\ I_2=[p,2p],\ I_3=[0,p],\ \dots, \ I_n=[(n-1)p,np]. $$ Let $X_k=1$ if $I_k\cap B\ne\emptyset$ where $$B:=\{U+i,\ i\in\mathbb{Z}\},$$ and $X_k=0$ otherwise. It is easy to check that $\mathbb{P}(X_k=1)=p$ for all $k=1,2,\dots,n$. On the other hand, in each interval $$ [0,1],\ [1,2], \ \dots, \ [m-1,m] $$ there will be exactly one interval $I_k$ containing a point from the set $B$.

Hence, $S_n=m$ if the last interval $I_n$ does not overlap with $B$ (i.e., $m+U\not\in[m,np]$), and $S_n=m+1$ otherwise ($I_n\cap B\ne \emptyset$). As result, $S_n-m$ is a Bernoulli($a$) random variable, resulting in the statement of the claim.

Q.E.D.

van der Wolf
  • 5,743
  • 4
  • 13