U-statistics and independence testing

Question

I wish you a happy new year.

I am struggling to understand a small part of section 5.1: Independence testing for multinomials at page 17. Specifically, I am having difficulty understanding the sentence:

In this case, the expectation of the U-statistic is $4\left\|p_{Y Z}-p_Yp_Z\right\|_2^2$.

Let me summarize the relevant notations and concepts for context.

Independence testing: Let $P_{YZ}$ be a joint distribution of $Y$ and $Z$ that belongs to a certain family of distribution $\mathcal{P}$. Let $P_YP_Z$ denote the product of their marginal distribution. Suppose that we observe $\mathcal{X}_n:=\left(\left(Y_1, Z_1\right), \ldots,\left(Y_n, Z_n\right)\right) \stackrel{\text { i.i.d. }}{\sim} P_{Y Z}$. Given the samples, the hypotheses for testing indenpendence are: $H_0: P_{YZ} = P_YP_Z$ versus $H_1: \delta(P_{YZ},P_YP_Z)\ge \epsilon_n$.

U-statistics: Let us consider two bivariate functions $g_Y(y_1,y_2)$ and $g_Z(z_1,z_2)$, which are symmetric in their arguments. They define a product kernel in the following way

$$ h_{\text{in}} \left\{ \left( y_1, z_1 \right), \left( y_2, z_2 \right), \left( y_3, z_3 \right), \left( y_4, z_4 \right) \right\} := \left[ g_Y \left( y_1, y_2 \right) + g_Y \left( y_3, y_4 \right) - g_Y \left( y_1, y_3 \right) - g_Y \left( y_2, y_4 \right) \right] \cdot \left[ g_Z \left( z_1, z_2 \right) + g_Z \left( z_3, z_4 \right) - g_Z \left( z_1, z_3 \right) - g_Z \left( z_2, z_4 \right) \right] $$

For simplicity, we may also write $h_{\text {in }}\left\{\left(y_1, z_1\right),\left(y_2, z_2\right),\left(y_3, z_3\right),\left(y_4, z_4\right)\right\}$ as $h_{\text {in }}\left(x_1, x_2, x_3, x_4\right)$. Given this fourth order kernel, consider a $U$-statistic defined by

$$ U_n:=\frac{1}{n_{(4)}} \sum_{\left(i_1, i_2, i_3, i_4\right) \in \mathbf{i}_4^n} h_{\mathrm{in}}\left(X_{i_1}, X_{i_2}, X_{i_3}, X_{i_4}\right) $$

Independence testing for multinomials Let $p_{Y Z}$ denote a multinomial distribution on a product domain $\mathbb{S}_{d_1, d_2}:=\left\{1, \ldots, d_1\right\} \times\left\{1, \ldots, d_2\right\}$ and $p_Y$ and $p_Z$ be its marginal distributions. Let us recall the kernel $h_{\text {in }}\left(x_1, x_2, x_3, x_4\right)$ in (17) and define it with the following bivariate functions:

$$ \begin{aligned} & g_{\mathrm{Multi}, Y}\left(y_1, y_2\right):=\sum_{k=1}^{d_1} \mathbb{1}\left(y_1=k\right) \mathbb{1}\left(y_2=k\right) \\ & g_{\mathrm{Multi}, Z}\left(z_1, z_2\right):=\sum_{k=1}^{d_2} \mathbb{1}\left(z_1=k\right) \mathbb{1}\left(z_2=k\right) \end{aligned} $$

Then, the authors claim that

In this case, the expectation of the U-statistic is $4\left\|p_{Y Z}-p_Yp_Z\right\|_2^2$.

My attempt:

Obviously, we have $$\mathbb{E}_P[U_n] = \mathbb{E}_P\big[h_{\text{in}}((Y_1, Z_1), (Y_2, Z_2), (Y_3, Z_3), (Y_4, Z_4))\big] $$

On the other hand, we also have that $$\begin{aligned} \mathbb{E}_P\big[g_{\text{Multi},Y}(y_1, y_2) \cdot g_{\text{Multi},Z}(z_1, z_2)\big] & = \sum_{k=1}^{d_1} \sum_{l=1}^{d_2} \mathbb{P}(y_1 = k, y_2 = k, z_1 = l, z_2 = l).\\ & = \sum_{k=1}^{d_1} \sum_{l=1}^{d_2} \mathbb{P}(y_1 = k, z_1 = l, y_2 = k, z_2 = l)\\ & = \sum_{k=1}^{d_1} \sum_{l=1}^{d_2} \mathbb{P}(y_1 = k, z_1 = l) \cdot \mathbb{P}(y_2 = k, z_2 = l) \quad \text{since $(Y_i,Z_i)$ iid} \\ & = \sum_{k=1}^{d_1} \sum_{l=1}^{d_2} p^2_{YZ}(k,l) \end{aligned} $$

Similarly, I can calculate terms like $\mathbb{E}_P\big[g_{\text{Multi},Y}(y_3, y_4) \cdot g_{\text{Multi},Z}(z_3, z_4)\big]$, $\mathbb{E}_P\big[g_{\text{Multi},Y}(y_1, y_3) \cdot g_{\text{Multi},Z}(z_1, z_3)\big]$, $\mathbb{E}_P\big[g_{\text{Multi},Y}(y_2, y_4) \cdot g_{\text{Multi},Z}(z_2, z_4)\big]$.

However, I run into difficulties with the remaining terms. For example $\mathbb{E}_P\big[g_{\text{Multi},Y}(y_1, y_2) \cdot g_{\text{Multi},Z}(z_3, z_4)\big]$. To be more specific, I obtain that

$$\begin{aligned} \mathbb{E}_P\big[g_{\text{Multi},Y}(y_1, y_2) \cdot g_{\text{Multi},Z}(z_3, z_4)\big] & = \sum_{k=1}^{d_1} \sum_{l=1}^{d_2} \mathbb{P}(y_1 = k, y_2 = k, z_3 = l, z_4 = l). \end{aligned}$$

Given that we only know $(Y_i,Z_i)$ are i.i.d. I am unsure how to proceed further. Intuitively, I expect $$\begin{aligned} \mathbb{E}_P\big[g_{\text{Multi},Y}(y_1, y_2) \cdot g_{\text{Multi},Z}(z_3, z_4)\big] & = \sum_{k=1}^{d_1} \sum_{l=1}^{d_2} p^2_Y(k)p^2_Z(l) \end{aligned}$$

but I cannot formally justify this step since I have no information about $Y_i$ and $Z_i$. Did I miss any further important information?

Any help in resolving this would be greatly appreciated!

score 1 · Answer 1 · answered Jan 09 '25 at 01:44

1

Sorry for bothering you. I recently realized that if $(X_i,Y_i)$ are independent random variables then $(X_i, Y_j)$ where $i \ne j$ will be independent. Then, my problem is solved.

answered Jan 09 '25 at 01:44

Pipnap

507

U-statistics and independence testing

1 Answers1