Two definitions of p-value

Question

According to Casella-Berger (2002) the definition of p-value (def. 8.3.26, §8.3.4, p. 397) is:

A p-value $p(X)$ is a test statistic satisfying $0 \le p(x) \le 1$ for every sample point $x$. Small values of $p(X)$ give evidence that $H_1$ is true. A p-value is valid if, for every $\theta \in \Theta_0$ and every $0 \le \alpha \le 1$, $P_{\theta}(p(X) \le \alpha) \le \alpha$.

However, other books such as Rohatgi (2001) define it as:

The probability of observing under $H_0$ a sample outcome at least as extreme as the one observed is called the P-value. The smaller the P-value, the more extreme the outcome and the stronger the evidence against $H_0$.

I feel this definition is similar in spirit to the one by Schervish (2012):

p-value. In general, the p-value is the smallest level $\alpha_0$ such that we would reject the null-hypothesis at level $\alpha_0$ with the observed data.

How are these definitions equivalent?

The definition 0f Casella-Berger is precise,abstract, and works in both frequentist and Bayesian frameworks. The other 2 are not so much formal definitions as procedural descriptions. I would add to both of them the idea that 'extreme' means in the direction (or directions) of the alternative:('extremely small' if the alternative is left-sided, 'extremely large if right-sided. 'extremely far from null value' if two-sided. The C-B definition, hints at the important property that (for continuous test statistics) a P-value is a random variable uniformly distributed on $(0,1)$ if $H_0$ is true. — BruceET, Dec 27 '17 at 18:03

score 7 · Accepted Answer · answered Dec 27 '17 at 23:10

The difference between these definitions is that, while the first one presents a property that a p-value must satisfy, the latter ones define the p-value as a function of a collection of hypothesis tests. It is possible to show that these definitions are related, as I discuss below.

Note that the latter two definitions refer to varying levels of significance. In order to make these definitions operational, you must consider a collection of hypothesis tests. Formally, an hypothesis test, $\phi: \mathcal{X} \rightarrow \{0,1\}$ is a function from the sample space that assumes the value $1$ if $H_0$ is rejected and the value $0$, otherwise. Let $(\phi_{\alpha})_{\alpha \in (0,1)}$ be a collection of hypothesis tests such that $\phi_{\alpha}$ has size $\alpha$ and that satisfy monotonicity, that is, if $\alpha_1 \leq \alpha_2$, then for every $x$, $\phi_{\alpha_1}(x) \leq \phi_{\alpha_2}(x)$. The latter two definitions say that the p-value is a function $p: \mathcal{X} \rightarrow (0,1)$ such that $p(x)=\inf \{\alpha: \phi_{\alpha}(x)=1\}$. Observe that, for every $\theta \in H_0$,

\begin{align*} \mathbb{P}_\theta(p(X) \leq \alpha^*) &= \mathbb{P}_\theta(\inf \{\alpha: \phi_{\alpha}(X)=1\} \leq \alpha^*) \\ &= \mathbb{P}_\theta(\phi_{\alpha^*}(X)=1) & \text{monotonicity} \\ &\leq \alpha^* & \phi_{\alpha^*} \text{ has size } \alpha^* \end{align*}

This shows that the p-value as in Rohatgi and Schervish satisfies the property presented in Casella.

Next, consider that $p: \mathcal{X} \rightarrow (0,1)$ is a function such that, for every $\theta \in H_0$, $P_{\theta}(p(X) \leq \alpha^*) \leq \alpha^*$. In this case, you can define a collection of hypothesis tests such that $\phi_{\alpha}(x)=\mathbb{I}(p(x) \leq \alpha)$. It follows from the initial definition that each $\phi_{\alpha}$ has size $\alpha$. Also, it follows from construction that, if $\alpha_1 \leq \alpha_2$, then for every $x$, $\phi_{\alpha_1}(x) \leq \phi_{\alpha_2}(x)$. Finally, note that $p(x) = \inf\{\alpha: \phi_{\alpha}(x)=1\}$. That is, you can construct a collection of hypothesis tests based on Casella's definition. If you apply Rohatgi's or Schervishe's definition to this class, then you obtain $p(x)$.

This is nice! It shows how to derive a p-value from such a class of monotone hypothesis tests and vice versa. But it's a stretch to say that Rohatgi's and Schervish's "definitions" are the p-value that you defined via the tests $\phi_\alpha$, $\alpha \in (0,1)$. That requires a good deal of interpretation. I'd say that Rohatgi and Schervish give intuitive expanations of p-values, but don't rigorously define them. (At least not in the quotes provided.) — zxmkn, Mar 01 '24 at 10:55

Mittens · Answer 2 · 2024-04-21T13:59:08.523

This is an interesting question and I would like to add a few more comments.

As mention by @BruceET, the definition in Casella-Berger is very general and it is not directly related to any particular testing scheme for a given hypothesis $H_0:\theta\in\Theta_0$. In fact, any valid $p$-value statistic $p(X)$ can be used to provide a testing scheme at any level of significance $\alpha\in(0,1)$, namely, reject the hypothesis $H_0$ when $p(X)\leq \alpha$. Indeed, $$\sup_{\theta\in\Theta_0}P_\theta(p(X)\leq\alpha)\leq\alpha.$$
One can defined a valid $p$-value statistic from any statistic $W(X)$ and rejection scheme of the form: reject $H_0$ if $W(X)\geq c$ for some value $c$. Letting \begin{align} p(x):=\sup_{\theta\in\Theta_0}P_\theta(W(X)\geq W(x)),\tag{1}\label{one} \end{align} we have that $p(X)$ is a valid $p$-value. Indeed, for $\theta\in \Theta_0$, let $F_\theta$ be the distribution function of $-W(X)$ under $P_\theta$. Then $$p_\theta(x):=P_{\theta}(W(X)\geq W(x))=P_{\theta}(-W(X)\leq -W(x))=F_\theta(-W(x))$$ Hence $p_\theta(X)=F_\theta(-W(X))$ and so $P_\theta(X)$ stochastically dominates a $0-1$ uniform distribution, that is $$P_\theta(p_\theta(X)\leq u)\leq u, \qquad 0<u<1$$ Since $p_\theta\leq p$ for all $\theta\in\Theta_0$, we conclude that $$P_\theta(p(X)\leq u)\leq P_\theta(p_\theta(X)\leq u)\leq u$$

Notice that in (1) and (2) we have make no reference to level of any particular testing scheme. In practice, there are two settings in which valid $p$-values are used:

Suppose that for any $\alpha\in(0,1)$ there are sets $S_\alpha$ in the sample space $\mathcal{X}$ such that
- $S_\alpha\subset S_{\alpha'}$ wherever $\alpha<\alpha'$
- $\sup_{\theta\in\Theta_0}P_\theta(X\in S_\alpha)\leq \alpha$

For each $\alpha$, define the test scheme reject $H_0$ if $X\in S_\alpha$. This provides a test of level $\alpha$ for $H_0$. It follows that if \begin{align} p(x):=\inf\{\alpha: x\in S_\alpha\}\tag{2}\label{two} \end{align} then $P(X)$ is a valid $p$-test. is a valid $p$-value. This is the same scheme discussed by @madprob here. Expression \eqref{two} is perhaps the most familiar notion of $p$-value used in basic courses in Statistics.

Along the lines of point 3, suppose that each set $S_\alpha$ is of the form $S_\alpha=\{x\in\mathcal{X}: W(x)\geq k_\alpha\}$ where $W(X)$ is a statistic and $$\sup_{\theta\in\Theta_0}P_\theta(W(X)\geq k_\alpha)=\alpha$$ In this case, expressions \eqref{one} and \eqref{two} coincide. Indeed, let $p(x)$ be given by \eqref{one}. If $\alpha<p(x)$, then $$\sup_{\theta\in\Theta_0}P_\theta(W(X)\geq k_\alpha)=\alpha<p(x)=\sup_{\theta\in\Theta_0}P_\theta(W(X)\geq W(x))$$ This implies that $W(x)<k_\alpha$ and thus, $H_0$ is not rejected.
Now if $p(x)\leq\alpha$, then $$\sup_{\theta\in\Theta_0}P_\theta(W(X)\geq k_\alpha)=\alpha\geq p(x)=\sup_{\theta\in\Theta_0}P_\theta(W(X)\geq W(x))$$ This leads to two possible cases: (a) $W(x)\geq k_\alpha$ and thus, $H_0$ is rejected, or (b) $W(x)<k_\alpha$ and $\alpha=p(x)$. By setting $k'_\alpha=W(x)$, we have that set $\{y\in\mathcal{X}: W(y)\geq k'_\alpha\}$ also provides rejection set of level $\alpha$ for $H_0$ and thus, we can reject $H_0$. Putting things together we obtain the $p(x)=\inf\{\alpha\in(0,1): x\in S_\alpha\}$, that is $p(x)$ is the minimum level at which $H_0$ is rejected when the realization $X=x$ is observed.

Two definitions of p-value

2 Answers2