General test based on group invariance
Contents
General test based on group invariance¶
References:
- Romano, J.P., 1988. A bootstrap revival of some nonparametric distance tests, J. Amer. Stat. Assoc., 83, 698–708 
- Romano, J.P., 1989. Bootstrap and randomization tests of some nonparametric hypotheses, Ann. Stat., 17, 141–159 
Data \(X \sim P\) takes values in \(\mathcal X\).
\(\mathcal G\) is a finite group of transformations from \(\mathcal X\) to \(\mathcal X\). \(\# \mathcal G = G\).
Want to test null hypothesis \(H_0: P \in \Omega_0\).
Suppose \(H_0\) implies that \(P\) is invariant under \(\mathcal G\):
The orbit of \(x\) (under \(\mathcal G\)) is \(\{ gx : g \in \mathcal G \}\). (Does the orbit of \(x\) always contain \(G\) points?)
Test statistic \(T\)¶
Let \(T: \mathcal{X} \rightarrow \mathbb{R}\) be a test statistic.
We want to test \(H_0\) at significance level \(\alpha\).
For each fixed \(x\), let \(T^{(k)}(x)\) be the \(k\)th smallest element of the multiset
These are the \(G\) (not necessarily distinct) values \(T\) takes on the orbit of \(x\).
Finding the rejection region¶
Let
Define
and
Let
Define
To test the hypothesis, generate \(U \sim U[0, 1]\) independent of \(X\).
Reject \(H_0\) if \(\phi(x) \ge U\). (Randomized test.)
Test has level \(\alpha\) unconditionally¶
For each \(x \in \mathcal{X}\),
So if \(X \sim gX \sim P\) for all \(g \in \mathcal G\),
The unconditional chance of a Type I error is exactly \(\alpha\).
Tests for the mean of a symmetric distribution¶
Data \(X = (X_j )_{j=1}^N \in \mathcal{X} = \mathbb{R}^n\).
\(\{ X_j \}\) iid \(P\); \(\mathbb{E} X_j = \mu\).
Suppose \(P\) is symmetric. Examples?
Reflection group¶
Let \(\mathcal{G}_\mu\) be the group of reflections of coordinates about \(\mu\).
Let \(x \in \mathbb{R}^n\). Each \(g \in \mathcal{G}_\mu\) is of the form
for some \(\gamma = (\gamma_j)_{j=1}^n \in \{0, 1\}^n\).
Is \(\mathcal{G}_\mu\) really a group?
What’s the identity element?
What’s the inverse of \(g\)?
What \(\gamma\) corresponds to \(g_1g_2\)?
What is \(G\), the number of elements of \(\mathcal{G}_\mu\)?
What is the orbit of a point \(x\) under \(\mathcal{G}_\mu\)?
Are there always \(2^n\) distinct elements of the orbit?
Test statistic:
If \(\mathbb{E} X_j = \mu_0\), this is expected to be small—but how large a value would be surprising?
If the expected value of \(X_j\) is \(\mu\) and \(P\) is symmetric (i.e., if \(H_0\) is true), the \(2^n\) potential data
in the orbit of \(X\) under \(\mathcal{G}\) are equally likely.
Hence, the values in the multiset
are equally likely, conditional on the event that \(X\) is in the orbit of \(x\).
How to test \(H_0\): \(\mu = \mu_0\)?¶
We observe \(X = x\).
If fewer than \(\alpha G\) values in \([T(gx) : g \in \mathcal{G}_{\mu_0}]\) are greater than or equal to \(T(x)\), reject. If more than \(\alpha G\) values are greater than \(T(x)\), don’t reject. Otherwise, randomize.
If \(n\) is big …¶
How can we sample at random from the orbit?
Toss fair coin \(n\) times in sequence, independently. Take \(\gamma_j = 1\) if the \(j\)th toss gives heads; \(\gamma_j = 0\) if tails.
Amounts to sampling with replacement from the orbit of \(x\).
Other test statistics?¶
Could use \(t\)-statistic, but calibrate critical value using the permutation distribution.
Could use a measure of dispersion around the hypothesized mean (the true mean minimizes expected RMS difference, assuming variance is finite).
What about counting the number of values that are above \(\mu_0\)?
Define
Sign test for the median¶
We are assuming \(P\) is symmetric, so the expected value and median are equal.
To avoid unhelpful complexity, suppose \(P\) is continuous.
Then
and
This leads to the sign test: Reject the hypothesis that the median is \(\mu_0\) if \(T(X)\) is too large or too small. Thresholds set to get level \(\alpha\) test, using the fact that \(T(X)\sim \mbox{Binomial}(n, 1/2)\).
Is the sign test equivalent to the permutation test for the same test statistic?¶
Suppose we are using the test statistic \(T(x) = \sum_{j=1}^n 1_{x \ge \mu_0}\). Do the permutation test and the sign test reject for the same values of \(x \in \mathcal{X}\)?
Suppose no component of \(x\) is equal to \(\mu_0\). According to the sign test, the chance that \(T(x) = k\) is \({{n}\choose{k}} 2^{-n}\) if the null is true.
What’s the chance that \(T(x) = k\) according to the permutation test? There are \(G = 2^n\) points in the orbit of \(x\) under \(\mathcal G\). If the null is true, all have equal probability \(2^{-n}\). Of these points, \({{n}\choose{k}}\) have \(k\) components with positive deviations from \(\mu_0\). Hence, for the permutation test, the chance that \(T(x) = k\) is also \({{n} \choose {k}} 2^{-n}\): The two tests are equivalent.
Abstract setting using VC classes¶
The distribution \(P \in \Omega\), where \(\Omega\) is a known collection of distributions on \(\mathcal{X}\). The null hypothesis is that \(P \in \Omega_0 \subset \Omega\). We assume that \(\Omega_0\) can be characterized as a set of distributions that are invariant under a transformation on \(\Omega\): let \(\tau: \Omega \rightarrow \Omega_0\); we assume that \(\tau(P) = P\) for all \(P \in \Omega_0\).
Let \(\mathcal{V}\) be a collection of subsets of a set \(\mathcal{X}\). For a finite set \(D \subset \mathcal{X}\), let \(\Delta^\mathcal{V}(D)\) be the number of distinct sets \(\{ V \cap D: V \in \mathcal{V} \}\). For positive integers \(n\), let
Let
If \(c(\mathcal{V}) < \infty\), \(\mathcal{V}\) is a Vapnik-Cervonenkis (V-C) class. That is, \(\mathcal{V}\) is a V-C class if the maximum number of distinct intersections of sets in \(\mathcal{V}\) with sets containing \(n\) points grows sub-exponentially with \(n\). Intersections, finite unions, and Cartesian products of V-C classes are V-C classes. In \(\mathbb{R}^n\), the set of all ellipsoids, the set of all half-spaces, the set of all lower-left quadrants, and the set of all convex sets with at most \(p\) extreme points are all V-C classes.
An alternative, equivalent definition of a V-C class is based on the following definition:
Suppose \(\mathcal{V}\) is a collection of subsets of a set \(\mathcal{X}\), and that \(D\) is a finite subset of \(\mathcal{X}\). We say \(D\) is shattered by \(\mathcal{V}\) if every subset \(d \subset D\) can be written \(d = V \cap D\) for some \(V \in \mathcal{V}\).
Suppose \(D\) has \(n\) elements. Because there are \(2^n\) subsets of a set with \(n\) elements, this is equivalent to saying that there are \(2^n\) different subsets of the form \(D \cap V\) as \(V\) ranges over \(\mathcal{V}\).
A collection \(\mathcal{V}\) is a V-C class if for some finite integer \(n\), there exists a set \(D \subset \mathcal{X}\) with \(n\) elements that is not shattered by \(\mathcal{V}\).
Example. Half lines on \(\mathbb{R}\). Consider a set \(D = \{x_j\}_{j=1}^n\) of points on the real line. Let \(\mathcal{V} = \{ (-\infty, y] : y \in \mathbb{R} \}\). How many sets are there of the form \(V \cap D\), for \(V \in \mathcal{V}\)? Just \(n+1\). Suppose the points are in increasing order, so that \(x_1 < x_2 < \cdots < x_n\). Then the possibilities for \(V \cap D\) are \(\{ \}\), \(\{ x_1 \}\), \(\{ x_1, x_2 \}\), \(\ldots\), \(\{x_j \}_{j=1}^n\). Thus \(m^{\mathcal{V}}(n) = n+1\), and \(c(\mathcal{V}) \equiv \inf \{ n : m^\mathcal{V}(n) < 2^n \} = 2\) (for \(n=0\), we have \(0+1 = 2^0\), and for \(n=1\), we have \(1+1 = 2^1\), but for \(n=2\), we have \(2+1 < 2^2\)).
Example. Closed intervals \(\{[y, z]\): \(y < z\}\) on \(\mathbb{R}\). For finite sets \(D\) as discussed above, the possibilities for \(V \cap D\) include all sets of adjacent values, such as \(\{x_1\}\), \(\{x_2\}\), \(\{x_3\}\), \(\{x_1, x_2\}\), \(\{x_2, x_3\}\), and \(\{x_1, x_2, x_3\}\), but not, for example, \(\{ x_1, x_3 \}\). Clearly, \(m^{\mathcal{V}}(2) = 4\) but \(m^{\mathcal{V}}(3) = 7\), so \(c(\mathcal{V}) = 3\). (The general rule is \(m^{\mathcal{V}}(n) = 1 + n + {n}\choose{2}\). Why?)
Suppose that \(\mathcal{V}\) and \(\mathcal{W}\) are V-C classes on a common set \(\mathcal{X}\). Then \(\mathcal{V} \cup \mathcal{W}\) is also a V-C class, as is \(\mathcal{V} \cap \mathcal{W}\).
Let \(\mathcal{V}\) be a VC class of subsets of \(\mathcal{S}\). Define the pseudo-metric
This is a generalization of the Kolmogorov-Smirnov distance for distributions on the line. In that case, the sets in \(\mathcal{V}\) are the half-lines \(\{(-\infty, y] : y \in \mathbb{R}\}\) (which comprise a V-C class).
Assume that \(\mathcal{V}\) and \(\tau\) have been selected such that \(\delta(P, \tau P) = 0\) iff \(P \in \Omega_0\). Romano proposes using the test statistic
where \(\hat{P}_n\) is the empirical measure of \(\{X_j \}_{j=1}^n\). One rejects the hypothesis when \(\tau\hat{P}_n\) is far from \(\hat{P}_n\); i.e., when \(T_n\) is sufficiently large.
But how large? One way to obtain a critical value for the test is with the bootstrap: resample from \(\tau(\hat{P}_n)\), tabulate the distribution of the distance between the empirical distribution of the bootstrap samples and \(\tau\) applied to them, use the \(1-\alpha\) quantile of that distribution as the critical value for an approximate level \(\alpha\) test. (We have to resample from \(\tau(\hat{P}_n)\) rather than \(\hat{P}_n\) because the significance level is computed under the assumption that the null hypothesis is true. The null hypothesis is true for \(\tau(\hat{P}_n)\) but not necessarily for \(\hat{P}_n\).)
Suppose that there is a (known) group \(\mathcal{G}_n\) of transformations of the sample space \(\mathcal{S}_n\) such that under the null hypothesis, \(P\) is invariant under \(\mathcal{G}_n\). Then we can also construct a randomization test of the hypothesis \(H_0\). For simplicity, suppose that \(\mathcal{G}_n\) is finite, with \(M_n\) elements \(\{ g_{nj} \}_{j=1}^{M_n}\). Under the null hypothesis, conditional on \(X = x\), the values \(\{ g_{nj}x \}_{j=1}^{M_n}\) are equally likely. (The orbit of an point \(x\) in a space \(S\) acted on by a group \(\mathcal{G}\) is the set of all elements of \(S\) that can be obtained by applying elements of \(\mathcal{G}\) to \(x\). That is, it is the set \(\{ g(x): g \in \mathcal{G}\}\). For example, consider points in the plane and the group of rotations about the origin. Then the orbit of a point \(x\) is the circle with radius \(\|x\|\).)
Compute the test statistic for each \( g_{nj}x\) in the orbit of \(x\). Reject the null hypothesis if the statistic for \(x\) exceeds the \(1-\alpha\) quantile of the test statistic for the set of values obtained from the orbit; do not reject if it is less; reject with a given probability if the statistic equals the \(1-\alpha\) quantile, in such a way as to get a level \(\alpha\) test. This is a randomization test. Because the level of the randomization test is \(\alpha\), conditional on the data, integrating over the distribution of the data shows that it is \(\alpha\) unconditionally.
Examples of hypotheses and functions \(\tau\)¶
Examples Romano gives include testing for independence of the components of each \(X_j\), testing for exchangeability of the components of each \(X_j\), testing for spherical symmetry of the distribution of \(X_j\), testing for homogeneity among the \(X_j\), and testing for a change point.
In the example of testing for independence, the mapping \(\tau\) takes the marginal distributions of the joint distribution, then constructs a joint distribution that is the product of the marginals. For distributions with independent components, this is the identity; otherwise, it maps a distribution into one with the same marginals, but whose components are independent. For testing for spherical symmetry, \(\tau\) maps a distribution into one with the same mass at every distance from the origin, but that is uniform on spherical shells. For testing for exchangability, Romano proposes looking at the largest difference between \(P\) and a permutation of the coordinates of \(P\), over all permutations of the coordinates. See his paper for more details.
Romano shows that these tests are consistent against all alternatives, and that the critical values given by the bootstrap and by randomization are asymptotically equal with probability one. Because the randomization tests are exact level \(\alpha\) tests, they might be preferred. Romano also briefly discusses how to implement the tests computationally.
Let’s consider the implementation in more detail, for two hypotheses: independence of the components of a \(k\)-variate distribution, and rotational invariance of a bivariate distribution.
Independence¶
We observe \(\{X_j\}_{j=1}^n\) iid \(P\), where each \(X_j = (X_{ij})_{i=1}^k\) takes values in \(\mathbb{R}^k\). Under the null hypothesis, \(P\) is invariant under the mapping \(\tau\) that takes the \(k\) marginal distributions of \(P\) and multiplies them together to give a probability on \(\mathbb{R}^k\) with independent components. Let \(\hat{P}_n\) be the empirical measure; let the V-C class \(\mathcal{V}\) be the set of lower left quadrants \(\{ Q(x) : x \in \mathbb{R}^k \}\) where
Then
and
The maximum difference in the probability of a lower left quadrant \(Q(x)\) occurs when \(x\) is one of the points of support of \(\tau \hat{P}_n\):
To test the null hypothesis of independence, we would compute
from the data \(X\), then repeatedly draw iid samples \(X^*\) of size \(n\) from \(\tau \hat{P}_n\), computing
for each. We would reject the null hypothesis that the components of \(P\) are independent (at approximate significance level \(\alpha\)) if \(T(X)\) exceeds the \(1-\alpha\) quantile of the simulated distribution of \(T(X^*)\).
Rotational invariance in \(\mathbb{R}^2\)¶
We observe \(\{X_j\}_{j=1}^n\) iid \(P\), where each \(X_j = (X_{1j}, X_{2j})\) takes values in \(\mathbb{R}^2\).
For \(y \in \mathbb{R}^2\), define \(|y| \equiv \sqrt{y_1^2 + y_2^2}\) to be the distance from \(y\) to the origin.
Except at the origin, the mapping from Cartesian coordinates \((x_1, x_2)\) to polar coordinates \((r, \theta)\) is one-to-one; identify the origin with the polar coordinates \((0, 0)\). Under the null hypothesis, \(P\) is invariant under the mapping \(\tau\) that produces a distribution with the same marginal distribution of \(|X|\) but that is uniform on \(\theta\) for each possible value of \(|X|\).
As before, let \(\hat{P}_n\) be the empirical measure; let the V-C class \(\mathcal{V}\) be the set of lower left quadrants \(\{ Q(x) : x \in \mathbb{R}^2 \}\) where
Then
To proceed, we need to find the probability of lower left quadrants \(Q(x)\) for the distribution \(\tau \hat{P}_n\). Consider the contribution from each \(X_j\) separately. Let \(R_j = |X_j| = \sqrt{X_{1j}^2 + X_{2j}^2}\). The contribution of \(X_j\) to \(\tau \hat{P}_n (Q(x))\) is \(1/n\) times the fraction of the circle \(\{ y \in \mathbb{R}^2 : |y| = R_j\}\) that is in the quadrant \(Q(x)\). There are eight cases to consider:
- \(x_1^2 + x_2^2 > R_j^2\), \(x_1, \; x_2 < 0\) or \(x_1 < -R_j\) or \(x_2 < -R_j\). 
 The contribution is 0: the quadrant does not intersect the circle \(|y| = R_j\).
- \(x_1, x_2 > R_j\). 
 The contribution is \(1/n\): the quadrant contains the entire circle \(|y| = R_j\).
- \(x_1^2 + x_2^2 \le R_j^2\). 
 The quadrant includes an arc that is at most half the circle.
 Let the points at which the quadrant boundary intersects the circle be \((x_1', x_2)\) and \((x_1, x_2')\). Then \(x_1'\) is the negative root of \(x_1^{'2} = R_j^2 - x_2^2\) and \(x_2'\) is the negative root of \(x_2^{'2} = R_j^2 - x_1^2\). The fraction of the circle included in \(Q(x)\) is
- \(x_1^2 + x_2^2 > R_j^2\), \(-R_j < x_1 \le 0\), \(x_2 \ge 0\). 
 The fraction of the circle within \(Q(x)\) is
- \(x_1^2 + x_2^2 > R_j^2\), \(0 \le x_1 < R_j\), \(x_2 \ge R_j\). 
 The fraction of the circle within \(Q(x)\) is \(1 - q(x_1)\).
- \(x_1^2 + x_2^2 > R_j^2\), \(x_1 \ge 0\), \(-R_j < x_2 < 0\). 
 The fraction of the circle within \(Q(x)\) is \(q(x_2)\).
- \(x_1^2 + x_2^2 > R_j^2\), \(x_1 \ge R_j\), \(0 \le x_2 < R_j\). 
 The fraction of the circle within \(Q(x)\) is \(1-q(x_2)\).
- \(x_1^2 + x_2^2 > R_j^2\), \(0 \le x_1 < R_j\), \(0 \le x_2 < R_j\). 
 The fraction of the circle within \(Q(x)\) is \(1 - q(x_1) - q(x_2)\).
At which points \(x\) should we evaluate the discrepancy \( D(x) = |\hat{P}_n(Q(x)) - \tau \hat{P}_n (Q(x))|\)?
Let \(R = \max_j R_j\). Then for \(x_1, x_2 > R\), \(D(x) = 0\). Similarly, for \(x_1, x_2 < -R\), \(D(x) = 0\).
We might take \(x\) on a fine grid in the square \([-R, R] \times [-R, R]\), but this is wasteful.
Some thought shows that the maximum discrepancy occurs when some datum is just included in \(Q(x)\), which makes \(\hat{P}_n\) relatively large compared with \(\tau \hat{P}_n\), or when some datum is just excluded from \(Q(x)\), which makes \(\tau \hat{P}_n\) relatively large compared with \(\hat{P}_n\).
The possible points of maximum discrepancy are \(x\) of the form \((X_{1j}-s\epsilon, X_{2k}-s\epsilon)\) with \(1 \le j, k \le n\), \(s \in \{0, 1\}\), and \(\epsilon\) small, together with the points \((X_{1j}-s\epsilon, R)\) and \((R, X_{2j}-s\epsilon)\).
This is a large (\(2n^2 + 4n\)) but finite number of points. Denote this set by \(\mathcal{X} (\{X_j\}, \epsilon)\).
To draw an iid sample of size \(n\) from \(\tau \hat{P}_n\), we draw \(n\) values iid uniform on \(\{r_j\}_{j=1}^n\) and draw \(n\) iid \(U[0, 2\pi]\) random variables, and treat these as the polar coordinates \((r, \theta)\) of \(n\) points in \(\mathbb{R}^2\).
To test the null hypothesis of rotational invariance, we would compute
from the data \(X\), then repeatedly draw iid samples \(\{X_j^*\}\) of size \(n\) from \(\tau \hat{P}_n\), computing
for each.
We would reject the null hypothesis that \(P\) is rotationally invariant (at approximate significance level \(\alpha\)) if \(T(X)\) exceeds the \(1-\alpha\) quantile of the simulated distribution of \(T(X^*)\).
Under the null hypothesis, the distribution of the data is invariant under the action of the rotation group.
This is not a finite group, so we cannot exhaust the set of transformations on a computer.
However, we might consider the subgroup of rotations by multiples of \(2\pi/M\) for some large integer \(M\). We could get an alternative approximate level \(\alpha\) test of the hypothesis of rotational invariance by comparing \(T(X)\) with the \(1-\alpha\) quantile of \(T\) over all such rotations of the data—the orbit of the data under this finite subgroup.
Application: testing whether seismicity follows a spatially heterogeneous, temporally homogeneous Poisson Process¶
Reference: Luen and Stark, 2012. http://onlinelibrary.wiley.com/doi/10.1111/j.1365-246X.2012.05400.x/pdf
Poisson process: condition on number of events. Given the number of events, temporal distribution uniform, times exchangeable.
What’s the “projection” operation?
