Tests and Confidence Sets when there are Nuisance Parameters

Work in Progress!

Many statistical problems involve nuisance parameters, parameters that are not of scientific interest but that affect the probability distribution of the data.

Testing hypotheses and constructing confidence sets for some parameters when there are additional nuisance parameters can be complicated.

This chapter addresses two statistical inference problems involving nuisance parameters: tests and confidence bounds for a population mean from a stratified random sample, and tests and confidence bounds for the difference between the values of \(p\) for two independent binomial distributions.

One strategy that is widely applicable to testing hypotheses about some parameter when there are nuisance parameters is to find the maximum \(P\)-value over all possible values of the nuisance parameter, keeping the parameters of interest fixed.

Difference between two Binomial \(p\)s.

Suppose we observe \(X_1 \sim \mbox{Binom}(n_1, p_1)\) and \(X_2 \sim \mbox{Binom}(n_2, p_2)\) where \(X_1\) and \(X_2\) are independent. How can we make inferences about \(p_1-p_2\)?

We will develop several approaches.

Fisher’s exact test: a conditional test

Suppose we are only interested in the hypothesis \(p_1 = p_2\). One way to test that hypothesis is to condition on the value of \(X = X_1 + X_2\). If \(p_1 = p_2 = p\), then \(X_1 + X_2 \sim \mbox{Binom}(n_1+n_2, p)\). We can think of the data as arising from \(n_1+n_2\) iid Bernoulli\((p)\) trials. If that were the case, then, given the total, \(X\), any \(X\) of the \(n_1+n_2\) trials are equally likely to be the trials that resulted in “1” rather than “0.” (The trials are exchangeable: the joint distribution of the \(n_1+n_2\) trials is invariant under permutations of the labels.)

That is, conditional on \(X=x\), the distribution of \(X_1\) is hypergeometric with parameters \(N=n_1+n_2\), \(G=x\), and \(n=n_1\). We can use that to construct a conditional test of the hypothesis \(p_1=p_2\). As noted in the chapter on tests, if we always test conditionally at significance level not greater than \(\alpha\), whatever \(X\) happens to be, that yields a test that has unconditional level not greater than \(\alpha\).

Fisher’s exact test avoids the problem of estimating nuisance parameters by conditioning on \(X_1+X_2\). If \(p_1=p_2\) is big, \(X\) will tend to be large; if \(p_1=p_2\) is small, \(X\) will tend to be small. Regardless, the distribution of \(X_1\) given \(X_1+X_2\) is hypergeometric with known parameters.

While Fisher’s exact test provides a straightforward way to test the hypothesis \(p_1 = p_2\), it does not allow us to test the hypothesis \(p_1-p_1 = c \ne 0\) without additional assumptions, and thus does not allow us to make a confidence set for \(p_1-p_2\).

Bonferroni and Šidák simultaneous tests and confidence intervals

One easy but inefficient way to test hypotheses or make confidence intervals for \(p_1-p_2\) is to make simultaneous inferences about \(p_1\) and \(p_2\). For instance, suppose that \(\mathcal{I}_1(\cdot)\) is a \(1-\alpha_1\) confidence interval procedure for \(p_1\) and \(\mathcal{I}_2(\cdot)\) is a \(1-\alpha_2\) confidence interval procedure for \(p_2\).

By Bonferroni’s inequality,

\[\begin{equation*} \mathbb{P}_{p_1, p_2} \{ (\mathcal{I}_1(X_1) \ni p_1) \cap (\mathcal{I}_2(X_2) \ni p_2) \} \ge 1 - \alpha_1 - \alpha_2. \end{equation*}\]

Because \(X_1\) and \(X_2\) are independent, we can get a sharper result using Šidák’s adjustment:

\[\begin{equation*} \mathbb{P}_{p_1, p_2} \{ (\mathcal{I}_1(X_1) \ni p_1) \cap (\mathcal{I}_2(X_2) \ni p_2) \} \ge (1 - \alpha_1)(1 - \alpha_2). \end{equation*}\]

To see that Šidák’s inequality is sharper (for independent intervals), calculate

\[\begin{equation*} (1 - \alpha_1)(1 - \alpha_2) = 1 - \alpha_1 - \alpha_2 + \alpha_1\alpha_2 \ge 1 - \alpha_1 - \alpha_2. \end{equation*}\]

Whenever \(\mathcal{I}_1(X_1) \ni p_1\) and \(\mathcal{I}_2(X_2) \ni p_2\),

\[\begin{equation*} p_1-p_2 \in \mathcal{I}_1(X_1) - \mathcal{I}_2(X_2) := \{x-y: x \in \mathcal{I}_1(X_1), y \in \mathcal{I}_2(X_2) \}, \end{equation*}\]

so

\[\begin{equation*} \mathbb{P}_{p_1, p_2} \{ \mathcal{I}_1(X_1) - \mathcal{I}_2(X_2) \ni p_1 - p_2 \} \ge (1 - \alpha_1)(1 - \alpha_2). \end{equation*}\]

That is, \(\mathcal{I}_1(X_1) - \mathcal{I}_2(X_2)\) is a \((1 - \alpha_1)(1 - \alpha_2)\) confidence interval for \(p_1-p_2\). Because of the duality between testing hypotheses and confidence intervals, we can reject the hypothesis that \(p_1-p_2 = c\) at significance level \(1- (1 - \alpha_1)(1 - \alpha_2)\) if \(c \notin \mathcal{I}_1(X_1) - \mathcal{I}_2(X_2)\). In particular, we can reject the hypothesis \(p_1=p_2\) at level \(1- (1 - \alpha_1)(1 - \alpha_2)\) if \(\mathcal{I}_1(X_1) - \mathcal{I}_2(X_2)\) does not contain 0.

As you can see, this method involves estimating \(p_1\) and \(p_2\) separately, even though we only care about \(p_1-p_2\). We can do a bit better than this.

Unconditional tests

We can also base tests on the full joint distribution of \(X_1\) and \(X_2\). [TBD]

The Santner-Snell Test Santner and Snell (1980) give a procedure for finding an exact confidence interval for \(p_c-p_t\) and \(p_c/p_t\).

Alternative approach using potential outcomes: We have \(\{U_j\}_{j=1}^N\) iid \(U[0,1]\), independent of the \(\{T_j\}_{j=1}^N\). Let \(z_{cj} = 1_{U_j \le p_c}\) and \(z_{tj} = 1_{U_j \le p_t}\). Condition on \(\{U_j\}\).

[MORE TO COME]

Individual probabilities

A more reasonable model is that individual \(j\) has a “personal” probability \(p_{tj}\) of becoming infected if vaccinated and \(p_{tc}\) if unvaccinated.

[MORE TO COME]

Confidence bound for the mean of a binary population from a stratified sample from the population.

This problem occurs in opinion surveys, financial auditing, and election auditing. It will let us see some of the issues that arise when there are nuisance parameters.

We will use four powerful techniques:

  • the duality between hypothesis tests and confidence sets

  • maximizing \(P\)-values over nuisance parameters

  • union-intersection tests

  • combining independent \(P\)-values

It is also an example of a combinatorial optimization problem that can be solved quickly using a greedy algorithm–which is rare.

The problem

A population of \(N\) items of which \(G\) are labeled “1” and \(N-G\) are labeled “0” is partitioned into \(S\) strata. Stratum \(s\) contains \(N_s\) items, of which \(G_s\) are labeled “1.” Thus \(N = \sum_{s=1}^S N_s\) and \(G \equiv \sum_{s=1}^S G_s\). We draw a simple random sample of size \(n_s\) from stratum \(s\), \(s = 1, \ldots, S\), independently across strata. (I.e., from stratum \(s\) we draw a sample of size \(n_s\) in such a way that every subset of \(n_s\) distinct items of the \(N_s\) items is equally likely; and the \(S\) samples are drawn independently.)

Let \(x_{sj}\) denote the value if the \(j\)th item in stratum \(s\), \(s=1, \ldots, S\), \(j=1, \ldots, N_s\). Then \(G_s = \sum_{j=1}^{N_s} x_{sj}\) and

\[\begin{equation*} G = \sum_{s=1}^S \sum_{j=1}^{N_s} x_{sj}. \end{equation*}\]

Let \(Y_s\) denote the sum of the values in the sample from stratum \(s\). The variables \(\{Y_s \}_{s=1}^S\) are independent. The observed value of \(Y_s\) is \(y_s\).

We seek hypothesis tests and confidence bounds for \(G\). We first consider one-sided tests of the hypothesis \(G = g\) against the alternative \(G > g\), and corresponding lower confidence bounds for \(G\). Reversing the roles of “0” and “1” gives upper confidence bounds, mutatis mutandis.

The general strategy for testing the hypothesis \(G=g\) is to find the largest \(P\)-value among all ways of allocating \(g\) items labeled “1” among the \(S\) strata (honoring the stratum sizes \(\{N_s\}\)). That is a \(P\)-value for the composite hypothesis \(G=g\). The maximum can be found by examining all such allocations and calculating the \(P\)-value for each.

Maximizing the \(P\)-value over all allocations of \(G\) ones across \(S\) strata is combinatorially complex: Feller’s “bars and stars” argument shows that there are \(\binom{G+S-1}{S-1}\) ways to allocate \(G\) objects among \(S\) strata. (Some of those can be ruled out, for instance if \(G\) exceeds the size of any stratum.) For \(S=10\) strata of size \(N_s = 400\) and \(G = 300\), there are roughly 6.3e+16 allocations: impractical by any standard.

Aside: Feller’s Bars and Stars Argument

How many distinguishable ways are there to divide \(G\) indistinguishable objects into \(S\) groups? Feller (1950) argues as follows.

Consider \(G\) “stars” lined up in a row:

\[\begin{equation*} ********* \cdots ********* \end{equation*}\]

To divide them into \(S\) groups, we place \(S-1\) “bars” somewhere among them. For instance,

\[\begin{equation*} **|****|*** \cdots ******|*** \end{equation*}\]

corresponds to having 2 items in the first group, 4 in the second, …, and 3 in the \(S\)th group. Similarly,

\[\begin{equation*} |||||********* \cdots ********* \end{equation*}\]

corresponds to having all the items in the \(S\)th group. The number of ways of partitioning the \(G\) items into \(S\) groups is thus the number of ways of choosing \(S-1\) places to put bars among a total of \(G+S-1\) places (the bars and stars pooled together), i.e., \(\binom{G+S-1}{S-1}\).

We now examine some approaches to making confidence intervals for the population total \(G\) (equivalently, for the population mean \(\mu = G/N\)) from a stratified sample.

Wright’s Method: Sum of Šidák Intervals

One easy way to get a lower confidence bound for the sum is to take the sum of simultaneous lower confidence bounds for each stratum. Because the samples from different strata are independent, Šidák’s adjustment works. Wright (1991) suggests this approach.

A confidence bound for \(G_s\) can be constructed from \(Y\) by inverting hypergeometric tests.

To have joint confidence level \(1-\alpha\), make each confidence interval at \((1-\alpha)^{1/S}\).

The chance that all of the \(S\) confidence bounds cover their corresponding parameters is

\[\begin{equation*} \prod_{s=1}^S (1-\alpha)^{1/S} = 1-\alpha \end{equation*}\]

because the intervals all based on independent data. This approach to making simultaneous confidence bounds for a set of parameters using independent data for each parameter is called Šidák’s method.

Wright’s method is an example of a much more general approach: make a joint \(1-\alpha\) confidence set for all the parameters \(\{G_j\}_{j=1}^S\), then find a lower bound on a functional of interest (here, their sum) over the joint set. Whenever the joint confidence set covers the parameter, the lower bound does not exceed the true value of the functional of the parameter. But because we do not care about the individual stratum sums, only the overall population sum, this approach is unnecessarily conservative: we are “wasting” effort constraining them separately.

The Wendell-Schmee Test

Wendell and Schmee (1996, https://www.tandfonline.com/doi/abs/10.1080/01621459.1996.10476950) proposed making inferences about the population total by maximizing a \(P\)-value over a set of nuisance parameters—the individual stratum totals. They find the \(P\)-value by ordering possible outcomes based on the estimated population proportion: the test statistic is the “tail probability” of \(\hat{p}\), the unbiased estimator of the population percentage from the stratified sample. They construct confidence bounds by inverting hypothesis tests.

The test statistic for the Wendell-Schmee test is the unbiased estimate of the population proportion, \(G/N\):

\[ \hat{p} = \frac{1}{N} \sum_{s=1}^S N_s y_s/n_s. \]

The \(P\)-value of the hypothesis \(G_s=g_s\), \(s=1, \ldots, S\), is the “lower tail probability” of \(\hat{p}\).

Wendell and Schmee consider maximizing this lower tail probability over all allocations of \(g\) ones across strata, either by exhaustive search, or by numerical optimization using a descent method from some number of random starting points. (They show graphically that the tail probability is not convex in the original parametrization.)

In practice, their method is computationally intractable when there are more than \(S=3\) strata.

Constructive maximization

For some test statistics, there is a much more efficient approach.

Define $\( p_s(g_s) \equiv \Pr \{ Y_s \ge y_s || G_s = g_s \} = \sum_{y = y_s}^{g_s} \frac{\binom{g_s}{y} \binom{N_s-g_s}{n_s - y}}{\binom{N_s}{n_s}}, \)\( where \)\binom{a}{b} \equiv 0\( if \)a \le 0\( or \)b > a\(. (The double vertical bars denote "computed on the assumption that.") This is the \)P\(-value of the hypothesis \)G_s = g_s\( tested against the alternative \)G_s > g_s$.

A test of the conjunction hypothesis \(G_s = g_s\), \(s=1, \ldots, S\) can be constructed using Fisher’s combining function: if all \(S\) hypotheses are true, the distribution of $\( X^2(\vec{g}) \equiv -2 \sum_{s=1}^S \log p_s(g_s) \)\( is _dominated_ by the chi-square distribution with \)2S\( degrees of freedom. (That is, the chance \)X^2 \ge t\( is less then or equal to the chance that a random variable with a chi-square distribution is \)\ge t\(, for all \)t$.)

Let \(\chi_d(z)\) denote the survival function for the chi-sqare distribution with \(d\) degrees of freedom, i.e., the chance that a random variable with the chi-square distribution with \(d\) degrees of freedom is greater than or equal to \(z\). Then a conservative \(P\)-value for the allocation \(\vec{g}\) $\( P(\vec{g}) = \chi_{2S}(X^2(\vec{g})). \)\( The allocation \)\vec{g}\( of \)g\( ones across strata that maximizes the \)P\(-value is the allocation that minimizes \)X^2(\vec{g})\( and satisfies \)\sum_s g_s = g\(. Equivalently, it is the allocation that maximizes \)\sum_{s=1}^S \log p_s(g_s)$.

Let $\( a_s(j) \equiv \left \{ \begin{array}{ll} \log p_s(y_s), & j = y_s \\ \log \left (p_s(j)/p_s(j-1) \right ), & j = y_s+1, \ldots N_s-(n_s-y_s). \end{array} \right . \)$

Then \(\log p_s(g_s) = \sum_{j=y_s}^{g_s} a_s(j)\) if \(y_s \le g_s \le N-(n_s-y_s)\), and \(\log p_s(g_s) = -\infty\) otherwise. Moreover, $\( X^2(\vec{g}) = -2\sum_{s=1}^S a_s(y_s) -2\sum_{s=1}^S \sum_{j=y_s+1}^{g_s} a_s(j) \)\( provided \)y_s \le g_s \le N-(n_s-y_s)\(, \)s=1, \ldots, S$; otherwise, it is infinite.

An allocation of \(g\) ones across strata is inconsistent with the data unless \(g_s \ge y_s\), \(s=1, \ldots, S\). Thus, in considering how to allocate \(g\) ones to maximize the \(P\)-value, the first sum above, accounting for \(\sum_s y_s\) ones, is “mandatory,” or the \(P\)-value will be zero. The question is how to allocate the remaining \(g - \sum_s y_s\) ones to maximize the \(P\)-value (equivalently, to minimize \(X^2(\vec{g})\)).

Let \(b_k\) denote the \(k\)th largest element of the set

\[ \{a_s(j): j=y_s+1, \ldots, N_s-(n_s-y_s), \;\; s=1, \ldots, S \}, \]

with ties broken arbitrarily. Define \(\tilde{g}_y \equiv g - \sum_{s=1}^S y_s\).

Proposition. For every \(\vec{g}\) with \(\sum_s g_s = g\), $\( X^2(\vec{g}) \ge X_*^2(g) \equiv \left \{ \begin{array}{ll} -2 \left ( \sum_{s=1}^S a_s(y_s) + \sum_{k=1}^{\tilde{g}_y} b_k \right ), & \sum_s y_s \le g \le N - \sum_s (n_s-y_s) \\ \infty, & \mbox{ otherwise } \end{array} \right . . \)$

Proof. Any \(\vec{g}\) for which \(X^2(\vec{g})\) is finite includes the first sum and a sum of \(\tilde{g}_y\) elements of \(\{b_k\}\); the latter is at most the sum of the \(\tilde{g}_y\) largest elements of \(\{b_k\}\). \(\Box\)

Moreover, if the \(\tilde{g}_y\) largest elements of \(\{b_k \}\) correspond to an allocation of \(\tilde{g}_y\) ones across the strata, the bound is sharp. That turns out to be true (the proof involves log convexity of the distributions). The second sum thus corresponds to a particular allocation \(\vec{g}\) of \(g\) ones across the \(S\) strata, with \(y_s \le g_s \le N_s-(n_s-y_s)\). Among all allocations of \(g\) items labeled “1,” this one minimizes has the smallest tail probability, because it is the exponentiation of the smallest sum of logs (the largest negative sum of logs). \(\Box\)

Proposition: For \(j \in y_s+1, \ldots, N_s-(n_s-y_s)\), \(a_s(j)\) is monotone decreasing in \(j\).

Let \(P(g) \equiv \max_{\vec{g}: \sum_s g_s = g} P(\vec{g})\).

Theorem: If \(\sum_s y_s \le g \le N - \sum_s (n_s-y_s)\),

\[ P(g) \le \chi_d(X_*^2(g)). \]

Proof: Immediate from the definitions.

The theorem shows that a “greedy” approach finds a conservative \(P\)-value:

  • Construct the values \(a_s(j)\) and the set \(\{b_k\}\).

  • Add the \(S\) values \(\{a_s(x_k)\) to the \(g-g_y\) largest elements of \(\{b_k\}\) and multiply the sum by \(-2\).

  • The upper tail probability of the chi-square distribution with \(2S\) degrees of freedom is a conservative \(P\)-value for the hypothesis \(G=g\).

A conservative upper \(1-\alpha\) confidence bound for \(G\) can be found by inverting the test: it is the largest \(g\) for which \(P(g) \ge \alpha\).

Changing the direction of the test

The test of the hypothesis \(G=g\) given above is a one-sided test against the alternative \(G > g\): it rejects if the chance of observing “so few” good objects is small.

To test against the alternative \(G < g\) (i.e., to reject if the chance of observing “so many” good objects is small), exchange the role of “good” and “bad.” The hypothesis \(G < g\) is equivalent to the hypothesis \((N-G) > (N-g)\).

The resulting null hypothesis is \(G = N-g\), and the data are \(n_s - Y_s\).

We now code the combinatorial optimization and the greedy optimization.

# Install permute and cryptorandom in the current kernel, if not already installed
import sys
!{sys.executable} -m pip install permute --user
!{sys.executable} -m pip install cryptorandom --user
Requirement already satisfied: permute in /opt/anaconda3/lib/python3.7/site-packages (0.1a5)
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.7/site-packages (from permute) (1.19.2)
Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.7/site-packages (from permute) (1.5.2)
Requirement already satisfied: cryptorandom in /opt/anaconda3/lib/python3.7/site-packages (0.2)
Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.7/site-packages (from cryptorandom) (1.5.2)
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.7/site-packages (from cryptorandom) (1.19.2)
from scipy.stats import binom, hypergeom, chi2
import itertools
import warnings
from permute.utils import binom_conf_interval, hypergeom_conf_interval
@lru_cache(maxsize=None)  # decorate the function to cache the results 
                          # of calls to the function
def strat_test_brute(strata, sams, found, good, alternative='upper'):
    """
    Find p-value of the hypothesis that the number G of "good" objects in a 
    stratified population is less than or equal to good, using a stratified
    random sample.
    
    Assumes that a simple random sample of size sams[s] was drawn from stratum s, 
    which contains strata[s] objects in all.
    
    The P-value is the maximum Fisher combined P-value across strata
    over all allocations of good "good" objects among the strata. The allocations are
    enumerated using Feller's "bars and stars" construction, constrained to honor the
    stratum sizes (each stratum can contain no more "good" items than it has items in all,
    nor fewer "good" items than the sample contains).
    
    The number of allocations grows combinatorially: there can be as many as
    [(#strata + #good items) choose (#strata-1)] allocations, making the brute-force
    approach computationally infeasible when the number of strata and/or the number of
    good items is large.
    
    The test is a union-intersection test: the null hypothesis is the union over allocations
    of the intersection across strata of the hypotheses that the number of good items
    in the stratum is less than or equal to a constant.
    
    Parameters:
    -----------
    strata : list of ints
        sizes of the strata. One int per stratum.
    sams : list of ints
        the sample sizes from each stratum
    found : list of ints
        the numbers of "good" items found in the samples from the strata
    good : int
        the hypothesized total number of "good" objects in the population
    alternative : string {'lower', 'upper'}
        test against the alternative that the true value is less than good (lower)
        or greater than good (upper)
    
    Returns:
    --------
    p : float
        maximum combined p-value over all ways of allocating good "good" objects
        among the strata, honoring the stratum sizes.        
    best_part : list
        the partition that attained the maximum p-value
    """
    if alternative == 'upper': # exchange roles of "good" and "bad"
        p_func = lambda f, s, p, x: sp.stats.hypergeom.logcdf(f, s, p, x)
    elif alternative == 'lower':
        p_func = lambda f, s, p, x: sp.stats.hypergeom.logsf(f-1, s, p, x)
    else:
        raise NotImplementedError("alternative {} not implemented".format(alternative))
    best_part = found # start with what you see
    strata = np.array(strata, dtype=int)
    sams = np.array(sams, dtype=int)
    found = np.array(found, dtype=int)
    good = int(good)
    if good < np.sum(found):     
        p = 0 if alternative == 'lower' else 1 
    elif good > np.sum(strata) - np.sum(sams) + np.sum(found):
        p = 1 if alternative == 'lower' else 0
    else:  # use Feller's "bars and stars" enumeration of combinations, constrained
        log_p = np.NINF   # initial value for the max
        n_strata = len(strata)
        parts = sp.special.binom(good+n_strata-1, n_strata-1)
        if parts >= 10**7:
            print("warning--large number of partitions: {}".format(parts))
        barsNstars = good + n_strata
        bars = [0]*n_strata + [barsNstars]
        partition = ([bars[j+1] - bars[j] - 1 for j in range(n_strata)] \
            for bars[1:-1] in itertools.combinations(range(1, barsNstars), n_strata-1) \
            if all(((bars[j+1] - bars[j] - 1 <= strata[j]) and \
                   (bars[j+1] - bars[j] >= found[j])) for j in range(n_strata)))
        for part in partition:
            log_p_new = 0 
            for s in range(n_strata): # should be possible to vectorize this
                log_p_new += p_func(found[s], strata[s], part[s], sams[s])
            if log_p_new > log_p:
                best_part = part
                log_p = log_p_new
        p = sp.stats.chi2.sf(-2*log_p, df = 2*len(strata))
    return p, best_part

@lru_cache(maxsize=None)  # decorate the function to cache the results 
                          # of calls to the function
def strat_ci_bisect(strata, sams, found, alternative='lower', cl=0.95,
                  p_value=strat_test_brute):
    """
    Confidence bound on the number of ones in a stratified population,
    based on a stratified random sample (without replacement) from
    the population.
    
    
    If alternative=='lower', finds an upper confidence bound.
    If alternative=='upper', finds a lower confidence bound.

    Uses an integer bisection search to find an exact confidence bound.
    The starting upper endpoint for the search is the unbiased estimate
    of the number of ones in the population. That could be refined in various
    ways to improve efficiency.
    
    The lower endpoint for the search is the Šidák joint lower confidence bounds,
    which should be more conservative than the exact bound.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    alternative : string {'lower', 'upper'}
        if alternative=='lower', finds an upper confidence bound.
        if alternative=='upper', finds a lower confidence bound.
        While this is not mnemonic, it corresponds to the sidedness of the tests
        that are inverted to get the confidence bound.
    cl : float
        confidence level. Assumed to be at least 50%.
    p_value : callable
        method for computing the p-value
    
    Returns:
    --------
    b : int
        confidence bound
    best_part : list of ints
        partition that attains the confidence bound
    """
    assert alternative in ['lower', 'upper']
    if alternative == 'upper':  # interchange good and bad
        compl = np.array(sams)-np.array(found)  # bad items found
        cb, best_part = strat_ci_bisect(strata, sams, compl, alternative='lower', 
                                       cl=cl, p_value=p_value)
        b = np.sum(strata) - cb    # good from bad
        best_part_b = np.array(strata, dtype=int)-np.array(best_part, dtype=int)
    else:
        cl_sidak = math.pow(cl, 1/len(strata))  # Šidák adjustment
        tail = 1-cl
        a = sum((hypergeom_conf_interval( \
                sams[s], found[s], strata[s], cl=cl_sidak, alternative="lower")[0] \
                for s in range(len(strata)))) # Šidák should give a lower bound
        b = int(np.sum(np.array(strata)*np.array(found)/np.array(sams)))-1 # expected good
        p_a, best_part_a = p_value(strata, sams, found, a, alternative=alternative)
        p_b, best_part_b = p_value(strata, sams, found, b, alternative=alternative)
        tot_found = np.sum(found)
        while p_a > tail and a > tot_found:
            a = math.floor(a/2)
            p_a, best_part_a = p_value(strata, sams, found, a, alternative=alternative)
        if p_a > tail:
            b = a
            best_part_b = best_part_a
        else:
            while b-a > 1:
                c = int((a+b)/2)
                p_c, best_part_c = p_value(strata, sams, found, 
                                           c, alternative=alternative)
                if p_c > tail:
                    b, p_b, best_part_b = c, p_c, best_part_c
                elif p_c < tail:
                    a, p_a, best_part_a = c, p_c, best_part_c
                elif p_c == tail:
                    b, p_b, best_part_b = c, p_c, best_part_c
                    break
    return b, list(best_part_b)
    
@lru_cache(maxsize=None)  # decorate the function to cache the results 
                          # of calls to the function
def strat_test(strata, sams, found, good, alternative='lower'):
    """
    P-value for the hypothesis the number of ones in a stratified population is not 
    greater than (or not less than) a hypothesized value, based on a stratified 
    random sample without replacement from the population.
    
    Uses a fast algorithm to find (an upper bound on) the P-value constructively.
    
    Uses Fisher's combining function to combine stratum-level P-values.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    good : int
        hypothesized number of ones in the population
    alternative : string {'lower', 'upper'}
        test against the alternative that the true number of "good" items 
        is less than (lower) or greater than (upper) the hypothesized number, good
    
    Returns:
    --------
    p : float
        P-value
    best_part : list
        the partition that attained the maximum p-value
    """
    assert alternative in ['lower', 'upper']
    if alternative == 'upper':                 # exchange roles of "good" and "bad"
        compl = np.array(sams) - np.array(found) # bad items found 
        bad = np.sum(strata) - good            # total bad items hypothesized
        res = strat_test(strata, sams, compl, bad, alternative='lower')
        return res[0], list(np.array(strata, dtype=int)-np.array(res[1], dtype=int))
    else:  
        good = int(good)
        best_part = []               # best partition
        if good < np.sum(found):     # impossible
            p = 0  
            best_part = found
        elif good >= np.sum(strata) - np.sum(sams) + np.sum(found): # guaranteed
            p = 1
            best_part = list(np.array(strata, dtype=int) - \
                             np.array(sams, dtype=int) + \
                             np.array(found, dtype=int))       
        elif good >= np.sum(found):  # outcome is possible under the null  
            log_p = 0                # log of joint probability
            contrib = []             # contributions to the log joint probability
            base = np.sum(found)     # must have at least this many good items
            for s in range(len(strata)):
                log_p_j = sp.stats.hypergeom.logsf(found[s]-1, strata[s], found[s], sams[s])
                                     # baseline p for minimum number of good items in stratum
                log_p += log_p_j     # log of the product of stratum-wise P-values
                for j in range(found[s]+1, strata[s]-(sams[s]-found[s])+1):
                    log_p_j1 = sp.stats.hypergeom.logsf(found[s]-1, strata[s], j, sams[s])
                                     # tail probability for j good in stratum
                    contrib.append([log_p_j1 - log_p_j, s])  
                                     # relative increase in P from new item
                    log_p_j = log_p_j1
            sorted_contrib = sorted(contrib, key = lambda x: x[0], reverse=True)
            best_part = np.array(found)
            for i in range(good-base):
                log_p += sorted_contrib[i][0]
                best_part[int(sorted_contrib[i][1])] += 1
            p = sp.stats.chi2.sf(-2*log_p, df = 2*len(strata))
    return p, list(best_part)

def strat_ci_search(strata, sams, found, alternative='lower', cl=0.95):
    """
    Confidence bound on the number of ones in a stratified population,
    based on a stratified random sample (without replacement) from
    the population.
        
    If alternative=='lower', finds an upper confidence bound.
    If alternative=='upper', finds a lower confidence bound.
    
    Searches for the allocation of items that attains the confidence bound
    by increasing the number of ones from the minimum consistent
    with the data (total found in the sample) until the P-value is greater
    than 1-cl.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    alternative : string {'lower', 'upper'}
        if alternative=='lower', finds an upper confidence bound.
        if alternative=='upper', finds a lower confidence bound.
        While this is not mnemonic, it corresponds to the sidedness of the tests
        that are inverted to get the confidence bound.
    cl : float
        confidence level. Assumed to be at least 50%.
    
    Returns:
    --------
    cb : int
        confidence bound
    best_part : list of ints
        partition that attains the confidence bound (give or take one item)
    """
    assert alternative in ['lower', 'upper']
    if alternative == 'upper':  # interchange good and bad
        compl = np.array(sams)-np.array(found)  # bad items found
        cb, best_part = strat_ci(strata, sams, compl, alternative='lower', cl=cl)
        cb = np.sum(strata) - cb    # good from bad
        best_part = np.array(strata, dtype=int)-np.array(best_part, dtype=int)
    else:
        cb = int(np.sum(np.array(strata)*np.array(found)/np.array(sams)))-1 # expected good
        p_attained, best_part = strat_test(strata, sams, found, cb, 
                                                 alternative=alternative)
        while p_attained >= 1-cl:
            cb -= 1
            p_attained, best_part = strat_test(strata, sams, found, cb, 
                                                     alternative=alternative)
        cb += 1
        p_attained, best_part = strat_test(strata, sams, found, cb, 
                                                 alternative=alternative)
    return cb, list(best_part)

def strat_ci(strata, sams, found, alternative='lower', cl=0.95):
    """
    Conservative confidence bound on the number of ones in a population,
    based on a stratified random sample (without replacement) from
    the population.
    
    If alternative=='lower', finds an upper confidence bound.
    If alternative=='upper', finds a lower confidence bound.
    
    Constructs the confidence bound directly by constructing the
    allocation of the maximum number of ones that would not be
    rejected at (conservative) level 1-cl.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    alternative : string {'lower', 'upper'}
        if alternative=='lower', finds an upper confidence bound.
        if alternative=='upper', finds a lower confidence bound.
        While this is not mnemonic, it corresponds to the sidedness of the tests
        that are inverted to get the confidence bound.
    cl : float
        confidence level
        
    Returns:
    --------
    cb : int
        confidence bound
    best_part : list of ints
        partition that attains the confidence bound (give or take one item)
    """
    assert alternative in ['lower', 'upper']
    if alternative == 'upper':  # interchange role of good and bad
        compl = np.array(sams)-np.array(found)  # bad found
        cb, best_part = strat_ci(strata, sams, compl, alternative='lower', cl=cl)
        best_part = np.array(strata, dtype=int)-np.array(best_part, dtype=int)
        cb = np.sum(strata) - cb   # good to bad
    else:                
        threshold = -sp.stats.chi2.ppf(cl, df=2*len(strata))/2
        # g is in the set if 
        #   chi2.sf(-2*log(p), df=2*len(strata)) >= 1-cl
        #  i.e.,   -2*log(p) <=  chi2.ppf(cl, df)
        #  i.e.,   log(p) >= -chi2.ppf(cl, df)/2
        log_p = 0                # log of joint probability
        contrib = []             # contributions to the log joint probability
        base = np.sum(found)     # must have at least this many good items
        for s in range(len(strata)):
            log_p_j = sp.stats.hypergeom.logsf(found[s]-1, strata[s], found[s], sams[s])
                                 # baseline p for minimum number of good items in stratum
            log_p += log_p_j     # log of the product of stratum-wise P-values
            small = np.PINF      # for monotonicity check
            for j in range(found[s]+1, strata[s]-(sams[s]-found[s])+1):
                log_p_j1 = sp.stats.hypergeom.logsf(found[s]-1, strata[s], j, sams[s])
                log_p_j1 = log_p_j if log_p_j1 < log_p_j else log_p_j1 # true difference is nonnegative
                contrib.append([log_p_j1 - log_p_j, s]) 
                log_p_j = log_p_j1
                if contrib[-1][0] > small:
                    print("reversal in stratum {} for {} good; old: {} new:{}".format(s, 
                                    j, small, contrib[-1][0]))
                small = contrib[-1][0]
        sorted_contrib = sorted(contrib, key = lambda x: x[0], reverse=True)
        best_part = np.array(found)
        added = 0
        while log_p < threshold: 
            log_p += sorted_contrib[added][0]
            best_part[int(sorted_contrib[added][1])] += 1
            added += 1
        cb = base + added 
    return cb, list(best_part)

def strat_p(strata, sams, found, hypo, alternative='lower'):
    """
    Finds tail probability for the hypothesized population counts <hypo> for 
    simple random samples of sizes <sams> from strata of sizes <strata> if 
    <found> ones are found in the strata.
    
    Uses Fisher's combining function across strata.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes from the strata
    found : list of ints
        number of ones found in each stratum in each sample
    hypo : list of ints
        hypothesized number of ones in the strata
    
    Returns:
    --------
    p : float
        appropriate tail probability
    """
    assert alternative in ['lower', 'upper']
    if alternative == 'lower':
        p_func = lambda x, N, G, n: sp.stats.hypergeom.sf(x-1, N, G, n)
    else:
        p_func = lambda x, N, G, n: sp.stats.hypergeom.cdf(x, N, G, n)
    p = 1    
    for s in range(len(strata)):
        p *= p_func(found[s], strata[s], hypo[s], sams[s])
    return sp.stats.chi2.sf(-2*math.log(p), df=2*len(strata))

def strat_p_ws(strata, sams, found, hypo, alternative='upper'):
    """
    Finds Wendell-Schmee P-value for the hypothesized population counts <hypo> for 
    simple random samples of sizes <sams> from strata of sizes <strata> if 
    <found> ones are found in the strata.
        
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes from the strata
    found : list of ints
        number of ones found in each stratum in each sample
    hypo : list of ints
        hypothesized number of ones in the strata
    
    Returns:
    --------
    p : float
        tail probability
    """
    assert alternative in ['lower', 'upper']
    if alternative == 'lower':                     # exchange roles of "good" and "bad"
        compl = np.array(sams) - np.array(found)   # bad items found 
        hypo_c = np.array(strata) - np.array(hypo) # total bad items hypothesized
        p = strat_p_ws(strata, sams, compl, hypo_c, alternative='upper')
    else:    
        p_hat = lambda f, st=strata, sa=sams: np.sum(np.array(st)*np.array(f)/np.array(sa))/np.sum(st) # pooled estimate
        p_hat_0 = p_hat(found)
        per_strat = np.array(strata)/np.array(sams)/np.sum(strata)
        strat_max = np.floor(p_hat_0/per_strat)
        lo_t = (t for t in itertools.product(*[range(int(s+1)) for s in strat_max]) \
                    if p_hat(t) <= p_hat_0)
        p = sum(np.prod(sp.stats.hypergeom.pmf(t, strata, hypo, sams)) \
                    for t in lo_t)
    return p

@lru_cache(maxsize=None)  # decorate the function to cache the results 
                          # of calls to the function
def strat_test_ws(strata, sams, found, good, alternative='lower'):
    """
    Find p-value of the hypothesis that the number G of "good" objects in a 
    stratified population is less than or equal to good, using a stratified
    random sample.
    
    Assumes that a simple random sample of size sams[s] was drawn from stratum s, 
    which contains strata[s] objects in all.
    
    The P-value is the maximum Windell-Schmee P-value over all allocations of 
    good objects among the strata. The allocations are enumerated using Feller's 
    "bars and stars" construction, constrained to honor the stratum sizes (each 
    stratum can contain no more "good" items than it has items in all, nor fewer 
    "good" items than the sample contains).
    
    The number of allocations grows combinatorially: there can be as many as
    [(#strata + #good items) choose (#strata-1)] allocations, making the brute-force
    approach computationally infeasible when the number of strata and/or the number of
    good items is large.
    
    Parameters:
    -----------
    strata : list of ints
        sizes of the strata. One int per stratum.
    sams : list of ints
        the sample sizes from each stratum
    found : list of ints
        the numbers of "good" items found in the samples from the strata
    good : int
        the hypothesized total number of "good" objects in the population
    alternative : string {'lower', 'upper'}
        test against the alternative that the true value is less than good (lower)
        or greater than good (upper)
    
    Returns:
    --------
    p : float
        maximum combined p-value over all ways of allocating good "good" objects
        among the strata, honoring the stratum sizes.        
    best_part : list
        the partition that attained the maximum p-value
    """
    assert alternative in ['lower', 'upper']
    if alternative == 'lower':                   # exchange roles of "good" and "bad"
        compl = np.array(sams) - np.array(found) # bad items found 
        bad = np.sum(strata) - good              # total bad items hypothesized
        res = strat_test_ws(strata, sams, compl, bad, alternative='upper')
        return res[0], list(np.array(strata, dtype=int)-np.array(res[1], dtype=int))        
    best_part = found # start with what you see
    good = int(good)
    if good < np.sum(found):     
        p = 0 if alternative == 'lower' else 1 
    elif good > np.sum(strata) - np.sum(sams) + np.sum(found):
        p = 1 if alternative == 'lower' else 0
    else:  # use Feller's "bars and stars" enumeration of combinations, constrained
        p_hat = lambda f, st=strata, sa=sams: np.sum(np.array(st)*np.array(f)/np.array(sa))/np.sum(st) # pooled estimate
        p_hat_0 = p_hat(found)
        per_strat = np.array(strata)/np.array(sams)/np.sum(strata)
        strat_max = np.floor(p_hat_0/per_strat)
        p = 0   # initial value for the max
        n_strata = len(strata)
        parts = sp.special.binom(good+n_strata-1, n_strata-1)
        if parts >= 10**7:
            print("warning--large number of partitions: {}".format(parts))
        barsNstars = good + n_strata
        bars = [0]*n_strata + [barsNstars]
        partition = ([bars[j+1] - bars[j] - 1 for j in range(n_strata)] \
            for bars[1:-1] in itertools.combinations(range(1, barsNstars), n_strata-1) \
            if all(((bars[j+1] - bars[j] - 1 <= strata[j]) and \
                   (bars[j+1] - bars[j] >= found[j])) for j in range(n_strata)))
        for part in partition:
            lo_t = (t for t in itertools.product(*[range(int(s+1)) for s in strat_max]) \
                    if p_hat(t) <= p_hat_0)
            p_new = 0
            for t in lo_t:
                p_temp = 1
                for s in range(len(strata)):
                    p_temp *= sp.stats.hypergeom.pmf(t[s], strata[s], part[s], sams[s])
                p_new += p_temp
            if p_new > p:
                best_part = part
                p = p_new
    return p, list(best_part)

def strat_ci_wright(strata, sams, found, alternative='lower', cl=0.95):
    """
    Confidence bound on the number of ones in a stratified population,
    based on a stratified random sample (without replacement) from
    the population.
    
    If alternative=='lower', finds an upper confidence bound.
    If alternative=='upper', finds a lower confidence bound.
    
    Constructs the confidence bound by finding Šidák multiplicity-adjusted
    joint lower confidence bounds for the number of ones in each stratum.
    
    This approach is mentioned in Wright, 1991.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    alternative : string {'lower', 'upper'}
        if alternative=='lower', finds an upper confidence bound.
        if alternative=='upper', finds a lower confidence bound.
        While this is not mnemonic, it corresponds to the sidedness of the tests
        that are inverted to get the confidence bound.
    cl : float
        confidence level
        
    Returns:
    --------
    cb : int
        confidence bound
    """
    assert alternative in ['lower', 'upper']
    inx = 0 if alternative == 'lower' else 1
    cl_sidak = math.pow(cl, 1/len(strata))  # Šidák-adjusted confidence level per stratum
    cb = sum((hypergeom_conf_interval(
                sams[s], found[s], strata[s], cl=cl_sidak, alternative=alternative)[inx] 
                for s in range(len(strata))))
    return cb

Let’s check the empirical coverage of the greedy method.

# test empirical coverage
reps = int(10**3)
cl = 0.95
alternative = 'upper'
strata = [5000, 3000, 2000]
sams = [75,50,25]
good = [100, 100, 500]
g = np.sum(good)
cover = 0
verb = False
for i in range(reps):
    found = sp.stats.hypergeom.rvs(strata, good, sams)
    ub = strat_ci(strata, sams, found, alternative=alternative, cl=cl)
    if verb: 
        print("f: {} ub: {} best: {}".format(found, ub[0], ub[1]))
    cover = cover+1 if ub[0] >= g else cover
    if i % 100 == 0:
        print(i+1, cover, cover/(i+1))
cover/reps
1 1 1.0
101 101 1.0
201 200 0.9950248756218906
301 300 0.9966777408637874
401 400 0.9975062344139651
501 500 0.998003992015968
601 600 0.9983361064891847
701 700 0.9985734664764622
801 800 0.9987515605493134
901 899 0.9977802441731409
0.998

Let’s write some numerical tests to confirm that the three ways of computing bounds agree.

def test_strat_test(verbose = False):
    strata = [[10, 20, 30, 40], [20, 20, 20]]
    sams = [[2, 3, 4, 5], [5, 5, 10]]
    found = [[1, 2, 3, 4], [0, 3, 2]]
    for i in range(len(strata)):
        compl = np.array(sams[i]) - np.array(found[i])
        for j in list(range(4, 20)) + [60]:
            for alternative in ['lower', 'upper']:
                if verbose:
                    print("i: {} j: {} alternative: {}".format(i, j, alternative))
                p_exact = strat_test_brute(strata[i], sams[i], found[i], j, alternative=alternative)
                p_exact_c = strat_test_brute(strata[i], sams[i], compl, j, alternative=alternative)
                p_lkp = strat_test(strata[i], sams[i], found[i], j, alternative=alternative)
                p_lkp_c = strat_test(strata[i], sams[i], compl, j, alternative=alternative)
                np.testing.assert_allclose(p_exact[0], p_lkp[0])
                np.testing.assert_allclose(p_exact_c[0], p_lkp_c[0])

def test_strat_ci(verbose = False):
    strata = [[10, 20], [10, 20, 30, 40], [10, 20, 20, 30]]
    sams = [[5, 5], [2, 3, 4, 5], [2, 4, 5, 6]]
    found = [[2, 2], [1, 2, 3, 4], [0, 1, 2, 3]]
    for i in range(len(strata)):
        for alternative in ['lower','upper']:
            brute = strat_ci_bisect(strata[i], sams[i], found[i], alternative=alternative)
            lkp = strat_ci(strata[i], sams[i], found[i], alternative=alternative)
            lkp_s = strat_ci_search(strata[i], sams[i], found[i], alternative=alternative)
            if verbose:
                print("{}-{}".format(i, alternative))
                print("i: {} brute: {} lkp: {} lkp_s: {} \nbest_brute: {} best_lkp: {} best_lkp_s: {}".format(
                   i, brute[0], lkp[0], lkp_s[0], brute[1], lkp[1], lkp_s[1]))
            np.testing.assert_allclose(brute[0], lkp[0])
            np.testing.assert_allclose(brute[0], lkp_s[0])
            np.testing.assert_allclose(brute[1], lkp[1])
            np.testing.assert_allclose(brute[1], lkp_s[1])
test_strat_test()
test_strat_ci()