Confidence Sets

This is a rough work in progress!

Types of Parameters

Many different kinds of things are called “parameters.” Here are several categories.

Population parameters

Any property of a population may be called a parameter. Examples include the population mean, percentiles, number of modes, etc.

If the population has more than one “value” per item, a parameter could involve more than one of them. E.g., if the population is a group of people each of whom has a height and a weight, then the population correlation between height and weight is a parameter.

Similarly, consider a group of individuals and the values of some quantity (the “response”) for each of those individuals without and with some intervention (notionally, a “treatment”). The difference between the average response without the intervention and the average response with the intervention is a parameter (the average treatment effect).

If we are sampling at random from a population, the probability distribution of the sample depends on the values in the population, and thus, in general, on the values of population parameters. (It will also depend on the sampling design.)

Functional parameters of probability distributions

Suppose \(X \sim \mathbb{P}\), where \(\mathbb{P}\) is a probability distribution on some space \(\mathcal{X}\) of possible outcomes. We assume that \(\mathbb{P} \in \mathcal{P}\), some known set of possible distributions.

A functional parameter \(\theta(\mathbb{P})\) is a function of \(\mathbb{P}\). For instance the (population) mean is a functional parameter:

()\[\begin{equation} \theta(\mathbb{P}) = \mathbb{E}X \equiv \int_\mathcal{X} x d\mathbb{P}(x). \end{equation}\]

So are other moments of the probability distribution:

\[\begin{equation*} \theta(\mathbb{P}) = \mathbb{E}X^n \equiv \int_\mathcal{X} x^n d\mathbb{P}(x), \;\; n=1, 2, \ldots . \end{equation*}\]

Other properties of \(\mathbb{P}\), such as percentiles of a univariate distribution, are also functional parameters. For instance, if \(X\) is a real-valued random variable, then the \(\alpha\) percentile of \(\mathbb{P}\),

()\[\begin{equation} \theta(\mathbb{P}) = \inf \left \{x: \int_{-\infty}^x d\mathbb{P}(x) \ge \alpha \right \}, \end{equation}\]

is a functional parameter.

For multivariate distributions, correlations among the components of \(X\) are functional parameters.

In general, there can be distinct distributions \(\mathbb{P}\) and \(\mathbb{Q}\) such that \(\mathbb{P} \ne \mathbb{Q}\) but \(\theta(\mathbb{P}) = \theta(\mathbb{Q})\). For instance there are infinitely many normal distributions with the same mean (but different variances).

Parameters as indices for sets of distributions

Another use of the term “parameter” is as an abstract index that points to a particular distribution in a family of distributions. For instance, we might have a multiset of distributions \(\mathcal{P} = \{\mathbb{P}_\eta\}_{\eta \in \Theta}\). In that case, \(\eta\) is an index parameter. For index parameters, if for all parameters \(\eta\), \(\nu \in \Theta\) such that \(\eta \ne \nu\), \(\mathbb{P}_\eta \ne \mathbb{P}_\nu\), the parameter is said to be identifiable. That is, \(X\) contains enough information to identify the value of the parameter with arbitrarily high accuracy, given enough observations. Otherwise, the parameter is non-identifiable or unidentifiable: the data do not contain enough information to distinguish among different values of the parameter, no matter how many observations are made.

Special case: location-scale families

Many indexed families of distributions are related through the value of their parameter in a particular way. For instance, suppose that the outcome space \(\mathcal{X}\) is a real vector space, so it makes sense to add elements of \(\mathcal{X}\) and to multiply them by scalars.

If \(X \sim \mathbb{P}\), then for any \(x \in \mathcal{X}\) and \(a \in \Re \backslash \{0\}\), we could define \(\mathbb{P}_{x,a}\) to be the distribution of \(aX+x\). Then \(\{P_{x, a} \}_{x \in \mathcal{X}, a \in \Re \backslash \{0\}}\) is a location-scale family with parameter \(\theta = (x, a)\) As \(x\) varies, the probability distribution “shifts” its location. As \(a\) varies, the probability distribution \(P\) is “stretched” or re-scaled. The family of univariate normal distributions is a location-scale family over the two-dimensional parameter \(\theta = (\mu, \sigma)\) with \(\mu \in \Re\), \(\sigma \in \Re \backslash \{0\}\).

Notation for index parameters

To keep the notation for index parameters parallel with the notation for functional parameters, we will define \(\theta(\mathbb{P}) \equiv \{ \eta: \mathbb{P} = \mathbb{P}_\eta\}\). If \(\theta\) is identifiable, \(\theta(\mathbb{P})\) is a singleton set; otherwise, it may contain more than one element.

Parametric families of distributions

A parametric family of distributions is an indexed collection of probability distributions that depends on the index parameter (which might be multidimensional) in a fixed functional way. (We can think of things like the mean and standard deviation of a normal distribution as either a multidimensional parameter or as a collection of parameters.)

Most distributions that have names are parametric families, e.g., Bernoulli (the parameter \(p\)), Binomial (the two-dimensional parameter \((n, p)\)), Geometric (\(p\)), Hypergeometric (the three-dimensional parameter \((N, G, n)\)), Negative Binomial \((p, k)\), Normal \((\mu, \sigma)\), Student’s \(T\) \((\mu, \sigma, \nu)\), continuous uniform (the endpoints of the interval of support, the two-dimensional parameter \((a, b)\)), and so on.

Nuisance parameters

When the probability distribution of the data depends on a multi-dimensional parameter but only some components of that parameter are of interest, the other components are called nuisance parameters. For instance, in estimating the mean of a normal distribution, the variance of the distribution is a nuisance parameter: we don’t care what it is, but it affects the probability distribution of the data.

Similarly, in estimating a population mean from a stratified sample, the means within the different strata are nuisance parameters.

Abstract parameters

For most of the theory in this chapter, \(\theta\) will be an abstract parameter: the development applies to functional parameters, index parameters, parameters of parametric families, etc.

Confidence sets

What can we learn about the value of \(\theta(\mathbb{P})\) from observations? The chapter on testing discusses testing hypotheses, including hypotheses about parameters. Here we explore a different approach to quantifying what a sample tells us about \(\theta\): confidence sets. The treatment will be abstract but informal. (For instance, we shall ignore measurability issues.)

In an abuse of notation, we will let \(\theta\) denote both the value of a parameter, and the mapping from a distribution to the value of the parameter for that distribution, as if \(\theta\) were a functional parameter even if it is an index parameter (or some other kind of parameter). Thus, \(\theta: \mathcal{P} \rightarrow \Theta\), \(\mathbb{P} \mapsto \theta(\mathbb{P})\). If \(\mathbb{P} = \mathbb{P}_\eta\), then \(\theta(\mathbb{P}) = \eta\). The set \(\Theta\) will denote the possible values of \(\theta\). Lowercase Greek letters such as \(\eta\) will denote generic elements of \(\Theta\).

We shall observe \(X \sim \mathbb{P}\), where \(X\) takes values in the outcome space \(\mathcal{X}\). We do not know \(\mathbb{P}\), but we know that \(\mathbb{P} \in \mathcal{P}\), a known set of distributions. Let \(\mathcal{I}(\cdot)\) be a set-valued function that assigns a subset of \(\Theta\) to each possible observation \(x \in \mathcal{X}\). For instance, we might observe \(X \sim N(\theta, 1)\), and \(\mathcal{I}(x)\) might be \([x-c, x+c]\).

Fix \(\alpha \in (0, 1)\). Suppose that for all \(\eta \in \Theta\), if \(\theta(\mathbb{P}) = \eta\) then

()\[\begin{equation} \mathbb{P} \{\mathcal{I}(X) \ni \eta \} \ge 1-\alpha. \end{equation}\]

Then \(\mathcal{I}(\cdot)\) is a \(1-\alpha\) confidence set procedure for \(\theta(\mathbb{P})\). It maps outcomes to sets in such a way that the chance is at least \(1-\alpha\) that the resulting set will contain the true value of the parameter \(\theta(\mathbb{P})\).

If we observe \(X=x\), \(\mathcal{I}(x)\) is a \(1-\alpha\) confidence set for \(\theta\). The confidence level of the set is \(1-\alpha\).

When \(\mathcal{I}(x) \ni \theta\), we say that the confidence set covers \(\theta\). The coverage probability of the confidence set procedure \(\mathcal{I}\) is

\[\begin{equation*} \inf_{\eta \in \Theta} \inf_{\mathbb{P} \in \mathcal{P} : \theta(\mathbb{P}) = \eta} \mathbb{P} \{\mathcal{I}(X) \ni \eta \}. \end{equation*}\]

Before the data \(X\) are observed, the chance that \(\mathcal{I}(X)\) will contain \(\theta\) is the coverage probability, \(1-\alpha\). After the data \(X=x\) are observed, the set \(\mathcal{I}(x)\) either does or does not contain \(\theta\): there is nothing random anymore.

A confidence interval is a special case of a confidence set, when the set is an interval of real numbers. One-sided confidence intervals are the special case that the confidence set is a semi-infinite interval, i.e., a set of real numbers of the form \((-\infty, c]\) or \([c, \infty)\).

Example: confidence interval for a Normal mean

Suppose \(X \sim N(\theta, 1)\): \(\theta\) is an index parameter for the (parametric) family of unit variance normal distributions (\(\mathcal{P} \equiv \{N(\eta, 1)\}_{\eta \in \Re}\)) and also a functional parameter, since \(\mathbb{E} X = \theta\).

Define \(\mathcal{I}(x) \equiv [x - z_{1-\alpha/2}, x + z_{1-\alpha/2}]\), where \(z_{1-\alpha/2}\) is the \(1-\alpha/2\) percentile of the standard normal distribution. Then

()\[\begin{equation} \mathbb{P}_\theta \{ \mathcal{I}(X) \ni \theta \} = 1-\alpha \end{equation}\]

whatever be \(\theta \in \Re\). Thus \([x - z_{1-\alpha/2}, x + z_{1-\alpha/2}]\) is a \(1-\alpha\) confidence interval for \(\theta\).

Why is the coverage probability of \([X - z_{1-\alpha/2}, X + z_{1-\alpha/2}]\) equal to \(1-\alpha\)?

The distribution of \(X-\theta\) is a standard normal (\(N(0,1)\)), so

\[\begin{equation*} \mathbb{P}_\theta \{|X-\theta| \le z_{1-\alpha/2} \} = 1-\alpha. \end{equation*}\]

But whenever \(|X-\theta| \le z_{1-\alpha/2}\), the interval \([X - z_{1-\alpha/2}, X + z_{1-\alpha/2}]\) contains \(\theta\).

Duality between hypothesis tests and confidence sets

One of the most versatile ways of constructing confidence sets is to invert hypothesis tests.

Suppose we have a (possibly randomized) family of significance level \(\alpha\) hypothesis tests for all possible values of a parameter \(\theta \in \Theta\). That is, for each \(\eta \in \Theta\), we have a test function (aka critical function) \(\phi_\eta : \mathcal{X} \rightarrow [0, 1]\) such that if \(\theta(\mathbb{P}) = \eta\),

\[\begin{equation*} \mathbb{E}_{\mathbb{P}} \phi_\eta(X) \ge 1-\alpha. \end{equation*}\]

The test function \(\phi_\eta(x)\) is the probability of not rejecting the hypothesis \(\theta(\mathbb{P}) = \eta\) when \(X=x\). When \(\phi_\eta(X) = 0\), we certainly reject the null; when \(\phi_\eta(X)=1\), we certainly do not reject the null; values between 0 and 1 correspond to rejecting the null hypothesis with probability \(1-\phi(X)\). The test involves both \(X\) and a uniformly distributed random variable \(U \sim U[0,1]\) independent of \(X\). The test rejects if \(U \ge \phi(X)\). See the chapter on hypothesis tests.

Consider the set

\[\begin{equation*} \mathcal{I}(X,U) \equiv \{ \eta \in \Theta : \phi_\eta(X) > U \}. \end{equation*}\]

That is, \(\mathcal{I}\) is the set of possible parameters \(\eta \in \Theta\) for which the corresponding test \(\phi_\eta\) does not reject the hypothesis that \(\theta(\mathbb{P}) = \eta\), for the observed values of \(X\) and \(U\). (If \(\phi\) can only take the values 0 and 1, i.e., if the test is not randomized, then \(\mathcal{I}\) does not depend on \(U\).)

Claim: \(\mathcal{I}(X,U)\) is a \(1-\alpha\) confidence procedure. That is, whatever the true value of \(\theta(\mathbb{P})\) happens to be,

()\[\begin{equation} \mathbb{P} \{ \mathcal{I}(X,U) \ni \theta(\mathbb{P}) \} \ge 1-\alpha. \end{equation}\]

Proof: The set \(\mathcal{I}(X,U) \ni \theta(\mathbb{P})\) whenever the test of the null hypothesis that \(\theta(\mathbb{P}) = \theta\) does not reject, that is, when \(\phi_\theta(X) \ge U\). But when \(\theta(\mathbb{P}) = \theta\),

\[\begin{equation*} \mathbb{P} \{\phi_\theta(X) \ge U\} = \mathbb{E}_{\mathbb{P}} \phi(X) \ge 1-\alpha. \end{equation*}\]

In general, there are many ways to construct a set of tests \(\{\phi_\eta\}_{\eta \in \Theta}\) with significance level \(\alpha\). Inverting different tests will lead to confidence sets with different properties. As a simple example, inverting 1-sided tests for a real parameter will give a 1-seded confidence interval for the parameter, while inverting 2-sided tests will yield a 2-sided confidence interval.

It is often possible to design confidence sets that have desirable characteristics–such as avoiding including zero to the extent possible (so that they determine the sign of their parameter), or being on average as small as possible–by inverting suitably chosen tests. See, e.g., Benjamini, Hochberg, and Stark (1996); Benjamini and Stark (1996); Evans, Hansen, and Stark (2005); and Benjamini, Hechtlinger, and Stark (2019).

Example: confidence interval for Binomial \(p\) (known \(n\))

Suppose we will observe \(X \sim \mbox{Binom}(n, p)\), where \(n\) is known but \(p\) is not. We seek a one-sided lower confidence interval for \(p\), that is, a set of the form \([f(X,U), \infty)\) such that for all \(q \in [0, 1]\), if \(X \sim \mbox{Binom}(n, q)\) and \(U\) is an independent uniform random variable, then

\[\begin{equation*} \mathbb{P} \{ [f(X,U), \infty) \ni q \} \ge 1-\alpha. \end{equation*}\]

Since it is certain that \(p \le 1\), the upper endpoint of the interval can be reduced from \(\infty\) to 1 without sacrificing coverage probability. That is, the same \(f\) will satisfy

\[\begin{equation*} \mathbb{P} \{ [f(X,U), 1] \ni q \} \ge 1-\alpha. \end{equation*}\]

To make such a lower confidence bound, we can invert one-sided hypothesis tests that reject when \(X\) is “too big.” That is, we want a family of tests of the hypotheses \(p = q\) for all \(q \in [0, 1]\) that reject for large values of \(X\). Such tests give evidence that \(p\) is at least a given size.

To keep things simple, we will use conservative non-randomized tests rather than exact randomized tests. Because we are basing the confidence intervals on conservative tests, we expect the coverage probability to be greater than \(1-\alpha\). We reject the hypothesis \(p = q\) if, on the assumption that \(p=q\), the chance that \(X\) would be greater than or equal to its observed value is not greater than \(\alpha\). That is,

\[\begin{equation*} \phi_q(x) = \left \{ \begin{array}{ll} 1, & \sum_{k=x}^n \binom{n}{k}q^k(1-q)^{n-k} \ge \alpha \\ 0, & \mbox{otherwise.} \end{array} \right . \end{equation*}\]

The lower endpoint of the one-sided confidence interval is the smallest value of \(q\) for which the corresponding test does not reject:

\[\begin{equation*} f(x) \equiv \min \left \{q \in [0, 1]: \sum_{k=x}^n \binom{n}{k}q^k(1-q)^{n-k} \ge \alpha \right \}. \end{equation*}\]

Note that the upper tail probability, \(\sum_{k=x}^n \binom{n}{k}q^k(1-q)^{n-k}\), increases continuously and monotonically as \(q\) increases, so finding where it crosses \(\alpha\) is a straightforward root-finding problem.

Let’s code this up in python.

import math
import numpy as np
import scipy as sp
from scipy.optimize import brentq  # Brent's root-finding algorithm
from scipy.stats import binom      # the Binomial distribution
def binom_lower_ci(n, x, cl=0.95):
    '''
    lower confidence bound for a binomial p
    
    Assumes x is a draw from a binomial distribution with parameters
    n (known) and p (unknown). Finds a lower confidence bound for p 
    at confidence level cl.
    
    Parameters
    ----------
    n : int
        number of trials, nonnegative integer
    x : int
        observed number of successes, nonnegative integer not larger than n
    cl : float
        confidence level, between 0 and 1
        
    Returns
    -------
    lb : float
        lower confidence bound
    '''
    assert 0 <= x <= n, 'impossible arguments'
    assert 0 < cl < 1, 'silly confidence level'
    lb = 0
    if x > 0:
        lb = brentq(lambda q: binom.sf(x-1,n,q)-(1-cl), 0, 1)
    return lb
# some simulations to test the confidence intervals
reps = int(10**4)
for n in [10, 50, 100]:
    print(f'\nn: {n}')
    for p in [0.01, 0.05, 0.1, 0.2, 0.4, 0.5, 0.7, 0.9, 0.95, 0.99]:
        cover = 0
        xs = binom.rvs(n, p, size=reps)
        for x in xs:
            cover += (1 if binom_lower_ci(n,x) <= p
                 else 0)
        print(f'p: {p:.2f} covered: {100*cover/reps : .2f}%')   
n: 10
p: 0.01 covered:  99.60%
p: 0.05 covered:  98.91%
p: 0.10 covered:  98.69%
p: 0.20 covered:  97.02%
p: 0.40 covered:  98.81%
p: 0.50 covered:  99.18%
p: 0.70 covered:  97.18%
p: 0.90 covered:  100.00%
p: 0.95 covered:  100.00%
p: 0.99 covered:  100.00%

n: 50
p: 0.01 covered:  98.54%
p: 0.05 covered:  96.44%
p: 0.10 covered:  97.55%
p: 0.20 covered:  97.01%
p: 0.40 covered:  96.78%
p: 0.50 covered:  96.55%
p: 0.70 covered:  95.63%
p: 0.90 covered:  96.69%
p: 0.95 covered:  100.00%
p: 0.99 covered:  100.00%

n: 100
p: 0.01 covered:  98.27%
p: 0.05 covered:  97.24%
p: 0.10 covered:  96.04%
p: 0.20 covered:  96.50%
p: 0.40 covered:  95.58%
p: 0.50 covered:  95.93%
p: 0.70 covered:  95.42%
p: 0.90 covered:  97.57%
p: 0.95 covered:  96.35%
p: 0.99 covered:  100.00%

Think about how you might make a 2-sided confidence interval for \(p\) instead of a lower 1-sided confidence interval. There are countless ways of constructing acceptance regions for the underlying tests, as mentioned in the testing chapter. For instance, we could trim the same probability \(\alpha/2\) from both tails, or use the acceptance region that contains the fewest outcomes. The latter choice in general will lead to shorter confidence intervals. This approach is due to Sterne (1954).

Let’s implement Sterne’s approach now. The development is analogous to the randomized hypergeometric test in the chapter on testing.

from functools import lru_cache

@lru_cache(maxsize=None)  # decorate the function to cache the results 
                          # of calls to the function
def binom_accept(n, p, alpha=0.05, randomized=False):
    '''
    Acceptance region for a randomized binomial test
    
    If randomized==True, find the acceptance region for a randomized, exact 
    level-alpha test of the null hypothesis X~Binomial(n,p). 
    The acceptance region is the smallest possible. (And not, for instance, symmetric.)

    If randomized==False, find the smallest conservative acceptance region.

    Parameters
    ----------
    n : integer
        number of independent trials
    p : float
        probability of success in each trial
    alpha : float
        desired significance level  
    ramndomized : Boolean
        return randomized exact test or conservative non-randomized test?
  
    Returns
    --------
    If randomized:
    I : list
        values for which the test never rejects
    J : list 
        values for which the test sometimes rejects
    gamma : float
        probability the test does not reject when the value is in J
    
    If not randomized:
    I : list
        values for which the test does not reject
    
    '''
    assert 0 < alpha < 1, "bad significance level"
    x = np.arange(0, n+1)
    I = list(x)                    # start with all possible outcomes (then remove some)
    pmf = binom.pmf(x,n,p)         # "frozen" binomial pmf
    bottom = 0                     # smallest outcome still in I
    top = n                        # largest outcome still in I
    J = []                         # outcomes for which the test is randomized
    p_J = 0                        # probability of outcomes for which test is randomized
    p_tail = 0                     # probability of outcomes excluded from I
    while p_tail < alpha:          # need to remove outcomes from the acceptance region
        pb = pmf[bottom]
        pt = pmf[top]
        if pb < pt:                # the smaller possibility has smaller probability
            J = [bottom]
            p_J = pb
            bottom += 1
        elif pb > pt:              # the larger possibility has smaller probability
            J = [top]
            p_J = pt
            top -= 1
        else:                      
            if bottom < top:       # the two possibilities have equal probability
                J = [bottom, top]
                p_J = pb+pt
                bottom += 1
                top -= 1
            else:                  # there is only one possibility left
                J = [bottom]
                p_J = pb
                bottom +=1
        p_tail += p_J
        for j in J:                # remove outcomes from acceptance region
            I.remove(j)
    return_val = None
    if randomized:
        gamma = (p_tail-alpha)/p_J     # probability of accepting H_0 when X in J 
                                       # to get exact level alpha
        return_val = I, J, gamma
    else:
        while p_tail > alpha:
            j = J.pop()            # move the outcome into the acceptance region
            p_tail -= pmf[j]
            I.append(j)
        return_val = I
    return return_val 


def binom_ci(n, x, cl=0.95, eps=10**-3):
    '''
    two-sided confidence bound for a binomial p
    
    Assumes x is a draw from a binomial distribution with parameters
    n (known) and p (unknown). Finds a confidence interval for p 
    at confidence level cl by inverting conservative tests
    
    Parameters
    ----------
    n : int
        number of trials, nonnegative integer
    x : int
        observed number of successes, nonnegative integer not larger than n
    cl : float
        confidence level, between 1/2 and 1
    eps : float in (0, 1)
        resolution of the grid search
        
    Returns
    -------
    lb : float
        lower confidence bound
    ub : float
        upper confidence bound
    '''
    assert 0 <= x <= n, 'impossible arguments'
    assert 0 < cl < 1, 'silly confidence level'
    lb = 0
    ub = 1
    alpha = 1-cl
    if x > 0:
        while x not in binom_accept(n, lb, alpha, randomized=False):
            lb += eps
        lb -= eps
    if x < n:
        while x not in binom_accept(n, ub, alpha, randomized=False):
            ub -= eps
        ub += eps
    return lb, ub
cl = 0.95
eps = 10**-4
for n in [10, 50, 100]:
    print(f'\nn: {n}')
    for p in [0.01, 0.05, 0.1, 0.2, 0.4, 0.5, 0.7, 0.9, 0.95, 0.99]:
        cover = 0
        mean_width = 0
        xs = binom.rvs(n, p, size=reps)
        for x in xs:
            bounds = binom_ci(n, x, cl=cl, eps=eps)
            cover += (1 if (bounds[0] <= p <= bounds[1])             
                      else 0)
            mean_width += bounds[1]-bounds[0]
        mean_width /= reps
        print(f'p: {p:.2f} covered: {100*cover/reps:.2f}% mean width: {mean_width:.2f}')
n: 10
p: 0.01 covered: 99.65% mean width: 0.31
p: 0.05 covered: 98.84% mean width: 0.36
p: 0.10 covered: 98.86% mean width: 0.41
p: 0.20 covered: 96.80% mean width: 0.48
p: 0.40 covered: 98.23% mean width: 0.54
p: 0.50 covered: 97.95% mean width: 0.55
p: 0.70 covered: 96.03% mean width: 0.52
p: 0.90 covered: 98.82% mean width: 0.41
p: 0.95 covered: 98.94% mean width: 0.36
p: 0.99 covered: 99.55% mean width: 0.31

n: 50
p: 0.01 covered: 98.69% mean width: 0.09
p: 0.05 covered: 95.98% mean width: 0.13
p: 0.10 covered: 96.95% mean width: 0.17
p: 0.20 covered: 94.79% mean width: 0.22
p: 0.40 covered: 95.42% mean width: 0.27
p: 0.50 covered: 96.89% mean width: 0.27
p: 0.70 covered: 96.00% mean width: 0.25
p: 0.90 covered: 97.19% mean width: 0.17
p: 0.95 covered: 96.08% mean width: 0.14
p: 0.99 covered: 98.73% mean width: 0.09

n: 100
p: 0.01 covered: 98.07% mean width: 0.05
p: 0.05 covered: 96.73% mean width: 0.09
p: 0.10 covered: 95.53% mean width: 0.12
p: 0.20 covered: 95.56% mean width: 0.16
p: 0.40 covered: 96.06% mean width: 0.19
p: 0.50 covered: 96.62% mean width: 0.20
p: 0.70 covered: 95.43% mean width: 0.18
p: 0.90 covered: 95.51% mean width: 0.12
p: 0.95 covered: 96.53% mean width: 0.09
p: 0.99 covered: 98.22% mean width: 0.05

Bootstrap approximate confidence intervals

The bootstrap approximates sampling from a population by sampling with replacement from a sample from the population. That is, it approximates the population distribution by the empirical distribution of the observed sample, which is the nonparametric maximum likelihood estimate of the population distribution

The bootstrap is commonly used to estimate the variability of an estimator of a functional parameter. It generally does a good job of estimating things like the variance of an estimator.

It can also be used to construct approximate confidence intervals in a variety of ways, including the percentile method, which approximates percentiles of an estimator from percentiles of the resampling distribution of the estimator. However, this generally is not a good approximation.

Let’s code it up and see.

def boot_ci(data, estimator, cl=0.95, reps=int(10**4), random_state=None):
    """
    Bootstrap percentile approximate confidence interval for a functional parameter
    
    Parameters
    ----------
    data : array-like
        original sample
    estimator : callable
        function defined on the empirical distribution of a sample.
        Applying it to the population distribution would give the
        true value of the parameter of interest. Applying it to a 
        sample yields an estimator of the parameter of interest
    cl : float in (0,1)
        confidence level
    reps : int, nonnegative
        number of bootstrap samples to use
        
    Returns
    -------
    lb : float
        estimated lower confidence bound
    ub : float
        estimated upper confidence bound
    """
    if random_state is None:
        prng = np.random.randomstate()
    else:
        prng = random_state
    n = len(data)
    estimates = []
    for j in range(reps):
        estimates.append(estimator(prng.choice(data, size=n, replace=True)))
    estimates = np.array(estimates)
    tail = (1-cl)/2
    lower = np.quantile(estimates, tail)
    upper = np.quantile(estimates, 1-tail)
    return (lower, upper)
reps = int(10**3)
reps_boot = int(10**3)
seed = 12345678
prng = np.random.RandomState(seed)
for n in [10, 50, 100]:
    print(f'\nn: {n}')
    for p in [0.01, 0.05, 0.1, 0.2, 0.4, 0.5, 0.7, 0.9, 0.95, 0.99]:
        cover = 0
        mean_width = 0
        xs = binom.rvs(n, p, size=reps)
        for x in xs:
            sample = [1]*x + [0]*(n-x)
            bounds = boot_ci(sample, np.mean, reps=reps_boot, random_state=prng)
            cover += (1 if (bounds[0] <= p <= bounds[1])             
                      else 0)
            mean_width += bounds[1]-bounds[0]
        mean_width /= reps
        print(f'p: {p:.2f} covered: {100*cover/reps : .2f}% mean width: {mean_width:.2f}')
n: 10
p: 0.01 covered:  8.50% mean width: 0.03
p: 0.05 covered:  39.10% mean width: 0.13
p: 0.10 covered:  65.20% mean width: 0.25
p: 0.20 covered:  88.20% mean width: 0.41
p: 0.40 covered:  91.70% mean width: 0.56
p: 0.50 covered:  97.30% mean width: 0.58
p: 0.70 covered:  96.50% mean width: 0.52
p: 0.90 covered:  64.90% mean width: 0.25
p: 0.95 covered:  38.00% mean width: 0.13
p: 0.99 covered:  8.80% mean width: 0.03

n: 50
p: 0.01 covered:  38.20% mean width: 0.03
p: 0.05 covered:  91.00% mean width: 0.10
p: 0.10 covered:  96.60% mean width: 0.16
p: 0.20 covered:  94.70% mean width: 0.21
p: 0.40 covered:  95.10% mean width: 0.27
p: 0.50 covered:  96.30% mean width: 0.27
p: 0.70 covered:  96.90% mean width: 0.25
p: 0.90 covered:  96.30% mean width: 0.16
p: 0.95 covered:  89.90% mean width: 0.10
p: 0.99 covered:  38.40% mean width: 0.03

n: 100
p: 0.01 covered:  62.90% mean width: 0.03
p: 0.05 covered:  96.80% mean width: 0.08
p: 0.10 covered:  94.90% mean width: 0.11
p: 0.20 covered:  95.30% mean width: 0.15
p: 0.40 covered:  95.10% mean width: 0.19
p: 0.50 covered:  94.90% mean width: 0.19
p: 0.70 covered:  95.40% mean width: 0.18
p: 0.90 covered:  95.00% mean width: 0.11
p: 0.95 covered:  95.30% mean width: 0.08
p: 0.99 covered:  63.80% mean width: 0.03

As you can see, the coverage probability of bootstrap percentile confidence intervals can be much lower than their nominal level (here, as low as about 9% when they should be 95%). Where their coverage is about right, their average width is comparable to the average width of the conservative intervals derived by inverting two-sided tests.

This example is a fairly simple situation: estimating the mean of a population of zeros and ones from a random sample with replacement. In more complicated examples, bootstrap confidence intervals generally do not attain their nominal level. There is a substantial body of work on how to improve the coverage of bootstrap CIs. Pre-pivoting can help substantially. See Beran (1987).

Other approximate confidence intervals for Binomial(\(p\))

There are many approximate methods for binomial confidence intervals, for instance, based on the normal approximation or the normal approximation to a transformation of the data. They generally do not attain their nominal confidence level, even for modest sample sizes, and their coverage probability varies erratically as \(n\) and \(p\) vary. See Brown, Cai, and DasGupta (2001)

Example 2: confidence interval for Hypergeometric \(G\) (known \(N\), \(n\))

Suppose we will observe \(X \sim \mbox{Hyper}(N, G, n)\), where \(N\) and \(n\) are known but \(G\) is not. We seek a one-sided lower confidence interval for \(G\), that is, a set of the form \([f(X,U), \infty)\) such that for all \(G \in \{0, 1, \ldots, N\}\), if \(X \sim \mbox{Hyper}(N, G, n)\) and \(U\) is an independent uniform random variable, then

\[\begin{equation*} \mathbb{P} \{ [f(X,U), \infty) \ni G \} \ge 1-\alpha. \end{equation*}\]

Since it is certain that \(G \le N\), the upper endpoint of the interval can be reduced from \(\infty\) to \(N\) without sacrificing coverage probability. That is, the same \(f\) will satisfy

\[\begin{equation*} \mathbb{P} \{ [f(X,U), N] \ni q \} \ge 1-\alpha. \end{equation*}\]

To make such a lower confidence bound, we can invert one-sided hypothesis tests that reject when \(X\) is “too big.” That is, we want a family of tests of the hypotheses \(G = H\) for all \(H \in \{0, 1, \ldots, N\}\) that reject for large values of \(X\). Such tests give evidence that \(G\) is at least a given size.

To keep things simple, we will use conservative non-randomized tests rather than exact randomized tests. Because we are basing the confidence intervals on conservative tests, we expect the coverage probability to be greater than \(1-\alpha\). We reject the hypothesis \(G = I\) if, on the assumption that \(G=I\), the chance that \(X\) would be greater than or equal to its observed value is not greater than \(\alpha\). That is,

\[\begin{equation*} \phi_q(x) = \left \{ \begin{array}{ll} 1, & \sum_{k=x}^n \frac{\binom{I}{k}\binom{N-I}{n-k}}{\binom{N}{n}} \ge \alpha \\ 0, & \mbox{otherwise.} \end{array} \right . \end{equation*}\]

(Of course, we know \(G \ge X\).) The lower endpoint of the one-sided confidence interval is the smallest value of \(I\) for which the corresponding test does not reject:

\[\begin{equation*} f(x) \equiv \min \left \{I \in \{x, x+1, \ldots, N\} : \frac{\binom{I}{k}\binom{N-I}{n-k}}{\binom{N}{n}} \ge \alpha \right \}. \end{equation*}\]

Note that the upper tail probability does not increase monotonically as \(I\) increases; moreover, because \(I\) must be an integer, the optimization problem is discrete, not continuous. Thus standard root-finding methods will not let us find where the tail probability crosses \(\alpha\). Instead, we will use a search.

Let’s code this up in python.

from scipy.stats import hypergeom
def hypergeom_lower_ci(N, n, x, cl=0.95):
    '''
    lower confidence bound for a hypergeometric G
    
    Assumes x is a draw from a hypergeometric distribution with parameters
    N (known), n (known), and G (unknown). Finds a lower confidence bound for G 
    at confidence level cl.
    
    Parameters
    ----------
    N : int
        population size, nonnegative integer
    n : int
        number of trials, nonnegative integer <= N
    x : int
        observed number of successes, nonnegative integer <= n
    cl : float
        confidence level, between 0 and 1
        
    Returns
    -------
    lb : float
        lower confidence bound
    '''
    assert 0 <= x <= n, 'impossible arguments'
    assert n <= N, 'impossible sample size'
    assert 0 < cl < 1, 'silly confidence level'
    lb = x
    tail = hypergeom.sf(x-1, N, lb, n)
    while tail < (1-cl):
        lb += 1
        tail = hypergeom.sf(x-1, N, lb, n)
    return lb
NN = [20, 50, 100]
nn = [10, 20, 30]
GG = [10, 30, 25]
reps = int(10**4)

for j in range(len(NN)):
    cover = 0
    xs = hypergeom.rvs(NN[j], GG[j], nn[j], size=reps)
    for x in xs:
        cover += (1 if hypergeom_lower_ci(NN[j], nn[j], x, cl=0.95) <= GG[j]
                 else 0)
    print(f'N={NN[j]}, G={GG[j]}, n={nn[j]}: covered {100*cover/reps : .2f}%')
N=20, G=10, n=10: covered  98.83%
N=50, G=30, n=20: covered  97.92%
N=100, G=25, n=30: covered  97.78%

We can make two-sided confidence bounds for \(G\) by inverting two-sided tests, such as those given in the chapter on testing. Recall that we discussed two different constructions of two-sided tests there, one based on trimming the same probability from each tail, and one based on minimizing the number of outcomes in the acceptance region.

Confidence intervals from permutation tests

The two-sample problem

Recall from the chapter on hypothesis tests that the two-sample problem asks whether two groups plausibly resulted from allocating their union randomly into two groups of their observed sizes.

That is, we have two groups of data, \(\{x_j\}_{j=1}^n\) and \(\{y_j\}_{j=1}^m\), and hypothesize that they arose by taking the multiset of \(n+m\) values \(\{x_1, \ldots, x_n, y_1, \ldots, y_m\}\) and randomly selecting \(n\) items to comprise the first group, with the remaining \(m\) comprising the second group.

This problem arises in many contexts, including randomized controlled trials: We have a group of \(n+m\) subjects of whom \(n\) are selected at random to be the control/placebo group and the other \(m\) receive the active treatment. The strong null hypothesis of no treatment effect is that each subject would have had the same response, no matter which treatment the subject was assigned. It is as if the response were determined before the assignment occurred. The assignment just reveals the response corresponding to the assigned treatment.

Thus, testing the strong null hypothesis is an instance of the two-sample problem: did the two sets of values plausibly arise from dividing a single set of values at random into two groups by simple random sampling?