Confidence bounds and tests for the population mean, Part I:

The Normal Approximation and Bootstrap Percentile Intervals

This notebook examines confidence bounds and tests for a population mean from a random sample, using the normal approximation or the bootstrap.

There are many situations in which an exact or conservative (not approximate) confidence interval or test is required, and there are many populations for which asymptotic/approximate tests and confidence intervals are very inaccurate, especially for modest sample sizes.

The context for what follows is “nonstandard” problems where the population distribution may be quite skewed, and sample sizes may be small.

For instance, in applications to financial audits, populations of overstatement errors are often mostly zero or small (i.e., the account is accurately reported), with possible occasional large values. The same is true for election audits based on comparing manual counts of votes to machine counts of votes.

This notebook shows that standard approximate methods can fail dramatically in those situations. Later notebooks develop exact and conservative confidence sets and tests that work in those situations.

Confidence intervals based on the normal approximation

One of the first things most statisticians would consider is to use confidence based on the normal approximation, especially if it were possible to draw a moderately large sample (e.g., 100 or more units).

The procedure would be:

  • choose the desired coverage probability \(1-\alpha\) and the sample size \(n\)

  • draw a simple random sample or a random sample with replacement of size \(n\) from the population

  • compute the sample mean \(\bar{X}\) and sample standard deviation \(s\)

  • find \(\hat{SE} = f \times s/\sqrt{n}\), the estimated standard error of the sample mean, where \(f=1\) for sampling with replacement, and \(f = \sqrt{\frac{N-n}{N-1}}\) for sampling without replacement (\(N\) is the population size)

  • find \(t\), the appropriate critical value of Student’s t distribution with \(n-1\) degrees of freedom

  • the approximate 2-sided confidence interval is \([\bar{X} - t \hat{SE}, \bar{X} + t \hat{SE}]\).

Formally, this gives an approximate confidence interval, but there’s no guarantee that the actual coverage probability is close to the nominal coverage probability

  • Normal approximation is common, but can be grossly inaccurate, especially with:

    • small sample sizes

    • skewed populations

In auditing situations, typical for the distributions to be very skewed: point mass at zero, mixed with something else.

  • Examine scenarios by simulation.

  • First scenario is a population of zeros and ones, in various proportions

  • Second scenario is “nonstandard” mixture of pointmass at 0 and a Uniform

We are estimating the coverage by simulation.

General considerations for simulations

  • It’s important to know which pseudo-random number generator (PRNG) you are using (Python uses the Mersenne Twister by default)

  • It’s important to assess whether that PRNG is adequate (the Mersenne twister is pretty good for statistical simulations, but not good enough for cryptographic applications). See, e.g., Stark and Ottoboni, 2018

  • It’s important to remember that your estimates will have sampling error

  • If the result matters, you should quantify the sampling error, for instance, by a confidence interval for the estimated result.

Here, we will use 1,000 or 10,000 iterations in the estimates. For estimating probabilities, the standard error of the result is no larger than \(1/(2\sqrt{n})\): 1.58% for \(n=1000\) and 0.5% for \(n=10,000\).

For estimating things like expected length, guarantees of accuracy are harder, but the kinds of techniques we will see can be applied to that problem too.

Bootstrap Percentile Approximate Confidence Intervals

The bootstrap approximates sampling from a population by sampling with replacement from a sample from the population. That is, it approximates the population distribution by the empirical distribution of the observed sample, which is the nonparametric maximum likelihood estimate of the population distribution.

The bootstrap is commonly used to estimate the variability of an estimator of a functional parameter. It generally does a good job of estimating things like the variance of an estimator.

It can also be used to construct approximate confidence intervals in a variety of ways, including the percentile method, which approximates percentiles of an estimator from percentiles of the resampling distribution of the estimator. However, this approximation generally is not accurate..

Let’s code it up and see.

# First code cell: set up the Python environment
%matplotlib inline
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
# For interactive widgets
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
def boot_percentile_ci(data : list, estimator : callable, cl : float=0.95, reps : int=int(10**4), \
                       random_state=None):
    """
    Bootstrap percentile approximate confidence interval for a functional parameter
    
    Parameters
    ----------
    data : array-like
        original sample
    estimator : callable
        function defined on the empirical distribution of a sample.
        Applying it to the population distribution would give the
        true value of the parameter of interest. Applying it to a 
        sample yields an estimator of the parameter of interest
    cl : float in (0,1)
        confidence level
    reps : int, nonnegative
        number of bootstrap samples to use
        
    Returns
    -------
    lb : float
        estimated lower confidence bound
    ub : float
        estimated upper confidence bound
    """
    if random_state is None:
        prng = np.random.randomstate()
    else:
        prng = random_state
    n = len(data)
    estimates = []
    for j in range(reps):
        estimates.append(estimator(prng.choice(data, size=n, replace=True)))
    estimates = np.array(estimates)
    tail = (1-cl)/2
    lower = np.quantile(estimates, tail)
    upper = np.quantile(estimates, 1-tail)
    return (lower, upper)
# Population of two values, {0, 1}, in various proportions.  Amounts to Binomial random variable
seed = 12345678
prng = np.random.RandomState(seed)

ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([.001, .01, 0.1])    # mixture fractions, proportion of 1s in the population
alpha = 0.05  # 1- (confidence level)
reps = int(10**4) # just for demonstration
reps_boot = int(10**3)
vals = [0, 1]

simTable = pd.DataFrame(columns=('fraction of 1s', 'sample size', 'Student-t cov', 'Student-t len', \
                                 'Bootstrap cov', 'Bootstrap len'))
for p in ps:
    popMean = p
    for n in ns:
        tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
        sam = sp.stats.binom.rvs(n, p, size=reps)
        cover_boot = 0
        aveLen_boot = 0
        for x in sam:
            sample = [1]*x + [0]*(n-x)
            bounds = boot_percentile_ci(sample, np.mean, cl=1-alpha, reps=reps_boot, random_state=prng)
            cover_boot += (1 if (bounds[0] <= p <= bounds[1])             
                           else 0)
            aveLen_boot += bounds[1]-bounds[0]
        cover_boot = cover_boot/reps
        aveLen_boot = aveLen_boot/reps
        samMean = sam/n
        samSD = np.sqrt(samMean*(1-samMean)/(n-1))
        cover = ( np.fabs(samMean-popMean) < tCrit*samSD).mean()
        aveLen = 2*(tCrit*samSD).mean()
        simTable.loc[len(simTable)] =  p, n, f'{100*cover :0.2f}%', f'{aveLen :0.4f}', \
                                       f'{100*cover_boot :0.2f}%', f'{aveLen_boot :0.4f}'
#
ansStr =  '<h3>Simulated coverage probability and expected length of Student-t and bootstrap percentile confidence intervals for a {0, 1} population</h3>' +\
          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
          '%</strong>.<br /><strong>Estimated from ' + str(reps) + ' replications.</strong>'

display(HTML(ansStr))
display(simTable)

Simulated coverage probability and expected length of Student-t and bootstrap percentile confidence intervals for a {0, 1} population

Nominal coverage probability 95.0%.
Estimated from 10000 replications.
fraction of 1s sample size Student-t cov Student-t len Bootstrap cov Bootstrap len
0 0.001 25 2.50% 0.0042 2.50% 0.0031
1 0.001 50 4.60% 0.0037 4.60% 0.0028
2 0.001 100 9.72% 0.0039 9.72% 0.0031
3 0.001 400 32.05% 0.0034 31.96% 0.0028
4 0.010 25 22.13% 0.0385 22.11% 0.0292
5 0.010 50 39.74% 0.0352 39.62% 0.0282
6 0.010 100 63.13% 0.0303 63.13% 0.0258
7 0.010 400 90.56% 0.0188 90.90% 0.0180
8 0.100 25 92.83% 0.2337 92.05% 0.2058
9 0.100 50 87.60% 0.1668 95.57% 0.1573
10 0.100 100 93.20% 0.1180 95.68% 0.1145
11 0.100 400 95.01% 0.0588 95.38% 0.0582

Why/When is the coverage bad?

  • Because of the high population mass at 0, the sample will often contain only 0s. Then the SD will be zero, the length of the Student-t interval will be 0, and the interval won’t cover the mean—which is bigger than zero. Similarly, the bootstrap will be resampling from a sample that consists entirely of zeros.

  • In general, coverage will be bad when sampling distribution of the sample mean is strongly skewed

E.g., Binomial with very small or very large \(p\).

def plotBinomialPmf(n, p):
    '''
       Plots the sampling distribution of the sample mean of a sample of size n
       drawn with replacement from a {0, 1} population; i.e., a scaled binomial.
       Plots a vertical red line at the expected value.
    '''
    fig, ax = plt.subplots(nrows=1, ncols=1)
    x = np.arange(n+1)
    xFrac = x.astype(float)/ x.max()
    width = 1.0/n
    ax.bar(xFrac, binom.pmf(x, n, p), width, align="center")
    plt.xlim([-width,1+width])
    plt.axvline(x=p, color='r')
    plt.show()

interact(plotBinomialPmf, n=widgets.IntSlider(min=5, max=300, step=5, value=30),\
         p=widgets.FloatSlider(min=0.001, max=1, step=0.001, value=.001))
<function __main__.plotBinomialPmf(n, p)>

Nonstandard mixtures of distributions

  • Common in statistical study of auditing to consider “nonstandard mixtures” of a pointmass and continuous distribution

  • E.g., NRC Monograph and 1989 Statistical Science article

  • Let’s look at a mixture of \(U[0,1]\) and a pointmass at zero

# Nonstandard mixture: a pointmass at zero and a uniform[0,1]
ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass
alpha = 0.05  # 1- (confidence level)
reps = int(1.0e4) # just for demonstration

simTable = pd.DataFrame(columns=('mass at 0', 'sample size', 'Student-t cov', 'Student-t len'))
for p in ps:
    popMean = (1-p)*0.5  #  p*0 + (1-p)*.5
    for n in ns:
        tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
        cover = 0
        totLen = 0.0
        samMean = np.zeros(reps)
        for rep in range(int(reps)):
            sam = np.random.uniform(size=n)
            ptMass = np.random.uniform(size=n)
            sam[ptMass < p] = 0.0
            samMean[rep] = np.mean(sam)
            samSD = np.std(sam, ddof=1)
            cover += ( math.fabs(samMean[rep]-popMean) < tCrit*samSD )
            totLen += 2*tCrit*samSD
        simTable.loc[len(simTable)] =  p, n, str(100*float(cover)/float(reps)) + '%', str(round(totLen/reps,4))
#
ansStr =  '<h3>Simulated coverage probability and expected length of Student-t confidence intervals for mixture of U[0,1] and ' +\
          'pointmass at 0.</h3>' +\
          '<h4>Nominal coverage probability ' + str(100*(1-alpha)) +\
          '%.</h4> Estimated from ' + str(reps) + ' replications.'

display(HTML(ansStr))
display(simTable)

Simulated coverage probability and expected length of Student-t confidence intervals for mixture of U[0,1] and pointmass at 0.

Nominal coverage probability 95.0%.

Estimated from 10000 replications.
mass at 0 sample size Student-t cov Student-t len
0 0.900 25 90.31% 0.6428
1 0.900 50 98.95% 0.6719
2 0.900 100 99.96% 0.68
3 0.900 400 100.0% 0.6869
4 0.990 25 22.61% 0.1011
5 0.990 50 39.54% 0.1297
6 0.990 100 62.67% 0.1627
7 0.990 400 98.05% 0.2098
8 0.999 25 2.26% 0.0092
9 0.999 50 4.95% 0.0145
10 0.999 100 9.43% 0.0191
11 0.999 400 32.97% 0.036

Again the distribution of the sample mean is skewed, which hurts the coverage of Student-t intervals.

When the mass at 0 gets large enough, there is again a large chance that the sample will contain only 0s, in which case the sample SD will be zero, the length of the interval will be zero, and the interval won’t cover.

Summary

  • The actual coverage probability of confidence intervals based on the normal approximation or the bootstrap percentile method can be much less than the nominal coverage probability, even for large-ish sample sizes, for skewed population distributions.

  • In many applications, the population is very skewed: “Nonstandard mixture distributions”

  • For example, accounting overstatements might be mostly 0, and occasionally large. Then:

    • Can be very likely that the sample variance is zero

    • Leads to apparent certainty that the result is correct.

What’s next?

We will consider some conservative methods for constructing confidence bounds by inverting hypothesis tests.