Confidence bounds for the mean of a bounded population: Binomial and Hypergeometric¶

We will use basic combinatorics to find upper and lower confidence bounds for the mean of a bounded population.¶

Elementary derivation for {0, 1} populations
Obvious extension to 2-valued populations
Thresholding for general bounded populations
Trinomial and Multinomial bounds

Comparison with the normal approximation¶

Let’s start by constructing two-sided confidence intervals for the same {0, 1} cases in which the Normal approximation did so badly.

To find two-sided intervals, we find lower and upper confidence bounds at half the value of $\alpha$. This is equivalent to constructing acceptance regions that are “balanced” in the sense that, under the null hypothesis, the chance of rejecting the hypothesis because $X$ is too big is equal to the chance of rejecting the hypothesis because $X$ is too small.

The code tells the story:

# This is the first cell with code: set up the Python environment
%matplotlib inline
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML

def binoLowerCL(n, x, cl = 0.975, inc=0.000001, p = None):
    "Lower confidence level cl confidence interval for Binomial p, for x successes in n trials"
    if p is None:
            p = float(x)/float(n)
    lo = 0.0
    if (x > 0):
            f = lambda q: cl - scipy.stats.binom.cdf(x-1, n, q)
            lo = sp.optimize.brentq(f, 0.0, p, xtol=inc)
    return lo

def binoUpperCL(n, x, cl = 0.975, inc=0.000001, p = None):
    "Upper confidence level cl confidence interval for Binomial p, for x successes in n trials"
    if p is None:
            p = float(x)/float(n)
    hi = 1.0
    if (x < n):
            f = lambda q: scipy.stats.binom.cdf(x, n, q) - (1-cl)
            hi = sp.optimize.brentq(f, p, 1.0, xtol=inc) 
    return hi

# Population of two values, {0, 1}, in various proportions.  Amounts to Binomial random variable
ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([.001, .01, 0.1])    # mixture fractions, proportion of 1s in the population
alpha = 0.05  # 1- (confidence level)
reps = int(1.0e3)   # just for demonstration
vals = [0, 1]

simTable = pd.DataFrame(columns=('fraction of 1s', 'sample size', 'Student-t cov',\
                                 'Binom cov', 'Student-t len', 'Binom len'))
for p in ps:
    popMean = p
    for n in ns:
        tCrit = sp.stats.t.ppf(q=1.0-alpha/2, df=n-1)
        samMean = np.zeros(reps)
        sam = sp.stats.binom.rvs(n, p, size=reps)
        samMean = sam/float(n)
        samSD = np.sqrt(samMean*(1-samMean)/(n-1))
        coverT = (np.fabs(samMean-popMean) < tCrit*samSD).sum()
        aveLenT = 2*(tCrit*samSD).mean()
        coverB = 0
        totLenB = 0.0
        for r in range(int(reps)):  
            lo = binoLowerCL(n, sam[r], cl=1.0-alpha/2)
            hi = binoUpperCL(n, sam[r], cl=1.0-alpha/2)
            coverB += ( p >= lo) & (p <= hi)
            totLenB += hi-lo
        simTable.loc[len(simTable)] =  p, n, str(100*float(coverT)/float(reps)) + '%', \
            str(100*float(coverB)/float(reps)) + '%',\
            str(round(aveLenT,4)),\
            str(round(totLenB/float(reps),4))
#
ansStr =  '<h3>Simulated coverage probability and expected length of Student-t and Binomial confidence intervals for a {0, 1} population</h3>' +\
          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
          '%</strong>.<br /><strong>Estimated from ' + str(int(reps)) + ' replications.</strong>'
display(HTML(ansStr))
display(simTable)

Simulated coverage probability and expected length of Student-t and Binomial confidence intervals for a {0, 1} population

Nominal coverage probability 95.0%.
Estimated from 1000 replications.

	fraction of 1s	sample size	Student-t cov	Binom cov	Student-t len	Binom len
0	0.001	25	3.0%	97.0%	0.005	0.1391
1	0.001	50	5.0%	100.0%	0.004	0.0729
2	0.001	100	10.5%	99.3%	0.0043	0.0382
3	0.001	400	31.0%	99.3%	0.0033	0.0109
4	0.010	25	20.8%	99.9%	0.036	0.1521
5	0.010	50	38.2%	97.8%	0.0341	0.0873
6	0.010	100	65.3%	97.7%	0.0313	0.0526
7	0.010	400	90.5%	96.1%	0.0187	0.022
8	0.100	25	94.1%	99.2%	0.238	0.2628
9	0.100	50	89.0%	98.0%	0.1676	0.1811
10	0.100	100	94.1%	95.8%	0.1169	0.1246
11	0.100	400	93.5%	94.3%	0.059	0.0612

The empirical coverage rates are generally much higher than 95% when $n$ is small, because of the discreteness of the Binomial distribution: to ensure that a test rejects at most 5% of the time might require rejecting far less than 5% of the time. That is, the tests are conservative rather than exact.

We could construct randomized tests that have exactly level $\alpha$, but we won’t go there…

Hypergeometric confidence bounds¶

Consider $n$ uniform draws without replacement from a $\{0, 1\}$ population of known size $N$ that contains an unknown number $G$ of 1s—i.e., a simple random sample of size $n$. Let $X$ denote the number of 1s in the sample. Then $X$ has a Hypergeometric distribution with parameters $N$, $G$, and $n$.

In particular,

\[\begin{equation*} \mathbb P_{N, G, n} (X = k) = \frac{{G \choose k}{N-G \choose n-k}}{{N \choose n}} \end{equation*}\]

for $\max(0, n − (N−G)) \le k \le \min(n, G)$.

We can get sharper confidence bounds in this case by inverting hypergeometric tests instead of Binomial tests.

The strategy is the same: construct a family of acceptance regions for each possible value of the parameter, then let the confidence set be the parameter values for which, given the observed data, the test would not reject.

To find a one-sided lower confidence bound for $G$ in a Hypergeometric$(N, G, n)$ distribution, with $N$ and $n$ known, we would invert one-sided upper tests, that is, tests that reject the hypothesis $G = g$ when $X$ is large. (The corresponding confidence bound for the population mean is the confidence bound for $G$, divided by $N$.)

The form of the acceptance region for the test is:

\[\begin{equation*} A_G := \ [0, a_G], \end{equation*}\]

where

\[\begin{equation*} a_G := \min \left \{ k: \sum_{i=k+1}^n \frac{{G \choose i}{N-G \choose n-i}}{{N \choose n}} \le \alpha \right \}. \end{equation*}\]

Two-valued populations¶

Clearly, if the population is known to contain only the values $\{a, b\}$ instead of the values $\{0, 1\}$, it’s trivial to re-scale binomial or hypergeometric confidence bounds.

If we observe the sum $Y$ of $n$ iid uniform draws from the population,

let $X := (Y - na)/(b-a)$
find the Binomial confidence bound for $p$ based on $X$ with $n$ trials
transform each endpoint $q$ of that interval to $(1-q)a + qb$ to get the corresponding endpoint of the interval for the mean of the original population.

For example, suppose the population was known to contain only the values 1 and 10, but in unknown proportions.

We draw an iid uniform sample of 50 items; the sample sum is 320 (corresponding to drawing “1” twenty times and “10” thirty times).

Then a 95% upper confidence bound for the fraction of 10s in the population is binoUpperCL(50, 30, cl=0.95) = 0.7169, so a 95% upper confidence bound for the population mean is

\[\begin{equation*} (1-0.7169)\times 1 + 0.7169\times 10) = 7.45182. \end{equation*}\]

Thresholding for bounded populations¶

The applications that motivate this inquiry generally call for one-sided confidence bounds rather than confidence intervals. For instance:

A plaintiff may want to show that, with high confidence, the damages are at least $x$
A prosecutor may want to show that, with high confidence, the accused committed fraud worth at least $x$
A financial auditor might want to certify that, with high confidence, a company’s assets are overstated by at most $x$ and its liabilities are understated by at most $y$.
A local election official might want determine whether, with high confidence, the error in tallying the votes is not as large as the margin
An online marketer might want to show that, with high confidence, its technology increases sales by at least $x$%.
A pharmaceutical company might want to show that, with high confidence, its headache remedy decreases the average duration of headaches by at least $x$ hours, or that its Ebola vaccine increases the rate of survival by at least $x$%.

In general, some constraint is needed to get even a one-sided (conservative or exact) confidence bound.

In these problems, there are natural constraints, for instance:

Damages cannot be negative (the plaintiff doesn’t owe the defendant—unless the defendant sues the plaintiff, becoming the plaintiff in a new action)
Fraudulent billing is non-negative; the frauduent portion of a bill does not exceed total bill

Consider the second example in the Normal Approximation notebook, a mixture of a uniform and a point mass at zero. Suppose for the moment that we are interested in an upper confidence bound, e.g., an upper bound on the overstatement of a collection of accounts, for the purpose of establishing that a company's books are fairly represented.

We have an unknown population $\{ x_j \}_{j=1}^N$. We want an upper confidence bound for $\mu_x := \frac{1}{N} \sum_{j=1}^N x_j$.

Suppose we have an a priori upper bound $u_j$ on the value $x_j$ of item $j$, $\forall j$. For simplicity, let's take $u_j = 1$. Pick a threshold $t < 1$. Imagine a new population $\{ y_j \}_{j=1}^N$, where

\[\begin{equation*} y_j := \left \{ \begin{array}{ll} t, & x_j \le t \cr 1, & x_j > t. \end{array} \right . \end{equation*}\]

This new population $\{y_j\}$ "majorizes" the original population $\{x_j\}$. Clearly $\mu_y := \frac{1}{N} \sum_{j=1}^N y_j \ge \mu_x$, so an upper confidence bound for $\mu_y$ is also an upper confidence bound for $\mu_x$. But we can find an upper confidence bound for $\mu_y$ using a random sample with replacement and the general two-value transformation of the Binomial, as sketched above.

draw a random sample with replacement of size $n$ from the population
let $X$ denote the number of items in the sample with value greater than $t$. Then $X$ has a Binomial distribution with parameters $n$ and $p$, where $p$ is the population fraction of items with value greater than $t$
find an upper $1-\alpha$ confidence bound $p^+$ for $p$ by inverting Binomial tests
an upper $1-\alpha$ confidence bound for $\mu$ is $(1-p^+)t + p^+$.

Let’s see how this does compared to a one-sided Student-t interval. We will estimate the coverage for a variety of thresholds $t$. We will also estimate the average length of the intervals, to get an idea of the tradeoff between coverage and precision.

It is perhaps discomfiting to have a tuning parameter $t$ in the method, but regardless of the choice of $t$, the intervals are guaranteed to be conservative (true converage probability at least $1-\alpha$. However, the expected length of the interval depends on $t$ and on the underlying population of values.

# Nonstandard mixture: a pointmass at zero and a uniform[0,1]
ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass
thresh = [0.2, 0.1, 0.01, .001]
alpha = 0.05  # 1- (confidence level)
reps = 1.0e3   # just for demonstration

cols = ['mass at 0', 'sample size', 'Student-t cov']
for i in range(len(thresh)):
    cols.append('Bin t=' + str(thresh[i]) + ' cov')
cols.append('Student-t len')
for i in range(len(thresh)):
    cols.append('Bin t=' + str(thresh[i]) + ' len')


simTable = pd.DataFrame(columns=cols)

for p in ps:
    popMean = (1-p)*0.5  #  p*0 + (1-p)*.5
    for n in ns:
        tCrit = sp.stats.t.ppf(q=1-alpha, df=n-1)
        coverT = 0    # coverage of t intervals
        tUp = 0       # mean upper bound of t intervals
        coverB = np.zeros(len(thresh))  # coverage of binomial threshold intervals
        bUp = np.zeros(len(thresh))     # mean upper bound of binomial threshold intervals
        for rep in range(int(reps)):
            sam = np.random.uniform(size=n)
            ptMass = np.random.uniform(size=n)
            sam[ptMass < p] = 0.0
            samMean = np.mean(sam)
            samSD = np.std(sam, ddof=1)
            tlim = samMean + tCrit*samSD
            coverT += (popMean <= tlim)  # one-sided Student-t
            tUp += tlim
            for i in range(len(thresh)):
                x = (sam > thresh[i]).sum()  # number of binomial "successes"
                pPlus = binoUpperCL(n, x, cl=1-alpha)
                blim = thresh[i]*(1.0-pPlus) + pPlus
                coverB[i] += (popMean <= blim)
                bUp[i] += blim
        theRow = [p, n, str(100*float(coverT)/float(reps)) + '%']
        for i in range(len(thresh)):
            theRow.append(str(100*float(coverB[i])/float(reps)) + '%')
        theRow.append(str(round(tUp/float(reps), 3)))
        for i in range(len(thresh)):
            theRow.append(str(round(bUp[i]/float(reps), 3)))
        simTable.loc[len(simTable)] =  theRow
#
ansStr =  '<h3>Simulated coverage probability and expected lengths of one-sided Student-t confidence intervals and threshold ' +\
          'Binomial intervals for mixture of U[0,1] and pointmass at 0</h3>' +\
          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
          '%</strong>. <br /><strong>Estimated from ' + str(int(reps)) + ' replications.</strong>'

display(HTML(ansStr))
display(simTable)

Simulated coverage probability and expected lengths of one-sided Student-t confidence intervals and threshold Binomial intervals for mixture of U[0,1] and pointmass at 0

Nominal coverage probability 95.0%.
Estimated from 1000 replications.

	mass at 0	sample size	Student-t cov	Bin t=0.2 cov	Bin t=0.1 cov	Bin t=0.01 cov	Bin t=0.001 cov	Student-t len	Bin t=0.2 len	Bin t=0.1 len	Bin t=0.01 len	Bin t=0.001 len
0	0.900	25	91.4%	100.0%	100.0%	100.0%	100.0%	0.323	0.384	0.32	0.262	0.257
1	0.900	50	98.4%	100.0%	100.0%	100.0%	100.0%	0.33	0.337	0.265	0.202	0.196
2	0.900	100	100.0%	100.0%	100.0%	100.0%	100.0%	0.333	0.311	0.235	0.169	0.163
3	0.900	400	100.0%	100.0%	100.0%	100.0%	100.0%	0.336	0.284	0.205	0.135	0.128
4	0.990	25	22.2%	100.0%	100.0%	100.0%	100.0%	0.046	0.3	0.215	0.137	0.13
5	0.990	50	39.3%	100.0%	100.0%	100.0%	100.0%	0.057	0.257	0.166	0.083	0.075
6	0.990	100	60.0%	100.0%	100.0%	100.0%	100.0%	0.069	0.234	0.139	0.054	0.046
7	0.990	400	98.1%	100.0%	100.0%	100.0%	100.0%	0.093	0.216	0.119	0.032	0.023
8	0.999	25	2.6%	100.0%	100.0%	100.0%	100.0%	0.005	0.291	0.203	0.123	0.115
9	0.999	50	4.8%	100.0%	100.0%	100.0%	100.0%	0.006	0.248	0.154	0.069	0.061
10	0.999	100	9.6%	100.0%	100.0%	100.0%	100.0%	0.008	0.225	0.128	0.041	0.032
11	0.999	400	32.9%	100.0%	100.0%	100.0%	100.0%	0.016	0.207	0.108	0.019	0.01

Notice that in many cases the expected length of the Student-t interval is greater than the length of the binomial interval, even though the coverage probability of the Student-t interval is smaller.

Trinomial Bounds, Multinomial Bounds, and the Stringer Bound¶

Instead of thresholding at one value (yielding a binomial variable: “success” means “above threshold”), we could discretize the support of $\bar{X}$ into three or more ranges, and make inferences based on the trinomial or multinomial distribution that induces.

Specifying the acceptance regions in a way that allows the tests to be inverted in a simple way is not simple (and there are issues with published methods).

The Stringer Bound is well known in financial auditing. It amounts to combining confidence limits for data-dependent ranges (adaptive thresholds) into an overall confidence limit. Empirically, it is conservative, and Bickel (1992) has shown that it is asymptotically conservative, but as far as I know, there is no proof that it is conservative for finite samples.

Moreover, computational evidence suggests that methods we will explore in later lectures perform better than multinomial methods and the Stringer bound in practice, so we will not dwell on them.

What’s next?¶

Now we will consider some conservative methods for constructing lower confidence intervals for the mean of nonnegative populations

Collaborative and Reproducible Data Science

Confidence bounds for the mean of a bounded population: Binomial and Hypergeometric

Contents