Confidence bounds via the Chebychev and Hoeffding Inequalities
Contents
Confidence bounds via the Chebychev and Hoeffding Inequalities¶
Chebychev’s inequality¶
Theorem
Suppose \(X\) has finite expected value \(\mu\) and variance \(\sigma^2\). Then
Using Chebychev’s inequality for confidence intervals¶
Suppose that \(\{X_j\}_{j=1}^n\) are independent random variables and \(\{\ell_j\}_{j=1}^n\) and \(\{u_j\}_{j=1}^n\) are real numbers such that
Let \(\bar{X} \equiv \frac{1}{n} \sum_{j=1}^n X_j\), \(\mu = \mathbb E \bar{X}\), and \(\tau^2 \equiv \sum_{j=1}^n (u_j - \ell_j)^2\).
Let’s bound \(\mbox{Var } X_j = \mathbb E (X_j - \mathbb E X_j)^2\). Intuitively, the variance is maximized by “spreading out” the distribution as much as possible, which would mean putting mass 1/2 at \(\ell_j\) and mass 1/2 at \(u_j\). (This is directly analogous to the fact that the variance of a Bernoulli distribution is maximal when \(p = 1/2\).) The mean of the corresponding distribution is \((\ell_j + u_j)/2\), and the variance is
(As a sanity check, this variable is a shifted and scaled Bernoulli with \(p=1/2\), which has variance \(1/4\). The shift doesn’t affect the variance, and the scale factor is \(u_j - \ell_j\), so the new variance should be \(1/4 \times (u_j - \ell_j)^2\).)
Hence
Since \(\sigma \le \tau/(2n)\),
and thus
This bound is trivial (because it is greater than 1) until
i.e.,
Hoeffding’s Inequality¶
and define \(\bar{X} \equiv \frac{1}{n} \sum_j X_j\), \(\mu \equiv \mathbb E \bar{X}\), and \(\tau^2 \equiv \sum_{j=1}^n (u_j - \ell_j)^2\).
Then \begin{equation*} \mathbb P_\mu ( \bar{X} - \mu \ge a ) \le \exp \left (-\frac{2n^2 a^2}{\tau^2} \right ) \end{equation*} and \begin{equation*} \mathbb P_\mu ( |\bar{X} - \mu | \ge a ) \le 2 \exp \left(-\frac{2n^2 a^2}{\tau^2} \right ). \end{equation*}
This bound is useful once
i.e.,
Since \(\sqrt{-2 \ln 1/2} \approx 1.18\), Hoeffding’s inequality requires \(a\) to be about 18% larger to be useful compared to Chebychev’s inequality, but it is exponentially better as \(n\) grows.
Truncation¶
If we know a priori that \(\mu \in [\ell, u]\), we can take the intersection of any confidence interval \(\mathcal I\) with \([\ell, u]\) without losing any coverage probability: if \(\mathcal I \ni \mu\), then \(\mathcal I \cap [\ell, u] \ni \mu\). Hence, we can truncate any confidence interval by taking its intersection with \([\ell, u]\) without reducing the confidence level.
For instance, suppose \(\mathbb P X_j \in [\ell_j, u_j]) = 1, \forall j = 1, \ldots, n\). Define \(\ell \equiv \frac{1}{n} \sum_{j=1}^n \ell_j\) and \(u \equiv \frac{1}{n} \sum_{j=1}^n u_j\). Then
and thus \(\mu \equiv \mathbb E \bar{X} \in [\ell, u]\). Since we know with certainty that \(\mu \in [\ell, u]\), any portion of a confidence interval \(\mathcal I\) that lies outside \([\ell, u]\) cannot help it cover \(\mu\), and we might as well use \(\mathcal I \cap [\ell, u]\) if it is shorter than \(\mathcal I\): we gain precision without losing coverage.
Basing a confidence interval on Chebychev’s inequality or Hoeffding’s inequality sometimes produces an interval that can be improved by truncation, decreasing the length (and expected length) without sacrificing coverage.
Comparing Hoeffding’s Inequality and Chebychev’s Inequality for the sample mean from finite bounded populations¶
Both inequalities bound \(\mathbb P_\mu (|\bar{X} - \mu | \ge a)\) in terms of \(\frac{\sum_{j=1}^n (u_j - \ell_j)^2}{n^2 a^2}\), but Hoeffding’s bound is exponentially better.
Here is a comparison, normalized by taking \(\tau^2 \equiv \sum_{j=1}^n (u_j - \ell_j)^2 = n\) so that \(\tau^2/n = 1\), as would be the case if \(\ell_j = 0\) and \(u_j = 1\), \(\forall j\):
# This is the first cell with code: set up the Python environment
%matplotlib inline
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
from ipywidgets.widgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
def chebychev(n, ssByn, a):
return ssByn/(4.0 * n * np.square(a))
def hoeffding(n, ssByn, a):
return 2*np.exp(-2.0 * n * np.square(a) / ssByn)
def chebychevHoeffdingBoundPlot(n, ssByn = 1.0, aMax = 1.0, pts = 1000):
'''
Plots the bounds on P( |\bar{X} - \mu | \ge a) for Chebychev's inequality and
Hoeffding's inequality.
\bar{X} is the sample mean of n independent bounded random variables.
ssBy is the mean sum of squares of the differences between the upper and lower bounds on
each of those n variables.
Plots the curves as pts points
'''
fig, ax = plt.subplots(nrows=1, ncols=1)
aMin = math.sqrt(-math.log(0.5)*ssByn/(2.0*n)) # smallest a for which Hoeffding bound is nontrivial
x = np.linspace(aMin, aMax, pts+1)
cheb, = plt.plot(x, np.fmin(1,chebychev(n, ssByn, x)), color='b', label='Chebychev')
hoef, = plt.plot(x, hoeffding(n, ssByn, x), color='r', label='Hoeffding')
plt.title(r'Bound on $\mathbf{P} (|\bar{X} - \mu| > a)$')
plt.xlabel('a')
plt.ylabel('Probability')
plt.legend(loc = 'best')
interact(chebychevHoeffdingBoundPlot, n=widgets.IntSlider(min=5, max=300, step=1, value=30),\
ssByn = widgets.FloatSlider(min=0.5, max=10, step=0.1, value=1.0),\
aMax = widgets.fixed(1), pts = widgets.fixed(1000)
)
<function __main__.chebychevHoeffdingBoundPlot(n, ssByn=1.0, aMax=1.0, pts=1000)>
Using Hoeffding’s inequality for confidence intervals¶
Suppose that \(\{X_j \}_{j=1}^n\) are iid with support \([0, 1]\). Then \(\tau = \sqrt{n}\).
It follows that \begin{equation*} A_\alpha(\theta) = \left [0, \theta + \sqrt{\frac{-\ln \alpha}{2n}} \right ] \end{equation*} is the (one-sided) acceptance region for a level-$\alpha$ test of the hypothesis $\mu = \theta$; \begin{equation*} \mathbb P_\mu \left (\bar{X} - \sqrt{\frac{-\ln \alpha}{2n}} \ge \mu \right ) \le \alpha; \end{equation*} and \begin{equation*} \left [\bar{X} - \sqrt{\frac{-\ln \alpha}{2n}}, 1 \right ]\end{equation*} is a one-sided confidence interval for $\mu$, with confidence level $1-\alpha$.
Similarly, \begin{equation*} A_\alpha(\theta) = \left [\theta - \sqrt{\frac{-\ln \alpha/2}{2n}}, \theta + \sqrt{\frac{-\ln \alpha/2}{2n}} \right ] \end{equation*} is the two-sided acceptance region for a level-$\alpha$ test of the hypothesis $\mu = \theta$; \begin{equation*} \mathbb P_\mu \left ( \left | \bar{X} - \sqrt{\frac{-\ln \alpha/2}{2n}} \right | \ge \mu \right ) \le \alpha; \end{equation*} and \begin{equation*} \left [\bar{X} - \sqrt{\frac{-\ln \alpha/2}{2n}}, \bar{X} + \sqrt{\frac{-\ln \alpha/2}{2n}} \right ]\end{equation*} is a two-sided confidence interval for $\mu$, with confidence level $1-\alpha$.
Let's compare Hoeffding confidence intervals and intervals based on the normal approximation by simulation, in the cases we saw previously in [Confidence intervals based on the normal approximation](normApprox.ipynb). We will use two-sided bounds, even though in applications one-sided bounds are often more interesting.
Of course, if we know a priori that $\mu \in [\ell, u]$, the intersection of the Hoeffding confidence interval with $[\ell, u]$ is still a $1-\alpha$ confidence interval.
# Population of two values, {0, 1}, in various proportions. Amounts to Binomial random variable
ns = np.array([25, 50, 100, 400]) # sample sizes
ps = np.array([.001, .01, 0.1]) # mixture fractions, proportion of 1s in the population
alpha = 0.05 # 1- (confidence level)
reps = int(1.0e4) # just for demonstration
vals = [0, 1]
simTable = pd.DataFrame(columns=('fraction of 1s', 'sample size', 'Student-t cov',\
'Hoeffding cov', 'Student-t len', 'Trunc Hoeffding len'))
for p in ps:
popMean = p
for n in ns:
tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
hCrit = np.sqrt(-math.log(alpha/2)/(2*n))
sam = sp.stats.binom.rvs(n, p, size=reps)
samMean = sam/float(n)
samSD = np.sqrt(samMean*(1-samMean)/(n-1))
coverT = (np.fabs(samMean-popMean) < tCrit*samSD).sum()
lenT = 2*(tCrit*samSD).mean()
lenH = (np.minimum(samMean+hCrit, 1.0)-np.maximum(samMean-hCrit, 0.0)).mean()
coverH = (np.fabs(samMean-popMean) < hCrit).sum()
simTable.loc[len(simTable)] = p, n, str(100*float(coverT)/float(reps)) + '%', \
str(100*float(coverH)/float(reps)) + '%',\
str(round(lenT,4)),\
str(round(lenH,4))
#
ansStr = '<h3>Simulated coverage probability for Student-t and truncated Hoeffding confidence intervals for a {0, 1} population</h3>' +\
'<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
'%</strong>.<br /><strong>Estimated from ' + str(round(reps)) + ' replications.</strong>'
display(HTML(ansStr))
display(simTable)
Simulated coverage probability for Student-t and truncated Hoeffding confidence intervals for a {0, 1} population
Nominal coverage probability 95.0%.Estimated from 10000 replications.
fraction of 1s | sample size | Student-t cov | Hoeffding cov | Student-t len | Trunc Hoeffding len | |
---|---|---|---|---|---|---|
0 | 0.001 | 25 | 2.23% | 100.0% | 0.0037 | 0.2725 |
1 | 0.001 | 50 | 4.88% | 100.0% | 0.004 | 0.1931 |
2 | 0.001 | 100 | 10.0% | 100.0% | 0.0041 | 0.1369 |
3 | 0.001 | 400 | 33.22% | 100.0% | 0.0035 | 0.0689 |
4 | 0.010 | 25 | 22.26% | 100.0% | 0.0384 | 0.2816 |
5 | 0.010 | 50 | 38.6% | 100.0% | 0.0343 | 0.2019 |
6 | 0.010 | 100 | 63.6% | 100.0% | 0.0308 | 0.1459 |
7 | 0.010 | 400 | 90.66% | 100.0% | 0.0187 | 0.0778 |
8 | 0.100 | 25 | 92.35% | 99.98% | 0.2314 | 0.3709 |
9 | 0.100 | 50 | 87.91% | 100.0% | 0.1669 | 0.2918 |
10 | 0.100 | 100 | 93.67% | 99.99% | 0.1181 | 0.234 |
11 | 0.100 | 400 | 94.91% | 100.0% | 0.0589 | 0.1358 |
# Nonstandard mixture: a pointmass at zero and a uniform[0,1]
ns = np.array([25, 50, 100, 400]) # sample sizes
ps = np.array([0.9, 0.99, 0.999]) # mixture fraction, weight of pointmass
alpha = 0.05 # 1- (confidence level)
reps = int(1.0e4) # just for demonstration
simTable = pd.DataFrame(columns=('mass at 0', 'sample size', 'Student-t cov',\
'Hoeffding cov', 'Student-t len', 'Trunc Hoeffding len')
)
for p in ps:
popMean = (1-p)*0.5 # p*0 + (1-p)*.5
for n in ns:
tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
hCrit = np.sqrt(-math.log(alpha/2)/(2*n))
coverT = 0
coverH = 0
totT = 0.0
totH = 0.0
for rep in range(int(reps)):
sam = np.random.uniform(size=n)
ptMass = np.random.uniform(size=n)
sam[ptMass < p] = 0.0
samMean = np.mean(sam)
samSD = np.std(sam, ddof=1)
coverT += ( math.fabs(samMean-popMean) < tCrit*samSD )
totT += 2*tCrit*samSD
coverH += ( math.fabs(samMean-popMean) < hCrit )
totH += min(samMean+hCrit, 1.0)-max(samMean-hCrit, 0.0)
simTable.loc[len(simTable)] = p, n,\
str(100*float(coverT)/float(reps)) + '%',\
str(100*float(coverH)/float(reps)) + '%',\
str(round(totT/float(reps),4)),\
str(round(totH/float(reps),4))
#
ansStr = '<h3>Simulated coverage probability of Student-t and truncated Hoeffding confidence intervals for ' +\
'mixture of U[0,1] and pointmass at 0 population</h3>' +\
'<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
'%</strong>.<br /><strong>Estimated from ' + str(round(reps,0)) + ' replications.</strong>'
display(HTML(ansStr))
display(simTable)
Simulated coverage probability of Student-t and truncated Hoeffding confidence intervals for mixture of U[0,1] and pointmass at 0 population
Nominal coverage probability 95.0%.Estimated from 10000.0 replications.
mass at 0 | sample size | Student-t cov | Hoeffding cov | Student-t len | Trunc Hoeffding len | |
---|---|---|---|---|---|---|
0 | 0.900 | 25.0 | 90.56% | 100.0% | 0.6459 | 0.3219 |
1 | 0.900 | 50.0 | 98.75% | 100.0% | 0.6707 | 0.2421 |
2 | 0.900 | 100.0 | 99.97% | 100.0% | 0.6816 | 0.1858 |
3 | 0.900 | 400.0 | 100.0% | 100.0% | 0.6868 | 0.1178 |
4 | 0.990 | 25.0 | 22.2% | 100.0% | 0.0988 | 0.2767 |
5 | 0.990 | 50.0 | 38.95% | 100.0% | 0.1268 | 0.1971 |
6 | 0.990 | 100.0 | 62.96% | 100.0% | 0.1616 | 0.1408 |
7 | 0.990 | 400.0 | 97.81% | 100.0% | 0.2101 | 0.0729 |
8 | 0.999 | 25.0 | 2.66% | 100.0% | 0.0112 | 0.2722 |
9 | 0.999 | 50.0 | 4.87% | 100.0% | 0.0139 | 0.1926 |
10 | 0.999 | 100.0 | 9.54% | 100.0% | 0.019 | 0.1363 |
11 | 0.999 | 400.0 | 32.32% | 100.0% | 0.0352 | 0.0684 |
Remarkably, the truncated Hoeffding intervals are often shorter on average than the Student-t intervals, but have far better coverage.
More in this direction…¶
We could also apply other inequalities, such as Serfling’s Inequality (a version of Hoeffding’s inequality for sampling without replacement) or Feige’s Inequality.
For sampling without replacement, the “Empirical Bernstein-Serfling Inequality” of Bardenet and Maillard (2013) seems particularly worth exploring.
What’s next?¶
Now we will consider some conservative methods for constructing lower confidence intervals for the mean of nonnegative populations