Penny sampling and Conditional Expectation
Contents
Penny sampling and Conditional Expectation¶
Penny Sampling¶
“Penny sampling” is a sampling and estimation approach closely related to DUS, but that allows particularly simple analysis. It was introduced by Edwards et al. (2013).
We shall derive a continuous version of Penny Sampling that turns the problem of making confidence bounds for the mean of a bounded random variable into the problem of making confidence bounds for a Binomial \(p\), by introducing auxiliary independent randomization.
Basic Penny Sampling¶
We shall derive basic Penny Sampling in the context of financial auditing.
Again, we have a population of \(N\) items; item \(j\) has unknown value \(x_j\), but the value is known to be nonnegative and less than \(u_j\).
We divide the upper bound on into “pennies.” Item \(j\) contains \(u_j\) pennies, of which \(x_j\) are “good” and \((u_j - x_j)\) are “bad.”
Let \(u \equiv \sum_{j=1}^N u_j\) and \(x \equiv \sum_{j=1}^N x_j\).
We want to estimate the population fraction of “good” pennies:
We then sample “pennies” at random, without replacement, from the entire population.
If we draw a penny associated with item \(j\), we inspect that item to determine the value \(x_j\).
If the index of the penny within item \(j\) is less than or equal to \(x_j\), we consider the penny that was drawn to be “good.” Otherwise the penny is “bad.”
The number of “good” pennies in the sample has a hypergeometric distribution with parameters \(u\), \(x = u \mu\), and \(n\); the expected fraction of “good” pennies in the sample is \(\mu = x/u\).
We invert hypergeometric tests to find confidence bounds for \(x\), which we can translate into confidence bounds for \(\mu\) by dividing by \(u\).
Again, we are imagining situations in which the sample is very small compared to the number of items in the population—much less the number of pennies in those items—so we will act as if we are sampling with replacement, which gives a conservative result in any case.
Thus we treat the distribution of the number of “good” pennies in the sample as Binomial with parameters \(n\) and \(p = \mu = x/u\), rather than hypergeometric.
Relation between penny sampling and PPS/DUS sampling¶
While the chance of selecting each “penny” is equal, the chance of selecting item \(j\) is proportional to \(u_j\), just as in dollar-unit sampling. The difference is in how the data are analyzed: using \(X_i\), the value of \(x_j\) for the item selected on the \(i\)th draw, or using information about whether just the single “penny” is good or bad.
Continuous Penny Sampling and the Conditional Expectation¶
The Two-Step¶
One can think of penny sampling as taking place in two steps: first, select an item \(J\) with PPS sampling, then select a penny uniformly at random from that item. If the penny’s index is less than \(X_J\), the penny is good. In other words, pennies are selected conditionally uniformly, given the item they are selected from, and compared to the true value of the item.
In the limit as pennies shrink in value, this amounts to comparing \(X_J\) to a random variable \(Z_J\) that is (continuously) conditionally uniformly distributed on \([0, U_J]\), where \(U_J\) is the upper bound on the item selected. Let \(W\) be the event that \(Z_J \le X_J\). Then
In \(n\) iid draws, the distribution of the sum of the corresponding values of \(W\) is Binomial with parameters \(n\) and \(p=\mu = x/u\).
Alternatively, we can think of the process as creating a new set of \(n\) random variables \(\{W_i\}\) where \(W_i \sim \mbox{Bernoulli}(X_i/U_i)\). It is clear that unconditionally, \(\{W_i\}\) are iid Benoulli(\(x/u\)), so \(\sum_{i=1}^n W_i \sim \mbox{Binomial}(n, \mu)\), where \(\mu \equiv x/u\).
Sequences of iid variables¶
This kind of “continuous penny sampling” lets us make confidence bounds for iid samples from an arbitrary bounded distribution, such as the mixture of a uniform and a point-mass at 0. The lower and upper bounds on each draw are 0 and 1. For each \(X_i\) in the sample, we generate a uniformly distributed \(Z_i\), with all \(\{X_i\}\) and \(\{Z_i\}\) independent. Let \(W_i = 1_{Z_i \le X_i}\). Then \(\{W_i \}_{i=1}^n\) are iid Bernoulli random variables with \(\mathbb P \{W_i = 1\} = \mathbb E X_i = \mu\):
by the tail-integral formula for expectation.
Implementation¶
Let’s implement continuous penny sampling in Python.
# This is the first cell with code: set up the Python environment
%matplotlib inline
from __future__ import division, print_function
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
def binoLowerCL(n, x, cl = 0.975, inc=0.000001, p = None):
"Lower confidence level cl confidence interval for Binomial p, for x successes in n trials"
if p is None:
p = float(x)/float(n)
lo = 0.0
if (x > 0):
f = lambda q: cl - scipy.stats.binom.cdf(x-1, n, q)
lo = sp.optimize.brentq(f, 0.0, p, xtol=inc)
return lo
def binoUpperCL(n, x, cl = 0.975, inc=0.000001, p = None):
"Upper confidence level cl confidence interval for Binomial p, for x successes in n trials"
if p is None:
p = float(x)/float(n)
hi = 1.0
if (x < n):
f = lambda q: scipy.stats.binom.cdf(x, n, q) - (1-cl)
hi = sp.optimize.brentq(f, p, 1.0, xtol=inc)
return hi
def pennySampleReplacement(weights, n):
'''
Weighted random sample of size n drawn with replacement.
Returns indices of the selected items, the "remainder pennies" (the conditionally uniform
auxiliary randomization within items) and the raw uniform values used to select the sample
'''
if any(weights < 0):
print('negative weight in weightedRandomSample')
return float('NaN')
else:
totWt = np.sum(weights, dtype=float)
wc = np.cumsum(weights, dtype=float)/totWt # ensure weights sum to 1
theSam = np.random.random_sample((n,))
inx = np.array(wc).searchsorted(theSam)
penny = [(wc[inx[i]]-theSam[i])*totWt for i in range(n)]
return inx, penny, theSam
def pennyBinomialLowerBound(x, inx, pennies, cl=0.95):
'''
Penny sampling lower (one-sided) 1-alpha confidence bound on the mean, for sampling with replacement.
x is the vector of observed values
pennies is the vector of _which_ "penny" in each sampled item is to be adjudicated as "good" or "bad"
The first x_j pennies in item j are deemed "good," the remaining (u_j - x_j) are "bad."
Returns the lower bound and the number of "good" pennies in the sample.
'''
s = sum([pennies[i] <= x[inx[i]] for i in range(len(pennies))])
n = len(inx)
return binoLowerCL(n, s, cl=cl), s
def pennyBinomialBounds(x, inx, pennies, cl=0.95):
'''
Penny sampling 2-sided confidence interval for the mean, for sampling with replacement.
x is the vector of observed values
pennies is the vector of _which_ "penny" in each sampled item is to be adjudicated as "good" or "bad"
The first x_j pennies in item j are deemed "good," the remaining (u_j - x_j) are "bad."
Returns the lower bound, the upper bound and the number of "good" pennies in the sample.
'''
s = sum([pennies[i] <= x[inx[i]] for i in range(len(pennies))])
n = len(inx)
return binoLowerCL(n, s, cl=1-(1-cl)/2), binoUpperCL(n, s, cl=1-(1-cl)/2), s
Illustration¶
Let’s imagine a population of \(N\) items bounded between 0 and 1.
Let’s test this:
# Nonstandard mixture: a pointmass at zero and a uniform[0,1]
ns = np.array([25, 50, 100, 400]) # sample sizes
ps = np.array([0.9, 0.99, 0.999]) # mixture fraction, weight of pointmass
alpha = 0.05 # 1- (confidence level)
reps = int(1.0e4) # just for demonstration
simTable = pd.DataFrame(columns=('mass at 0', 'sample size', 'Student-t cov',\
'Penny cov', 'Student-t len', 'Penny len')
)
for p in ps:
popMean = (1-p)*0.5 # p*0 + (1-p)*.5
for n in ns:
tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
coverT = 0
coverP = 0
totT = 0.0
totP = 0.0
for rep in range(int(reps)):
sam = np.random.uniform(size=n)
ptMass = np.random.uniform(size=n)
pennies = np.random.uniform(size=n) # auxiliary uniform randomization within items
sam[ptMass < p] = 0.0 # point mass at zero
samMean = np.mean(sam)
samSD = np.std(sam, ddof=1)
coverT += ( math.fabs(samMean-popMean) < tCrit*samSD )
totT += 2*tCrit*samSD
pLo, pHi, s = pennyBinomialBounds(sam, np.r_[0:n], pennies, cl=1-alpha )
coverP += ( pLo <= popMean ) & (popMean <= pHi)
totP += pHi - pLo
simTable.loc[len(simTable)] = p, n,\
str(100*float(coverT)/float(reps)) + '%',\
str(100*float(coverP)/float(reps)) + '%',\
str(round(totT/float(reps),4)),\
str(round(totP/float(reps), 4))
#
ansStr = '<h3>Simulated coverage probability of Student-t and Continuous Penny Sampling confidence intervals for ' +\
'mixture of U[0,1] and pointmass at 0 population</h3>' +\
'<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
'%</strong>.<br /><strong>Estimated from ' + str(int(reps)) + ' replications.</strong>'
display(HTML(ansStr))
display(simTable)
Simulated coverage probability of Student-t and Continuous Penny Sampling confidence intervals for mixture of U[0,1] and pointmass at 0 population
Nominal coverage probability 95.0%.Estimated from 10000 replications.
mass at 0 | sample size | Student-t cov | Penny cov | Student-t len | Penny len | |
---|---|---|---|---|---|---|
0 | 0.900 | 25.0 | 90.37% | 99.35% | 0.6419 | 0.207 |
1 | 0.900 | 50.0 | 98.89% | 98.7% | 0.6724 | 0.1382 |
2 | 0.900 | 100.0 | 99.99% | 98.41% | 0.6817 | 0.0942 |
3 | 0.900 | 400.0 | 100.0% | 95.9% | 0.6868 | 0.0451 |
4 | 0.990 | 25.0 | 21.96% | 99.32% | 0.0975 | 0.1451 |
5 | 0.990 | 50.0 | 38.61% | 99.82% | 0.1253 | 0.0795 |
6 | 0.990 | 100.0 | 62.16% | 98.49% | 0.1604 | 0.0447 |
7 | 0.990 | 400.0 | 97.98% | 98.31% | 0.2098 | 0.0167 |
8 | 0.999 | 25.0 | 2.57% | 98.67% | 0.0106 | 0.1381 |
9 | 0.999 | 50.0 | 4.44% | 97.87% | 0.0127 | 0.0719 |
10 | 0.999 | 100.0 | 8.94% | 99.93% | 0.0178 | 0.0371 |
11 | 0.999 | 400.0 | 32.85% | 98.16% | 0.0357 | 0.0101 |
As is the case with the other nonparametric methods, the Continuous Penny Sample intervals are in some cases shorter on average than Student-t intervals, but have higher coverage.