Penny sampling and Conditional Expectation¶

Penny Sampling¶

“Penny sampling” is a sampling and estimation approach closely related to DUS, but that allows particularly simple analysis. It was introduced by Edwards et al. (2013).

We shall derive a continuous version of Penny Sampling that turns the problem of making confidence bounds for the mean of a bounded random variable into the problem of making confidence bounds for a Binomial \(p\), by introducing auxiliary independent randomization.

Basic Penny Sampling¶

We shall derive basic Penny Sampling in the context of financial auditing.

Again, we have a population of \(N\) items; item \(j\) has unknown value \(x_j\), but the value is known to be nonnegative and less than \(u_j\).

We divide the upper bound on into “pennies.” Item \(j\) contains \(u_j\) pennies, of which \(x_j\) are “good” and \((u_j - x_j)\) are “bad.”

Let \(u \equiv \sum_{j=1}^N u_j\) and \(x \equiv \sum_{j=1}^N x_j\).

We want to estimate the population fraction of “good” pennies:

\[\begin{equation*} \mu \equiv \frac{x}{u}.\end{equation*}\]

We then sample “pennies” at random, without replacement, from the entire population.

If we draw a penny associated with item \(j\), we inspect that item to determine the value \(x_j\).

If the index of the penny within item \(j\) is less than or equal to \(x_j\), we consider the penny that was drawn to be “good.” Otherwise the penny is “bad.”

The number of “good” pennies in the sample has a hypergeometric distribution with parameters \(u\), \(x = u \mu\), and \(n\); the expected fraction of “good” pennies in the sample is \(\mu = x/u\).

We invert hypergeometric tests to find confidence bounds for \(x\), which we can translate into confidence bounds for \(\mu\) by dividing by \(u\).

Again, we are imagining situations in which the sample is very small compared to the number of items in the population—much less the number of pennies in those items—so we will act as if we are sampling with replacement, which gives a conservative result in any case.

Thus we treat the distribution of the number of “good” pennies in the sample as Binomial with parameters \(n\) and \(p = \mu = x/u\), rather than hypergeometric.

Relation between penny sampling and PPS/DUS sampling¶

While the chance of selecting each “penny” is equal, the chance of selecting item \(j\) is proportional to \(u_j\), just as in dollar-unit sampling. The difference is in how the data are analyzed: using \(X_i\), the value of \(x_j\) for the item selected on the \(i\)th draw, or using information about whether just the single “penny” is good or bad.

Continuous Penny Sampling and the Conditional Expectation¶

The Two-Step¶

One can think of penny sampling as taking place in two steps: first, select an item \(J\) with PPS sampling, then select a penny uniformly at random from that item. If the penny’s index is less than \(X_J\), the penny is good. In other words, pennies are selected conditionally uniformly, given the item they are selected from, and compared to the true value of the item.

In the limit as pennies shrink in value, this amounts to comparing \(X_J\) to a random variable \(Z_J\) that is (continuously) conditionally uniformly distributed on \([0, U_J]\), where \(U_J\) is the upper bound on the item selected. Let \(W\) be the event that \(Z_J \le X_J\). Then

\[\begin{align*} \mathbb P \{ W = 1 \} & = \mathbb P \{ Z_J \le X_J \} \\ &= \sum_{j=1}^N \mathbb P\{Z_J \le X_J | J = j\} \mathbb P \{J=j\} \\ &= \sum_{j=1}^N u_j/u \mathbb P\{Z_j \le x_j | J = j\} \\ &= \sum_{j=1}^N u_j/u \cdot x_j/u_j \\ &= x/u. \end{align*}\]

In \(n\) iid draws, the distribution of the sum of the corresponding values of \(W\) is Binomial with parameters \(n\) and \(p=\mu = x/u\).

Alternatively, we can think of the process as creating a new set of \(n\) random variables \(\{W_i\}\) where \(W_i \sim \mbox{Bernoulli}(X_i/U_i)\). It is clear that unconditionally, \(\{W_i\}\) are iid Benoulli(\(x/u\)), so \(\sum_{i=1}^n W_i \sim \mbox{Binomial}(n, \mu)\), where \(\mu \equiv x/u\).

Sequences of iid variables¶

This kind of “continuous penny sampling” lets us make confidence bounds for iid samples from an arbitrary bounded distribution, such as the mixture of a uniform and a point-mass at 0. The lower and upper bounds on each draw are 0 and 1. For each \(X_i\) in the sample, we generate a uniformly distributed \(Z_i\), with all \(\{X_i\}\) and \(\{Z_i\}\) independent. Let \(W_i = 1_{Z_i \le X_i}\). Then \(\{W_i \}_{i=1}^n\) are iid Bernoulli random variables with \(\mathbb P \{W_i = 1\} = \mathbb E X_i = \mu\):

\[\begin{align*} \mathbb P \{ W_i = 1 \} & = \mathbb P \{ Z_i \le X_i \} \\ &= \int_0^1 \mathbb P \{ Z_i \le X_i | Z_i = z\} dz \\ &= \int_0^1 \mathbb P \{ X_i > z \} dz \\ &= \mathbb E X_i, \end{align*}\]

by the tail-integral formula for expectation.

Implementation¶

Let’s implement continuous penny sampling in Python.

# This is the first cell with code: set up the Python environment
%matplotlib inline
from __future__ import division, print_function
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML

def binoLowerCL(n, x, cl = 0.975, inc=0.000001, p = None):
    "Lower confidence level cl confidence interval for Binomial p, for x successes in n trials"
    if p is None:
            p = float(x)/float(n)
    lo = 0.0
    if (x > 0):
            f = lambda q: cl - scipy.stats.binom.cdf(x-1, n, q)
            lo = sp.optimize.brentq(f, 0.0, p, xtol=inc)
    return lo

def binoUpperCL(n, x, cl = 0.975, inc=0.000001, p = None):
    "Upper confidence level cl confidence interval for Binomial p, for x successes in n trials"
    if p is None:
            p = float(x)/float(n)
    hi = 1.0
    if (x < n):
            f = lambda q: scipy.stats.binom.cdf(x, n, q) - (1-cl)
            hi = sp.optimize.brentq(f, p, 1.0, xtol=inc) 
    return hi

def pennySampleReplacement(weights, n):
    '''
       Weighted random sample of size n drawn with replacement.
       Returns indices of the selected items, the "remainder pennies" (the conditionally uniform
       auxiliary randomization within items) and the raw uniform values used to select the sample
    '''
    if any(weights < 0):
        print('negative weight in weightedRandomSample')
        return float('NaN')
    else:
        totWt = np.sum(weights, dtype=float)
        wc = np.cumsum(weights, dtype=float)/totWt  # ensure weights sum to 1
        theSam = np.random.random_sample((n,))
        inx = np.array(wc).searchsorted(theSam)
        penny = [(wc[inx[i]]-theSam[i])*totWt for i in range(n)]
        return inx, penny, theSam

def pennyBinomialLowerBound(x, inx, pennies, cl=0.95):
    '''
       Penny sampling lower (one-sided) 1-alpha confidence bound on the mean, for sampling with replacement.
       x is the vector of observed values
       pennies is the vector of _which_ "penny" in each sampled item is to be adjudicated as "good" or "bad"
       The first x_j pennies in item j are deemed "good," the remaining (u_j - x_j) are "bad."
       Returns the lower bound and the number of "good" pennies in the sample.
    '''
    s = sum([pennies[i] <= x[inx[i]] for i in range(len(pennies))])
    n = len(inx)
    return binoLowerCL(n, s, cl=cl), s

def pennyBinomialBounds(x, inx, pennies, cl=0.95):
    '''
       Penny sampling 2-sided confidence interval for the mean, for sampling with replacement.
       x is the vector of observed values
       pennies is the vector of _which_ "penny" in each sampled item is to be adjudicated as "good" or "bad"
       The first x_j pennies in item j are deemed "good," the remaining (u_j - x_j) are "bad."
       Returns the lower bound, the upper bound and the number of "good" pennies in the sample.
    '''
    s = sum([pennies[i] <= x[inx[i]] for i in range(len(pennies))])
    n = len(inx)
    return binoLowerCL(n, s, cl=1-(1-cl)/2), binoUpperCL(n, s, cl=1-(1-cl)/2), s

Illustration¶

Let’s imagine a population of \(N\) items bounded between 0 and 1.

Let’s test this:

# Nonstandard mixture: a pointmass at zero and a uniform[0,1]
ns = np.array([25, 50, 100, 400])  # sample sizes
ps = np.array([0.9, 0.99, 0.999])    # mixture fraction, weight of pointmass
alpha = 0.05  # 1- (confidence level)
reps = int(1.0e4) # just for demonstration

simTable = pd.DataFrame(columns=('mass at 0', 'sample size', 'Student-t cov',\
                                 'Penny cov', 'Student-t len', 'Penny len')
                       )

for p in ps:
    popMean = (1-p)*0.5  #  p*0 + (1-p)*.5
    for n in ns:
        tCrit = sp.stats.t.ppf(q=1-alpha/2, df=n-1)
        coverT = 0
        coverP = 0
        totT = 0.0
        totP = 0.0
        for rep in range(int(reps)):
            sam = np.random.uniform(size=n)
            ptMass = np.random.uniform(size=n)
            pennies = np.random.uniform(size=n)  # auxiliary uniform randomization within items
            sam[ptMass < p] = 0.0                # point mass at zero
            samMean = np.mean(sam)
            samSD = np.std(sam, ddof=1)
            coverT += ( math.fabs(samMean-popMean) < tCrit*samSD )
            totT += 2*tCrit*samSD   
            pLo, pHi, s = pennyBinomialBounds(sam, np.r_[0:n], pennies, cl=1-alpha )
            coverP += ( pLo <= popMean ) & (popMean <= pHi)
            totP += pHi - pLo
        simTable.loc[len(simTable)] =  p, n,\
            str(100*float(coverT)/float(reps)) + '%',\
            str(100*float(coverP)/float(reps)) + '%',\
            str(round(totT/float(reps),4)),\
            str(round(totP/float(reps), 4))
#
ansStr =  '<h3>Simulated coverage probability of Student-t and Continuous Penny Sampling confidence intervals for ' +\
          'mixture of U[0,1] and pointmass at 0 population</h3>' +\
          '<strong>Nominal coverage probability ' + str(100*(1-alpha)) +\
          '%</strong>.<br /><strong>Estimated from ' + str(int(reps)) + ' replications.</strong>'

display(HTML(ansStr))
display(simTable)

Simulated coverage probability of Student-t and Continuous Penny Sampling confidence intervals for mixture of U[0,1] and pointmass at 0 population

Nominal coverage probability 95.0%.
Estimated from 10000 replications.

	mass at 0	sample size	Student-t cov	Penny cov	Student-t len	Penny len
0	0.900	25.0	90.37%	99.35%	0.6419	0.207
1	0.900	50.0	98.89%	98.7%	0.6724	0.1382
2	0.900	100.0	99.99%	98.41%	0.6817	0.0942
3	0.900	400.0	100.0%	95.9%	0.6868	0.0451
4	0.990	25.0	21.96%	99.32%	0.0975	0.1451
5	0.990	50.0	38.61%	99.82%	0.1253	0.0795
6	0.990	100.0	62.16%	98.49%	0.1604	0.0447
7	0.990	400.0	97.98%	98.31%	0.2098	0.0167
8	0.999	25.0	2.57%	98.67%	0.0106	0.1381
9	0.999	50.0	4.44%	97.87%	0.0127	0.0719
10	0.999	100.0	8.94%	99.93%	0.0178	0.0371
11	0.999	400.0	32.85%	98.16%	0.0357	0.0101

As is the case with the other nonparametric methods, the Continuous Penny Sample intervals are in some cases shorter on average than Student-t intervals, but have higher coverage.

Collaborative and Reproducible Data Science

Penny sampling and Conditional Expectation

Contents