Finite-population SPRT for Population Percentage
Contents
Finite-population SPRT for Population Percentage¶
Example of the SPRT without replacement¶
This notebook develops a sequential probability ratio test for the fraction of items labeled “1” in a population of \(N\) items of which \(Np\) are labeled \(1\) and \(N(1-p)\) are labeled “0.”
This is a special case of the result derived in the notebook Wald’s Sequential Probability Ratio Test.
There is a population of \(N\) items. Item \(j\) has “value” \(a_j \in \{0, 1\}\).
Define \(p = \frac{1}{N}\sum_j a_j\) to be the population percentage.
We want to test the hypothesis \(H_0\) that \(p = p_0\) against the alternative hypothesis \(H_1\) that \(p = p_1 \), for some fixed \(p_1 > p_0\).
We will draw items sequentially, without replacement, such that the chance that item \(i\) is selected in draw \(j\), assuming it has not been selected already, is \(1/(N-j+1)\). Let \({\mathcal B_{j-1}}\) be the indices of the items selected up to and including the \(j-1\)st draw, and \({\mathcal B_0} \equiv \emptyset\).
Let \(\mathbb B_j\) denote the index of the item selected at random in the \(j\)th draw.
The chance that the first draw \({\mathbb B_1}\) gives an item with value 1, i.e., \(\Pr \{a_{\mathbb B_1} = 1\}\), is \(\frac{1}{N}\sum_b a_b\). Under \(H_0\), this chance is \(p_{01} = p_0\); under \(H_1\), this chance is \(p_{11} = p_1\).
Given the values of \(\{a_{\mathbb B_k}\}_{k=1}^i\), the conditional probability that the \(i\)th draw gives an item with value 1 is
Under \(H_0\), this chance is
Under \(H_1\), this chance is
Let \(X_i\) be the indicator of the event that the \(i\)th draw gives an item with value \(1\), i.e., the indicator of the event \(a_{\mathbb B_i} = 1\). The likelihood ratio for a given sequence \(\{X_k\}_{k=1}^i\) is
This can be simplified: \(p_{0k}\) and \(p_{1k}\) have the same denominator, \(N - k + 1\), and their numerators share a term. Define \(A(k) \equiv \sum_{b \in {\mathcal B_{k-1}}}\). Then
where the products are defined to be infinite if any denominator vanishes (or is negative).
If \(H_0\) is true, the chance that \(\mbox{LR}\) is ever greater than \(1/\alpha\) is at most \(\alpha\).
# This is the first cell with code: set up the Python environment
%matplotlib inline
import matplotlib.pyplot as plt
import math
import numpy as np
import numpy.random
import scipy as sp
import scipy.stats
# For interactive widgets
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
np.random.seed(1234567890) # set seed for reproducibility
def LRFromTrials(trials, N, p0, p1):
'''
Finds the sequence of likelihood ratios for the hypothesis that the population
percentage is p1 to the hypothesis that it is p0, for sampling without replacement
from a population of size N.
'''
A = np.cumsum(np.insert(trials, 0, 0)) # so that cumsum does the right thing
terms = np.ones(N)
for k in range(len(trials)):
if trials[k] == 1.0:
if (N*p0 - A[k]) > 0:
terms[k] = np.max([N*p1 - A[k], 0])/(N*p0 - A[k])
else:
terms[k] = np.inf
else:
if (N*(1-p0) - k + 1 + A[k]) > 0:
terms[k] = np.max([(N*(1-p1) - k + 1 + A[k]), 0])/(N*(1-p0) - k + 1 + A[k])
else:
terms[k] = np.inf
return(np.cumprod(terms))
def plotBernoulliSPRT(N, p, p0, p1, alpha):
'''
Plots the progress of a one-sided SPRT for N dependent Bernoulli trials
in sampling without replacement from a population of size N with a
fraction p of items labeled "1," for testing the hypothesis that p=p0
against the hypothesis p=p1 at significance level alpha
'''
plt.clf()
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
trials = np.zeros(N)
nOnes = int(math.floor(N*p))
trials[0:nOnes] = np.ones(nOnes)
np.random.shuffle(trials) # items are in random order
LRs = np.nan_to_num(LRFromTrials(trials, N, p0, p1))
logLRs = np.nan_to_num(np.log(LRs))
LRs[LRs > 10**6] = 10**6 # avoid plot overflow
logLRs[logLRs > 10**6] = 10**6 # avoid plot overflow
#
ax[0].plot(range(N),LRs, color='b')
ax[0].axhline(y=1/alpha, xmin=0, xmax=N, color='g', label=r'$1/\alpha$')
ax[0].set_title('Simulation of Wald\'s SPRT for population percentage, w/o replacement')
ax[0].set_ylabel('LR')
ax[0].legend(loc='best')
#
ax[1].plot(range(N),logLRs, color='b', linestyle='--')
ax[1].axhline(y=math.log(1/alpha), xmin=0, xmax=N, color='g', label=r'$log(1/\alpha)$')
ax[1].set_ylabel('log(LR)')
ax[1].set_xlabel('trials')
ax[1].legend(loc='best')
plt.show()
interact(plotBernoulliSPRT,\
N=widgets.IntSlider(min=500, max=50000, step=500, value=5000),\
p=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.51),\
p0=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.5),\
p1=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.51),\
alpha=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.05)
)
<function __main__.plotBernoulliSPRT(N, p, p0, p1, alpha)>
Simulate the distribution of sample sizes needed to reject¶
alpha = 0.05 # significance level
reps = int(10**4) # number of replications
tot = 150000
req = 125000
alt = 140000
actual = 130000
p, p0, p1 = [actual/tot, req/tot, alt/tot] # need p > p0 or might never reject
N = 150000 # population size
dist = np.zeros(reps) # allocate space for the results
trials = np.zeros(N)
nOnes = int(math.floor(N*p))
trials[0:nOnes] = np.ones(nOnes) # trials now contains math.floor(n*p) ones
for i in np.arange(reps):
np.random.shuffle(trials) # items are in random order
LRs = np.array(LRFromTrials(trials, N, p0, p1)) # likelihood ratios for this realization
dist[i] = np.argmax(LRs >= 1/alpha) # trials at which threshold is crossed
# report mean, median, and 90th percentile
print(np.mean(dist), np.percentile(dist, [50, 90]))
36.6912 [ 0. 117.]