Wald’s Sequential Probability Ratio Test

Sequential test: make a decision after each draw from the population

Sequence of data \(X_1, X_2, \ldots\).

Two hypotheses, \(H_0\) and \(H_1\), each of which completely specifies the joint distribution of the data.

Assume that the joint distributions under \(H_0\) and \(H_1\) have densities with respect to the same dominating measure.

Let \(f_{0m}\) be the likelihood of \(H_0\) for data \((X_j)_{j=1}^m\) and let \(f_{1m}\) be the likelihood of \(H_1\) for data \((X_j)_{j=1}^m\).

The likelihood ratio of \(H_1\) to \(H_0\) is \(f_{1m}/f_{0m}\); this is (speaking informally) the probability of observing \(X_1, \ldots, X_m\) if \(H_1\) is true, divided by the probability of observing \(X_1, \ldots, X_m\) if \(H_0\) is true.

The probability of observing the data actually observed will tend to be higher for whichever hypothesis is in fact true, so this likelihood ratio will tend to be greater than \(1\) if \(H_1\) is true, and will tend to be less than \(1\) if \(H_0\) is true. The more observations we make, the more probable it is that the resulting likelihood ratio will be small if \(H_0\) is true. Wald (1945) showed that if \(H_0\) is true, then the probability is at most \(\alpha\) that the likelihood ratio is ever greater than \(1/\alpha\), no matter how many observations are made. More generally, we have:

Wald's SPRT

Theorem

For any \(\alpha \in (0, 1)\) and \(\beta \in [0, 1)\), the following sequential algorithm tests the hypothesis \(H_0\) at level no larger than \(\alpha\) and with power at least \(1-\beta\) against the alternative \(H_1\).

Set \(m=0\).

  1. Increment \(m\)

  2. If \(\frac{f_{1m}}{f_{0m}} \ge \frac{1-\beta}{\alpha}\), stop and reject \(H_0\).

  3. If \(\frac{f_{1m}}{f_{0m}} \le \frac{\beta}{1-\alpha}\), stop and do not reject \(H_0\).

  4. If \(\frac{\beta}{1-\alpha} < \frac{f_{1m}}{f_{0m}} < \frac{1-\beta}{\alpha}\), go to step 1.

SPRT miracle

Don’t need to know the distribution of the likelihood ratio \(\mbox{LR}=\frac{f_{1m}}{f_{0m}}\) under the null hypothesis to find the critical values for the test.

Connection to Gambler’s ruin

For IID data, the likelihood ratio is a product of terms. On a log scale, it’s a sum. Each “trial” produces another term in the sum, positive or negative. But—unlike the classical Gambler’s Ruin problem in which the game is fair—the terms are not equal in magnitude: the steps are not all the same size.

Likelihood ratio for two values of \(p\) in IID Bernoulli trials

Suppose \(X_1, X_2, \ldots,\) are IID \(\mbox{Bernoulli}(p)\) random variables and let \(p_1 > p_0\).

Set \(\mbox{LR} \leftarrow 1\) and \(j \leftarrow 0\).

  • Increment \(j\)

  • If \(X_j = 1\), \(\mbox{LR} \leftarrow \mbox{LR} \times p_1/p_0\).

  • If \(X_j = 0\), \(\mbox{LR} \leftarrow \mbox{LR} \times (1-p_1)/(1- p_0)\).

What’s \(\mbox{LR}\) at stage \(m\)? Let \(T_m \equiv \sum_{j=1}^m X_j\).

\[\begin{equation*} \frac{p_{1m}}{p_{0m}} \equiv \frac{p_1^{T_m}(1-p_1)^{m-T_m}}{p_0^{T_m}(1-p_0)^{m-T_m}}. \end{equation*}\]

This is the ratio of binomial probability when \(p = p_1\) to binomial probability when \(p = p_0\) (the binomial coefficients in the numerator and denominator cancel). It simplifies further to

\[\begin{equation*} \frac{p_{1m}}{p_{0m}} = (p_0/p_1)^{T_m} ((1-p_0)/(1-p_1))^{m-T_m}. \end{equation*}\]

Wald's SPRT for $p$ in iid Bernoulli trials

Conclude \(p > p_0\) if

\[\begin{equation*} \frac{p_{1m}}{p_{0m}} \ge \frac{1-\beta}{\alpha}. \end{equation*}\]

Conclude \(p \le p_0\) if

\[\begin{equation*} \frac{p_{1m}}{p_{0m}} \le \frac{\beta}{1-\alpha}. \end{equation*}\]

Otherwise, draw again.

The SPRT approximately minimizes the expected sample size when \(p \le p_0\) or \(p > p_1\). For values in \((p_1, p_0)\), it can have larger sample sizes than fixed-sample-size tests.

Simulation of SPRT for Binomial

Let’s watch the SPRT in action for iid Bernoulli trials.

# This is the first cell with code: set up the Python environment
%matplotlib inline
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
# For interactive widgets
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
def plotBinomialSPRT(n, p, p0, p1, alpha, beta):
    '''
       Plots the progress of the SPRT for n iid Bernoulli trials with probabiity p
       of success, for testing the hypothesis that p=p0 against the hypothesis p=p1
       with significance level alpha and power 1-beta
    '''
    fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
    trials = sp.stats.binom.rvs(1, p, size=n+1)  # leave room to start at 1
    terms = np.ones(n+1)
    sfac = p1/p0
    ffac = (1.0-p1)/(1.0-p0)
    terms[trials == 1.0] = sfac
    terms[trials == 0.0] = ffac
    terms[0] = 1.0
    logterms = np.log(terms)
    #
    ax[0].plot(range(n+1),np.cumprod(terms), color='b')
    ax[0].axhline(y=(1-beta)/alpha, xmin=0, xmax=n, color='g', label=r'$(1-\beta)/\alpha$')
    ax[0].axhline(y=beta/(1-alpha), xmin=0, xmax=n, color='r', label=r'$\beta/(1-\alpha)$')
    ax[0].set_title('Simulation of Wald\'s SPRT')
    ax[0].set_ylabel('LR')
    ax[0].legend(loc='best')
    #
    ax[1].plot(range(n+1),np.cumsum(logterms), color='b', linestyle='--')
    ax[1].axhline(y=math.log((1-beta)/alpha), xmin=0, xmax=n, color='g', label=r'$log((1-\beta)/\alpha)$')
    ax[1].axhline(y=math.log(beta/(1-alpha)), xmin=0, xmax=n, color='r', label=r'$log(\beta/(1-\alpha))$')
    ax[1].set_ylabel('log(LR)')
    ax[1].set_xlabel('trials')
    ax[1].legend(loc='best')
    plt.show()


interact(plotBinomialSPRT, n=widgets.IntSlider(min=5, max=300, step=5, value=100),\
         p=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.45),\
         p0=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.5),\
         p1=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.6),\
         alpha=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.05),\
         beta=widgets.FloatSlider(min=0.001, max=1, step=0.01, value=.05)
         )
<function __main__.plotBinomialSPRT(n, p, p0, p1, alpha, beta)>

Behavior

For \(p_0 < p_1\),

  • when \(p \ge p_1\), the likelihood ratio is likely to cross the upper (green) line eventually and unlikely to cross the lower (red) line.

  • when \(p \le p_0\), the likelihood ratio is likely to cross the lower (red) line eventually and unlikely to cross the upper (green) line.

  • the SPRT approximately minimizes the expected number of trials before rejecting one or the other hypothesis, provided \(p \notin (p_0, p_1)\).

For \(p_1 < p_0\), the directions are reversed.

It works for \(\beta = 0\) too: \(P\)-values

The inequalities hold when \(\beta = 0\) also. Then the likelihood ratio has chance at most \(\alpha\) of ever being greater than \(1/\alpha\), if in fact \(p > p_0\).

Hence, \(1/\mbox{LR}\) is the \(P\)-value of the hypothesis \(p > p_0\). This can be used to construct one-sided confidence bounds for \(p\) and other parameters. The next chapter does exactly that, to find a lower confidence bound for the mean of a nonnegative population.

Dependent observations

As an illustration of SPRT in for obervations that are not iid, consider the following problem.

There is a population of \(B\) items. Item \(b\) has “weight” \(N_b\) and “value” \(a_b \in \{0, 1\}\). Let \(N = \sum_{b=1}^B N_b\). (You might think of items as manufacturing lots, where \(a_b = 1\) means the lot is acceptable and \(a_b=0\) means the lot is defective.)

We want to test the hypothesis \(H_0\) that \(\frac{1}{N}\sum_b N_b a_b = 1/2\) against the alternative hypothesis \(H_1\) that \(\frac{1}{N}\sum_b N_b a_b = \gamma \), for some fixed \(\gamma > 1/2\).

We will draw items sequentially, without replacement, such that the chance that item \(b\) is selected in draw \(i\), assuming it has not been selected already, is \(N_b/\sum_{j \notin {\mathcal B_{i-1}}} N_j\), where \({\mathcal B_{i-1}}\) is the sample up to and including the \(i-1\)st draw, and \({\mathcal B_0} \equiv \emptyset\). That is, the chance of selecting an item is in proportion to its weight among the items that have not yet been selected.

Let \(\mathbb B_i\) denote the item selected at random in such a manner in the \(i\)th draw.

The chance that the first draw \({\mathbb B_1}\) gives an item with value 1, i.e., \(\Pr \{a_{\mathbb B_1} = 1\}\), is \(\frac{1}{N}\sum_b N_b a_b\). Under \(H_0\), this chance is \(p_{01} = 1/2\); under \(H_1\), this chance is \(p_{11} = \gamma\).

Given the values of \(\{a_{\mathbb B_k}\}_{k=1}^i\), the conditional probability that the \(i\)th draw gives an item with value 1 is

\[\begin{equation*} \Pr \{a_{\mathbb B_i} = 1 | {\mathcal B_{i-1}} \} = \frac{ \sum_{b \notin {\mathcal B_{i-1}}} N_b a_b}{\sum_{b \notin {\mathcal B_{i-1}}} N_b}. \end{equation*}\]

Under \(H_0\), this chance is

\[\begin{equation*} p_{0i} = \frac{N/2 - \sum_{b \in {\mathcal B_{i-1}}} N_b a_b}{N - \sum_{b \in {\mathcal B_{i-1}}} N_b}. \end{equation*}\]

Under \(H_1\), this chance is

\[\begin{equation*} p_{1i} = \frac{N \gamma - \sum_{b \in {\mathcal B_{i-1}}} N_b a_b}{N - \sum_{b \in {\mathcal B_{i-1}}} N_b}. \end{equation*}\]

Let \(X_k\) be the indicator of the event that the \(k\)th draw gives an item with value \(1\), i.e., the indicator of the event \(a_{\mathbb B_k} = 1\). The likelihood ratio for a given sequence \(\{X_k\}_{k=1}^i\) is

\[\begin{equation*} \mbox{LR} = \frac{\prod_{k=1}^i p_{1k}^{X_k}(1-p_{1k})^{1-X_k}} {\prod_{k=1}^i p_{0k}^{X_k}(1-p_{0k})^{1-X_k}}. \end{equation*}\]

This can be simplified: \(p_{0k}\) and \(p_{1k}\) have the same denominator, \(N - \sum_{b \in {\mathcal B_{i-1}}} N_b\), and their numerators share a term. Define \(N(k) \equiv \sum_{j = 1}^{k-1}N_b\) and \(A(k) \equiv \sum_{j = 1}^{k-1}N_b a_b\). Then

\[\begin{equation*} \mbox{LR} = \prod_{k=1}^i \left ( \frac{N\gamma - A(k)}{N/2 - A(k)} \right )^{X_k} \left ( \frac{N-N\gamma - (N(k) - A(k))}{N-N/2 - (N(k)-A(k))} \right )^{1-X_k} \end{equation*}\]
\[\begin{equation*} = \prod_{k=1}^i \left ( \frac{N\gamma - A(k)}{N/2 - A(k)} \right )^{X_k} \left ( \frac{N(1-\gamma) - N(k) + A(k)}{N/2 - N(k) + A(k)} \right )^{1-X_k}, \end{equation*}\]

where the products are defined to be infinite if the denominator vanishes anywhere.

If \(H_0\) is true, the chance that \(\mbox{LR}\) is ever greater than \(1/\alpha\) is at most \(\alpha\).

Nuisance parameters

This example arises in constructing risk-limiting audits using “ballot polling,” which involves selecting individual cast ballots and reading voter intent from them manually.

Consider a population of \(N\) items, each of which has one of three possible labels, \(\{w, \ell, u\}\). Let \(N_w\) denote the number of items labeled \(w\), \(N_\ell\) denote the number labeled \(\ell\), and \(N_u\) denote the number labeled \(u\). We want to test the compound hypothesis that \(N_\ell \ge N_w\) against the alternative that \(N_w = V_w\), \(N_\ell = V_\ell\), and \(N_u = V_u\), with \(V_w > V_\ell\).

In election auditing, the items are ballots and the types are votes: \(w\) means the ballot shows a vote for reported a particular reported winner \(w\) and not for particular reported loser \(\ell\), \(\ell\) is a ballot with vote for \(\ell\) but not for \(w\), and \(u\) means the ballot shows an undervote or invalid vote or a vote for both \(w\) and \(\ell\). The values \(V_w\), \(V_\ell\), and \(V_u\) are the reported results (or related values); the values \(N_w\), \(N_\ell\), and \(N_u\) are what a full manual tally of the paper ballots would show. Reported winner \(w\) really beat reported loser \(\ell\) if \(N_w > N_\ell\). The null hypothesis \(N_\ell \ge N_w\) is that the reported out come is wrong: the outcome was a tie or \(\ell\) beat \(w\).)

In this problem, \(N_u\) (equivalently, \(N_w + N_\ell\)) is a nuisance parameter: we care about \(N_\ell - N_w\).

Suppose we sample sequentially without replacement from the \(N\) items in such a way that in every draw, each of the unselected items is equally likely to be selected. Let \(X_k\) be the label of the item selected on the \(k\)th draw; let \(W_n \equiv \sum_{k=1}^n 1_{X_k = w}\); and define \(L_n\) and \(U_n\) analogously.

The probability of a given data sequence \(X_1, \ldots, X_n\) under the alternative hypothesis is

\[\begin{equation*} \frac{V_w(V_w - 1) \cdots (V_w - W_n+1) \times V_\ell(V_\ell - 1) \cdots (V_\ell - L_n+1) \times V_u(V_u - 1) \cdots (V_u - U_n+1)}{N(N-1) \cdots (N-n+1)}. \end{equation*}\]

If \(L_n \ge W_n\), the data obviously do not provide evidence against the null, so we suppose that \(L_n < W_n\), in which case, the element of the null that will maximize the probability of the observed data has a tie: \(N_w = N_\ell\). Under the null hypothesis, the probability of \(X_1, \ldots, X_n\) is

\[\begin{equation*} \frac{N_w(N_w - 1) \cdots (N_w - W_n+1) \times N_w(N_w - 1) \cdots (N_w - L_n+1) \times N_u(N_u - 1) \cdots (N_u - U_n+1)}{N(N-1) \cdots (N-n+1)}, \end{equation*}\]

for some value \(N_w\) and the corresponding \(N_u = N - 2N_w\).

How large can that probability be under the null? Certainly no larger than the likelihood of the maximum likelihood estimate of \(N_w\) when \(N_w\) is constrained to equal \(N_\ell\). We thus have a one-dimensional optimization problem. The probability under the null is maximized by any integer value \(x \in \{ \max(W_n, L_n), \ldots, (N-U_n)/2 \}\) that maximizes

\[\begin{equation*} x(x - 1) \cdots (x - W_n+1) \times x(x - 1) \cdots (x - L_n+1) \times (N-2x)(N-2x - 1) \cdots (N-2x - U_n+1).\end{equation*}\]

The logarithm is monotonic, so any maximizer, \(x^*\), also maximizes

\[\begin{equation*} f(x) = \sum_{i=0}^{W_n-1} \ln (x-i) + \sum_{i=0}^{L_n-1} \ln (x-i) + \sum_{i=0}^{U_n-1} \ln (N-2x-i).\end{equation*}\]

If we treat \(x\) as a continuous variable, we can differentiate \(f\) with respect to \(x\). The second derivative is negative, so the function is convex, and a local maximum is a global maximum. The optimal value will be the integer value of \(x\) closest to the real value of \(x\) that is a stationary point.

A conservative \(p\) value for the null hypothesis after \(n\) items have been drawn is thus

\[\begin{equation*} \frac{\prod_{i=0}^{W_n-1} (x^*-i) \; \prod_{i=0}^{L_n-1} (x^*-i) \; \prod_{i=0}^{U_n-1} (N-2x^*-i)}{\prod_{i=0}^{W_n-1}(V_w-i) \; \prod_{i=0}^{L_n-1} (V_\ell-i) \; \prod_{i=0}^{U_n-1} (V_u-i)}. \end{equation*}\]

An alternative approach that avoids the need to maximize over a nuisance parameter is given in SHANGRLA: Sets of Half-Average Nulls Generate Risk-Limiting Audits Stark, 2020: in this example, it would encode a vote for \(w\) as \(1\), a vote for \(u\) as \(1/2\), and a vote for \(\ell\) as zero. That yields a finite list of values between \(0\) and \(1\). If the mean of that list is greater than \(1/2\), \(w\) really got more votes than \(\ell\). This transformation allows any method for testing whether the mean of a finite, bounded population is less than or equal to \(1/2\) to be used. See also the Alpha notebook

## TO DO: implement stationary point calculation

def f_plus(x, Wn, Ln):
    return sum([math.log(x-i) for i in range(0,Wn)]) + sum([math.log(x-i) for i in range(0,Ln)])

def f_minus(x, Un, N):
    return sum([math.log(N-2*x-i) for i in range(0, Un)])

def f_all(x, Wn, Ln, Un, N):
    return f_plus(x, Wn, Ln)+f_minus(x, Un, N)

def feasible_range(Wn, Ln, Un, N):
    '''
    Returns the range of values of Nw consistent with the data
    '''
    return max(Wn,Ln), (N-Un)/2