Introduction to Risk-Limiting Election Audits

Reading

  1. Lindeman, M. and P.B. Stark, 2012. A Gentle Introduction to Risk-Limiting Audits. IEEE Security and Privacy, 10, 42–49. Preprint: https://www.stat.berkeley.edu/~stark/Preprints/gentle12.pdf

  2. Stark, P.B., and D.A. Wagner, 2012. Evidence-Based Elections. IEEE Security and Privacy, 10, 33–41. Preprint: https://www.stat.berkeley.edu/~stark/Preprints/evidenceVote12.pdf

  3. Rivest, R.L. and E. Shen, 2012. A Bayesian Method for Auditing Elections, EVT/WOTE, https://www.usenix.org/system/files/conference/evtwote12/evtwote12-final30.pdf

Relevant code

  1. https://github.com/ron-rivest/audit-lab

  2. https://github.com/pbstark/auditTools

    1. https://www.stat.berkeley.edu/~stark/Java/Html/auditTools.htm

    2. https://www.stat.berkeley.edu/~stark/Java/Html/ballotPollTools.htm

  3. https://github.com/pbstark/DKDHondt14

“Super-simple simultaneous” comparison audits for plurality contests:

The Kaplan-Markov Inequality, Wald’s Sequential Probability Ratio Test, and the Kaplan-Wald Inequality

This section follows https://www.usenix.org/legacy/event/evtwote10/tech/full_papers/Stark.pdf closely, but derives something slightly more general.

Notation

  • \(C\) contests under audit

  • \(N\) ballots were cast in all. (There might not be any contest that appears on all \(N\) ballots)

  • Contest \(c\) appears on \(N_c\) of the \(N\) cast ballots. (\(N\) and \(\{N_c\}_{c=1}^C\) are known.)

  • \(\mathcal{W}_c\): the set of reported winners of contest \(c\).

  • \(\mathcal{L}_c\): the set of reported losers of contest \(c\).

  • \(v_{pi} \in \{0, 1\}\): the reported votes for candidate \(i\) on ballot \(p\)

  • \(a_{pi} \in \{0, 1\}\): actual votes for candidate \(i\) on ballot \(p\). If contest \(c\) does not appear on ballot \(p\) then \(a_{pi} = 0\).

  • \(V_{w\ell} \equiv \sum_{p=1}^N (v_{pw} - v_{p\ell}) > 0\): Reported margin of reported winner \(w \in \mathcal{W}_c\) over reported loser \(\ell \in \mathcal{L}_c\) in contest \(c\).

  • \(V\): smallest reported margin among all \(C\) contests: \(V \equiv \min_c \min_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} V_{w \ell}\)

  • \(\mu = V/N\): the “diluted margin,” the margin in votes divided by the total number of ballots

  • \(A_{w\ell} \equiv \sum_{p=1}^N (a_{pw} - a_{p\ell})\): actual margin of reported winner \(w \in \mathcal{W}_c\) over reported loser \(\ell \in \mathcal{L}_c\) in contest \(c\)

The reported winners of all \(C\) contests are the actual winners of those contests if

\[\begin{equation*} \min_c \min_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} A_{w\ell} > 0. \end{equation*}\]

Otherwise, at least one reported electoral outcome is wrong. (We aren’t particularly worried about whether the votes are counted perfectly—they almost never all—just whether the errors were large enough to make a loser appear to be a winner, or vice versa.)

We won’t test that inequality directly. Instead, we will test a condition that is sufficient but not necessary for the inequality to hold, to get a computationally simple test that is still conservative (the simultaneous risk is below its nominal value).

One such reduction relies on the maximum across-contest relative overstatement (MACRO), which we will now develop.

The reported winners of contest \(c\) are correct if for every pair \(w \in \mathcal{W}_c\), \(\ell \in \mathcal{L}_c\),

\[\begin{equation*} \sum_{p=1}^N (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} < 1. \end{equation*}\]

This condition says that the error in \(V_{w\ell}\) is less than \(V_{w\ell}\). The reported winners of all the contests are correct if for every contest \(c\), the previous relation holds, i.e., if

\[\begin{equation*} \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \sum_{p=1}^N (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} < 1. \end{equation*}\]

Now the maximum (over \(c\) and, for each contest \(c\), over all winner, loser pairs in that contest) of sums is not larger than the sum of maxima; that is,

\[\begin{equation*} \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \sum_{p=1}^N (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} \le \sum_{p=1}^N \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell}. \end{equation*}\]

Hence, if

\[\begin{equation*} \sum_{p=1}^N \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} < 1, \end{equation*}\]

all the reported outcomes must be correct. Define

\[\begin{equation*} e_p \equiv \max_c \max_{w \in \mathcal{W}_c \ell \in \mathcal{L}_c} (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell}. \end{equation*}\]

Then the reported outcomes of all the contests must be correct if

\[\begin{equation*} E \equiv \sum_{p=1}^N e_p < 1. \end{equation*}\]

To see that a different way, suppose that the outcome of one or more contests is wrong. Then there is some contest \(c\) and some reported (winner, loser) pair \(w \in \mathcal{W}_c, \ell \in \mathcal{L}_c\) for which

\[\begin{equation*} 0 \ge A_{w\ell} = V_{w\ell} - (V_{w\ell} - A_{w\ell}) = V_{w\ell} - \sum_{p=1}^N (v_{pw} - a_{pw} - v_{p\ell} + a_{p\ell}), \end{equation*}\]

i.e.,

\[\begin{equation*} \sum_{p=1}^N (v_{pw} - a_{pw} - v_{p\ell} + a_{p\ell}) \ge V_{w\ell}. \end{equation*}\]

Diving both sides by \(V_{w\ell}\) gives

\[\begin{equation*} \sum_{p=1}^N \frac{v_{pw} - a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \ge 1. \end{equation*}\]

But

\[\begin{equation*} \frac{v_{pw} - a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \le \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \end{equation*}\]
\[\begin{equation*} \le \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} = e_p, \end{equation*}\]

so if the outcome is wrong, \(E = \sum_p e_p \ge 1\). Thus a risk-limiting audit can rely on testing whether \(E \ge 1\). If the hypothesis \(E \ge 1\) can be rejected at significance level \(\alpha\), we can conclude that all the reported outcomes are correct.

Testing whether \(E \ge 1\) would require a very large sample if we knew nothing at all about \(e_p\) without auditing ballot \(p\): a single large value of \(e_p\) could make \(E\) arbitrarily large. Fortunately, there is an a priori upper bound for \(e_p\). Whatever the reported votes \(v_{pi}\) are on ballot \(p\), we can find the potential values of the actual votes \(a_{pi}\) that would make the error \(e_p\) largest, because \(a_{pi}\) can only be zero or one:

\[\begin{equation*} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \le \frac{v_{pw}- 0 - v_{p\ell} + 1}{V_{w\ell}}. \end{equation*}\]

Hence,

\[\begin{equation*} e_p \le \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw} - v_{p\ell} + 1}{V_{w\ell}} \equiv \tilde{u}_p. \end{equation*}\]

Knowing that \(e_p \le \tilde{u}_p\) might let us conclude reliably that \(E < 1\) by examining only a small fraction of the ballots–depending on the values \(\{ \tilde{u}_p\}_{p=1}^N\) and on the values of \(\{e_p\}\) for the audited ballots.

To make inferences about \(E\), it is helpful to work with the taint \(t_p \equiv \frac{e_p}{\tilde{u}_p} \le 1\). Define \(\tilde{U} \equiv \sum_{p=1}^N \tilde{u}_p\). Suppose we draw ballots at random with replacement, with probability \(\tilde{u}_p/\tilde{U}\) of drawing ballot \(p\) in each draw, \(p = 1, \ldots, N\). (Since \(\tilde{u}_p \ge 0\), these are all positive numbers, and they sum to 1, so they define a probability distribution on the \(N\) ballots.)

Let \(T_j\) be the value of \(t_p\) for the ballot \(p\) selected in the \(j\)th draw. Then \(\{T_j\}_{j=1}^n\) are IID, \(\mathbb{P} \{T_j \le 1\} = 1\), and

\[\begin{equation*} \mathbb{E} T_1 = \sum_{p=1}^N \tilde{u}_p/\tilde{U} t_p = \frac{1}{\tilde{U}}\sum_{p=1}^N \tilde{u}_p \frac{e_p}{\tilde{u}_p} = \frac{1}{\tilde{U}} \sum_{p=1}^N e_p = E/\tilde{U}. \end{equation*}\]

Thus \(E = \tilde{U} \mathbb{E} T_1\).

So, if we have strong evidence that \(\mathbb{E} T_1 < 1/\tilde{U}\), we have strong evidence that \(E < 1\).

This approach can be simplified even further by noting that \(\tilde{u}_p\) has a simple upper bound that does not depend on any \(v_{pi}\). At worst, the CVR for ballot \(p\) shows a vote for the “least-winning” apparent winner of the contest with the smallest margin, but a hand interpretation shows a vote for the runner-up in that contest. Since \(V_{w\ell} \ge V\) and \(0 \le v_{pi} \le 1\),

\[\begin{equation*} \tilde{u}_p = \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw} - v_{p\ell} + 1}{V_{w\ell}} \le \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{1 - 0 + 1}{V_{w\ell}} \le \frac{2}{V}. \end{equation*}\]

Thus, if we define \(u_p \equiv 2/V\) and sample ballots at random with probability proportional to \(u_p\), in fact we will sample ballots with equal probability. Define

\[\begin{equation*} U \equiv \sum_{p=1}^N \frac{2}{V} = \frac{2N}{V} = 2/\mu \end{equation*}\]

and re-define \(t_p \equiv e_p/u_p\) (rather than \(e_p/\tilde{u}_p\)); let \(T_j\) be the value of \(t_p\) for the ballot selected at random in the \(j\)th draw, as before. Then still \(\{T_j\}_{j=1}^n\) are IID, \(\mathbb{P} \{T_j \le 1\} = 1\), and

\[\begin{equation*} \mathbb{E} T_1 = \sum_{p=1}^N \frac{u_p}{U} t_p = \frac{1}{U}\sum_{p=1}^N u_p \frac{e_p}{u_p} = \frac{1}{U} \sum_{p=1}^N e_p = E/U = \frac{\mu}{2} E, \end{equation*}\]

i.e.,

\[\begin{equation*} E = \frac{2}{\mu}\mathbb{E} T_1. \end{equation*}\]

So, if we have evidence that \(\mathbb{E} T_1 < \mu/2 = 1/U\), we have evidence that \(E < 1\).

\(P\)-values

We digress to define \(P\) values a little more carefully than elementary books typically do. See also Stark, P.B., 2016. The value of P-values, The American Statistician, 70, DOI:10.1080/00031305.2016.1154108, http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108

There is a hypothesis about the world. A (possibly multivariate) datum \(Y\) will be observed; \(Y\) is a random variable that takes values in a measurable space \(\mathcal{Y}\). If the hypothesis is true, the probability distribution \(G\) of \(Y\) is in some known set \(\mathcal{G}\) of probability distributions.

The null hypothesis is the assertion that \(G \in \mathcal{G}\); if the null hypothesis is false, \(G \notin \mathcal{G}\). (For simplicity, we suppose that all the distributions in \(\mathcal{G}\) are defined on the same \(\sigma\)-algebra, denoted \(\Sigma\).)

One particularly general and flexible definition of a \(P\)-value is that it is any random variable \(P\) for which \(\Pr_G \{ P \le t \} \le t\) for all \(t \in [0, 1]\). That is, it is stochastically smaller than a uniform random variable if the null hypothesis is true.

Another definition of \(P\)-values is in terms of rejection regions for a family of tests. For every value \(\alpha \in (0, 1)\), we have a measurable set \(R_\alpha \in \Sigma\) such that

\[\begin{equation*} \sup_{G \in \mathcal{G}} \Pr_G \{ Y \in R_\alpha \} \le \alpha. \end{equation*}\]

If the hypothesis \(G \in \mathcal{G}\) is true, the chance that the datum \(Y\) will be in the set \(R_\alpha\) is no greater than \(\alpha\). The set \(R_\alpha\) is the rejection region of a significance-level \(\alpha\) test of the null hypothesis. The hypothesis is rejected at significance level \(\alpha\) if \(Y \in R_\alpha\). If we conclude that the hypothesis is false whenever \(Y \in R_\alpha\), the chance of erroneously concluding that the hypothesis is false if it is in fact true is at most \(\alpha\).

To define \(P\)-values, we shall also insist that rejection regions for different significance levels nest: If \(0 < \beta < \gamma < 1\), then

\[\begin{equation*} R_\beta \subset R_\gamma. \end{equation*}\]

This ensures that if the family of tests rejects the hypothesis at significance level \(\alpha\), it also rejects the hypothesis at every significance level greater than \(\alpha\). We also define \(R_1 \equiv \mathcal{Y}\).

Given this structure, the \(P\)-value of the null hypothesis given the observation \(Y = y\) is

\[\begin{equation*} P \equiv \inf \{ \alpha \in (0, 1] : y \in R_\alpha \}. \end{equation*}\]

That is, the \(P\)-value is the smallest significance level at which the family of tests rejects the hypothesis. Because of the nesting, the family rejects the hypothesis for all significance levels greater than the \(P\)-value.

The \(P\)-value depends not only on the hypothesis \(\mathcal{G}\) and the observation that \(Y = y\), but also on the family of hypothesis tests (the sets \(R_\alpha\), \(\alpha \in (0, 1]\)). Different tests in general give different \(P\)-values for the same hypothesis given the same data.

One test is better (more powerful) than another if

  • no matter what value \(y\) the datum \(Y\) takes, the first test assigns a \(P\)-value no larger than the second test does, and

  • there is some \(y \in \mathcal{Y}\) such that the first test assigns a smaller \(P\)-value than the second when \(Y = y\).

Many tests are defined by a test statistic–a single-number summary \(\phi(Y)\) of the vector \(Y\), where \(\phi\) is a \(\Sigma\)-measurable function from \(\mathcal{Y}\) to \(\Re\). The rejection region is then expressed in terms of \(\phi(Y)\): The null hypothesis is rejected if \(\phi(Y) \le f_\alpha\), where \(f_\alpha\) is chosen to satisfy

\[\begin{equation*} \sup_{G \in \mathcal{G}} \mathbb{P}_G \{ \phi(Y) \le f_\alpha \} \le \alpha. \end{equation*}\]

The nesting condition requires that for \(0 < \beta < \gamma < 1\),

\[\begin{equation*} \{ y \in \mathcal{Y} : \phi(y) \le f_\beta \} \subset \{ y \in \mathcal{Y} : \phi(y) \le f_\gamma \}; \end{equation*}\]

i.e., \(f_\alpha\) should increase monotonically with \(\alpha\). The \(P\)-value of the hypothesis for the observation \(Y = y\) is then

\[\begin{equation*} P = \inf \{ \alpha : \phi(y) \le f_\alpha \}. \end{equation*}\]

The Kaplan-Markov Inequality

Suppose \(\{X_j\}_{j=1}^n\) are iid with \(\mathbb{P}\{X_j \ge 0\} = 1\). Form the nonnegative Martingale:

\[\begin{equation*} \left ( X_1/\mathbb{E} X_1, (X_1/\mathbb{E} X_1)\cdot(X_2/\mathbb{E} X_1), ..., \prod_{j=1}^n X_n/\mathbb{E} X_1 \right ). \end{equation*}\]

The expected value of each term is 1. Kaplan (1987) uses Ville’s (1939) inequality for an optionally stopped nonnegative martingale \(Z_1, Z_2, \ldots\): For any \(t > 0\), \(\mathbb{P} \{ \max_{j=1}^n Z_j > z \} \le \mathbb{E} Z_n/t\). In the case at hand, that gives:

\[\begin{equation*} \mathbb{P} \left \{ \max_{j=1}^n \prod_{i=1}^j X_j/\mathbb{E} X_1 > 1/\alpha \right \} \le \alpha. \end{equation*}\]

This is the Kaplan-Markov bound (see also Stark, 2009).

We can reject the hypothesis that \(\mathbb{E} X_j \le \mu\) at significance level \(\alpha\) if

\[\begin{equation*} \max_{j=1}^n \prod_{i=1}^j X_i/\mu > 1/\alpha. \end{equation*}\]

If we observe \((X_j = x_j)_{j=1}^n\), the \(P\)-value of the hypothesis that \(\mathbb{E} X_1 \le \mu\) is

\[\begin{equation*} \left ( \max_{j=1}^n \prod_{i=1}^j x_j/\mu \right )^{-1}. \end{equation*}\]

This derivation gave a \(P\)-value for the hypothesis that \(\mathbb{E} X_1 \le \mu\) starting with the assumptions that \(\{X_j\}_{j=1}^n\) are IID and \(\mathbb{P}\{X_1 \ge 0\} = 1\). But in election auditing, we want a \(P\)-value for the hypothesis \(\mathbb{E} T_1 \ge 1/U\) and we know that \(\{T_j\}_{j=1}^n\) are IID and \(\mathbb{P}\{T_1 \le 1\} = 1\). The transformation \(X_j = 1 - T_j\) transforms one problem into the other.

The \(P\)-value for the hypothesis \(E \ge 1\) if we observe \((T_j = t_j)_{j=1}^n\) is

\[\begin{equation*} P_{KM} \equiv \min_{j=1}^n \prod_{i=1}^j \frac{1 - 1/U}{1 - t_i}. \end{equation*}\]

The \(P\)-value \(P_{KM}\) can depend on the order in which the data are collected, which is discomfiting unless the data are in fact collected sequentially.

Padding the Kaplan-Markov Inequality

The basic Kaplan-Markov inequality isn’t helpful if any observed \(t_p\) is equal to 1, that is, if the corresponding ballot was reported to have a vote for the winner with the fewest votes (in the tightest contest), but an audit shows it to have a vote for the loser with the most votes. If equipment is functioning properly, such an error should not occur frequently: it would be much more common for a scanner to miss a mark, to treat a “hesitation mark” as an overvote, etc. However, if the auditors mistakenly retrieve the wrong ballot, such an error might occur more frequently, in turn requiring a full hand count even when the reported outcome is correct.

To hedge against the possibility that the ratio of \(e_p\) to its upper bound is equal to 1 for any ballot in the sample, we can increase the upper bound so that it cannot be attained, for instance, by inflating it by a small percentage. The risk remains strictly controlled.

To that end, we take the error bound for each ballot to be \( u_p \equiv \gamma 2/V > \tilde{u}_p \) where the “inflator” \(\gamma > 1\). That ensures that \(e_p/u_p\) cannot be larger than \(1/\gamma < 1\). The cost of inflating the upper bound in this way is that a larger sample will be needed than if \(\{\tilde{u}_p\}\) were used as the bounds and the sample did not happen to include any ballots with \(e_p\) equal to \(\tilde{u}_p\). On the other hand, inflating the error bounds can help avoid a full count when that full count would merely confirm that all the apparent outcomes are correct; that is, using \(\gamma > 1\) can increase the power of the test against some alternatives. The larger the value of \(\gamma\), the larger the initial sample needs to be to allow the audit to stop if at most a given number of ballots overstated one or more margins by one vote, but the less the sample will need to be expanded if ballots in the sample overstate any margin by two votes–unless a full hand count is required.

With \(u_p\) defined as above, the total error bound across all \(N\) ballots becomes \( U \equiv 2\gamma N/V = 2\gamma/\mu, \) where \(\mu\) is the diluted margin \(V/N\). The diluted margin plays an important role in determining the sample size: The initial sample size is \(1/\mu\) multiplied by a constant that depends on the desired simultaneous risk limit, the number of errors of each kind we would like to be able to tolerate without expanding the audit, and the inflator \(\gamma\).

Suppose that \(n\) of the \(N\) ballots are drawn with replacement with equal probability. Let \(e_j\) be the value of the error \(e_p\) for the \(j\)th randomly selected ballot. The Kaplan-Markov MACRO \(P\)-value is

\[\begin{equation*} P_{KM} \le \prod_{j=1}^n \frac{1 - 1/U}{1 - \frac{e_j}{u_j}} = ( 1 - 1/U )^n \prod_{j=1}^n \left ( 1 - e_j\frac{V}{2\gamma} \right )^{-1}. \end{equation*}\]

An audit with simultaneous risk limit \(\alpha\) can be conducted by continuing to hand count the votes on ballots selected at random until \(P_{KM} \le \alpha\) or until the votes on all the ballots have been counted by hand (and the correct outcome is therefore known).

A simple approximation

The Kaplan-Markov \(P\)-value depends on which margins in which contests are affected by error, and the nature of those errors.
We will develop a simplification that reduces bookkeeping without compromising the risk limit, but at the expense of potentially requiring the audit to inspect more ballots.

The first step is to give up some of the advantages of normalizing errors by the margins they affect, so that errors that affect wide margins will count the same as errors that affect narrow margins.

Recall that a sufficient condition for all the reported outcomes to be correct is

\[\begin{equation*} \sum_{p=1}^N \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} < 1. \end{equation*}\]

Since

\[\begin{equation*} V \equiv \min_c \min_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} V_{w\ell}, \end{equation*}\]

an even more restrictive sufficient condition is

\[\begin{equation*} \sum_{p=1}^N \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V} < 1. \end{equation*}\]

This condition insists that the sum over ballots of the largest error in any margin in any contest on the ballot be less than the smallest margin between any winner and any loser in any contest. Clearly, this gives up a lot, since a 1-vote error in a wide margin matters less than a 1-vote error in a narrow margin.

This condition is the same as the condition \(E < 1\) if we replace all the individual pairwise margins \(V_{w\ell}\) by \(V\), the smallese pairwise margin.

An overstatement is an error that caused the margin between some reported winner and some reported loser to be larger than it really is. That is, an overstatement is an error that, if corrected, would reduce the margin. For instance, if the ballot was recorded as an undervote, but manual inspection shows that it was a vote for a reported loser, that is an overstatement error. Correcting it would reduce at least one reported margin.

Conversely, an understatement is an error that caused the margin between some reported winner and some reported loser to be smaller than it really is. That is, an understatement is an error that, if corrected, would increase the margin. For instance, if a ballot was recorded as a vote for a reported loser, but manual inspection of the ballot shows that it was a vote for a reported winner, that is an understatement error. Correcting it would increase at least one reported margin.

Discovering an overstatement reduces confidence that the reported outcomes are correct; discovering an understatement increases confidence in the outcomes—but not by as much.

To have algebraically simple rules, we can make a conservative approximation to the Kaplan-Markov \(P\)-value \(P_{KM}\), such that the approximate value is always larger than the true value, so the approximation will still control the true risk of failing to correct an incorrect reported outcome.

Here is a re-cap where we are in the notation, the question, and the strategy:

  • \(e_p \equiv \max_c \max_{w \in \mathcal{W}_c \ell \in \mathcal{L}_c} (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V\) is the maximum overstatement of pairwise margins on ballot \(p\); it is the value of \(e_p\) we used before, but normalizes by \(V\) instead of \(V_{w\ell}\)

  • \(E \equiv \sum_{p=1}^N e_p\) is an upper bound on the total maximum relative overstatement of any pairwise margin; if \(E < 1\) then all contest outcomes are correct

  • \(u_p \equiv 2\gamma/V\) is an a priori upper bound on \(e_p\) for every ballot \(p\)

  • \(U \equiv \sum_p u_p = 2N/V = 2/\mu\) is the total of those upper bounds

  • \(t_p \equiv e_p/u_p\) is the taint of ballot \(p\)

  • we are testing the hypothesis \(E \ge 1\)

  • the audit will draw ballots IID with probability \(1/N\) of selecting each ballot in each draw

We will develop a bound that depends on the data only through the number of ballots among the \(n\) ballots in the sample in each of four categories:

  • \(o_1\): #ballots that overstate any margin by one vote but no margin by two votes. For such a ballot, \(e_p = 1/V\) and \(t_p = 1/(2\gamma)\).

  • \(u_1\): #ballots that understate any margin by exactly one vote, and every margin by at least one vote. For such a ballot, \(e_p = -1/V\) and \(t_p = -1/(2\gamma)\).

  • \(o_2\): #ballots that overstate any margin by two votes. For such a ballot, \(e_p = 2/V\) and \(t_p = 1/\gamma\).

  • \(u_2\): #ballots that understate every margin by two votes (in general, if a contest with has more than one reported loser, there cannot be any such ballots). For such a ballot, \(e_p = -2/V\) and \(t_p = -1/\gamma\).

Then, once we have drawn \(n\) ballots at random

\[\begin{equation*} P_{KM} \le P(n, o_1, u_1, o_2, u_2; U, \gamma) \equiv \left (1 - 1/U \right )^n \times \prod_{j=1}^n ( 1 - t_j )^{-1} \end{equation*}\]
\[\begin{equation*} = ( 1 - 1/U )^n \left ( 1 - \frac{1}{2\gamma}\right )^{-o_1} \left ( 1 - \frac{1}{\gamma}\right )^{-o_2} \left ( 1 + \frac{1}{2\gamma}\right )^{-u_1} \left ( 1 + \frac{1}{\gamma}\right )^{-u_2}. \end{equation*}\]

If the audit stops only when \(P(n, o_1, u_1, o_2, u_2; U, \gamma) \le \alpha\) (or when all ballots have been tallied manually and the result is known), the risk of certifying an incorrect outcome is limited to \(\alpha\).

In practice, it’s easier to work with logarithms, so that the only operations required are addition and subtraction. The termination condition becomes:

\[\begin{equation*} n \log \left ( 1 - \frac{\mu}{2\gamma} \right ) - o_1 \log \left ( 1 - \frac{1}{2\gamma}\right ) - o_2 \log \left ( 1 - \frac{1}{\gamma}\right ) - u_1 \log \left ( 1 + \frac{1}{2\gamma}\right ) - u_2 \log \left ( 1 + \frac{1}{\gamma}\right ) < \log \alpha. \end{equation*}\]

Estimating the required sample size

Suppose we expect to see errors of the various kinds, \((o_1, o_2, u_1, u_2)\), at rates \((\omega_1, \omega_2, \upsilon_1, \upsilon_2)\), respectively. How large should we expect \(n\) will need to be before the audit can stop?

We expect \(n \omega_1\) 1-vote overstatements in a sample of size \(n\), etc. The expected termination condition is thus:

\[\begin{equation*} n \left [ \log \left ( 1 - \frac{\mu}{2\gamma} \right ) - \omega_1 \log \left ( 1 - \frac{1}{2\gamma}\right ) - \omega_2 \log \left ( 1 - \frac{1}{\gamma}\right ) - \upsilon_1 \log \left ( 1 + \frac{1}{2\gamma}\right ) - \upsilon_2 \log \left ( 1 + \frac{1}{\gamma}\right ) \right ] < \log \alpha. \end{equation*}\]

Solving for \(n\) gives

\[\begin{equation*} n \ge \frac{\log \alpha}{\log \left ( 1 - \frac{\mu}{2\gamma} \right ) - \omega_1 \log \left ( 1 - \frac{1}{2\gamma}\right ) - \omega_2 \log \left ( 1 - \frac{1}{\gamma}\right ) - \upsilon_1 \log \left ( 1 + \frac{1}{2\gamma}\right ) - \upsilon_2 \log \left ( 1 + \frac{1}{\gamma}\right )}, \end{equation*}\]

provided the denominator of the right hand side is strictly negative—which is a necessary condition for the audit to terminate without a full hand count (in expectation).

It should be clear that the audit ought not terminate if the net rate of error is greater than or equal to to the diluted margin, that is, if \(\omega_1 + 2\omega_2 - \upsilon_1 - 2\upsilon_2 > \mu\): if the net error could account for the margin, the audit should go to a full hand count. The restriction on the denominator is slightly stronger (more restrictive), because of the asymmetry between understatements and understatements in the Kaplan-Markov \(P\)-value: for \(0 < x < 1\),

\[\begin{equation*} \log ( 1 - x ) < - \log (1 + x), \end{equation*}\]

which one can see from the two Taylor series:

\[\begin{equation*} \log (1+x) = x - x^2/2 + x^3/3 - \cdots \end{equation*}\]

versus

\[\begin{equation*} \log (1-x) = -x - x^2/2 - x^3/2 - \cdots. \end{equation*}\]
import math
import numpy as np
def KM_P_value(n, o1, o2, u1, u2, gamma, mu):
    return((1 - mu/(2*gamma))**n *\
           (1 - 1/(2*gamma))**(-o1) *\
           (1 - 1/gamma)**(-o2) *\
           (1 + 1/(2*gamma))**(-u1) *\
           (1 + 1/gamma)**(-u2))

def KM_Expected_sample_size(or1, or2, ur1, ur2, mu, gamma, alpha):
    n = np.nan
    denom = np.log( 1 - mu/(2*gamma) ) -\
            or1*np.log(1 - 1/(2*gamma)) -\
            or2*np.log(1 - 1/gamma) -\
            ur1*np.log(1 + 1/(2*gamma)) -\
            ur2*np.log(1 + 1/gamma)
    if (denom < 0):
        n = np.ceil(np.log(alpha)/denom)
    return(n)
alpha = 0.05
gamma = 1.03905  

mu = (354040 - 337589)/(354040+337589+33234) # New Hampshire 2016
KM_P_value(200, 1, 0, 0, 0, gamma, mu)
0.21438135077031842
KM_Expected_sample_size(.001,0,.001,0,0.05, gamma, alpha)
125.0
KM_Expected_sample_size(.05,0,0,0,0.05, gamma, alpha)
nan

Wald’s Sequential Probability Ratio Test

See SPRT.

The Kaplan-Wald Inequality

See Kaplan-Wald.

Auditing Super-Majority Contests

In many states, some contests (e.g., ballot measures) require a super-majority to pass. For example, some contests require that more than 2/3 of the votes be “yes” votes. Super-majority contests were considered in the first paper on risk-limiting audits (Stark, 2008. Conservative Statistical Post-Election Audits, Ann. Appl. Stat., 2, 550–581 DOI: 10.1214/08-AOAS161), but ballot-level comparison audits have not been developed explicitly for such social choice functions.

Here is a derivation.

Let \(\tau\) be the threshold fraction of votes required for the measure to pass, e.g., \(\tau = 2/3\) for a contest that requires a 2/3 majority.

Suppose that the number of “yes” votes is \(A_{\mathrm{yes}}\), the number of “no” votes is \(A_{\mathrm{no}}\), and that \(N \ge A_{\mathrm{yes}} + A_{\mathrm{no}}\) ballots were cast in all, including undervotes and overvotes.

The “margin” in favor of “yes” in votes is \(A_{\mathrm{yes}}-\tau(A_{\mathrm{yes}}+A_{\mathrm{no}})\). If this margin is positive, “yes” wins. Otherwise, “no” wins. The reported margin is \(v_{\mathrm{yes}}-\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})\) (if “yes” is the reported winner) or \(\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})-v_{\mathrm{yes}}\) (if “no” is the reported winner).

How much can error in the interpretation of a single ballot affect the margin?

First, suppose that “yes” is the reported winner. Then the effects of errors are:

reported as

actual value

signed error

type

yes

invalid

\(1-\tau\)

overstatement

yes

no

\(1\)

overstatement

no

invalid

\(-\tau\)

understatement

no

yes

\(-1\)

understatement

invalid

yes

\(-(1-\tau)\)

understatement

invalid

no

\(\tau\)

overstatement

If “no” is the reported winner, then the errors are:

reported as

actual value

signed error

type

yes

invalid

\(-(1-\tau)\)

understatement

yes

no

\(-1\)

understatement

no

invalid

\(\tau\)

overstatement

no

yes

\(1\)

overstatement

invalid

yes

\(1-\tau\)

overstatement

invalid

no

\(-\tau\)

understatement

Note the symmetry: if “yes” won, a given error is the negative of what it would have been had “no” won. The largest possible error is \(1\) vote, regardless of the reported vote (in contrast, for plurality elections the largest possible error is \(2\) votes). As a fraction of the reported margin, that is \(u_p \equiv 1/|v_{\mathrm{yes}}-\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})|\).

If we want to base the audit on the Kaplan-Markov inequality, we might want to “pad” the bounds as described above, defining \(u_p \equiv \gamma/|v_{\mathrm{yes}}-\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})|\); we shall incorporate that padding here and below. Define \(m \equiv v_{\mathrm{yes}}-\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})\).

There are three possible magnitudes for an error (\(1\), \(1-\tau\), and \(\tau\)), each with with two possible signs.

The total upper bound on the error as a fraction of the margin is

\[\begin{equation*} U \equiv \sum_{p=1}^N u_p = N\gamma/|m|. \end{equation*}\]

Let \(v_{py}\) denote the number of reported votes for “yes” on ballot \(p\) and let \(v_{pn}\) be the number of reported votes for “no” on ballot \(p\), so \(0 \le v_{py} + v_{pn} \le 1\).

Define \(e_p\) to be the signed overstatement error in the margin in votes resulting from how the voting system interpreted ballot \(p\), normalized by the margin.

For instance, suppose “yes” was reported to have received a supermajority. If \(v_{py} = 1\) and \(v_{pn}=0\) (the vote was recorded as a “yes”), but \(a_{py}=0\) and \(a_{pn}=0\) (the vote was actually invalid), then \(e_p = \frac{1-\tau}{|m|}\).

The possible values for \(e_p\) are

  • \(0\)

  • \(\pm 1/|m|\)

  • \(\pm \tau/|m|\), and

  • \(\pm (1-\tau)/|m|\).

Let \(E = \sum_p e_p\). The reported outcome is correct if \(E < 1\).

As before, define \(t_p \equiv e_p/u_p = e_p/\gamma\), draw ballots independently with replacement with probability \(u_p/U\), and let \(T_j\) be the value of \(t_p\) for the ballot selected on the \(j\)th draw. Then

\[\begin{equation*} \mathbb{E} T_1 = \sum_{p=1}^N \frac{u_p}{U} t_p = \frac{1}{U} \sum_{p=1}^N u_p e_p/u_p = \frac{E}{U}, \end{equation*}\]

so \(E = U \mathbb{E} T_1\), and evidence that \(\mathbb{E}T_1 < 1/U\) is evidence that \(E < 1\).

A risk-limiting audit with risk limit \(\alpha\) can stop if an when

\[\begin{equation*} P = ( 1 - 1/U )^n \prod_{j=1}^n \left ( 1 - e_j/u_j \right )^{-1} = ( 1 - 1/U )^n \prod_{j=1}^n \left ( 1 - e_j|m|/\gamma \right )^{-1}< \alpha. \end{equation*}\]

Taking logs yields the condition

\[\begin{equation*} n \log ( 1 - 1/U ) - \sum_{j=1}^n \log ( 1 - e_j|m|/\gamma ) < \log \alpha. \end{equation*}\]

Choosing \(\gamma\)

Suppose we want the “cost” (in terms of the increase to the sample size) of mistaking a “no” vote for a “yes” vote \(k\) times larger than the “cost” of mistaking an invalid vote for a “yes” vote.

[TO BE COMPLETED]

def KM_Super_P_value(n, N, m, yes_none, yes_no, no_none, no_yes, none_yes, none_no, tau=2/3, gamma=1.03905):
    """
    Kaplan-Markov p-value for a super-majority contest, "yes" versus "no"
    
    Input
    ------
        n:        sample size
        N:        number of ballots cast
        m:        reported margin of victory: number of "yes" votes minus number required to win 
                  negative if "yes" lost. votes_for_yes - tau*(votes_for_yes + votes_for_no)
        yes_none: number of ballots in the audit sample that were reported as "yes" but 
                  revealed by the audit to be undervotes, overvotes, or otherwise invalid 
        yes_no:   number of ballots reported as "yes" that were revealed by the audit to 
                  be "no"
        no_none:  # ballots reported as "no" revealed to be invalid or non-votes
        no_yes:   # ballots reported as "no" revealed to be "yes"
        none_yes: # ballots reported as invalid/non-votes revealed by the audit to be "yes"
        none_no   # ballots reported as invalid/non-votes revealed by the audit to be "no"
        ...
        tau:      threshold for supermajority win
        gamma:    padding in Kaplan Markov. Choose gamma >1.
        
    Returns
    -------
        P-value : float
    """
        
    U = gamma*N/abs(m)  # bound on total error
    sgn = 1 if m >=0 else -1  # 1 if yes is the reported winner; -1 if not
    return((1 - 1/U)**n *\
           (1 - sgn*(1-tau)/gamma)**(-yes_none) *\
           (1 - sgn/gamma)**(-yes_no) *\
           (1 + sgn*tau/gamma)**(-no_none) *\
           (1 + sgn/gamma)**(-no_yes) *\
           (1 + sgn*(1-tau)/gamma)**(-none_yes) *\
           (1 - sgn*tau/gamma)**(-none_no))