Introduction to Risk-Limiting Election Audits¶

Reading¶

Lindeman, M. and P.B. Stark, 2012. A Gentle Introduction to Risk-Limiting Audits. IEEE Security and Privacy, 10, 42–49. Preprint: https://www.stat.berkeley.edu/~stark/Preprints/gentle12.pdf
Stark, P.B., and D.A. Wagner, 2012. Evidence-Based Elections. IEEE Security and Privacy, 10, 33–41. Preprint: https://www.stat.berkeley.edu/~stark/Preprints/evidenceVote12.pdf
Rivest, R.L. and E. Shen, 2012. A Bayesian Method for Auditing Elections, EVT/WOTE, https://www.usenix.org/system/files/conference/evtwote12/evtwote12-final30.pdf

Relevant code¶

https://github.com/ron-rivest/audit-lab
https://github.com/pbstark/auditTools
1. https://www.stat.berkeley.edu/~stark/Java/Html/auditTools.htm
2. https://www.stat.berkeley.edu/~stark/Java/Html/ballotPollTools.htm
https://github.com/pbstark/DKDHondt14

Relevant popular press¶

http://www.huffingtonpost.com/american-statistical-association/leave-election-integrity-_b_3580649.html
https://www.nytimes.com/2017/09/01/us/politics/russia-election-hacking.html
https://www.nytimes.com/2017/09/01/insider/in-election-interference-its-what-reporters-didnt-find-that-matters.html
https://www.usatoday.com/story/opinion/2016/11/18/election-audit-paper-machines-column/93803752/
https://newrepublic.com/article/140254/inside-story-trump-clinton-stein-presidential-election-recount
http://fortune.com/2017/07/31/defcon-hackers-us-voting-machines/
https://www.theregister.co.uk/2017/07/29/us_voting_machines_hacking/
https://www.wired.com/story/voting-machine-hacks-defcon/
https://www.washingtonpost.com/local/virginia-politics/virginia-scraps-touch-screen-voting-machines-as-election-for-governor-looms/2017/09/08/e266ead6-94fe-11e7-89fa-bb822a46da5b_story.html

Any way of tabulating votes can and will make mistakes.

How can we tell whether those mistakes are material, that is, whether they caused the outcome to be wrong?

If there is a trustworthy audit trail of paper ballots, we can check the electronic tally against a manual inspection of the paper. In principle we could recount all the votes by hand, but that’s unnecessarily expensive. Statistical methods can be more efficient.

What would we like an election audit to do?

Historically, statutory election audits didn’t really have much of a motivating principle. They were of the form “audit X% of the precincts and report what you see.” That doesn’t give much of a guarantee of anything. In particular, depending on the margin, X% might not even be enough to have a reasonable chance of finding any errors whatsoever even if the reported outcome was wrong. Proponents of that kind of audit often claim that the purpose of the audit is to check whether the machines are working properly. But “properly” isn’t a precise thing: because there is always some rate of error, especially in machine interpretation of manually marked ballots, the question is whether the machines worked “well enough.” They never work perfectly.

Historically, the next approach to audits involved a “detection” paradigm, which is sensitive to the margin (it involves drawing larger samples to check smaller margins). In the detection approach, the goal is to draw a large enough sample that, if the reported outcome was wrong, there is a large chance that the sample will reveal at least one machine error.

The detection paradigm is an improvement over the more arbitrary “audit X%” style of audit, but in my opinion, it doesn’t answer the right question. In part, that’s because machines inevitably misinterpret some fraction of hand-marked ballots, so an audit of a reasonable number of ballots is bound to find at least one machine error. What then? Such errors do not necessarily cast doubt on the outcome: the frequency of errors matters.

I think a better question is, “in light of the errors the audit found in the sample, how confident should we be that the reported outcome is correct?”

Even that doesn’t seem like enough to ask: ideally, we would like the audit to be a quality control measure that corrects errors if the errors “matter.” (A reasonable standard for materiality, that is, whether the errors matter, is whether they changed who or what appeared to win.)

Risk-limiting audits provide a more efficient approach by formulating auditing as a statistical hypothesis test. The null hypothesis is that the reported outcome is wrong. To reject that hypothesis is to conclude that the outcome is correct. Absent strong evidence that the outcome is correct, keep inspecting ballots by hand, possibly proceeding to a full hand tally.

There are two basic approaches to risk-limiting election audits:

Ballot-polling audits. These draw a random sample of ballots but do not rely on the tabulation system at all.
Comparison audits. This require the voting system to export subtotals (ideally a subtotal for every individual ballot) in such a way that the physical ballots corresponding to the subtotal can be retrieved and the subtotals can be checked manually. It also requires the local election official (LEO) to “commit” to the subtotals before the audit starts.

Both approaches require a ballot manifest that describes in detail how the physical ballots are organized and stored (e.g., the number of batches, an identifier for each batch, and the number of ballots in each batch), so that a random sample of ballots can be drawn.

Both approaches involve turning the hypothesis “the reported outcome is wrong” into a quantitative assertion about a scalar that gives a necessary condition for the outcome to be wrong. For instance, in a plurality contest, the true winner is the candidate (or position) that received the most votes. For the reported winner not to be the true winner, there must be some candidate who received at least as many votes as the reported winner. If we have strong statistical evidence that no candidate received as many votes or more votes than the reported winner, the audit can stop. Otherwise, it continues, potentially to a full hand count.

The evidence can be direct (evidence about the fraction of votes for a given candidate), or indirect (evidence that errors in the count are not large enough to alter the outcome). Ballot-polling audits use the first kind of evidence; comparison audits use the second.

“Super-simple simultaneous” comparison audits for plurality contests:¶

The Kaplan-Markov Inequality, Wald’s Sequential Probability Ratio Test, and the Kaplan-Wald Inequality¶

This section follows https://www.usenix.org/legacy/event/evtwote10/tech/full_papers/Stark.pdf closely, but derives something slightly more general.

Notation¶

\(C\) contests under audit
\(N\) ballots were cast in all. (There might not be any contest that appears on all \(N\) ballots)
Contest \(c\) appears on \(N_c\) of the \(N\) cast ballots. (\(N\) and \(\{N_c\}_{c=1}^C\) are known.)
\(\mathcal{W}_c\): the set of reported winners of contest \(c\).
\(\mathcal{L}_c\): the set of reported losers of contest \(c\).
\(v_{pi} \in \{0, 1\}\): the reported votes for candidate \(i\) on ballot \(p\)
\(a_{pi} \in \{0, 1\}\): actual votes for candidate \(i\) on ballot \(p\). If contest \(c\) does not appear on ballot \(p\) then \(a_{pi} = 0\).
\(V_{w\ell} \equiv \sum_{p=1}^N (v_{pw} - v_{p\ell}) > 0\): Reported margin of reported winner \(w \in \mathcal{W}_c\) over reported loser \(\ell \in \mathcal{L}_c\) in contest \(c\).
\(V\): smallest reported margin among all \(C\) contests: \(V \equiv \min_c \min_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} V_{w \ell}\)
\(\mu = V/N\): the “diluted margin,” the margin in votes divided by the total number of ballots
\(A_{w\ell} \equiv \sum_{p=1}^N (a_{pw} - a_{p\ell})\): actual margin of reported winner \(w \in \mathcal{W}_c\) over reported loser \(\ell \in \mathcal{L}_c\) in contest \(c\)

The reported winners of all \(C\) contests are the actual winners of those contests if

\[\begin{equation*} \min_c \min_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} A_{w\ell} > 0. \end{equation*}\]

Otherwise, at least one reported electoral outcome is wrong. (We aren’t particularly worried about whether the votes are counted perfectly—they almost never all—just whether the errors were large enough to make a loser appear to be a winner, or vice versa.)

We won’t test that inequality directly. Instead, we will test a condition that is sufficient but not necessary for the inequality to hold, to get a computationally simple test that is still conservative (the simultaneous risk is below its nominal value).

One such reduction relies on the maximum across-contest relative overstatement (MACRO), which we will now develop.

The reported winners of contest \(c\) are correct if for every pair \(w \in \mathcal{W}_c\), \(\ell \in \mathcal{L}_c\),

\[\begin{equation*} \sum_{p=1}^N (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} < 1. \end{equation*}\]

This condition says that the error in \(V_{w\ell}\) is less than \(V_{w\ell}\). The reported winners of all the contests are correct if for every contest \(c\), the previous relation holds, i.e., if

\[\begin{equation*} \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \sum_{p=1}^N (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} < 1. \end{equation*}\]

Now the maximum (over \(c\) and, for each contest \(c\), over all winner, loser pairs in that contest) of sums is not larger than the sum of maxima; that is,

\[\begin{equation*} \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \sum_{p=1}^N (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} \le \sum_{p=1}^N \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell}. \end{equation*}\]

Hence, if

\[\begin{equation*} \sum_{p=1}^N \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} < 1, \end{equation*}\]

all the reported outcomes must be correct. Define

\[\begin{equation*} e_p \equiv \max_c \max_{w \in \mathcal{W}_c \ell \in \mathcal{L}_c} (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell}. \end{equation*}\]

Then the reported outcomes of all the contests must be correct if

\[\begin{equation*} E \equiv \sum_{p=1}^N e_p < 1. \end{equation*}\]

To see that a different way, suppose that the outcome of one or more contests is wrong. Then there is some contest \(c\) and some reported (winner, loser) pair \(w \in \mathcal{W}_c, \ell \in \mathcal{L}_c\) for which

\[\begin{equation*} 0 \ge A_{w\ell} = V_{w\ell} - (V_{w\ell} - A_{w\ell}) = V_{w\ell} - \sum_{p=1}^N (v_{pw} - a_{pw} - v_{p\ell} + a_{p\ell}), \end{equation*}\]

i.e.,

\[\begin{equation*} \sum_{p=1}^N (v_{pw} - a_{pw} - v_{p\ell} + a_{p\ell}) \ge V_{w\ell}. \end{equation*}\]

Diving both sides by \(V_{w\ell}\) gives

\[\begin{equation*} \sum_{p=1}^N \frac{v_{pw} - a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \ge 1. \end{equation*}\]

But

\[\begin{equation*} \frac{v_{pw} - a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \le \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \end{equation*}\]

\[\begin{equation*} \le \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} = e_p, \end{equation*}\]

so if the outcome is wrong, \(E = \sum_p e_p \ge 1\). Thus a risk-limiting audit can rely on testing whether \(E \ge 1\). If the hypothesis \(E \ge 1\) can be rejected at significance level \(\alpha\), we can conclude that all the reported outcomes are correct.

Testing whether \(E \ge 1\) would require a very large sample if we knew nothing at all about \(e_p\) without auditing ballot \(p\): a single large value of \(e_p\) could make \(E\) arbitrarily large. Fortunately, there is an a priori upper bound for \(e_p\). Whatever the reported votes \(v_{pi}\) are on ballot \(p\), we can find the potential values of the actual votes \(a_{pi}\) that would make the error \(e_p\) largest, because \(a_{pi}\) can only be zero or one:

\[\begin{equation*} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \le \frac{v_{pw}- 0 - v_{p\ell} + 1}{V_{w\ell}}. \end{equation*}\]

Hence,

\[\begin{equation*} e_p \le \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw} - v_{p\ell} + 1}{V_{w\ell}} \equiv \tilde{u}_p. \end{equation*}\]

Knowing that \(e_p \le \tilde{u}_p\) might let us conclude reliably that \(E < 1\) by examining only a small fraction of the ballots–depending on the values \(\{ \tilde{u}_p\}_{p=1}^N\) and on the values of \(\{e_p\}\) for the audited ballots.

To make inferences about \(E\), it is helpful to work with the taint \(t_p \equiv \frac{e_p}{\tilde{u}_p} \le 1\). Define \(\tilde{U} \equiv \sum_{p=1}^N \tilde{u}_p\). Suppose we draw ballots at random with replacement, with probability \(\tilde{u}_p/\tilde{U}\) of drawing ballot \(p\) in each draw, \(p = 1, \ldots, N\). (Since \(\tilde{u}_p \ge 0\), these are all positive numbers, and they sum to 1, so they define a probability distribution on the \(N\) ballots.)

Let \(T_j\) be the value of \(t_p\) for the ballot \(p\) selected in the \(j\)th draw. Then \(\{T_j\}_{j=1}^n\) are IID, \(\mathbb{P} \{T_j \le 1\} = 1\), and

\[\begin{equation*} \mathbb{E} T_1 = \sum_{p=1}^N \tilde{u}_p/\tilde{U} t_p = \frac{1}{\tilde{U}}\sum_{p=1}^N \tilde{u}_p \frac{e_p}{\tilde{u}_p} = \frac{1}{\tilde{U}} \sum_{p=1}^N e_p = E/\tilde{U}. \end{equation*}\]

Thus \(E = \tilde{U} \mathbb{E} T_1\).

So, if we have strong evidence that \(\mathbb{E} T_1 < 1/\tilde{U}\), we have strong evidence that \(E < 1\).

This approach can be simplified even further by noting that \(\tilde{u}_p\) has a simple upper bound that does not depend on any \(v_{pi}\). At worst, the CVR for ballot \(p\) shows a vote for the “least-winning” apparent winner of the contest with the smallest margin, but a hand interpretation shows a vote for the runner-up in that contest. Since \(V_{w\ell} \ge V\) and \(0 \le v_{pi} \le 1\),

\[\begin{equation*} \tilde{u}_p = \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw} - v_{p\ell} + 1}{V_{w\ell}} \le \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{1 - 0 + 1}{V_{w\ell}} \le \frac{2}{V}. \end{equation*}\]

Thus, if we define \(u_p \equiv 2/V\) and sample ballots at random with probability proportional to \(u_p\), in fact we will sample ballots with equal probability. Define

\[\begin{equation*} U \equiv \sum_{p=1}^N \frac{2}{V} = \frac{2N}{V} = 2/\mu \end{equation*}\]

and re-define \(t_p \equiv e_p/u_p\) (rather than \(e_p/\tilde{u}_p\)); let \(T_j\) be the value of \(t_p\) for the ballot selected at random in the \(j\)th draw, as before. Then still \(\{T_j\}_{j=1}^n\) are IID, \(\mathbb{P} \{T_j \le 1\} = 1\), and

\[\begin{equation*} \mathbb{E} T_1 = \sum_{p=1}^N \frac{u_p}{U} t_p = \frac{1}{U}\sum_{p=1}^N u_p \frac{e_p}{u_p} = \frac{1}{U} \sum_{p=1}^N e_p = E/U = \frac{\mu}{2} E, \end{equation*}\]

i.e.,

\[\begin{equation*} E = \frac{2}{\mu}\mathbb{E} T_1. \end{equation*}\]

So, if we have evidence that \(\mathbb{E} T_1 < \mu/2 = 1/U\), we have evidence that \(E < 1\).

\(P\)-values¶

We digress to define \(P\) values a little more carefully than elementary books typically do. See also Stark, P.B., 2016. The value of P-values, The American Statistician, 70, DOI:10.1080/00031305.2016.1154108, http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108

There is a hypothesis about the world. A (possibly multivariate) datum \(Y\) will be observed; \(Y\) is a random variable that takes values in a measurable space \(\mathcal{Y}\). If the hypothesis is true, the probability distribution \(G\) of \(Y\) is in some known set \(\mathcal{G}\) of probability distributions.

The null hypothesis is the assertion that \(G \in \mathcal{G}\); if the null hypothesis is false, \(G \notin \mathcal{G}\). (For simplicity, we suppose that all the distributions in \(\mathcal{G}\) are defined on the same \(\sigma\)-algebra, denoted \(\Sigma\).)

One particularly general and flexible definition of a \(P\)-value is that it is any random variable \(P\) for which \(\Pr_G \{ P \le t \} \le t\) for all \(t \in [0, 1]\). That is, it is stochastically smaller than a uniform random variable if the null hypothesis is true.

Another definition of \(P\)-values is in terms of rejection regions for a family of tests. For every value \(\alpha \in (0, 1)\), we have a measurable set \(R_\alpha \in \Sigma\) such that

\[\begin{equation*} \sup_{G \in \mathcal{G}} \Pr_G \{ Y \in R_\alpha \} \le \alpha. \end{equation*}\]

If the hypothesis \(G \in \mathcal{G}\) is true, the chance that the datum \(Y\) will be in the set \(R_\alpha\) is no greater than \(\alpha\). The set \(R_\alpha\) is the rejection region of a significance-level \(\alpha\) test of the null hypothesis. The hypothesis is rejected at significance level \(\alpha\) if \(Y \in R_\alpha\). If we conclude that the hypothesis is false whenever \(Y \in R_\alpha\), the chance of erroneously concluding that the hypothesis is false if it is in fact true is at most \(\alpha\).

To define \(P\)-values, we shall also insist that rejection regions for different significance levels nest: If \(0 < \beta < \gamma < 1\), then

\[\begin{equation*} R_\beta \subset R_\gamma. \end{equation*}\]

This ensures that if the family of tests rejects the hypothesis at significance level \(\alpha\), it also rejects the hypothesis at every significance level greater than \(\alpha\). We also define \(R_1 \equiv \mathcal{Y}\).

Given this structure, the \(P\)-value of the null hypothesis given the observation \(Y = y\) is

\[\begin{equation*} P \equiv \inf \{ \alpha \in (0, 1] : y \in R_\alpha \}. \end{equation*}\]

That is, the \(P\)-value is the smallest significance level at which the family of tests rejects the hypothesis. Because of the nesting, the family rejects the hypothesis for all significance levels greater than the \(P\)-value.

The \(P\)-value depends not only on the hypothesis \(\mathcal{G}\) and the observation that \(Y = y\), but also on the family of hypothesis tests (the sets \(R_\alpha\), \(\alpha \in (0, 1]\)). Different tests in general give different \(P\)-values for the same hypothesis given the same data.

One test is better (more powerful) than another if

no matter what value \(y\) the datum \(Y\) takes, the first test assigns a \(P\)-value no larger than the second test does, and
there is some \(y \in \mathcal{Y}\) such that the first test assigns a smaller \(P\)-value than the second when \(Y = y\).

Many tests are defined by a test statistic–a single-number summary \(\phi(Y)\) of the vector \(Y\), where \(\phi\) is a \(\Sigma\)-measurable function from \(\mathcal{Y}\) to \(\Re\). The rejection region is then expressed in terms of \(\phi(Y)\): The null hypothesis is rejected if \(\phi(Y) \le f_\alpha\), where \(f_\alpha\) is chosen to satisfy

\[\begin{equation*} \sup_{G \in \mathcal{G}} \mathbb{P}_G \{ \phi(Y) \le f_\alpha \} \le \alpha. \end{equation*}\]

The nesting condition requires that for \(0 < \beta < \gamma < 1\),

\[\begin{equation*} \{ y \in \mathcal{Y} : \phi(y) \le f_\beta \} \subset \{ y \in \mathcal{Y} : \phi(y) \le f_\gamma \}; \end{equation*}\]

i.e., \(f_\alpha\) should increase monotonically with \(\alpha\). The \(P\)-value of the hypothesis for the observation \(Y = y\) is then

\[\begin{equation*} P = \inf \{ \alpha : \phi(y) \le f_\alpha \}. \end{equation*}\]

The Kaplan-Markov Inequality¶

Suppose \(\{X_j\}_{j=1}^n\) are iid with \(\mathbb{P}\{X_j \ge 0\} = 1\). Form the nonnegative Martingale:

\[\begin{equation*} \left ( X_1/\mathbb{E} X_1, (X_1/\mathbb{E} X_1)\cdot(X_2/\mathbb{E} X_1), ..., \prod_{j=1}^n X_n/\mathbb{E} X_1 \right ). \end{equation*}\]

The expected value of each term is 1. Kaplan (1987) uses Ville’s (1939) inequality for an optionally stopped nonnegative martingale \(Z_1, Z_2, \ldots\): For any \(t > 0\), \(\mathbb{P} \{ \max_{j=1}^n Z_j > z \} \le \mathbb{E} Z_n/t\). In the case at hand, that gives:

\[\begin{equation*} \mathbb{P} \left \{ \max_{j=1}^n \prod_{i=1}^j X_j/\mathbb{E} X_1 > 1/\alpha \right \} \le \alpha. \end{equation*}\]

This is the Kaplan-Markov bound (see also Stark, 2009).

We can reject the hypothesis that \(\mathbb{E} X_j \le \mu\) at significance level \(\alpha\) if

\[\begin{equation*} \max_{j=1}^n \prod_{i=1}^j X_i/\mu > 1/\alpha. \end{equation*}\]

If we observe \((X_j = x_j)_{j=1}^n\), the \(P\)-value of the hypothesis that \(\mathbb{E} X_1 \le \mu\) is

\[\begin{equation*} \left ( \max_{j=1}^n \prod_{i=1}^j x_j/\mu \right )^{-1}. \end{equation*}\]

This derivation gave a \(P\)-value for the hypothesis that \(\mathbb{E} X_1 \le \mu\) starting with the assumptions that \(\{X_j\}_{j=1}^n\) are IID and \(\mathbb{P}\{X_1 \ge 0\} = 1\). But in election auditing, we want a \(P\)-value for the hypothesis \(\mathbb{E} T_1 \ge 1/U\) and we know that \(\{T_j\}_{j=1}^n\) are IID and \(\mathbb{P}\{T_1 \le 1\} = 1\). The transformation \(X_j = 1 - T_j\) transforms one problem into the other.

The \(P\)-value for the hypothesis \(E \ge 1\) if we observe \((T_j = t_j)_{j=1}^n\) is

\[\begin{equation*} P_{KM} \equiv \min_{j=1}^n \prod_{i=1}^j \frac{1 - 1/U}{1 - t_i}. \end{equation*}\]

The \(P\)-value \(P_{KM}\) can depend on the order in which the data are collected, which is discomfiting unless the data are in fact collected sequentially.

Padding the Kaplan-Markov Inequality¶

The basic Kaplan-Markov inequality isn’t helpful if any observed \(t_p\) is equal to 1, that is, if the corresponding ballot was reported to have a vote for the winner with the fewest votes (in the tightest contest), but an audit shows it to have a vote for the loser with the most votes. If equipment is functioning properly, such an error should not occur frequently: it would be much more common for a scanner to miss a mark, to treat a “hesitation mark” as an overvote, etc. However, if the auditors mistakenly retrieve the wrong ballot, such an error might occur more frequently, in turn requiring a full hand count even when the reported outcome is correct.

To hedge against the possibility that the ratio of \(e_p\) to its upper bound is equal to 1 for any ballot in the sample, we can increase the upper bound so that it cannot be attained, for instance, by inflating it by a small percentage. The risk remains strictly controlled.

To that end, we take the error bound for each ballot to be \( u_p \equiv \gamma 2/V > \tilde{u}_p \) where the “inflator” \(\gamma > 1\). That ensures that \(e_p/u_p\) cannot be larger than \(1/\gamma < 1\). The cost of inflating the upper bound in this way is that a larger sample will be needed than if \(\{\tilde{u}_p\}\) were used as the bounds and the sample did not happen to include any ballots with \(e_p\) equal to \(\tilde{u}_p\). On the other hand, inflating the error bounds can help avoid a full count when that full count would merely confirm that all the apparent outcomes are correct; that is, using \(\gamma > 1\) can increase the power of the test against some alternatives. The larger the value of \(\gamma\), the larger the initial sample needs to be to allow the audit to stop if at most a given number of ballots overstated one or more margins by one vote, but the less the sample will need to be expanded if ballots in the sample overstate any margin by two votes–unless a full hand count is required.

With \(u_p\) defined as above, the total error bound across all \(N\) ballots becomes \( U \equiv 2\gamma N/V = 2\gamma/\mu, \) where \(\mu\) is the diluted margin \(V/N\). The diluted margin plays an important role in determining the sample size: The initial sample size is \(1/\mu\) multiplied by a constant that depends on the desired simultaneous risk limit, the number of errors of each kind we would like to be able to tolerate without expanding the audit, and the inflator \(\gamma\).

Suppose that \(n\) of the \(N\) ballots are drawn with replacement with equal probability. Let \(e_j\) be the value of the error \(e_p\) for the \(j\)th randomly selected ballot. The Kaplan-Markov MACRO \(P\)-value is

\[\begin{equation*} P_{KM} \le \prod_{j=1}^n \frac{1 - 1/U}{1 - \frac{e_j}{u_j}} = ( 1 - 1/U )^n \prod_{j=1}^n \left ( 1 - e_j\frac{V}{2\gamma} \right )^{-1}. \end{equation*}\]

An audit with simultaneous risk limit \(\alpha\) can be conducted by continuing to hand count the votes on ballots selected at random until \(P_{KM} \le \alpha\) or until the votes on all the ballots have been counted by hand (and the correct outcome is therefore known).

A simple approximation¶

The Kaplan-Markov \(P\)-value depends on which margins in which contests are affected by error, and the nature of those errors.
We will develop a simplification that reduces bookkeeping without compromising the risk limit, but at the expense of potentially requiring the audit to inspect more ballots.

The first step is to give up some of the advantages of normalizing errors by the margins they affect, so that errors that affect wide margins will count the same as errors that affect narrow margins.

Recall that a sufficient condition for all the reported outcomes to be correct is

\[\begin{equation*} \sum_{p=1}^N \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} < 1. \end{equation*}\]

Since

\[\begin{equation*} V \equiv \min_c \min_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} V_{w\ell}, \end{equation*}\]

an even more restrictive sufficient condition is

\[\begin{equation*} \sum_{p=1}^N \max_c \max_{w \in \mathcal{W}_c, \ell \in \mathcal{L}_c} \frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V} < 1. \end{equation*}\]

This condition insists that the sum over ballots of the largest error in any margin in any contest on the ballot be less than the smallest margin between any winner and any loser in any contest. Clearly, this gives up a lot, since a 1-vote error in a wide margin matters less than a 1-vote error in a narrow margin.

This condition is the same as the condition \(E < 1\) if we replace all the individual pairwise margins \(V_{w\ell}\) by \(V\), the smallese pairwise margin.

An overstatement is an error that caused the margin between some reported winner and some reported loser to be larger than it really is. That is, an overstatement is an error that, if corrected, would reduce the margin. For instance, if the ballot was recorded as an undervote, but manual inspection shows that it was a vote for a reported loser, that is an overstatement error. Correcting it would reduce at least one reported margin.

Conversely, an understatement is an error that caused the margin between some reported winner and some reported loser to be smaller than it really is. That is, an understatement is an error that, if corrected, would increase the margin. For instance, if a ballot was recorded as a vote for a reported loser, but manual inspection of the ballot shows that it was a vote for a reported winner, that is an understatement error. Correcting it would increase at least one reported margin.

Discovering an overstatement reduces confidence that the reported outcomes are correct; discovering an understatement increases confidence in the outcomes—but not by as much.

To have algebraically simple rules, we can make a conservative approximation to the Kaplan-Markov \(P\)-value \(P_{KM}\), such that the approximate value is always larger than the true value, so the approximation will still control the true risk of failing to correct an incorrect reported outcome.

Here is a re-cap where we are in the notation, the question, and the strategy:

\(e_p \equiv \max_c \max_{w \in \mathcal{W}_c \ell \in \mathcal{L}_c} (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V\) is the maximum overstatement of pairwise margins on ballot \(p\); it is the value of \(e_p\) we used before, but normalizes by \(V\) instead of \(V_{w\ell}\)
\(E \equiv \sum_{p=1}^N e_p\) is an upper bound on the total maximum relative overstatement of any pairwise margin; if \(E < 1\) then all contest outcomes are correct
\(u_p \equiv 2\gamma/V\) is an a priori upper bound on \(e_p\) for every ballot \(p\)
\(U \equiv \sum_p u_p = 2N/V = 2/\mu\) is the total of those upper bounds
\(t_p \equiv e_p/u_p\) is the taint of ballot \(p\)
we are testing the hypothesis \(E \ge 1\)
the audit will draw ballots IID with probability \(1/N\) of selecting each ballot in each draw

We will develop a bound that depends on the data only through the number of ballots among the \(n\) ballots in the sample in each of four categories:

\(o_1\): #ballots that overstate any margin by one vote but no margin by two votes. For such a ballot, \(e_p = 1/V\) and \(t_p = 1/(2\gamma)\).
\(u_1\): #ballots that understate any margin by exactly one vote, and every margin by at least one vote. For such a ballot, \(e_p = -1/V\) and \(t_p = -1/(2\gamma)\).
\(o_2\): #ballots that overstate any margin by two votes. For such a ballot, \(e_p = 2/V\) and \(t_p = 1/\gamma\).
\(u_2\): #ballots that understate every margin by two votes (in general, if a contest with has more than one reported loser, there cannot be any such ballots). For such a ballot, \(e_p = -2/V\) and \(t_p = -1/\gamma\).

Then, once we have drawn \(n\) ballots at random

\[\begin{equation*} P_{KM} \le P(n, o_1, u_1, o_2, u_2; U, \gamma) \equiv \left (1 - 1/U \right )^n \times \prod_{j=1}^n ( 1 - t_j )^{-1} \end{equation*}\]

\[\begin{equation*} = ( 1 - 1/U )^n \left ( 1 - \frac{1}{2\gamma}\right )^{-o_1} \left ( 1 - \frac{1}{\gamma}\right )^{-o_2} \left ( 1 + \frac{1}{2\gamma}\right )^{-u_1} \left ( 1 + \frac{1}{\gamma}\right )^{-u_2}. \end{equation*}\]

If the audit stops only when \(P(n, o_1, u_1, o_2, u_2; U, \gamma) \le \alpha\) (or when all ballots have been tallied manually and the result is known), the risk of certifying an incorrect outcome is limited to \(\alpha\).

In practice, it’s easier to work with logarithms, so that the only operations required are addition and subtraction. The termination condition becomes:

\[\begin{equation*} n \log \left ( 1 - \frac{\mu}{2\gamma} \right ) - o_1 \log \left ( 1 - \frac{1}{2\gamma}\right ) - o_2 \log \left ( 1 - \frac{1}{\gamma}\right ) - u_1 \log \left ( 1 + \frac{1}{2\gamma}\right ) - u_2 \log \left ( 1 + \frac{1}{\gamma}\right ) < \log \alpha. \end{equation*}\]

Estimating the required sample size¶

Suppose we expect to see errors of the various kinds, \((o_1, o_2, u_1, u_2)\), at rates \((\omega_1, \omega_2, \upsilon_1, \upsilon_2)\), respectively. How large should we expect \(n\) will need to be before the audit can stop?

We expect \(n \omega_1\) 1-vote overstatements in a sample of size \(n\), etc. The expected termination condition is thus:

\[\begin{equation*} n \left [ \log \left ( 1 - \frac{\mu}{2\gamma} \right ) - \omega_1 \log \left ( 1 - \frac{1}{2\gamma}\right ) - \omega_2 \log \left ( 1 - \frac{1}{\gamma}\right ) - \upsilon_1 \log \left ( 1 + \frac{1}{2\gamma}\right ) - \upsilon_2 \log \left ( 1 + \frac{1}{\gamma}\right ) \right ] < \log \alpha. \end{equation*}\]

Solving for \(n\) gives

\[\begin{equation*} n \ge \frac{\log \alpha}{\log \left ( 1 - \frac{\mu}{2\gamma} \right ) - \omega_1 \log \left ( 1 - \frac{1}{2\gamma}\right ) - \omega_2 \log \left ( 1 - \frac{1}{\gamma}\right ) - \upsilon_1 \log \left ( 1 + \frac{1}{2\gamma}\right ) - \upsilon_2 \log \left ( 1 + \frac{1}{\gamma}\right )}, \end{equation*}\]

provided the denominator of the right hand side is strictly negative—which is a necessary condition for the audit to terminate without a full hand count (in expectation).

It should be clear that the audit ought not terminate if the net rate of error is greater than or equal to to the diluted margin, that is, if \(\omega_1 + 2\omega_2 - \upsilon_1 - 2\upsilon_2 > \mu\): if the net error could account for the margin, the audit should go to a full hand count. The restriction on the denominator is slightly stronger (more restrictive), because of the asymmetry between understatements and understatements in the Kaplan-Markov \(P\)-value: for \(0 < x < 1\),

\[\begin{equation*} \log ( 1 - x ) < - \log (1 + x), \end{equation*}\]

which one can see from the two Taylor series:

\[\begin{equation*} \log (1+x) = x - x^2/2 + x^3/3 - \cdots \end{equation*}\]

versus

\[\begin{equation*} \log (1-x) = -x - x^2/2 - x^3/2 - \cdots. \end{equation*}\]

import math
import numpy as np
def KM_P_value(n, o1, o2, u1, u2, gamma, mu):
    return((1 - mu/(2*gamma))**n *\
           (1 - 1/(2*gamma))**(-o1) *\
           (1 - 1/gamma)**(-o2) *\
           (1 + 1/(2*gamma))**(-u1) *\
           (1 + 1/gamma)**(-u2))

def KM_Expected_sample_size(or1, or2, ur1, ur2, mu, gamma, alpha):
    n = np.nan
    denom = np.log( 1 - mu/(2*gamma) ) -\
            or1*np.log(1 - 1/(2*gamma)) -\
            or2*np.log(1 - 1/gamma) -\
            ur1*np.log(1 + 1/(2*gamma)) -\
            ur2*np.log(1 + 1/gamma)
    if (denom < 0):
        n = np.ceil(np.log(alpha)/denom)
    return(n)

alpha = 0.05
gamma = 1.03905  

mu = (354040 - 337589)/(354040+337589+33234) # New Hampshire 2016
KM_P_value(200, 1, 0, 0, 0, gamma, mu)

0.21438135077031842

KM_Expected_sample_size(.001,0,.001,0,0.05, gamma, alpha)

125.0

KM_Expected_sample_size(.05,0,0,0,0.05, gamma, alpha)

nan

Wald’s Sequential Probability Ratio Test¶

See SPRT.

The Kaplan-Wald Inequality¶

See Kaplan-Wald.

Auditing Super-Majority Contests¶

In many states, some contests (e.g., ballot measures) require a super-majority to pass. For example, some contests require that more than 2/3 of the votes be “yes” votes. Super-majority contests were considered in the first paper on risk-limiting audits (Stark, 2008. Conservative Statistical Post-Election Audits, Ann. Appl. Stat., 2, 550–581 DOI: 10.1214/08-AOAS161), but ballot-level comparison audits have not been developed explicitly for such social choice functions.

Here is a derivation.

Let \(\tau\) be the threshold fraction of votes required for the measure to pass, e.g., \(\tau = 2/3\) for a contest that requires a 2/3 majority.

Suppose that the number of “yes” votes is \(A_{\mathrm{yes}}\), the number of “no” votes is \(A_{\mathrm{no}}\), and that \(N \ge A_{\mathrm{yes}} + A_{\mathrm{no}}\) ballots were cast in all, including undervotes and overvotes.

The “margin” in favor of “yes” in votes is \(A_{\mathrm{yes}}-\tau(A_{\mathrm{yes}}+A_{\mathrm{no}})\). If this margin is positive, “yes” wins. Otherwise, “no” wins. The reported margin is \(v_{\mathrm{yes}}-\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})\) (if “yes” is the reported winner) or \(\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})-v_{\mathrm{yes}}\) (if “no” is the reported winner).

How much can error in the interpretation of a single ballot affect the margin?

First, suppose that “yes” is the reported winner. Then the effects of errors are:

reported as	actual value	signed error	type
yes	invalid	\(1-\tau\)	overstatement
yes	no	\(1\)	overstatement
no	invalid	\(-\tau\)	understatement
no	yes	\(-1\)	understatement
invalid	yes	\(-(1-\tau)\)	understatement
invalid	no	\(\tau\)	overstatement

If “no” is the reported winner, then the errors are:

reported as	actual value	signed error	type
yes	invalid	\(-(1-\tau)\)	understatement
yes	no	\(-1\)	understatement
no	invalid	\(\tau\)	overstatement
no	yes	\(1\)	overstatement
invalid	yes	\(1-\tau\)	overstatement
invalid	no	\(-\tau\)	understatement

Note the symmetry: if “yes” won, a given error is the negative of what it would have been had “no” won. The largest possible error is \(1\) vote, regardless of the reported vote (in contrast, for plurality elections the largest possible error is \(2\) votes). As a fraction of the reported margin, that is \(u_p \equiv 1/|v_{\mathrm{yes}}-\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})|\).

If we want to base the audit on the Kaplan-Markov inequality, we might want to “pad” the bounds as described above, defining \(u_p \equiv \gamma/|v_{\mathrm{yes}}-\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})|\); we shall incorporate that padding here and below. Define \(m \equiv v_{\mathrm{yes}}-\tau(v_{\mathrm{yes}}+v_{\mathrm{no}})\).

There are three possible magnitudes for an error (\(1\), \(1-\tau\), and \(\tau\)), each with with two possible signs.

The total upper bound on the error as a fraction of the margin is

\[\begin{equation*} U \equiv \sum_{p=1}^N u_p = N\gamma/|m|. \end{equation*}\]

Let \(v_{py}\) denote the number of reported votes for “yes” on ballot \(p\) and let \(v_{pn}\) be the number of reported votes for “no” on ballot \(p\), so \(0 \le v_{py} + v_{pn} \le 1\).

Define \(e_p\) to be the signed overstatement error in the margin in votes resulting from how the voting system interpreted ballot \(p\), normalized by the margin.

For instance, suppose “yes” was reported to have received a supermajority. If \(v_{py} = 1\) and \(v_{pn}=0\) (the vote was recorded as a “yes”), but \(a_{py}=0\) and \(a_{pn}=0\) (the vote was actually invalid), then \(e_p = \frac{1-\tau}{|m|}\).

The possible values for \(e_p\) are

\(0\)
\(\pm 1/|m|\)
\(\pm \tau/|m|\), and
\(\pm (1-\tau)/|m|\).

Let \(E = \sum_p e_p\). The reported outcome is correct if \(E < 1\).

As before, define \(t_p \equiv e_p/u_p = e_p/\gamma\), draw ballots independently with replacement with probability \(u_p/U\), and let \(T_j\) be the value of \(t_p\) for the ballot selected on the \(j\)th draw. Then

\[\begin{equation*} \mathbb{E} T_1 = \sum_{p=1}^N \frac{u_p}{U} t_p = \frac{1}{U} \sum_{p=1}^N u_p e_p/u_p = \frac{E}{U}, \end{equation*}\]

so \(E = U \mathbb{E} T_1\), and evidence that \(\mathbb{E}T_1 < 1/U\) is evidence that \(E < 1\).

A risk-limiting audit with risk limit \(\alpha\) can stop if an when

\[\begin{equation*} P = ( 1 - 1/U )^n \prod_{j=1}^n \left ( 1 - e_j/u_j \right )^{-1} = ( 1 - 1/U )^n \prod_{j=1}^n \left ( 1 - e_j|m|/\gamma \right )^{-1}< \alpha. \end{equation*}\]

Taking logs yields the condition

\[\begin{equation*} n \log ( 1 - 1/U ) - \sum_{j=1}^n \log ( 1 - e_j|m|/\gamma ) < \log \alpha. \end{equation*}\]

Choosing \(\gamma\)¶

Suppose we want the “cost” (in terms of the increase to the sample size) of mistaking a “no” vote for a “yes” vote \(k\) times larger than the “cost” of mistaking an invalid vote for a “yes” vote.

[TO BE COMPLETED]

def KM_Super_P_value(n, N, m, yes_none, yes_no, no_none, no_yes, none_yes, none_no, tau=2/3, gamma=1.03905):
    """
    Kaplan-Markov p-value for a super-majority contest, "yes" versus "no"
    
    Input
    ------
        n:        sample size
        N:        number of ballots cast
        m:        reported margin of victory: number of "yes" votes minus number required to win 
                  negative if "yes" lost. votes_for_yes - tau*(votes_for_yes + votes_for_no)
        yes_none: number of ballots in the audit sample that were reported as "yes" but 
                  revealed by the audit to be undervotes, overvotes, or otherwise invalid 
        yes_no:   number of ballots reported as "yes" that were revealed by the audit to 
                  be "no"
        no_none:  # ballots reported as "no" revealed to be invalid or non-votes
        no_yes:   # ballots reported as "no" revealed to be "yes"
        none_yes: # ballots reported as invalid/non-votes revealed by the audit to be "yes"
        none_no   # ballots reported as invalid/non-votes revealed by the audit to be "no"
        ...
        tau:      threshold for supermajority win
        gamma:    padding in Kaplan Markov. Choose gamma >1.
        
    Returns
    -------
        P-value : float
    """
        
    U = gamma*N/abs(m)  # bound on total error
    sgn = 1 if m >=0 else -1  # 1 if yes is the reported winner; -1 if not
    return((1 - 1/U)**n *\
           (1 - sgn*(1-tau)/gamma)**(-yes_none) *\
           (1 - sgn/gamma)**(-yes_no) *\
           (1 + sgn*tau/gamma)**(-no_none) *\
           (1 + sgn/gamma)**(-no_yes) *\
           (1 + sgn*(1-tau)/gamma)**(-none_yes) *\
           (1 - sgn*tau/gamma)**(-none_no))

Collaborative and Reproducible Data Science

Introduction to Risk-Limiting Election Audits

Contents