The Neyman “potential outcomes” model for causal inference
Contents
The Neyman “potential outcomes” model for causal inference¶
There are \(N\) subjects and \(T\) possible treatments. Each subject is represented by a ticket. Ticket \(j\) lists \(T\) numbers, \((x_{j0}, \ldots, x_{jT-1})\). The value \(x_{jt}\) is the response subject \(j\) will have if assigned to treatment \(t\). (Treatment \(0\) is commonly control or placebo.)
This mathematical set up embodies the non-interference assumption, which means that subject \(j\)’s response depends only on which treatment subject \(j\) receives, and not on the treatment other subjects receive. (That is not a good assumption in situations like vaccine trials, where whether one subject becomes infected may depend on which other subjects are vaccinated, if subjects may come in contact with each other.)
This model is also called the potential outcomes model, because it starts with the potential outcomes each subject will have to each treatment. Assigning a subject to a treatment just reveals the potential outcome that corresponds to that treatment, for that subject. This model was introduced by Jerzy Neyman, the founder of the U.C. Berkeley Department of Statistics, in a 1923 paper in Polish translated into English in 1990. It was popularized by Donald Rubin in the 1970s and 1980s.
There are generalizations of this model, including one in which the “potential outcomes” are random, rather than deterministic, but their distributions are fixed before assignment to treatment: if subject \(j\) is assigned treatment \(t\), a draw from the distribution \(\mathbb{P}_{jt}\) is observed. Draws for different subjects are independent. We shall see an example of this when we discuss nuisance parameters.
Null hypotheses for the Neyman model¶
The strong null hypothesis is that subject by subject, the effect of all \(T\) treatments is the same. That is,
Different subjects may have different responses (\(x_{jt}\) might not equal \(x_{kt}\) if \(j \ne k\)), but each subject’s response is the same regardless of the treatment assigned to that subject. This is the null hypothesis Fisher considered in The Design of Experiments and which he generally considered the “correct” null in practice.
Suppose \(T=2\): we are comparing two treatments. Suppose we assign \(n\) subjects at random to treatment 0 and the other \(m = N-n\) to treatment 1. Let \(\{z_j\}_{j=1}^n\) be the responses of the subjects assigned treatment 0 and \(\{y_j\}_{j=1}^m\) be the responses of the subjects assigned treatment 1. (That is, \(z_1 = x_{k0}\) if \(k\) is the first subject assigned treatment \(0\), and \(y_1 = x_{k1}\) if \(k\) is the first subject assigned treatment \(1\).) Then testing the strong null hypothesis is identical to the two-sample problem: under the strong null, each subject’s response would have been the same, regardless of treatment, so allocating subject to treatments and observing their responses is just randomly partitioning a fixed set of \(n\) numbers into a group of size \(n\) and a group of size \(m\).
The weak null hypothesis is that on average across subjects, all treatments have the same effect. That is,
Much of Neyman’s work on experiments involves this null hypothesis. The statistical theory is more complex for the weak null hypothesis than for the strong null.
The strong null is indeed a stronger hypothesis than the weak null, because if the strong null is true, the weak null must also be true: if \(T\) lists are equal, element by element, then their means are equal. But the converse is not true: the weak null can be true even if the strong null is false. For example, for \(T=2\) and \(N=2\), we might have potential responses \((0, 1)\) for subject 1 and \((1,0)\) for subject 2. The effect of treatment is to increase subject 1’s response from 0 to 1 and to decrease subject 2’s response from 1 to 0. The treatment affects both subjects, but the average effect of treatment is the same: the average response across subjects is 1/2, with or without treatment.
If we can test the weak null, we can also make inferences about the average treatment effect
(when treatment 1 is being compared to control).
If we can only test the strong null, in general we have to make assumptions about how treatment affects responses in order to make inferences about the average treatment effect.
Alternative hypotheses in the Neyman model¶
Constant shift¶
For example, if we assume that the effect of treatment is to shift every subject’s response by the same amount, then we can use a test of the strong null to make inferences about that constant effect. In symbols, this alternative states that there is some number \(\tau\) such that \(x_{j1} = x_{j0}+\tau\) for all subjects \(j\).
Again, once the original data are observed, this hypothesis completely specifies the probability distribution of the data: we know what subject \(j\)’s response would have been had the subject been assigned the other treatment. If the subject was assigned treatment 0, the response would have been larger by \(\Delta\) if the subject had been assigned treatment 1 instead. if the subject was assigned treatment 1, the response would have been smaller by \(\Delta\) if the subject had been assigned treatment 0 instead.
Other tractable alternative hypotheses¶
A more general alternative is that \(x_{j1} = f(x_{j0})\) for some strictly monotonic (and thus invertible) function \(f\). A simple example is that treatment multiplies the response by a constant.
In some contexts, it can be reasonable to assume that treatment can only help, that is that \(x_{j1} \ge x_{j0}\), without specifying a functional relationship between them.
Testing the strong null hypothesis¶
Under the strong null that the treatment makes no difference whatsoever–as if the response had been predetermined before assignment to treatment or control–the null distribution of any test statistic is completely determined once the data have been observed: we know what the data would have been for any other random assignment, namely, the same. And we know the chance that each of those possible datasets would have resulted from the experiment, since we know how subjects were assigned at random to treatments.
For alternatives that allow us to find \(x_{j0}\) from \(x_{j1}\) and vice versa, the alternative also completely determines the probability distribution of any test statistic, once the data have been observed.
Two treatments, binary responses.¶
Imagine testing whether a vaccine prevents a disease. We assign a random sample of \(n\) of the \(N\) subjects to receive treatment 1; the other \(N-n\) receive a placebo, treatment 0. Let \(W_j\) denote the treatment assigned to subject \(j\), so \(\sum_{j=1}^N W_j = n\). After some time has passed, we observe
These are random variables, but (in this model) the only source of randomness is \(\{W_j\}\), the treatment assignment variables.
The total number of infections among the vaccinated is
and the total among the unvaccinated is
Under the strong null that the vaccine makes no difference whatsoever–as if whether a subject would become ill was predetermined before assignment to treatment or control–the distribution of the number of infections among the vaccinated would have a hypergeometric distribution with parameters \(N\), \(G=X_0^*+X_1^*\), and \(n=n\). Testing the strong null using this hypergeometric distribution yields Fisher’s Exact Test.
This model can be generalized by considering the total number of infections \(X_0^*+X_1^*\) to be random, then conditioning on the observed number of infections to get a conditional test.
Inference about the size of the treatment effect¶
Recall that the “weak” null hypothesis that average response is the same, regardless of the assigned treatment (but individuals might have different responses to treatment). Define
and
the observed mean responses for the control group and the treatment group, respectively. These are unbiased estimates of the corresponding population parameters,
and
The average treatment effect for the study population is
where \(\tau_j := x_{1j}-x_{0j}\). An unbiased estimate of \(\tau\) is \(\hat{\tau} := \bar{X}_1 - \bar{X}_0\).
Since each \(\tau_j\) is either \(-1\), \(0\), or \(1\), the only possible values of \(\tau\) are multiples of \(1/N\). The largest and smallest possible values are \(\tau=-1\) (which occurs if \(x_{1j}=0\) and \(x_{0j}=1\) for all \(j\)) and \(\tau=1\) (which occurs if \(x_{1j}=1\) and \(x_{0j}=0\) for all \(j\)), a range of 2.
The study population can be summarized by four numbers, \(N_{00}\), \(N_{01}\), \(N_{10}\), and \(N_{11}\), where \(N_{ik}\) is the number of subjects \(j\) for whom \(x_{j0} = i\) and \(x_{j1} = k\), for \(i, k \in \{0, 1\}\). That is, \(N_{00}\) is the number of subjects whose response is “0” whether they are assigned to treatment or to control, while \(N_{01}\) is the number of subjects whose response would be “0” if assigned to control and “1” if assigned to treatment, etc. Of course, \(N = N_{00} + N_{01} + N_{10} + N_{11}\). (Note: this notation has the indices in the opposite order from that in Li and Ding (2016).)
Define \(N_{1+} := N_{10} + N_{11}\) and \(N_{+1} := N_{01} + N_{11}\). Then \(N_{1+}\) is the number of subjects whose response would be “1” if assigned to control and \(N_{+1}\) is the number of subjects whose response would be “1” if assigned to treatment. (Note: this notation also has the indices in the opposite order from that in Li and Ding (2016).)
Now \(\sum_{j=1}^N x_{0j} = N_{1+}\) and \(\sum_{j=1}^N x_{1j} = N_{+1}\), so the average treatment effect can be written
Once the data have been observed, we know that \(N_{1+}\) is at least \(X_0^*\) (we saw that many ones in the control group) and at most \(X_0^* + n\) (if every unobserved control response were one). Similarly, we know that \(N_{+1}\) is at least \(X_1^*\) (we saw that many ones in the treatment group) and at most \(X_1^* + (N-n)\) (if every unobserved treatment response were one). Their difference is thus between
and
so
which has range \(1\).
First approach to confidence bounds for \(\tau\): Bonferroni simultaneous confidence intervals¶
A collection of confidence set procedures \(\{\mathcal{I}_i \}_{i=1}^m\) for a corresponding set of parameters \(\{ \theta_i \}_{i=1}^m\) has simultaneous coverage probability \(1-\alpha\) if
whatever the true values \(\{\theta_i\}\) are.
Bonferroni’s inequality says that for any collection of events \(\{A_i\}\),
It follows that if \(\mathcal{I}_i\) is a confidence interval procedure for \(\theta_i\) with coverage probability \(1-\alpha_i\), \(i=1, \ldots, m\), then the collection of \(m\) confidence set procedures \(\{\mathcal{I}_i \}_{i=1}^m\) has simultaneous coverage probability not smaller than \(1-\sum_{i=1}^m \alpha_i\).
In the current situation, if we had simultaneous confidence sets for both \(N_{1+}\) and \(N_{+1}\), we could find a confidence set for \(\tau\), because \(\tau = (N_{+1} - N_{1+})/N\).
Now, \(X_1^*\) is the number of “1”s in a random sample of size \(n\) from a population of size \(N\) of which \(N_{+1}\) are labeled “1” and the rest are labeled “0.” Similarly, \(X_0^*\) is the number of “1”s in a random sample of size \(n\) from a population of size \(N\) of which \(N_{1+}\) are labeled “1” and the rest are labeled “0.” That is, the probability distribution of \(X_1^*\) is hypergeometric with parameters \(N\), \(G=N_{+1}\), and \(n\), and the probability distribution of \(X_0^*\) is hypergeometric with parameters \(N\), \(G=N_{1+}\), and \(n\). The variables \(X_1^*\) and \(X_0^*\) are dependent.
To construct an upper 1-sided \(1-\alpha\) confidence bound for \(\tau\), we can find an upper 1-sided \(1-\alpha/2\) confidence bound for \(N_{+1}\), subtract a lower 1-sided \(1-\alpha/2\) confidence bound for \(N_{1+}\), and divide the result by \(N\).
To construct a lower 1-sided \(1-\alpha\) confidence bound for \(\tau\), we can find a lower 1-sided \(1-\alpha/2\) confidence bound for \(N_{+1}\), subtract an upper 1-sided \(1-\alpha/2\) confidence bound for \(N_{1+}\), and divide the result by \(N\).
To construct a 2-sided confidence interval for \(\tau\), we can find a 2-sided \(1-\alpha/2\) confidence bound for \(N_{+1}\) and a 2-sided \(1-\alpha/2\) confidence bound for \(N_{1+}\). The lower endpoint of the \(1-\alpha\) confidence interval for \(\tau\) is the lower endpoint of the 2-sided interval for \(N_{+1}\) minus the upper endpoint of the 2-sided interval for \(N_{1+}\), divided by \(N\). The upper endpoint of the \(1-\alpha\) confidence interval for \(\tau\) is the upper endpoint of the 2-sided interval for \(N_{+1}\) minus the lower endpoint of the 2-sided interval for \(N_{1+}\), divided by \(N\).
This approach has a number of advantages: it is conceptually simple, conservative, and only requires the ability to compute confidence intervals for \(G\) for hypergeometric distributions. But the intervals will in general be quite conservative, i.e., unnecessarily wide. (See Table 1 of Li and Ding (2016).) We now examine a sharper approach.
Second approach: testing all tables of potential outcomes¶
After the randomization, for each subject \(j\), we observe either \(x_{j0}\) or \(x_{j1}\). At that point, we know \(N\) of the \(2N\) numbers \(\{x_{jk}\}_{j=1}^N{}_{k=0}^1\). The other \(N\) numbers–the responses that were not observed–can be any combination of 0s and 1s: there are \(2^N\) possibilities in all. But not all of those yield distinguishable tables of results.
At the end of the day, the average treatment effect \(\tau\) is determined by the four numbers, \(N_{00}\), \(N_{01}\), \(N_{10}\), and \(N_{11}\), which to sum to \(N\). How many possible values are there for those four numbers? The total number of ways there are of partitioning \(N\) items into 4 groups can be found by Feller’s “bars and stars” argument (see the notes on nuisance parameters); the answer is \(\binom{N+3}{3} = (N+3)(N+2)(N+1)/6\). This is \(O(N^3)\) tables (see Rigdon and Hudgens (2015)).
But many of those tables are incompatible with the observed data. For instance, we know that \( N_{1+} \ge X_0^*\) and \(N_{+1} \ge X_1^*\). Li and Ding (2016) show that taking into account the observed data reduces the number of tables that must be considered to \(O(N^2)\), greatly speeding the computation.
A permutation approach
Together, \(N_{00}\), \(N_{01}\), \(N_{10}\), and \(N_{11}\) determine the sampling distribution of any statistic, through the random allocation of subjects to treatments. Let \(\mathbf{N} := (N_{00}, N_{01}, N_{10}, N_{11})\) be the 2x2 summary table of counts of subjects with each combination of potential outcomes. To test the null hypothesis \(H_0: \mathbf{N} = \mathbf{N}_0\), we can define some function \(T\) of \(X_0^*\) and \(X_1^*\) and reject \(H_0\) if the observed value of \(T\) is in the tail of the probability distribution corresponding to \(H_0\).
What should we use for \(T\)? Since we are interested in \(\tau\), we might base the test on
This is an unbiased estimator of \(\tau\): \(X_1^*/n\) is the sample mean of a simple random sample of size \(n\) from \(\{x_{j1}\}_{j=1}^N\), so it is unbiased for \(N_{+1}/N\) and \(X_0^*/(N-n)\) is the sample mean of a simple random sample of size \(n\) from \(\{x_{j0}\}_{j=1}^N\), so it is unbiased for \(N_{1+}/N\). The difference is thus unbiased for \(N_{+1}/N - \)N_{1+}/N = \tau$.
Rigdon and Hudgens (2015) and Li and Ding (2016) take the test statistic to be \(T=|\hat{\tau} - \tau|\) (they look at other things, too).
Recall that \(N_{1+} := N_{10} + N_{11}\), \(N_{+1} := N_{01} + N_{11}\), and
For any 2x2 summary table \(\mathbf{M}\) of counts of potential outcomes, define
If we do not reject the simple hypothesis \(\mathbf{N} = \mathbf{N}_0\), then we cannot reject the hypothesis \(\tau = \tau(\mathbf{N}_0) := \tau_0\). We can reject the composite hypothesis \(\tau = \tau_0\) if we can reject the simple hypothesis \(\mathbf{N} = \mathbf{M}\) for every summary table \(\mathbf{M}\) for which \(\tau(\mathbf{M}) = \tau_0\).
We can calibrate the test of the hypothesis \(\mathbf{N} = \mathbf{N}_0\) by finding out the probability distribution of \(T\) under \(H_0\): Given the summary table \(\mathbf{N}_0\) of counts of subjets with each possible combination of potential outcomes, we can construct a table of potential outcomes that is consistent with that table. For each of the \(\binom{N}{n}\) equally likely possible treatment assignments, we can find the corresponding value of \(T\) by allocating the subjects accordingly, finding \(\hat{\tau}\), subtracting \(\tau(\mathbf{N}_0)\), and taking the absolute value.
When \(N\) is large and \(n\) is not close to \(0\) or \(N\), it is impractical to construct all \(\binom{N}{n}\) possible treatment assignments. Then, we might approximate the probability distribution by simulation: actually drawing pseudo-random samples of size \(n\) from the subjects, considering them to be the treatment group, calculating \(\hat{\tau}\), etc. As discussed in the chapter on testing, that can be viewed as an approximation to a theoretical \(P\)-value or as the exact \(P\)-value for a randomized test.
Enumerating all 2x2 tables consistent with the data
The table \(\mathbf{N}\) summarizing potential outcomes is constrained by the data. For instance, \(N_{10} + N_{11} \ge X_0^*\) and \(N_{01} + N_{11} \ge X_1^*\). There are other constraints, too, as we shall see.
Define
for \(w \in \{0, 1\}\) and \(k \in \{0, 1\}\). That is, \(n_{00} = (N-n)-X_0^*\), \(n_{01} = X_0^*\), \(n_{10} = n-X_1^*\), and \(n_{11} = X_1^*\). Clearly, \(\sum_{w,k} n_{wk} = N\): these numbers count the elements in a partition of the subjects. Using this notation, \(\hat{\tau} = n_{11}/n - n_{01}/(N-n)\).
Rigdon and Hudgens (2014) show that it suffices to examine \(n_{RH} := (n_{11} + 1)(n_{10} + 1)(n_{01} + 1)(n_{00} + 1)\) 2x2 tables to find confidence bounds for \(\tau\) using \(\hat{\tau}\) as the test statistic.
Their argument is as follows: Consider the \(n_{11}\) subjects who were assigned to treatment \(w=1\) and whose response was \(x_{j1}=1\). Fix the unobserved values of the remaining \(N-n_{11}\) subjects. As the unobserved responses of those \(n_{11}\) vary, the value of \(\tau\) and the probability distribution of \(T\) depend only on how many of them have \(x_{j0}=0\) and how many have \(x_{j0}=1\). The number \(m\) for which \(x_{j0}=0\) could be 0, 1, \(\ldots\), or \(n_{11}\); the other \(n_{11}-m\) have \(x_{j0}=1\). There are thus only \(n_{11}+1\) tables that need to be examined, given the unobserved values in the other three groups. By the same argument, as the unobserved values of the \(n_{01}\) subjects vary, holding constant the unobserved values for the rest, there are at most \(n_{01}+1\) distinct values of \(\tau\) and distinct probability distributions for \(T\). The same goes for \(n_{10}\) and \(n_{11}\). By the fundamental rule of counting, the total number of tables that give rise to distinguishable probability distributions for \(T\) is thus at most \(n_{RH}\).
Li and Ding (2016) improve on that result. They prove that a table \(\mathbf{N} = (N_{00}, N_{01}, N_{10}, N_{11})\) is consistent with the observed values \((n_{wk})\) iff
The algorithm
The overall approach is as follows:
The values \(N\), \((n_{wk})\), and \(\alpha\) are given. Set
the observed value of the unbiased estimate of \(\tau\).
for each table \(\mathbf{N}\) that is algebraically consistent with the observed values \((n_{wk})\):
find \(\tau(\mathbf{N}) = \frac{N_{01}-N_{10}}{N}\).
calculate \(t = |\tau^* - \tau(\mathbf{N})|\)
create a “full” table of potential outcomes \((x_{ik})\) consistent with the summary table \(\mathbf{N}\).
find or simulate the sampling distribution of \(|\hat{\tau} - \tau|\) using \((x_{ik})\)
if \(t\) is above the \(1-\alpha\) percentile of the sampling distribution, pass; otherwise include \(\tau(\mathbf{N})\) in the confidence set
report the smallest and largest values of \(\tau\) in the confidence set as the endpoints of the confidence interval
Below, the main step–generating tables that are algebraically consistent with the data–is implemented in python
(N_generator()
), as is the step of
creating a table of potential outcomes consistent with \(\mathbf{N}\) (potential_outcomes()
).
import math
import numpy as np
import scipy as sp
from itertools import filterfalse
def N_generator(N, n00, n01, n10, n11):
'''
generate tables algebraically consistent with data from an experiment with binary outcomes
Parameters
----------
N : int
number of subjects
n00 : int
number of subjects assigned to treatment 0 who had outcome 0
n01 : int
number of subjects assigned to treatment 0 who had outcome 0
n10 : int
number of subjects assigned to treatment 1 who had outcome 0
n11 : int
number of subjects assigned to treatment 1 who had outcome 1
Returns
-------
Nt : list of 4 ints
N00, subjects with potential outcome 0 under treatments 0 and 1
N01, subjects with potential outcome 0 under treatment 0 and 1 under treatment 1
N10, subjects with potential outcome 1 under treatment 0 and 0 under treatment 1
N11, subjects with potential outcome 1 under treatments 0 and 1
'''
for i in range(min(N-n00, N-n10)+1): # allocate space for the observed 0 outcomes, n00 and n10
N11 = i
for j in range(max(0, n01-N11), N-n00-N11): # N11+N10 >= n01; N11+N10+n00 <= N
N10 = j
for k in range(max(0, n11-N11), min(N-n10-N11, N-N11-N10)):
# N11+N01 >= n11; N11+N01+n10 <= N; no more than N subjects
N01 = k
N00 = N-N11-N10-N01
if filterTable([N00, N01, N10, N11], n00, n01, n10, n11):
yield [N00, N01, N10, N11]
else:
pass
def filterTable(Nt, n00, n01, n10, n11):
'''
check whether summary table Nt of binary outcomes is consistent with observed counts
implements the test in Theorem 1 of Li and Ding (2016)
Parameters:
----------
Nt : list of four ints
the table of counts of subjects with each combination of potential outcomes, in the order
N_00, N_01, N_10, N_11
n01 : int
number of subjects assigned to control whose observed response was 1
n11 : int
number of subjects assigned to treatment whose observed response was 1
Returns:
--------
ok : boolean
True if table is consistent with the data
'''
N = np.sum(Nt) # total subjects
return max(0,n11-Nt[2], Nt[3]-n01, Nt[2]+Nt[3]-n10-n01) <= min(Nt[3], n11, Nt[2]+Nt[3]-n01, N-Nt[2]-n01-n10)
def potential_outcomes(Nt):
'''
make a 2xN table of potential outcomes from the 2x2 summary table Nt
Parameters
----------
Nt : list of 4 ints
N00, N01, N10, N11
Returns
-------
po : Nx2 table of potential outcomes consistent with Nt
'''
return np.reshape(np.array([0,0]*Nt[0]+[0,1]*Nt[1]+[1,0]*Nt[2]+[1,1]*Nt[3]), [-1,2])
# test
Nt = [5, 4, 3, 2]
potential_outcomes(Nt)
array([[0, 0],
[0, 0],
[0, 0],
[0, 0],
[0, 0],
[0, 1],
[0, 1],
[0, 1],
[0, 1],
[1, 0],
[1, 0],
[1, 0],
[1, 1],
[1, 1]])
# Hypothetical experiment
N = 10
n = 5
n00 = 3
n01 = 2
n10 = 1
n11 = 4
[Nt for Nt in N_generator(N, n00, n01, n10, n11)]
[[3, 3, 3, 1],
[2, 4, 3, 1],
[1, 5, 3, 1],
[4, 2, 2, 2],
[3, 3, 2, 2],
[2, 4, 2, 2],
[1, 5, 2, 2],
[3, 2, 3, 2],
[2, 3, 3, 2],
[1, 4, 3, 2],
[4, 1, 2, 3],
[3, 2, 2, 3],
[2, 3, 2, 3],
[1, 4, 2, 3],
[3, 1, 3, 3],
[2, 2, 3, 3],
[1, 3, 3, 3],
[5, 0, 1, 4],
[4, 1, 1, 4],
[3, 2, 1, 4],
[2, 3, 1, 4],
[1, 4, 1, 4],
[4, 0, 2, 4],
[3, 1, 2, 4],
[2, 2, 2, 4],
[1, 3, 2, 4],
[4, 0, 1, 5],
[3, 1, 1, 5],
[2, 2, 1, 5],
[1, 3, 1, 5],
[4, 0, 0, 6],
[3, 1, 0, 6],
[2, 2, 0, 6]]
# Regeneron data from
# https://investor.regeneron.com/news-releases/news-release-details/phase-3-prevention-trial-showed-81-reduced-risk-symptomatic-sars
n=753
m=752
N=n+m
n01 = 59
n11 = 11
n00 = m-n01
n10 = n-n11
Other measures of the effect of treatment¶
The average treatment effect is one of many ways of quantifying the effect of treatment. For vaccines, a common measure of effect size is vaccine efficacy, defined as
This is the naive estimate of the fraction of infections that vaccinating everyone would prevent.
Here, risk is the fraction of subjects who become infected. In the notation we’ve been using,
if \(N_{1+} > 0\).
We can make confidence intervals for vaccine efficiacy in the same way as for the average treatment effect:
Choose a reasonable test statistic \(T\). Here, we might use the plug-in estimator of VE, which is biased, but still a sensible choice:
For each summary table of potential outcomes, \(\mathbf{N}\), test the hypothesis that the observations came from that table using the randomization distribution of \(T\): reject if the observed value of \(T\) exceeds the \(1-\alpha\) percentile of that distribution.
If the hypothesis is not rejected, include \(\mbox{VE}(\mathbf{N})\) in the confidence set.
A similar approach works with any measure of effect size in a randomized experiment with binary outcomes.
What’s special about binary outcomes?¶
The approach above uses the assumption that the potential outcomes can be only 0 or 1. If the potential outcomes take values in a known finite set, a similar approach works, but if there are more than 2 possible values, constructing all tables algebraically consistent with the data is in general harder, and the number of such tables will be larger.
If the potential outcomes are bounded with known bounds (but not necessarily discrete with known support), the approach can be generalized using nonparametric methods for estimating the mean of bounded, finite populations (see other chapters in the notes for such methods, including Confidence bounds via the Chebychev and Hoeffding Inequalities, Lower confidence bounds for the mean via Markov’s Inequality and methods based on the Empirical Distribution, Penny sampling, Wald’s Sequential Probability Ratio Test (SPRT), Kaplan-Wald Confidence Bound for a Nonnegative Mean, and Method shootout.
If the outcomes can be arbitrary, there is no