Statistical Inference
Contents
Statistical Inference¶
One of the most fundamental problems statistics tries to address is causal inference, determining what effect some treatment has.
One of the most serious difficulties in making causal inferences is confounding, the differences due to something other than treatment masquerading as the effect of treatment.
The Method of Comparison¶
The basic method to mitigate confounding is to compare (at least) two groups, one that receives treatment and a control group that does not (or that gets a different treatment). To minimize bias, the treatment group and the control group should be as similar as possible but for the fact that one gets treatment and the other does not.
If subjects self-select for treatment, that generally results in bias. So does allowing the experimenter flexibility to select the groups. The best way to minimize bias, and to be able to quantify the uncertainty in the resulting inferences, is to assign subjects to treatment or control randomly.
For human subjects, the mere fact of receiving treatment—even a treatment with no real effect—can produce changes in response. This is called the placebo effect. For that reason, it is important that human subjects be blind to whether they are treated or not, for instance, by giving subjects in the control group a placebo. That makes the treatment and control groups more similar. Both groups receive something: the difference is in what they receive, rather than whether they receive anything.
Also, subjective elements can deliberately or inadvertently enter the assessment of subjects’ responses to treatment, making it important for the people assessing the responses to be blind to which subjects received treatment. When neither the subjects nor the assessors know who was treated, the experiment is double blind.
See SticiGui: Does Treatment Have an Effect? for more discussion.
The 2-sample problem¶
Suppose we have a group of \(N\) individuals who are randomized into two groups, a treatment group of size \(N_t\) and a control group of size \(N_c = N - N_t\). Label the individuals from \(1\) to \(N\). Let \({\mathcal T}\) denote the labels of individuals assigned to treatment and \({\mathcal C}\) denote the labels of those assigned to control.
For each of the \(N\) individuals, we measure a quantitative (real-valued) response. Each individual \(i\) has two potential responses: the response \(c_i \)individual would have if assigned to the control group, and the response \(t_i\) the individual would have if assigned to the treatment group.
We assume that individual \(i\)’s response depends only on that individual’s assigment, and not on anyone else’s assignment. This is the assumption of non-interference. In some cases, this assumption is reasonable; in others, it is not.
For instance, imagine testing a vaccine for a communicable disease. If you and I have contact, whether you get the disease might depend on whether I am vaccinated—and vice versa—since if the vaccine protects me from illness, I won’t infect you. Similarly, suppose we are testing the effectiveness of an advertisement for a product. If you and I are connected and you buy the product, I might be more likely to buy it, even if I don’t see the advertisement.
Conversely, suppose that “treatment” is exposure to a carcinogen, and the response whether the subject contracts cancer. On the assumption that cancer is not communicable, my exposure and your disease status have no connection.
The strong null hypothesis is that individual by individual, treatment makes no difference whatsoever: \(c_i = t_i\) for all \(i\).
If so, any differences between statistics computed for the treatment and control groups are entirely due to the luck of the draw: which individuals happened to be assigned to treatment and which to control.
We can find the null distribution of any statistic computed from the responses of the two groups: if the strong null hypothesis is true, we know what individual \(i\)’s response would have been whether assigned to treatment or to control—namely, the same.
For instance, suppose we suspect that treatment tends to increase response: in general, \(t_i \ge c_i\). Then we might expect \(\bar{c} = \frac{1}{N_c} \sum_{i \in {\mathcal C}} c_i\) to be less than \(\bar{t} = \frac{1}{N_t} \sum_{i \in {\mathcal T}} t_i\). How large a difference between \(\bar{c}\) and \(\bar{t}\) would be evidence that treatment increases the response, beyond what might happen by chance through the luck of the draw?
This amounts to asking whether the observed difference in means between the two groups is a high percentile of the distribution of that difference in means, calculated on the assumption that the null hypothesis is true.
Because of how subjects are assigned to treatment or to control, all allocations of \(N_t\) subjects to treatment are equally likely.
One way to partition the \(N\) subjects randomly into a group of size \(N_c\) and a group of size \(N_t\) is to permute the \(N\) subjects at random, then take the first \(N_c\) in the permuted list to be the control group, and the remaining \(N_t\) to be the treatment group.
Aside: Random number generation and simulation¶
Most computers cannot generate true random numbers (there are rare exceptions that have hardware random number generators). Instead, they generate psdudo-random numbers using algorithms called pseudo-random number generators (PRNGs). PRNGs are deterministic, but the sequence of numbers they generate can behave like random numbers for some purposes—depending on the PRNG and the purpose.
Older languages, operating systems, and non-statistical software (including Excel) tend to use low-quality PRNGs that are not suitable for statistical inference.
A linear congruential generator makes sequences of the form
with \(m > 0\), \(0 \le c < m\), \(0 < a < m\), and “seed” \(X_0\). This generator has the benefit of simplicity, but the quality is not adequate for statistics. Moreover, for some choices of \(a\), \(c\), and \(m\), this can have especially poor behavior. One particularly bad generator, RANDU, was used on IBM mainframes in the 1960s. It has \(m = 2^{31}\), \(c=0\), and \(a = 65539\). It generates sequences with a very strong correlation: if you plot triples of numbers it generates as points in \(\Re^3\), the numbers line up on 15 planes. Despite the fact that problems with this generator were already known in the early 1960s, it was implemented in the IBM/360 in the 1970s and widely promulgated. (See Marsaglia, G., 1968. Random numbers fall mainly in the planes. PNAS, 61 25-28. http://www.pnas.org/content/61/1/25.full.pdf) Many scientific “results” are now in question because they relied on RANDU.
The Wichmann-Hill generator (published in 1982) was popular for a while; it combines three linear congruential generators. Excel purported to implement it, but at least through 2008, its implementation had bugs. (For instance, despite the fact that the Wichmann-Hill generator gives numbers between 0 and 1, the Excel PRNG occasionally gave negative numbers. See McCullough, B.D., 2008. Microsoft’s ‘Not the Wichmann-Hill’ random number generator. Computational Statistics and Data Analysis, 52, 4587–4593.)
Current high-end statistics packages and programming languages (e.g., R, Python, SAS, MATLAB) use the Mersenne Twister PRNG. The Mersenne Twister has a very long period (\(2^{19937}-1\)) and passed the DIEHARD tests for equidistribution, etc. However, it is not adequate for cryptography (the state space is so small that its future values can be predicted from a relatively small number of observations). Linear congruential generators are generally not adequate for statistics. In particular, beware the algorithms in Numerical Recipes and the Excel PRNG.
Random sampling¶
One standard way to draw a (pseudo-)random sample from a list is to assign a pseudo-random number to each element of the list, then sort the list by those numbers. Thus, the first element of the sample is the element of the list that was assigned the smallest number, the second element of the sample is the list element assigned the second-smallest number, etc. This generates a random sample in the same way that shuffling cards well generates random “hands” of cards.
Formally, this is fine, but PRNGs have limitations. Even the Mersenne Twister runs into trouble generating random permutations of long vectors using the naiive approach (assign a random number to each element of the vector, then sort by those numbers). The period of the Mersenne Twister is about \(4 \times 10^{6002}\). That’s less than the number of permutations of 2081 objects.
The Knuth Shuffle¶
Constructing a random permutation as described above is inefficient, in part because it requires sorting. The Knuth Shuffle is more efficient and can generate a random permutation in place. The Knuth shuffle goes through the list, randomly swapping the current element with elements above it in the list (or leaving it in place). That is, item \(i\) is swapped with item \(j \le i\) with probability \(1/i\) (swapping item \(i\) with item \(i\) leaves it in place, of course).
# boilerplate
%matplotlib inline
from __future__ import division
import math
import numpy as np
import scipy as sp
from scipy import stats # distributions
from scipy import special # special functions
from scipy import random # random variables, distributions, etc.
import matplotlib.pyplot as plt
from ipywidgets import widgets
# from ipywidgets import interact, interactive, FloatSliderWidget, fixed # interactive stuff
# Random sampling using random permutations
n = 20
d = range(1,n+1) # dummy data for illustration
np.random.RandomState(seed=20150824) # deliberately set the seed for reproducible results
x = np.random.random(n) # to use scipy instead of numpy, we could set x = sp.random.uniform(low=0, high=1, size=n)
i = np.argsort(x)
sam = [d[j] for j in i]
print 'x:', x
print 'i:', i
print 'shuffled data:', sam
x: [ 0.38376584 0.56143969 0.93319055 0.31239854 0.32030197 0.79957872
0.79578209 0.54639143 0.66867049 0.39427116 0.61276661 0.07677212
0.37835371 0.11060475 0.16613878 0.77981969 0.42478138 0.69276422
0.89906575 0.17081513]
i: [11 13 14 19 3 4 12 0 9 16 7 1 10 8 17 15 6 5 18 2]
shuffled data: [12, 14, 15, 20, 4, 5, 13, 1, 10, 17, 8, 2, 11, 9, 18, 16, 7, 6, 19, 3]
Computational Assignment¶
Implement Knuth’s algorithm in Python from scratch, using numpy.random.random().
Compare the cpu time taken to generate \(10^4\) random permutations of arrays of length \(10^k\), for \(k = 2, 3, 4, 5\) for
your implementation of Knuth’s algorithm
the method above generating \(10^k\) uniform random numbers and sorting the results
the numpy.random.shuffle(), which randomly shuffles the array in place
numpy.random.permutation(), which generates a new permutated array
IPython has a “magic” command %%timeit that will help you do this. For an example, see the cell below:
%%timeit n = 10**3
x = range(n)
for i in range(n):
np.random.shuffle(x)
1 loops, best of 3: 134 ms per loop
%%timeit n = 10**3
x = range(n)
for i in range(n):
np.random.permutation(x)
1 loops, best of 3: 837 ms per loop
%%timeit n= 10**3
x = range(n)
for i in range(n):
y = np.random.random(n) # to use scipy instead of numpy, we could set x = sp.random.uniform(low=0, high=1, size=n)
i = np.argsort(y)
sam = [x[j] for j in i]
1 loops, best of 3: 766 ms per loop
Significance level and power¶
Chance of the same event: rejecting the null hypothesis.
Currently vilified, partly because it is misused.
[To do.]
\(P\)-values¶
[To do.] Start with family of hypothesis tests for any desired significance level…
Confidence Sets¶
[To do.]
Confidence bounds for Binomial \(p\) and Hypergeometric \(G\)¶
from scipy.optimize import brentq
from scipy.stats import (binom, hypergeom)
def binom_conf_interval(n, x, cl=0.975, alternative="two-sided", p=None,
**kwargs):
"""
Compute a confidence interval for a binomial p, the probability of success in each trial.
Parameters
----------
n : int
The number of Bernoulli trials.
x : int
The number of successes.
cl : float in (0, 1)
The desired confidence level.
alternative : {"two-sided", "lower", "upper"}
Indicates the alternative hypothesis.
p : float in (0, 1)
Starting point in search for confidence bounds for probability of success in each trial.
kwargs : dict
Key word arguments
Returns
-------
tuple
lower and upper confidence level with coverage (approximately)
1-alpha.
Notes
-----
xtol : float
Tolerance
rtol : float
Tolerance
maxiter : int
Maximum number of iterations.
"""
assert alternative in ("two-sided", "lower", "upper")
if p is None:
p = x / n
ci_low = 0.0
ci_upp = 1.0
if alternative == 'two-sided':
cl = 1 - (1-cl)/2
if alternative != "upper" and x > 0:
f = lambda q: cl - binom.cdf(x - 1, n, q)
ci_low = brentq(f, 0.0, p, *kwargs)
if alternative != "lower" and x < n:
f = lambda q: binom.cdf(x, n, q) - (1 - cl)
ci_upp = brentq(f, 1.0, p, *kwargs)
return ci_low, ci_upp
def hypergeom_conf_interval(n, x, N, cl=0.975, alternative="two-sided", G=None,
**kwargs):
"""
Confidence interval for a hypergeometric distribution parameter G, the number of good
objects in a population in size N, based on the number x of good objects in a simple
random sample of size n.
Parameters
----------
n : int
The number of draws without replacement.
x : int
The number of "good" objects in the sample.
N : int
The number of objects in the population.
cl : float in (0, 1)
The desired confidence level.
alternative : {"two-sided", "lower", "upper"}
Indicates the alternative hypothesis.
G : int in [0, N]
Starting point in search for confidence bounds for the hypergeometric parameter G.
kwargs : dict
Key word arguments
Returns
-------
tuple
lower and upper confidence level with coverage (at least)
1-alpha.
Notes
-----
xtol : float
Tolerance
rtol : float
Tolerance
maxiter : int
Maximum number of iterations.
"""
assert alternative in ("two-sided", "lower", "upper")
if G is None:
G = (x / n)*N
ci_low = 0
ci_upp = N
if alternative == 'two-sided':
cl = 1 - (1-cl)/2
if alternative != "upper" and x > 0:
f = lambda q: cl - hypergeom.cdf(x-1, N, q, n)
ci_low = math.ceil(brentq(f, 0.0, G, *kwargs))
if alternative != "lower" and x < n:
f = lambda q: hypergeom.cdf(x, N, q, n) - (1-cl)
ci_upp = math.floor(brentq(f, G, N, *kwargs))
return ci_low, ci_upp
binom_conf_interval(1000, 3, cl=0.99, alternative="lower")
(0.00043638656412672464, 1.0)
141/339159
0.000415734213156661
Permutation tests¶
Basic setup: null hypothesis under which the probability distribution of the data is invariant under the action of some group. That is, if the null hypothesis is true, then there are many sets of data that would be (known to be) just as likely as the data actually observed.
Examples: symmetry about a point, rotational invariance, … [To do.]
Neyman model for causal inference¶
The ticket model. Each “subject” is represented by a ticket with two numbers on it, the response that the subject would have if assigned to control, and the response if assigned to treatment. These are assumed to be pre-ordained, fixed numbers. (Note that non-interference is built into this framework.)
We assign subjects at random according to the experimental design.
Non-interference: when is it realistic? Strong null hypothesis.
Generalizations¶
Two probability distributions on each ticket instead of two numbers.
Gender Bias in Teaching Evaluations¶
MacNell, Driscoll, and Hunt (2014. What’s in a Name: Exposing Gender Bias in Student Ratings of Teaching, Innovative Higher Education) conducted a controlled, randomized experiment on the effect of students’ perception of instructors’ gender on teaching evaluations in an online course. Students in the class did not know the instructors’ true genders.
MacNell et al. randomized 43 students in an online course into four groups: two taught by a female instructor and two by a male instructor. One of the groups taught by each instructor was led to believe the instructor was male; the other was led to believe the instructor was female. Comparable instructor biographies were given to all students. Instructors treated the groups identically, including returning assignments at the same time.
When students thought the instructor was female, students rated the instructor lower, on average, in every regard.
Characteristic | F - M |
---|---|
Caring | -0.52 |
Consistent | -0.47 |
Enthusiastic | -0.57 |
Fair | -0.76 |
Feedback | -0.47 |
Helpful | -0.46 |
Knowledgeable | -0.35 |
Praise | -0.67 |
Professional | -0.61 |
Prompt | -0.80 |
Respectful | -0.61 |
Responsive | -0.22 |
Those results are for a 5-point scale, so a difference of 0.8 is 16% of the entire scale.
MacNell et al. graciously shared their data. The evaluation data are coded as follows:
Group
3 (12 students) - TA identified as male, true TA gender female
4 (11 students) - TA identified as male, true TA gender male
5 (8 students) - TA identified as female, true TA gender female
6 (12 students) - TA identified as female, true TA gender male
tagender - 1 if TA is actually male, 0 if actually female
taidgender - 1 if TA is identified as male, 0 if identified as female
gender - 1 if student is male, 0 if student is female
There are grades for 47 students but evaluations for only 43 (4 did not respond: three in group 5 and one in group 6). The grades are not linked to the evaluations, per the IRB protocol.
Interlude: partitioning sets into more than two subsets¶
Recall that there are \(n \choose k\) ways of picking a subset of size \(k\) from \(n\) items; hence there are \(n \choose k\) ways of dividing a set into a subset of size \(k\) and one of size \(n-k\) (once you select those that belong to the subset of size \(k\), the rest must be in the complementary subset of size \(n-k\).
In this problem, we are partitioning 43 things into 4 subsets, one of size 8, one of size 11, and two of size 12. How many ways are there of doing that?
Recall the Fundamental Rule of Counting: If a set of choices, \(T_1, T_2, \ldots, T_k\), could result, respectively, in \(n_1, n_2, \ldots, n_k\) possible outcomes, the entire set of \(k\) choices has \(\prod_{i=1}^k n_k\) possible outcomes.
We can think of the allocation of students to the four groups as choosing 8 of the 43 students for the first group, then 11 of the remaining 35 for the second, then 12 of the remaining 24 for the third. The fourth group must containe the remaining 12.
The number of ways of doing that is
Does the number depend on the order in which we made the choices? Suppose we made the choices in a different order: first 12 students for one group, then 8 for the second, then 12 for the third (the fourth gets the remaining 11 students). The number of ways of doing that is
The number does not depend on the order in which we make the choices.
By the same reasoning, the number of ways of dividing a set of \(n\) objects into \(m\) subsets of sizes \(k_1, \ldots k_m\) is given by the multinomial coefficient
We will check how surprising it would be for the means to be so much lower when the TA is identified as female, if in fact there is “no real difference” in how they were rated, and the apparent difference is just due to the luck of the draw: which students happened to end up in which section.
In the actual randomization, all \(43 \choose 8, 11, 12, 12\) allocations were equally likely. But there might be real differences between the two instructors. Hence, we’d like to use each of them as his or her own “control.”
Each student’s potential responses are represented by a ticket with 4 numbers:
the rating that the student would assign to instructor A if instructor A is identified as male
the rating that the student would assign to instructor A if instructor A is identified as female
the rating that the student would assign to instructor B if instructor B is identified as male
the rating that the student would assign to instructor B if instructor B is identified as female
The null hypothesis is that the first two numbers are equal and the second two numbers are equal,
but the first two numbers might be different from the second two numbers.
This corresponds to the hypohtesis that
students assigned to a given TA would rate him or her the same, whether that TA seemed to be male or female.
For all students assigned instructor 1, we know both of the first two numbers if the hull hypothesis
is true; but we know neither of the second two numbers.
Similarly, if the null hypothesis is true, we know both of the second two numbers for all students
assigned to instructor 2, but we know neither of the first two numbers for those students.
Because of how the randomization was performed, all allocations of students to sections that keep the number of students in each section the same are equally likely, so in particular all allocations that keep the same students assigned to each actual instructor the same are equally likely.
Hence, all \({20 \choose 8}\) ways of splitting the 20 students assigned to the female instructor into two groups, one with 8 students and one with 12, are equally likely. Similarly, all \({23 \choose 12}\) ways of splitting the 23 students assigned to the male instructor into two groups, one with 12 students and one with 11, are equally likely. We can thus imagine shuffling the female TA’s students between her sections, and the male TA’s students between his sections, and examine the distribution of the difference between the mean score for the sections where the TA was identified as male is larger than the mean score for the sections where the TA was identified as female.
If the difference is rarely as large as the observed mean difference, the observed mean difference gives evidence that being identified as female really does lower the scores.
# Number of allocations to 8, 11, 12, 12
print sp.special.binom(43,8)*sp.special.binom(35,11)*sp.special.binom(24,12) # big number!
# the data are in a .csv file called "Macnell-RatingsData.csv" in the directory Data
import csv # library for dealing with .csv files
ratings = np.genfromtxt ('./Data/Macnell-RatingsData.csv', delimiter=",", names=True)
print ratings
print ratings['overall']
Computational assignment¶
Rosenzweig, M.R., E.L. Bennet, and M.C. Diamond, 1972. Brain changes in response to experience, Scientific American, 226, 22–29 report an experiment in which 11 triples of male rats, each triple from the same litter, were assigned at random to three different environments, “enriched” (E), standard, and “impoverished.”
After 65 days, they euthanized the rats and measured their cortical masses (mg). Here are the results: [To do: check that this is impoverished, not standard.]
enriched | 689 | 656 | 668 | 660 | 679 | 663 | 664 | 647 | 694 | 633 | 653 |
impoverished | 657 | 623 | 652 | 654 | 658 | 646 | 600 | 640 | 605 | 635 | 642 |
difference | 32 | 33 | 16 | 6 | 21 | 17 | 64 | 7 | 89 | -2 | 11 |
Code, from scratch, an appropriate 2-sample permutation test in Python using the sum of the 11 differences as the test statistic. The permutation scheme should match the actual randomization in the experiment. You may either construct a test that generates permutations at random, or write code that enumerates all permutations (since there are only 2048 possible equally-likely data in this case). If you choose to enumerate all permutations, you can use the itertools library.
Explain whether a 1-sided or 2-sided test is appropriate.
Under the Neyman model, find the \(P\)-value of the strong null hypothesis that the treatment makes no difference whatsoever. Use \(10^4\) random assignments or enumerate all possible assignments. Your code should work in the general case (different data, different number of litters), not only the particular set of 22 numbers given. That said, for reference, the \(P\)-value for a 2-sided test should be \(1/2^9\) for these data, so you can check your answer.
Find an upper confidence bound for the \(P\)-value by inverting Binomial tests. You may re-use code from the class to do this.
For your convenience, the data are below in a numpy array.
brain_wts = np.array([ [689, 656, 668, 660, 679, 663, 664, 647, 694, 633, 653],\
[657, 623, 652, 654, 658, 646, 600, 640, 605, 635, 642] ])
%run talkTools.py