Permutation Tests

Basic idea:

If, under the null hypothesis, the probability distribution of the data is invariant under the action of some group \(\mathcal{G}\), then once we observe the actual data, we know other possible data sets that are equally likely.

In particular, every element of the orbit of the observed data under \(\mathcal{G}\) was just as probable as the observed data.

Test conditionally on the orbit in which the observed data lie.

Example: Effect of treatment in a randomized controlled experiment

11 pairs of rats, each pair from the same litter.

Randomly—by coin tosses—put one of each pair into “enriched” environment; other sib gets “normal” environment.

After 65 days, measure cortical mass (mg).

enriched 689 656 668 660 679 663 664 647 694 633 653
impoverished 657 623 652 654 658 646 600 640 605 635 642
difference 32 33 16 6 21 17 64 7 89 -2 11

How should we analyze the data?

Cartoon of Rosenzweig, M.R., E.L. Bennet, and M.C. Diamond, 1972. Brain changes in response to experience, Scientific American, 226, 22–29 report an experiment in which 11 triples of male rats, each triple from the same litter, were assigned at random to three different environments, “enriched” (E), standard, and “impoverished.” See also Bennett et al., 1969.

Informal Hypotheses

Null hypothesis: treatment has “no effect.”

Alternative hypothesis: treatment increases cortical mass.

Suggests 1-sided test for an increase.

Test contenders

  • 2-sample Student \(t\)-test:

\[\begin{equation*} \frac{\mbox{mean(treatment) - mean(control)}} {\mbox{pooled estimate of SD of difference of means}} \end{equation*}\]
  • 1-sample Student \(t\)-test on the differences:

\[\begin{equation*} \frac{\mbox{mean(differences)}}{\mbox{SD(differences)}/\sqrt{11}} \end{equation*}\]

Better, since littermates are presumably more homogeneous.

  • Permutation test using \(t\)-statistic of differences: same statistic, different way to calculate \(P\)-value. Even better?

Strong null hypothesis

Treatment has no effect whatsoever—as if cortical mass were assigned to each rat before the randomization.

Then equally likely that the rat with the heavier cortex will be assigned to treatment or to control, independently across littermate pairs.

Gives \(2^{11} = 2,048\) equally likely possibilities:

difference ± 32 ± 33 ± 16 ± 6 ± 21 ± 17 ± 64 ± 7 ± 89 ± 2 ± 11

For example, just as likely to observe original differences as

difference -32 -33 -16 -6 -21 -17 -64 -7 -89 -2 -11

Weak null hypothesis

On average across pairs, treatment makes no difference.

Alternatives

  1. Individual’s response depends only on that individual’s assignment

    • Special cases: shift, scale, etc.

  2. Interactions/Interference: my response could depend on whether you are assigned to treatment or control.

Assumptions of the tests

  1. 2-sample \(t\)-test:

    • masses are iid sample from normal distribution, same unknown variance, same unknown mean.

    • Tests weak null hypothesis (plus normality, independence, non-interference, etc.).

  2. 1-sample \(t\)-test on the differences:

    • mass differences are iid sample from normal distribution, unknown variance, zero mean.

    • Tests weak null hypothesis (plus normality, independence, non-interference, etc.)

  3. Permutation test:

    • Randomization fair, independent across pairs.

    • Tests strong null hypothesis.

Assumptions of the permutation test are true by design: That’s how treatment was assigned.

What’s the group?

The data are elements of \(\Re^{11}\). The group operations reflect coordinates about the axes—they flip the signs of elements of the vectors.

# boilerplate
%matplotlib inline
import math
import numpy as np
import scipy as sp
from scipy import stats  # distributions
from scipy import special # special functions
from scipy import random # random variables, distributions, etc.
from scipy.optimize import brentq
from scipy.stats import (binom, hypergeom)
import matplotlib.pyplot as plt
from ipywidgets import widgets

treat = np.array([689, 656, 668, 660, 679, 663, 664, 647, 694, 633, 653])
control = np.array([657, 623, 652, 654, 658, 646, 600, 640, 605, 635, 642])
diffr = treat - control

sp.stats.ttest_1samp(diffr, 0.0)  # two-sided one-sample t-test
Ttest_1sampResult(statistic=3.243721403780522, pvalue=0.008813693207259936)

Student \(t\)-test calculations

Mean of differences: 26.73mg
Sample SD of differences: 27.33mg
\(t\)-statistic: \(26.73/(27.33/\sqrt{11}) = 3.244\).

\(P\)-value for 2-sided \(t\)-test: 0.0088

  • Why do cortical weights have normal distribution?

  • Why is variance of the difference between treatment and control the same for different litters?

  • Treatment and control are dependent because assigning a rat to treatment excludes it from the control group, and vice versa.

  • \(P\)-value depends on assuming differences are iid sample from a normal distribution.

  • If we reject the null, is that because there is a treatment effect, or because the other assumptions are wrong?

Permutation \(t\)-test calculations

Could enumerate all \(2^{11} = 2,048\) equally likely possibilities. Calculate \(t\)-statistic for each.

\(P\)-value is

\[\begin{equation*} P = \frac{\mbox{number of possibilities with $t \ge 3.244$}}{\mbox{2,048}} \end{equation*}\]

(Using the mean as the test statistic instead of \(t\)-statistic would yield \(P = 2/2048 = 0.00098\).)

Let’s enumerate the 2048 possibilities and the corresponding \(t\) statistics.

from itertools import product
tsf = lambda x: (np.mean(x)/(np.std(x, ddof=1)/math.sqrt(len(x))))
ts = tsf(diffr)
hits = 0
for signs in product([-1,1], repeat=len(diffr)):
    hits += 1 if tsf(np.array(signs)*diffr) >= ts\
            else 0
print(f'P-value {hits}/{2**len(diffr)} = {hits/2**len(diffr)}')   
P-value 2/2048 = 0.0009765625

If there were many more pairs, it would be impractical to enumerate all possible sign combinations, but we can still approxiimate the \(P\)-value by simulation:

  • Assign a random sign to each difference, independently across differences, with chance 1/2 of assigning +

  • Compute \(t\)-statistic

  • Repeat \(N\) times

\[\begin{equation*} P \approx \frac{\mbox{number of simulations with $t \ge 3.244$}}{N} \end{equation*}\]

Finally, we can invert Binomial tests to make an exact confidence interval for \(P\) from the simulation results (or treat the result as an exact \(P\)-value for a randomized test; see Hypothesis Testing, Combinations of Tests, and Stratified Tests.

np.random.RandomState(seed=20151013)  # set seed for reproducibility

iter = 100000  # N
def simPermuTP(z, iter):
# P.B. Stark, statistics.berkeley.edu/~stark 
# simulated P-value for 1-sided 1-sample t-test under the 
# randomization model.
    tsf = lambda x: (np.mean(x)/(np.std(x, ddof=1)/math.sqrt(len(x))))
    ts = tsf(z) 
    return np.mean([ (tsf(z*(2*np.random.randint(0, 2, size=len(z))-1)) >= ts) for i in range(iter)])
    
estP = simPermuTP(diffr, iter)
print(estP, estP*iter)
0.00097 97.0
# confidence interval for the p-value
# This function is in the _permute_ library; see http://statlab.github.io/permute/

def binom_conf_interval(n, x, cl=0.975, alternative="two-sided", p=None,
                        **kwargs):
    """
    Compute a confidence interval for a binomial p, the probability of success in each trial.

    Parameters
    ----------
    n : int
        The number of Bernoulli trials.
    x : int
        The number of successes.
    cl : float in (0, 1)
        The desired confidence level.
    alternative : {"two-sided", "lower", "upper"}
        Indicates the alternative hypothesis.
    p : float in (0, 1)
        Starting point in search for confidence bounds for probability of success in each trial.
    kwargs : dict
        Key word arguments

    Returns
    -------
    tuple
        lower and upper confidence level with coverage (approximately)
        1-alpha.

    Notes
    -----
    xtol : float
        Tolerance
    rtol : float
        Tolerance
    maxiter : int
        Maximum number of iterations.
    """
    assert alternative in ("two-sided", "lower", "upper")

    if p is None:
        p = x / n
    ci_low = 0.0
    ci_upp = 1.0

    if alternative == 'two-sided':
        cl = 1 - (1-cl)/2

    if alternative != "upper" and x > 0:
        f = lambda q: cl - binom.cdf(x - 1, n, q)
        ci_low = brentq(f, 0.0, p, *kwargs)
    if alternative != "lower" and x < n:
        f = lambda q: binom.cdf(x, n, q) - (1 - cl)
        ci_upp = brentq(f, 1.0, p, *kwargs)

    return ci_low, ci_upp

binom_conf_interval(iter, math.ceil(estP*iter), cl=0.95, alternative="two-sided", p=estP)
(0.0007866728734903729, 0.0011831915072138114)
binom_conf_interval(10**5, 907, cl=0.99, alternative="upper")
(0.0, 0.009792108717906225)

Other tests: sign test, Wilcoxon signed-rank test

Sign test:

Count pairs where treated rat has heavier cortex, i.e., where difference is positive.

Under strong null, distribution of the number of positive differences is Binomial(11, 1/2). Like number of heads in 11 independent tosses of a fair coin. (Assumes no ties w/i pairs.)

\(P\)-value is chance of 10 or more heads in 11 tosses of a fair coin: 0.0059.

Only uses signs of differences, not information that only the smallest absolute difference was negative.

Wilcoxon signed-rank test

uses information about the ordering of the differences: rank the absolute values of the differences, then give them the observed signs and sum them. Null distribution: assign signs at random and sum.

Still more tests, for other alternatives

All the tests we’ve seen here are sensitive to shifts–the alternative hypothesis is that treatment increases response (cortical mass).

There are also nonparametric tests that are sensitive to other treatment effects, e.g., treatment increases the variability of the response.

And there are tests for whether treatment has any effect at all on the distribution of the responses.

You can design a test statistic to be sensitive to any change that interests you, then use the permutation distribution to get a \(P\)-value (and simulation to approximate that \(P\)-value).

Silliness

Treat ordinal data (e.g., Likert scale) as if measured on a linear scale; use Student \(t\)-test.

Maybe not so silly for large samples\(\ldots\)

\(t\)-test asymptotically distribution-free.

How big is big?

Back to Rosenzweig et al.

Actually had 3 treatments: enriched, standard, deprived.

Randomized 3 rats per litter into the 3 treatments, independently across \(n\) litters.

How should we analyze these data?

Test contenders

\(n\) litters, \(s\) treatments (sibs per litter).

  • ANOVA–the \(F\)-test:

\[\begin{equation*} F = \frac{\mbox{BSS}/(s-1)}{\mbox{WSS}/(n-s)} \end{equation*}\]
  • Permutation \(F\)-test: use permutation distribution instead of \(F\) distribution to get \(P\)-value.

  • Friedman test: Rank within litters. Mean rank for treatment \(i\) is \(\bar{R}_i\).

\[\begin{equation*} Q = \frac{12n}{s(s+1)} \sum_{i=1}^s \left ( \bar{R}_i - \frac{s+1}{2} \right )^2. \end{equation*}\]

\(P\)-value from permutation distribution.

Strong null hypothesis

Treatment has no effect whatsoever—as if cortical mass were assigned to each rat before the randomization.

Then it’s equally likely that each littermate is assigned to each treatment, independently across litters.

There are \(3! = 6\) assignments of each triple to treatments. Thus, there \(6^n\) equally likely assignments across all litters.

For 11 litters, that’s 362,797,056 possibilities.

(The group under which the distribution of the data is invariant is a subgroup of the permutation group: if you think of the data as an element of \(\Re^{33}\) (33 rats), the group consists of all permutations of the 33 elements that shuffle only adjacent sets of 3 components: components 1–3, components 4–6, … components 31–33.)

Weak null hypothesis

The average cortical weight for all three treatment groups are equal. On average across triples, treatment makes no difference.

Assumptions of the tests

  • \(F\)-test: masses are iid sample from normal distribution, same unknown variance, same unknown mean for all litters and treatments.
    Tests weak null hypothesis.

  • Permutation \(F\)-test: Randomization was as advertised: fair, independent across triples.
    Tests strong null hypothesis.

  • Friedman test: Ditto.

Assumptions of the permutation test and Friedman test are true by design: that’s how treatment was assigned.

Friedman test statistic has \(\chi^2\) distribution asymptotically. Ties are a complication.

\(F\)-test assumptions—reasonable?

  • Why do cortical weights have normal distribution for each litter and for each treatment?

  • Why is the variance of cortical weights the same for different litters?

  • Why is the variance of cortical weights the same for different treatments?

Is \(F\) a good statistic for this alternative?

\(F\)-test (and Friedman statistic) sensitive to differences among the mean responses for each treatment group, no matter what pattern the differences have.

But the treatments and the responses can be ordered: we hypothesize that more stimulation produces greater cortical mass.

deprived \(\Longrightarrow\) normal \(\Longrightarrow\) enriched
low mass \(\Longrightarrow\) medium mass \(\Longrightarrow\) high mass

Can we use that to make a more sensitive test?

A test against an ordered alternative

Within each litter triple, count pairs of responses that are “in order.” Sum across litters.

E.g., if one triple had cortical masses

deprived 640
normal 660
enriched 650

that would contribute 2 to the sum: \(660 \ge 640\), \(650 \ge 640\), but \(640 < 650\).

Each litter triple contributes between 0 and 3 to the overall sum.

Null distribution for the test based on the permutation distribution: 6 equally likely assignments per litter, independent assignments across litters.

A different test against an ordered alternative

Within each litter triple, add differences that are “in order.” Sum across litters.

E.g., if one triple had cortical masses

deprived 640
normal 660
enriched 650

that would contribute 30 to the sum: \(660 - 640 = 20\) and \(650 - 640 = 10\), but \(640 < 650\), so that pair contributes zero.

Each litter triple contributes between 0 and \(2\times{\mbox{ range }}\) to the sum.

Null distribution for the test based on the permutation distribution: 6 equally likely assignments per litter, independent across litters.

The 2-sample problem

Suppose we have a group of \(N\) individuals who are randomized into two groups, a treatment group of size \(N_t\) and a control group of size \(N_c = N - N_t\). Label the individuals from \(1\) to \(N\). Let \({\mathcal T}\) denote the labels of individuals assigned to treatment and \({\mathcal C}\) denote the labels of those assigned to control.

For each of the \(N\) individuals, we measure a quantitative (real-valued) response. Each individual \(i\) has two potential responses: the response \(c_i \)individual would have if assigned to the control group, and the response \(t_i\) the individual would have if assigned to the treatment group.

We assume that individual \(i\)’s response depends only on that individual’s assigment, and not on anyone else’s assignment. This is the assumption of non-interference. In some cases, this assumption is reasonable; in others, it is not.

For instance, imagine testing a vaccine for a communicable disease. If you and I have contact, whether you get the disease might depend on whether I am vaccinated—and vice versa—since if the vaccine protects me from illness, I won’t infect you. Similarly, suppose we are testing the effectiveness of an advertisement for a product. If you and I are connected and you buy the product, I might be more likely to buy it, even if I don’t see the advertisement.

Conversely, suppose that “treatment” is exposure to a carcinogen, and the response whether the subject contracts cancer. On the assumption that cancer is not communicable, my exposure and your disease status have no connection.

The strong null hypothesis is that individual by individual, treatment makes no difference whatsoever: \(c_i = t_i\) for all \(i\).

If so, any differences between statistics computed for the treatment and control groups are entirely due to the luck of the draw: which individuals happened to be assigned to treatment and which to control.

We can find the null distribution of any statistic computed from the responses of the two groups: if the strong null hypothesis is true, we know what individual \(i\)’s response would have been whether assigned to treatment or to control—namely, the same.

For instance, suppose we suspect that treatment tends to increase response: in general, \(t_i \ge c_i\). Then we might expect \(\bar{c} = \frac{1}{N_c} \sum_{i \in {\mathcal C}} c_i\) to be less than \(\bar{t} = \frac{1}{N_t} \sum_{i \in {\mathcal T}} t_i\). How large a difference between \(\bar{c}\) and \(\bar{t}\) would be evidence that treatment increases the response, beyond what might happen by chance through the luck of the draw?

This amounts to asking whether the observed difference in means between the two groups is a high percentile of the distribution of that difference in means, calculated on the assumption that the null hypothesis is true.

Because of how subjects are assigned to treatment or to control, all allocations of \(N_t\) subjects to treatment are equally likely.

One way to partition the \(N\) subjects randomly into a group of size \(N_c\) and a group of size \(N_t\) is to permute the \(N\) subjects at random, then take the first \(N_c\) in the permuted list to be the control group, and the remaining \(N_t\) to be the treatment group.

Gender Bias in Teaching Evaluations

MacNell, Driscoll, and Hunt (2014. What’s in a Name: Exposing Gender Bias in Student Ratings of Teaching, Innovative Higher Education) conducted a controlled, randomized experiment on the effect of students’ perception of instructors’ gender on teaching evaluations in an online course. Students in the class did not know the instructors’ true genders.

MacNell et al. randomized 47 students in an online course into four groups: two taught by a female instructor and two by a male instructor. One of the groups taught by each instructor was led to believe the instructor was male; the other was led to believe the instructor was female. Comparable instructor biographies were given to all students. Instructors treated the groups identically, including returning assignments at the same time.

When students thought the instructor was female, the responding students rated the instructor lower, on average, in every regard.

Characteristic F - M
Caring -0.52
Consistent -0.47
Enthusiastic -0.57
Fair -0.76
Feedback -0.47
Helpful -0.46
Knowledgeable -0.35
Praise -0.67
Professional -0.61
Prompt -0.80
Respectful -0.61
Responsive -0.22

Those results are for a 5-point scale, so a difference of 0.8 (Promptness) is 16% of the entire scale.

MacNell et al. graciously shared their data. The evaluation data are coded as follows:

Group
      3 (12 students) - TA identified as male, true TA gender female 
      4 (11 students) - TA identified as male, true TA gender male
      5 (11 students, 8 responders) - TA identified as female, true TA gender female
      6 (13 students, 12 responders) - TA identified as female, true TA gender male
tagender - 1 if TA is actually male, 0 if actually female 
taidgender - 1 if TA is identified as male, 0 if identified as female 
gender - 1 if student is male, 0 if student is female

There are grades for 47 students but evaluations for only 43 (4 did not respond: three in group 5 and one in group 6). The grades are not linked to the evaluations, per the IRB protocol.

Our null hypothesis will include the assumption that whether a student responds does not depend on the identified gender of the TA: the 4 nonresponders would have remained nonresponders had they been in that instructor’s other section, and no additional students would become nonresponders if they had been assigned to the same instructor’s other section.

Interlude: partitioning sets into more than two subsets

Recall that there are \(n \choose k\) ways of picking a subset of size \(k\) from \(n\) items; hence there are \(n \choose k\) ways of dividing a set into a subset of size \(k\) and one of size \(n-k\) (once you select those that belong to the subset of size \(k\), the rest must be in the complementary subset of size \(n-k\).

In this problem, we are partitioning 47 things into 4 subsets, two of size 11, one of size 12, and one of size 13. How many ways are there of doing that?

Recall the Fundamental Rule of Counting: If a set of choices, \(T_1, T_2, \ldots, T_k\), could result, respectively, in \(n_1, n_2, \ldots, n_k\) possible outcomes, the entire set of \(k\) choices has \(\prod_{i=1}^k n_k\) possible outcomes.

We can think of the allocation of students to the four groups as choosing 11 of the 47 students for the first group, then 11 of the remaining 36 for the second, then 12 of the remaining 25 for the third. The fourth group must contain the remaining 13.

The number of ways of doing that is

\[\begin{equation*} {47 \choose 11}{36 \choose 11}{25 \choose 12} = \frac{47!}{11! 36!} \frac{36!}{11! 25!} \frac{25!}{12! 13!} = \frac{47!}{11! 11! 12! 13!}. \end{equation*}\]

Does the number depend on the order in which we made the choices? Suppose we made the choices in a different order: first 12 students for one group, then 11 for the second, then 13 for the third (the fourth gets the remaining 11 students). The number of ways of doing that is

\[\begin{equation*} {47 \choose 12}{35 \choose 11}{24 \choose 13} = \frac{47!}{12! 35!} \frac{35!}{11! 24!} \frac{24!}{13! 11!} = \frac{47!}{12! 11! 13! 11!} = . \end{equation*}\]

The number does not depend on the order in which we make the choices.

By the same reasoning, the number of ways of dividing a set of \(n\) objects into \(m\) subsets of sizes \(k_1, \ldots k_m\) is given by the multinomial coefficient

\[\begin{equation*} {n \choose k_1, k_2, \ldots, k_m} = {n \choose k_1}{n-k_1 \choose k_2} {n-k_1-k_2 \choose k_3} \cdots {n - \sum_{i=1}^{m-1} k_i \choose k_{m-1}} \end{equation*}\]
\[\begin{equation*} = \frac{n! (n-k_1)! (n-k_1 - k_2)! \cdots (n - k_1 - \cdots - k_{m-1}!}{k_1! (n-k_1)! k_2! (n-k_1-k_2)! \cdots k_m!} = \frac{n!}{\prod_{i=1}^m k_i!}. \end{equation*}\]

We will check how surprising it would be for the means to be so much lower when the TA is identified as female, if in fact there is “no real difference” in how they were rated, and the apparent difference is just due to the luck of the draw: which students happened to end up in which section.

In the actual randomization, all \(47 \choose 11, 11, 12, 13\) allocations were equally likely. But there might be real differences between the two instructors. Hence, we’d like to use each of them as his or her own “control.”

Each student’s potential responses are represented by a ticket with 4 numbers:

  • the rating that the student would assign to instructor A if instructor A is identified as male

  • the rating that the student would assign to instructor A if instructor A is identified as female

  • the rating that the student would assign to instructor B if instructor B is identified as male

  • the rating that the student would assign to instructor B if instructor B is identified as female

The null hypothesis is that the first two numbers are equal and the second two numbers are equal, but the first two numbers might be different from the second two numbers. This corresponds to the hypohtesis that
students assigned to a given TA would rate him or her the same, whether that TA seemed to be male or female. For all students assigned instructor A, we know both of the first two numbers if the hull hypothesis is true; but we know neither of the second two numbers. Similarly, if the null hypothesis is true, we know both of the second two numbers for all students assigned to instructor B, but we know neither of the first two numbers for those students.

Because of how the randomization was performed, all allocations of students to sections that keep the number of students in each section the same are equally likely, so in particular all allocations that keep the same students assigned to each actual instructor the same are equally likely. This is a subgroup of the full permutation group.

Hence, all \({23 \choose 12}\) ways of splitting the 23 students assigned to the female instructor into two groups, one with 11 students and one with 12, are equally likely. Similarly, all \({24 \choose 11}\) ways of splitting the 24 students assigned to the male instructor into two groups, one with 13 students and one with 11, are equally likely. We can thus imagine shuffling the female TA’s students between her sections, and the male TA’s students between his sections, and examine the distribution of the difference between the mean score for the sections where the TA was identified as male is larger than the mean score for the sections where the TA was identified as female. (The 4 nonresponders are also allocated at random between sections of the same instructor; they are not included in the averages.)

If the difference is rarely as large as the observed mean difference, the observed mean difference gives evidence that being identified as female really does lower the scores.

# Number of allocations to 11, 11, 12, 13
print(sp.special.binom(47,11)*sp.special.binom(36,11)*sp.special.binom(25,12))  # big number!
5.441753110664829e+25
(sp.special.binom(21,10)/sp.special.binom(23,12)) * \
    sp.special.binom(23,10)/sp.special.binom(24,11)
0.11956521739130435
# the data are in a .csv file called "Macnell-RatingsData.csv" in the directory Data
import pandas as pd

ratings = pd.read_csv("./Data/Macnell-RatingsData.csv")
categories = ratings.columns.values.tolist()[1:15]
print(ratings.head())
print(ratings.describe())
   group  professional  respect  caring  enthusiastic  communicate  helpful  \
0      3           5.0      5.0     4.0           4.0          4.0      3.0   
1      3           4.0      4.0     4.0           4.0          5.0      5.0   
2      3           5.0      5.0     5.0           5.0          5.0      5.0   
3      3           5.0      5.0     5.0           5.0          5.0      3.0   
4      3           5.0      5.0     5.0           5.0          5.0      5.0   

   feedback  prompt  consistent  fair  responsive  praised  knowledgeable  \
0       4.0     4.0         4.0   4.0         4.0      4.0            3.0   
1       5.0     5.0         3.0   4.0         5.0      5.0            5.0   
2       5.0     5.0         5.0   5.0         5.0      5.0            5.0   
3       5.0     5.0         5.0   5.0         3.0      5.0            5.0   
4       5.0     3.0         4.0   5.0         5.0      5.0            5.0   

   clear  overall  gender     age  tagender  taidgender  
0    5.0      4.0     2.0  1990.0         0           1  
1    5.0      4.0     1.0  1992.0         0           1  
2    5.0      5.0     2.0  1991.0         0           1  
3    5.0      5.0     2.0  1991.0         0           1  
4    5.0      5.0     2.0  1992.0         0           1  
           group  professional    respect     caring  enthusiastic  \
count  47.000000     43.000000  43.000000  43.000000     43.000000   
mean    4.531915      4.325581   4.325581   3.930233      3.906977   
std     1.158167      1.062811   1.062811   1.055492      1.019196   
min     3.000000      1.000000   1.000000   1.000000      1.000000   
25%     3.500000      4.000000   4.000000   3.500000      4.000000   
50%     5.000000      5.000000   5.000000   4.000000      4.000000   
75%     6.000000      5.000000   5.000000   5.000000      4.500000   
max     6.000000      5.000000   5.000000   5.000000      5.000000   

       communicate    helpful   feedback     prompt  consistent       fair  \
count    43.000000  43.000000  43.000000  43.000000   43.000000  43.000000   
mean      3.953488   3.744186   3.953488   3.976744    3.744186   3.906977   
std       1.068009   1.071115   1.132916   1.079867    1.156617   0.995560   
min       1.000000   1.000000   1.000000   1.000000    1.000000   1.000000   
25%       4.000000   3.000000   4.000000   4.000000    3.000000   3.500000   
50%       4.000000   4.000000   4.000000   4.000000    4.000000   4.000000   
75%       5.000000   5.000000   5.000000   5.000000    5.000000   5.000000   
max       5.000000   5.000000   5.000000   5.000000    5.000000   5.000000   

       responsive    praised  knowledgeable      clear    overall     gender  \
count   43.000000  43.000000      43.000000  43.000000  43.000000  43.000000   
mean     3.767442   4.209302       4.139535   3.720930   3.953488   1.534884   
std      0.996116   0.940064       0.989983   1.259735   1.022450   0.504685   
min      1.000000   1.000000       1.000000   1.000000   1.000000   1.000000   
25%      3.000000   4.000000       4.000000   3.000000   4.000000   1.000000   
50%      4.000000   4.000000       4.000000   4.000000   4.000000   2.000000   
75%      4.500000   5.000000       5.000000   5.000000   5.000000   2.000000   
max      5.000000   5.000000       5.000000   5.000000   5.000000   2.000000   

               age   tagender  taidgender  
count    43.000000  47.000000   47.000000  
mean   1990.325581   0.510638    0.489362  
std       4.115713   0.505291    0.505291  
min    1982.000000   0.000000    0.000000  
25%    1990.000000   0.000000    0.000000  
50%    1990.000000   1.000000    0.000000  
75%    1991.000000   1.000000    1.000000  
max    2012.000000   1.000000    1.000000  
from permute.core import corr, two_sample
from permute.utils import get_prng, permute_within_groups
rs = np.random.RandomState(seed=123456789)  # set the seed of the PRNG, for reproducibility
def stratified_two_sample(x, y, group_x, group_y, stat='mean', alternative="greater", reps=10**4, 
                          keep_dist=False, seed=None):
    """
    One-sided or two-sided, two-sample permutation test for equality of
    two means, with p-value estimated by simulated random sampling with
    reps replications.

    Tests the hypothesis that x and y are a random partition of x,y
    against the alternative that x comes from a population with mean

    (a) greater than that of the population from which y comes,
        if side = 'greater'
    (b) less than that of the population from which y comes,
        if side = 'less'
    (c) different from that of the population from which y comes,
        if side = 'two-sided'

    Permutations are carried out within the given groups.  Under the null hypothesis,
    observations within each group are exchangeable.

    If ``keep_dist``, return the distribution of values of the test statistic;
    otherwise, return only the number of permutations for which the value of
    the test statistic and p-value.

    Parameters
    ----------
    x : array-like
        Sample 1
    y : array-like
        Sample 2
    group_x : array-like
        Group assignments for sample 1
    group_y : array-like
        Group assignments for sample 2
    stat : {'mean', 't'}
        The test statistic.

        (a) If stat == 'mean', the test statistic is (mean(x) - mean(y))
            (equivalently, sum(x), since those are monotonically related), omitting
            NaNs, which therefore can be used to code non-responders
        (b) If stat == 't', the test statistic is the two-sample t-statistic--
            but the p-value is still estimated by the randomization,
            approximating the permutation distribution.
            The t-statistic is computed using scipy.stats.ttest_ind
        (c) If stat is a function (a callable object), the test statistic is
            that function.  The function should take a permutation of the pooled
            data and compute the test function from it. For instance, if the
            test statistic is the Kolmogorov-Smirnov distance between the
            empirical distributions of the two samples, max_t |F_x(t) - F_y(t)|,
            the test statistic could be written:

            f = lambda u: np.max( \
                [abs(sum(u[:len(x)]<=v)/len(x)-sum(u[len(x):]<=v)/len(y)) for v in u]\
                )        
    alternative : {'greater', 'less', 'two-sided'}
        The alternative hypothesis to test
    reps : int
        Number of permutations
    keep_dist : bool
        flag for whether to store and return the array of values
        of the irr test statistic
    seed : RandomState instance or {None, int, RandomState instance}
        If None, the pseudorandom number generator is the RandomState
        instance used by `np.random`;
        If int, seed is the seed used by the random number generator;
        If RandomState instance, seed is the pseudorandom number generator.

    Returns
    -------
    p : float
        the estimated p-value
    ts : float
        the test statistic
    dist : list
        The distribution of test statistics.
        These values are only returned if `keep_dist` == True        
    """
    
    prng = get_prng(seed)
    
    z = np.concatenate([x, y])   # pooled responses
    groups = np.concatenate([group_x, group_y])   # pooled group assignments
    
    # If stat is callable, use it as the test function. Otherwise, look in the dictionary
    stats = {
        'mean': lambda u: np.nanmean(u[:len(x)]) - np.nanmean(u[len(x):]),
        't': lambda u: ttest_ind(
            u[:len(y)][~np.isnan(u[:len(y)])], 
            u[len(y):][~np.isnan(u[len(y):])], 
            equal_var=True)[0]
    }
    if callable(stat):
        tst_fun = stat
    else:
        tst_fun = stats[stat]

    theStat = {
        'greater': tst_fun,
        'less': lambda u: -tst_fun(u),
        'two-sided': lambda u: math.fabs(tst_fun(u))
    }
    tst = theStat[alternative](z)
    observed_tst = tst_fun(z)
    
    if keep_dist:
        dist = np.empty(reps)
        for i in range(reps):
            dist[i] = theStat[alternative](permute_within_groups(z, groups, seed=prng))
        hits = np.sum(dist >= tst)
        return hits/reps, tst, dist
    else:
        hits = np.sum([(theStat[alternative](permute_within_groups(z, groups, seed=prng)) >= tst)
                       for i in range(reps)])
        return hits/reps, tst
(p, t) = stratified_two_sample(ratings['overall'][ratings.taidgender==1], ratings['overall'][ratings.taidgender==0], 
                               ratings['tagender'][ratings.taidgender == 1], 
                               ratings['tagender'][ratings.taidgender == 0],
                               alternative = "two-sided", seed = rs)
print('Overall rating:')
print('Difference in means:', t)
print('P-value (two-sided):', np.round(p, 5), "\n")

print ('\n\n{0:24} {1:8} {2:8}'.format('Category', 'Diff means', 'P-value'))
for col in categories:
    (p, t) = stratified_two_sample(ratings[col][ratings.taidgender==1], ratings[col][ratings.taidgender==0], 
                               ratings['tagender'][ratings.taidgender == 1],
                               ratings['tagender'][ratings.taidgender == 0],
                               alternative = "two-sided", seed = rs)
    print ('{0:20} {1:8.2f} {2:8.2f}'.format(col, t, p))
Overall rating:
Difference in means: 0.4739130434782606
P-value (two-sided): 0.1201 



Category                 Diff means P-value 
professional             0.61     0.07
respect                  0.61     0.07
caring                   0.52     0.10
enthusiastic             0.57     0.06
communicate              0.57     0.07
helpful                  0.46     0.18
feedback                 0.47     0.16
prompt                   0.80     0.01
consistent               0.46     0.21
fair                     0.76     0.01
responsive               0.22     0.48
praised                  0.67     0.02
knowledgeable            0.35     0.29
clear                    0.41     0.29

Next: General permutation tests