Algorithms for Pseudo-Random Permutations and Samples¶

Random Sampling¶

Random sampling is one of the most fundamental tools in Statistics.

Used for:

Surveys and extrapolation
- Opinion surveys and polls
- Census (for some purposes) and Current Population Survey
- Environmental statistics
- Litigation, including class actions, discrimination, …
Experiments
- Medicine / clinical trials
- Agriculture
- Marketing
- Product development
Quality control and auditing
- Process control
- Financial auditing
- Healthcare auditing
- Election auditing
Sampling and resampling methods
- Bootstrap
- Permutation tests
- MCMC

and on and on.

Simple random sampling¶

Draw \(k \le n\) items from a population of \(n\) items, in such a way that each of the \(n \choose k\) subsets of size \(k\) is equally likely.

Many standard statistical methods assume the sample is drawn in this way, or allocated between treatment and control in this way (e.g., \(k\) of \(n\) subjects are assigned to treatment, and the remaining \(n-k\) to control).

Random permutations¶

\(n!\) possible permutations of \(n\) distinct objects.

\(n!\) grows quickly with \(n\):

\[\begin{equation*} n! \sim \sqrt {2\pi n}\left({\frac {n}{e}}\right)^n (\mbox{Stirling's approximation})\end{equation*}\]

Pigeon-hole principle¶

If you put \(N>n\) pigeons in \(n\) pigeonholes, then at least one pigeonhole must contain more than one pigeon.

Corollary¶

At most \(n\) pigeons can be put in \(n\) pigeonholes if at most one pigeon is put in each hole.

Some pigeon coops & flocks¶

Expression	full	scientific notation
\(2^{32}\)	4,294,967,296	4.29e9
\(2^{64}\)	18,446,744,073,709,551,616	1.84e19
\(2^{128}\)		3.40e38
\(2^{32 \times 624}\)		9.27e6010
\(13!\)	6,227,020,800	6.23e9
\(21!\)	51,090,942,171,709,440,000	5.11e19
\(35!\)		1.03e40
\(2084!\)		3.73e6013
\({50 \choose 10}\)	10,272,278,170	1.03e10
\({100 \choose 10}\)	17,310,309,456,440	1.73e13
\({500 \choose 10}\)		2.4581e20
\(\frac{2^{32}}{{50 \choose 10}}\)	0.418
\(\frac{2^{64}}{{500 \choose 10}}\)	0.075

We will come back to these numbers.

Random sampling¶

Given a good source of randomness, many ways to draw a simple random sample.

One basic approach is like shuffling a deck of \(n\) cards, then dealing the top \(k\): permute the population at random, then take the first \(k\) elements of the permutation to be the sample.

There are a number of standard ways to generate a random permutation—i.e., to shuffle the deck.

Algorithm `PIKK` (permute indices and keep \(k\))¶

For instance, if we had a way to generate independent, identically distributed (iid) \(U[0,1]\) random numbers, we could do it as follows:

Algorithm PIKK

assign iid \(U[0,1]\) numbers to the \(n\) elements of the population
the sample consists of the \(k\) items assigned the smallest random numbers (break ties randomly)
amounts to generating a random permutation of the population, then taking first \(k\).
if the numbers really are iid, every permutation is equally likely, and it follows that the first \(k\) are an SRS
requires \(n\) random numbers (and sorting)

%matplotlib inline
import math
import numpy as np
import scipy as sp
from scipy.special import comb, factorial
from scipy.optimize import brentq
from scipy.stats import chisquare, norm
import scipy.integrate
from random import Random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def PIKK(n,k, gen=np.random):
    try: 
        x = gen.random(n)
    except TypeError:
        x = np.array([gen.random() for i in np.arange(n)])
    return set(np.argsort(x)[0:k])

There are more efficient ways to generate a random permutation than assigning a number to each element and sorting, for instance the “Fisher-Yates shuffle” or “Knuth shuffle” (Knuth attributes it to Durstenfeld).

Algorithm Fisher-Yates-Knuth-Durstenfeld shuffle (backwards version)

for i=1, ..., n-1:
    J <- random integer uniformly distributed on {i, ..., n}
    (a[J], a[i]) <- (a[i], a[J])

This requires the ability to generate independent random integers on various ranges. Doesn’t require sorting.

There’s a version suitable for streaming, i.e., generating a random permutation of a list that has an (initially) unknown number \(n\) of elements:

Algorithm Fisher-Yates-Knuth-Durstenfeld shuffle (streaming version)

i <- 0
a = []
while there are records left:
    i <- i+1
    J <- random integer uniformly distributed on {1, ..., i}
    if J < i:
        a[i] <- a[J]
        a[J] <- next record
    else:
        a[i] <- next record

Again, need to be able to generate independent uniformly distributed random integers.

def fykd(a, gen=np.random):
    for i in range(len(a)-2):
        J = gen.randint(i,len(a))
        a[i], a[J] = a[J], a[i]
    return(a)

fykd(np.arange(10))

array([8, 3, 4, 7, 0, 9, 5, 1, 2, 6])

Proof that the streaming Fisher-Yates-Knuth-Durstenfeld algorithm works¶

Induction:

For \(i=1\), obvious.

At stage \(i\), suppose all \(i!\) permutations are equally likely. For each such permutation, there are \(i+1\) distinct, equally likely permutations at stage \(i+1\), resulting from swapping the \(i+1\)st item with one of the first \(i\), or putting it in position \(i+1\). These possibilities are mutually exclusive of the permutations attainable from a different permutation of the first \(i\) items.

Thus, this enumerates all \((i+1)i! = (i+1)!\) permutations of \((i+1)\) items, and they are equally likely. ■

A good way to get a bad shuffle¶

Sort using a “random” comparison function, e.g.,

def RandomSort(a,b):
    return (0.5 - np.random.random())

But this fails the basic properties of an ordering, e.g., transitivity, reflexiveness, etc. Doesn’t produce random permutation. Output also depends on sorting algorithm.

The right way, the wrong way, and the Microsoft way.¶

Notoriously used by Microsoft to offer a selection of browsers in the EU. Resulted in IE being 5th of 5 about half the time in IE and about 26% of the time in Chrome, but only 6% of the time in Safari (4th about 40% of the time).

See, e.g., http://www.computerworld.com/article/2520190/web-apps/microsoft-s-eu-ballot-fails-to-randomize-browser-order.html

Cormen et al. (2009) Algorithm `Random_Sample`¶

recursive algorithm
requires only \(k\) random numbers (integers)
does not require sorting

def Random_Sample(n, k, gen=np.random):  # from Cormen et al.
    if k==0:
        return set()
    else:
        S = Random_Sample(n-1, k-1)
        i = gen.randint(1,n+1) 
        if i in S:
            S = S.union([n])
        else:
            S = S.union([i])
    return S

Random_Sample(100,5)

{22, 44, 79, 82, 86}

Proof. Recursion.

Reservoir algorithms¶

The previous algorithms require \(n\) to be known. There are reservoir algorithms that do not. Moreover, the algorithms are suitable for streaming (aka online) use: items are examined sequentially and either enter into the reservoir, or, if not, are never revisited.

Algorithm `R`, Waterman (per Knuth, 1997)¶

Put first \(k\) items into the reservoir
when item \(k+j\) is examined, either skip it (with probability \(j/(k+j)\)) or swap for a uniformly selected item in the reservoir (with probability \(k/(k+j)\))
naive version requires generating at most \(n-k\) random numbers
closely related to FYKD shuffle

def R(n,k):  # Waterman's algorithm R
    S = np.arange(1, k+1)  # fill the reservoir
    for t in range(k+1,n+1):
        i = np.random.randint(1,t+1)
        if i <= k:
            S[i-1] = t
    return set(S)

R(100,5)

{15, 16, 20, 42, 55}

Like Random-Sample, the proof that algorithm R generates an SRS uses the ability to generate independent random integers, uniformly distributed on \(\{1, \ldots, m \}\).

Algorithm Z, Vitter (1985)¶

Much more efficient than R, using random skips. Works in time essentially linear in \(k\).

Note: Vitter proposes using the (inaccurate) \(J = \lfloor mU \rfloor\) to generate a random integer between \(0\) and \(m\) in both algorithm R and algorithm Z. The issue is pervasive!

PRNGs in Common packages¶

Package/Lang	default	other	SRS algorithm
SAS 9.2	MT	32-bit LCG
SPSS 20.0	32-bit LCG	MT1997ar
SPSS ≤ 12.0	32-bit LCG
STATA 13	KISS 32
STATA 14	MT
R	MT		trunc
python	MT		mask
MATLAB	MT

Key. MT = Mersenne Twister. LCG = linear congruential generator. ANS = “assign a number to each of the \(n\) items and sort.” The KISS generator combines 4 generators of three types: two multiply-with-carry generators, the 3-shift register SHR3 and the congruential generator CONG.

Prior to April 2011, in Stata 10 exhibited predictable behavior: 95.1% of the \(2^{31}\) possible seed values resulted in the first and second draws from rnormal() having the same sign.

Best Practices¶

Use a source of real randomness with a substantial amount of entropy to set the seed, e.g., 20 rolls of 10-sided dice.
Record the seed so your analysis is reproducible.
Use a PRNG at least as good as the Mersenne Twister, and preferably a cryptographic quality PRNG. Consider the PCG family. Avoid standard linear congruential generators and the Wichmann-Hill generator.
Use open-source software, and record the version of the software you use.
Use a sampling algorithm that does not “waste randomness.” Avoid permuting the entire population.
Be aware of discretization issues in the sampling algorithm; many methods assume the PRNG produces \(U[0,1]\) or \(U[0,1)\) random numbers, rather than (an approximation to) numbers that are uniform on \(w\)-bit binary integers.
Consider the size of the problem: are your PRNG and sampling algorithm adequate?
Avoid “tests of representativeness” and procedures that reject some samples. They alter the distribution of the sample.

References¶

Argyros, G. and A. Kiayias, 2012. PRNG: Pwning Random Number Generators. https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_WP.pdf
Bernstein, D.J., T. Lange, and R. Niederhagen, 2016. Dual EC: A Standardized Back Door, in The New Codebreakers, Essays Dedicated to David Kahn on the Occasion of his 85th Birthday, Ryan, P.Y.A., D. Naccache, and J-J Quisquater, eds., Springer, Berlin.
Cormen, T.H., C.E. Leiserson, R.L. Rivest and C. Stein, 2009. Introduction to Algorithms, 3rd edition, MIT Press.
Fishman, G.S., and L.R. Moore, 1981. In Search of Correlation in Multiplicative Congruential Generators with Modulus 2**31-1, Computer Science and Statistics: Proceedings of the 13 Symposium on the Interface, William F. Eddy, ed., Springer Verlag, New York.
Knuth, D., 1997 The Art of Computer Programming, V.II: Seminumerical methods, 3rd edition, Addison-Wesley, Boston.
L’Ecuyer, P. and R. Simard, 2007. TestU01: A C Library for Empirical Testing of Random Number Generators, ACM Trans. Math. Softw., 33, http://doi.acm.org/10.1145/1268776.1268777
Marsaglia, G., 1968. Random Numbers Fall Mainly in the Planes, PNAS, 61, 25–28.
Marsaglia, G., 2003. Xorshift RNGs. Journal of Statistical Software, 8, 1–6.
Matsumoto, M., and T. Nishimura, 1998. 8). Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Transactions on Modeling and Computer Simulation, 8, 3–30. doi:10.1145/272991.272995
McCullough, B.D., 2008. Microsoft’s ‘Not the Wichmann-Hill’ random number generator. Computational Statistics and Data Analysis, 52 (10), 4587–4593. http://dx.doi.org/10.1016/j.csda.2008.03.006
NIST Computer Security Division, Random Number Generation http://csrc.nist.gov/groups/ST/toolkit/rng/
O’Neill, M.E., 2015. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation, submitted to ACM Transactions on Mathematical Software. http://www.pcg-random.org/pdf/toms-oneill-pcg-family-v1.02.pdf
http://www.pcg-random.org/
Shannon, C.E., 1948. A Mathematical Theory of Communication, Bell System Technical Journal, 27, 379–423, 623–656.
Vitter, J.S., 1985. Random Sampling with a Reservoir, ACM Transactions on Mathematical Software, 11, 37–57.
Wikipedia articles, including https://en.wikipedia.org/wiki/Mersenne_Twister, https://en.wikipedia.org/wiki/Linear_congruential_generator, https://en.wikipedia.org/wiki/Comparison_of_hardware_random_number_generators, https://en.wikipedia.org/wiki/Pseudorandom_number_generator, https://en.wikipedia.org/wiki/List_of_random_number_generators, https://en.wikipedia.org/wiki/Random_number_generator_attack

Collaborative and Reproducible Data Science

Algorithms for Pseudo-Random Permutations and Samples

Contents

Algorithms for Pseudo-Random Permutations and Samples¶

Random Sampling¶

Simple random sampling¶

Random permutations¶

Pigeon-hole principle¶

Corollary¶

Some pigeon coops & flocks¶

Random sampling¶

Algorithm `PIKK` (permute indices and keep \(k\))¶

Proof that the streaming Fisher-Yates-Knuth-Durstenfeld algorithm works¶

A good way to get a bad shuffle¶

The right way, the wrong way, and the Microsoft way.¶

Cormen et al. (2009) Algorithm `Random_Sample`¶

Reservoir algorithms¶

Algorithm `R`, Waterman (per Knuth, 1997)¶

Algorithm Z, Vitter (1985)¶

PRNGs in Common packages¶

Best Practices¶

References¶

Collaborative and Reproducible Data Science

Algorithms for Pseudo-Random Permutations and Samples

Contents

Algorithms for Pseudo-Random Permutations and Samples¶

Random Sampling¶

Simple random sampling¶

Random permutations¶

Pigeon-hole principle¶

Corollary¶

Some pigeon coops & flocks¶

Random sampling¶

Algorithm PIKK (permute indices and keep \(k\))¶

Proof that the streaming Fisher-Yates-Knuth-Durstenfeld algorithm works¶

A good way to get a bad shuffle¶

The right way, the wrong way, and the Microsoft way.¶

Cormen et al. (2009) Algorithm Random_Sample¶

Reservoir algorithms¶

Algorithm R, Waterman (per Knuth, 1997)¶

Algorithm Z, Vitter (1985)¶

PRNGs in Common packages¶

Best Practices¶

References¶

Algorithm `PIKK` (permute indices and keep \(k\))¶

Cormen et al. (2009) Algorithm `Random_Sample`¶

Algorithm `R`, Waterman (per Knuth, 1997)¶