Goodness of Fit Tests

This notebook explores a few techniques for checking whether a statistical model is “adequate,” i.e., whether it fits the data so poorly that one can reject the hypothesis that the model generated the data.

We will look at two kinds of models: models with no fitted parameters and models with free parameters fitted to the data. One should expect that a model that is adjusted to fit the data by picking the values of free parameters will fit better than a model in which the values of those parameters are pre-specified without looking at the data.

It’s very common to fit models to data (e.g., normal, log-normal, linear regression, logistic regression, Poisson regression) without worrying about where the model comes from or even whether it is plausible.

In general, if you test whether a model fits, and then report the fitted model, \(P\)-values, confidence intervals, etc., only if it does fit, inferences will be “selective inferences.” The same thing goes for procedures that select which variables to include in a model, select transformations of the variables, select the functional form of the model, etc.: The \(P\)-values, confidence intervals, and so on will be misleading if you do not adjust appropriately for the selection process.

For examples, see:

  • Benjamini, Y. and D. Yekutieli, 2005. False Discovery Rate-Adjusted Multiple Confidence Intervals for Selected Parameters, Journal of the American Statistical Association, Theory and Methods, 100(469), DOI 10.1198/016214504000001907

  • Berk, R., L. Brown, and L. Zhao, 2009. Statistical Inference After Model Selection, J. Quantitative Criminology, DOI: 10.1007/s10940-009-9077-7

  • Berk, R., L. Brown, A. Buja, K. Zhang, and L. Zhao, 2013. Valid Post-Selection Inference, Annals of Statistics, 41, 802-837

  • Olshen, R., 1974. The Conditional Level of the \(F\) test, Journal of the American Statistical Association, Theory and Methods, 68(343), 692-698.

Models with no Free Parameters

Kolmogorov-Smirnov Test

One particularly useful test for models for real-valued data is the Kolmogorov-Smirnov test, one of many goodness-of-fit tests based on the cumulative distribution function.

Suppose \(\{X_j \}_{j=1}^n\) are IID with a continuous distribution. Define \(1_A\) = {1, if \(A\); 0, otherwise}. The theoretical cdf of \(X_i\) is

\[\begin{equation*} F(x) \equiv \mathbb P \{ X_i \le x \} . \end{equation*}\]

Define the empirical cumulative distribution function

\[\begin{equation*} \hat{F}_n (x) \equiv \frac{1}{n} \sum_{i=1}^n 1_{x \ge X_i} \end{equation*}\]

Consider the one-sided Kolmogorov-Smirnov statistics

\[\begin{equation*} D_n^+ \equiv \sup_x (\hat{F}_n(x) - F(x) ) \end{equation*}\]

and

\[\begin{equation*} D_n^- \equiv \sup_x (F(x) - \hat{F}_n(x)), \end{equation*}\]

and the two-sided Kolmogorov-Smirnov statistic

\[\begin{equation*} D_n \equiv \sup_x |\hat{F}_n(x) - F(x)|. \end{equation*}\]

Kolmogorov (1933) and Smirnov (1944) seem to have been the first to study these statistics, which have the same distribution—a distribution that does not depend on \(F\) if \(F\) is continuous.

Below is a simulation: we take a random sample of size \(n\) from a Uniform distribution and plot the ecdf and the true cdf. The ecdf is the stair-step function. In places it is above the true cdf, and in places it is below. \(D^-\) measures how far \(F\) ever gets above \(\hat{F}_n\). Note that as \(n\) increases, the maximum distance between \(F\) and \(\hat{F}_n\) tends to be smaller.

%matplotlib inline
from __future__ import division, print_function
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML
def ecdf(x):
    '''
       calculates the empirical cdf of data x
       returns the unique values of x in ascending order and the cumulative probabity at those values
       NOTE: This is not an efficient algorithm: it is O(n^2), where n is the length of x. 
       A better algorithm would rely on the Collections package or something similar and could work
       in O(n log n)
    '''
    theVals = sorted(np.unique(x))
    theProbs = np.array([sum(x <= v) for v in theVals])/float(len(x))
    if (theVals[0] > 0.0):
        theVals = np.append(0., theVals)
        theProbs = np.append(0., theProbs)
    return theVals, theProbs


def plotUniformKolmogorov(n):
    sam = np.random.uniform(size=n)
    v, pr = ecdf(sam)
    fig, ax = plt.subplots(nrows=1, ncols=1)
    ax.plot([0, 1], [0, 1], linestyle='--', color='r', label='true cdf')
    ax.step(v, pr, color='b', where='post', label='ecdf')
    ax.legend(loc='best')
    dLoc = np.argmax(v-pr)
    dMinus = (v-pr)[dLoc]
    ax.axvline(x=v[dLoc], ymin=pr[dLoc], ymax=v[dLoc], color='g', linewidth='3')
    ax.text(0.5, 0.1, r'$D_n^-=$' + str(round(dMinus, 3)), color='g', weight='heavy')
    plt.show()

interact(plotUniformKolmogorov, n=widgets.IntSlider(min=3, max=300, step=1, value=10))
Widget Javascript not detected.  It may not be installed or enabled properly.
<function __main__.plotUniformKolmogorov(n)>

The Massart-Dvoretzky-Kiefer-Wolfowitz (MDKW) Inequality

Dvoretsky, Kiefer, and Wolfowitz (1956, DKW) showed that

\[\begin{equation*} \mathbb P \{D_n^- > t\} \le C \exp(- 2 n t^2) \end{equation*}\]

for some constant \(C\).

Massart (1990) showed that \(C = 1\) is the sharp constant in the DKW inequality, provided \(\exp(-2 n t^2) \le \frac{1}{2}\). The distribution of \(D_n^-\) is stochastically larger when \(F\) is continuous than when \(F\) has jumps (Massart 1990); thus the inequality conservative for IID sampling from discrete or “mixed” distributions.

Moreover, \(D_n^-\) is stochastically larger for sampling with replacement than for sampling without replacement, so the inequality is conservative for sampling from a finite population without replacement as well.

Let’s call the inequality with \(C=1\) the Massart-Dvoretsky-Kiefer-Wolfowitz (MDKW) inequality. It follows from the MDKW inequality that

\[\begin{equation*} \mathbb P \left \{\sup_x (F(x) - \hat{F}_n(x)) > \sqrt{-\frac{\ln \alpha}{2n}} \right \} \le \alpha \end{equation*}\]

and (by symmetry) that

\[\begin{equation*} \mathbb P \left \{\sup_x |F(x) - \hat{F}_n(x)| > 2\sqrt{-\frac{\ln \alpha}{2n}} \right \} \le \alpha. \end{equation*}\]

This lets us construct goodness-of-fit tests for models for real-valued data (in contrast, for instance, to models for categorical data or models for multidimensional data–although there are related methods for such models).

Example: Testing whether data come from a uniform distributution on \([a, b]\)

To find a \(P\)-value for the hypothesis that \(\{X_j \}_{j=1}^n\) are IID \(U[a,b]\), calculate

\[\begin{equation*} D_n = \sup_{x \in [a, b]} \left | \frac{x-a}{b-a} - \hat{F}_n(x) \right | .\end{equation*}\]

(Note that the maximum will be attained at or infinitesimally before some \(X_j\).)

The \(P\)-value can be found by solving

\[\begin{equation*} D_n = 2\sqrt{-\frac{\ln P}{2n}} \end{equation*}\]
\[\begin{equation*} -2n \left ( D_n/2 \right )^2 = \ln P \end{equation*}\]
\[\begin{equation*} P = \exp(-2n \left ( D_n/2 \right )^2) \end{equation*}\]

(Of course, if \(\min_j X_j < a\) or \(\max_j X_j > b\), we know with certainty that \(\{X_j\}\) are not IID \(U[a,b]\).)

Example: Testing whether data come from a Poisson process

This model has a parameter, but that will turn out not to matter, because our inference will be conditional on the observed number of events.

We observe \(X\) on the interval \([0, T]\). If \(X\) has a Poisson process with rate \(\lambda\), then, conditional on the number of events that occur between \(0\) and \(T\), the times of the events have the same distribution as the order statistics of that many IID uniform random variables on \([0, T]\). That is, if \(X(T) = n\), then the times of the \(n\) events are like an IID sample of size \(n\) from a \(U[0, T]\) distribution. Thus, to test whether \(X\) is indeed Poisson, we can use the previous test, taking \(a=0\) and \(b=T\).

Example: Testing whether data come from a Poisson distribution with known rate \(\lambda\)

The CDF of the Poisson distribution with mean \(\lambda\) is

\[\begin{equation*} e^{-\lambda} \sum_{j=0}^{\lfloor x \rfloor} \lambda^j/j! \end{equation*}\]

This can be used with the KS test to test whether data come from a Poisson distribution with mean \(\lambda\).

Multinomial Models and the chi-square Goodness-of-Fit Test

A very common test for goodness-of-fit is based on the probability that the data fall in different categories.

The most natural setting is multinomial models, but (typically by binning the data), the test can be used in many situations–although in its usual form, it is an approximate test, while the KS test is conservative.

Recall that the joint distribution of the number of values in each of \(k \ge 2\) categories for \(n\) IID draws with probability \(\pi_j\) of selecting value \(j\) in each draw is multinomial. A random vector \((X_1, \ldots, X_k)\) has a multinomial joint distribution with parameters \(n\) and \(\{\pi_j\}_{j=1}^k\) iff

\[\begin{equation*} \Pr \{X_j = x_j \} = \prod_{j=1}^k \pi_j^{x_j} \frac{n!}{x_1!x_2! \cdots x_j!}, \;\; x_j \ge 0,\;\; \sum_{j=1}^k x_j = n.\end{equation*}\]

The chi-square statistic for categorical data is

\[\begin{equation*} \chi^2 \equiv \sum_{j=1}^k \frac{(X_j - E_j)^2}{E_j}, \end{equation*}\]

where \(E_j = n \pi_j\) is the expected value of \(X_j\).

If \((X_1, \ldots, X_k)\) are jointly multinomial with category probabilities \(\{ \pi_j \}_{j=1}^k\), the probability distribution of \(\chi^2\) is asymptotically the chi-square distribution with \(k-1\) degrees of freedom, as \(E_j\) approach infinity. The approximation improves as \(E_j\) grows; see SticiGui: The Multinomial Distribution and the Chi-Squared Test for Goodness of Fit.

When \(\{E_j\}\) are not all large (and even when they are), the null distribution of \(\chi^2\) can be approximated by simulation; the theoretical approximation can be pretty bad.

The chi-square test is often applied to artificial categories created by binning continuous data. That makes the test results depend on the (arbitrary) choice of bins. The \(P\)-values will generally be misleading if the bins are chosen after looking at the data: they should be specified in advance.

Example: Testing whether data come from a Poisson distribution with known rate \(\lambda\)

A different approach to testing whether data come from a Poisson distribution is to bin the possible outcomes \(\{0, 1, 2, \ldots\}\) into a finite number \(k\) of categories, then rely on the fact that the joint distribution of the number of observations that fall into each of the bins has a multinomial distribution.

For instance, one might define four bins, with the first bin containing the possible outcomes \(\{0, 1, 2\}\), the second containing \(\{3, 4, 5, 6\}\), the third containing \(\{7, 8, 9\}\), and the fourth containing \(\{10, 11, \ldots \}\). Then \(\pi_1 = e^{-\lambda} \sum_{\ell = 0}^2 \lambda^\ell/\ell!\), \(\pi_2 = e^{-\lambda} \sum_{\ell = 3}^6 \lambda^\ell/\ell!\), \(\pi_3 = e^{-\lambda} \sum_{\ell = 7}^9 \lambda^\ell/\ell!\), and \(\pi_4 = 1 - e^{-\lambda} \sum_{\ell = 0}^9 \lambda^\ell/\ell!\).

If you are doing this and intend to use the chi-square approximation to the null distribution of the chi-square statistic, you should define the bins in a way that ensures the expected count in every bin is large. If the sample size and/or some \(\pi_j\) is too small to have \(E_j\) large for all \(j\), don’t use the chi-square approximation.

Chi-square GOF when there are estimated parameters

If a model has \(p\) free parameters fitted to the data, the distribution of the chi-square statistic is still asymptotically the chi-square distribution, but with \(k-p-1\) degrees of freedom instead of \(k-1\).

Example: Poisson Regression

TBD