Models, response schedules, and estimators¶

This notebook summarizes some probability distributions and models related to them, and draws a distinction between a model and a response schedule.

Some common probability distributions¶

Discrete¶

Bernoulli: distribution of a single trial that can result in “success” (1) or “failure” (0). A random variable \(X\) has the Bernoulli(\(p\)) distribution iff

\[\begin{equation*} \Pr \{ X=1 \} = p, \;\; \Pr \{X=0\} = 1-p.\end{equation*}\]

Binomial: distribution of the number of successes in \(n\) independent Bernoulli(\(p\)) trials. Special case: Bernoulli (\(n=1\)). A random variable \(X\) has a Binomial(\(n,p\)) distribution iff

\[\begin{equation*} \Pr \{X=j\} = {{n}\choose{j}}p^j(1-p)^{n-j}, \; j=0, 1, \ldots, n.\end{equation*}\]

Geometric: distribution of the number of trials until the 1st success in independent Bernoulli(\(p\)) trials. A random variable \(X\) has a Geometric(\(p\)) distribution iff

\[\begin{equation*} \Pr \{X=j\} = (1-p)^{j-1}p, \;\; j=1, 2, \ldots .\end{equation*}\]

Negative binomial: distribution of the number of trials until the \(k\)th success in independent Bernoulli(\(p\)) trials. Special case: geometric (\(k=1\)). A random variable \(X\) has a Negative Binomial distribution with parameters \(p\) and \(k\) distribution iff

\[\begin{equation*} \Pr \{X=j\} = {{j-1}\choose{k-1}}(1-p)^{j-k}p^k, \;\; j=k, k+1, \ldots .\end{equation*}\]

Poisson: limit of Binomial as \(n \rightarrow \infty\) and \(p \rightarrow 0\), with \(np= \lambda\). A random variable \(X\) has a Poisson(\(\lambda\)) distribution iff

\[\begin{equation*} \Pr \{X=j\} = e^{-\lambda} \frac{\lambda^j}{j!}, \;\;j=0, 1, \ldots .\end{equation*}\]

Hypergeometric: number of “good” items in a simple random sample of size \(n\) from a population of \(N\) items of which \(G\) are good. A random variable \(X\) has a hypergeometric distribution with parameters \(N\), \(G\), and \(n\) iff

\[\begin{equation*} \Pr \{X = j,\; j = 1, \ldots, k\} = \frac{{{G}\choose{j}}{{N-G}\choose{n-j}}}{{N}\choose{n}}, \;\; j = \max(0,n-(N-G)), \ldots, \min(n, G).\end{equation*}\]

Multinomial: joint distribution of the number of values in each of \(k \ge 2\) categories for \(n\) IID draws with probability \(\pi_j\) of selecting value \(j\) in each draw. Special cases: uniform distribution on \(k\) outcomes (\(n=1\), \(\pi_j = 1/k\)), binomial (\(k=2\)). A random vector \((X_1, \ldots, X_k)\) has a multinomial joint distribution with parameters \(n\) and \(\{\pi_j\}_{j=1}^k\) iff

\[\begin{equation*} \Pr \{X_j = x_j \} = \prod_{j=1}^k \pi_j^{x_j} \frac{n!}{x_1!x_2! \cdots x_j!}, \;\; x_j \ge 0,\;\; \sum_{j=1}^k x_j = n.\end{equation*}\]

Multi-hypergeometric: joint distribution of the number of values in each of \(k \ge 2\) categories for \(n\) draws without replacement from a finite population of \(N\) items of which \(N_j\) are in category \(j\). Special case: hypergeometric (\(k = 2\)). A random vector \((X_1, \ldots, X_k)\) has a multi-hypergeometric joint distribution with parameters \(\{N_j\}_{j=1}^k\) iff

\[\begin{equation*} \Pr \{ X_j = x_j,\; j = 1, \ldots, k \} = \frac{{{N_1}\choose{x_1}} \cdots {{N_k}\choose{x_k}}}{{{N}\choose{n}}}, \;\; x_j \ge 0;\;\; \sum_j x_j = n; \;\; \sum_j N_j = N.\end{equation*}\]

Continuous¶

Uniform on a domain \(\mathbf{S}\). A random variable \(X\) has a Uniform distribution on \(\mathbf{S}\) iff

\[\begin{equation*} \Pr \{X \in A\} = \frac{\int_{A \cap S} dx}{\int_{S} dx}.\end{equation*}\]

(Here and below, \(A\) needs to be a Lebesgue-measurable set; we will not worry about measurability.)

Normal. A random variable \(X\) has a normal distribution with mean \(\mu\) and variance \(\sigma^2\) iff

\[\begin{equation*} \Pr \{ X \in A \} = \int_A \frac{1}{\sqrt{2 \pi} \sigma} e^{-(x-\mu)^2/(2\sigma^2)} dx.\end{equation*}\]

Distributions derived from the normal: Student’s t, F, chi-square
Exponential. A random variable \(X\) has an exponential distribution with rate \(\lambda\) (mean \(\lambda^{-1}\)) iff

\[\begin{equation*} \Pr \{ X \in A \} = \int_{A \cap [0, \infty)} \lambda e^{-\lambda x} dx.\end{equation*}\]

Gamma. A random variable \(X\) has a Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\) iff

\[\begin{equation*} \Pr \{ X \in A \} = \int_{A \cap [0, \infty)}\frac{\beta ^{\alpha }}{\Gamma (\alpha )}x^{\alpha -1}e^{-\beta x} dx.\end{equation*}\]

What’s a model?¶

An expression for the probability distribution of data \(X\), usually “indexed” by a (possibly abstract, possibly infinite-dimensional) parameter, often relating some observables (independent variables, covariates, explanatory variables, predictors) to others (dependent variables, response variables, data, …).

\[\begin{equation*} X \sim \mathbb{P}_\theta, \;\; \theta \in \Theta. \end{equation*}\]

Examples¶

coins and 0-1 boxes
- number of heads in 1 toss
- number of heads in \(n\) tosses
- number of tosses to the first head
- number of tosses to the \(k\)th head
draws without replacement
- boxes of numbers
- boxes of categories
radioactive decay
Hooke’s Law, Ohm’s Law, Boyle’s Law
Conjoint analysis
avian-turbine interactions

Some models¶

Linear regression
Linear probability model
Logit
Probit
Multinomial logit
Poisson regression

Response schedules and causal inference¶

A response schedule is an assertion about how Nature generated the data: it says how one variable would respond if you intervened and changed the value of other variables.

Regression is about conditional expectation: the expected value of the response variable for cases selected on the basis of the values of the predictor variables.

Causal inference is about intervention: what would happen if the values of the predictor variables were exogenously set to some values.

Response schedules connect selection to intervention.

For conditioning to give the same result as intervention, the model has to be a response schedule, and the response schedule has to be correct.

Linear: a model for real-valued outcomes. \(Y_X = X\beta + \epsilon\). Nature picks \(X\), multiplies by \(\beta\), adds \(\epsilon\). \(X\) and \(\epsilon\) are independent.
- Good examples (for suitable ranges of \(X\) and suitable instrumental error): Hooke’s law, Ohm’s law, Boyle’s law
- Bad examples: most (if not all) applications in social science, including econometrics.
Linear probability model: a model for binary outcomes. \(\Pr \{Y_j = 1 | X \} = X_j\beta + \epsilon\), where the components of \(\epsilon\) are IID with mean zero. Not guaranteed to give probabilities between 0 and 1 when fitted to data.
Logit: a model for binary outcomes. Logistic distribution function is \(\Lambda(x) = e^x/(1+e^x)\). The logit function is \(\mathrm{logit} p \equiv \log [p/(1-p)]\), also called the log odds ratio. The logit model is that \(\{Y_j\}\) are independent with \(\Pr \{Y_j = 1 | X \} = \Lambda(X_j \beta)\). Equivalently, \(\mathrm{logit} \Pr(Y_j=1 | X) = X_j \beta\). Also equivalently, the latent variable formulation

\[\begin{equation*} Y_j = \begin{cases} 1, & X_j\beta + U_j \ge 0\\ 0, & \mathrm{otherwise,} \end{cases}\end{equation*}\]

where \(\{U_j \}\) are IID random variables with the logistic distribution, and are independent of \(X\).

Probit: a model for binary outcomes. Let \(\Phi\) denote the standard normal cdf. The probit model is that \(\{Y_j\}\) are independent with \(\Pr \{Y_j = 1 | X) = \Phi(X_j \beta)\). Equivalently, the latent variable formulation

\[\begin{equation*} Y_j = \begin{cases} 1, & X_j\beta + U_j \ge 0\\ 0, &\mathrm{otherwise,} \end{cases}\end{equation*}\]

where \(\{U_j \}\) are IID random variables with the standard normal distribution, and are independent of \(X\).

Multinomial logit: a model for categorical outcomes. Suppose there are \(K\) categories. The multinomial logit model is that \(\{Y_j\}\) are independent with

\[\begin{equation*} \Pr \{Y_j = k | X \} = \begin{cases} \frac{e^{X_j \beta_k}}{1 + \sum_{\ell=1}^{K-1}e^{X_j \beta_\ell}}, & k=1, \ldots, K-1 \\ \frac{1}{1 + \sum_{\ell=1}^{K-1}e^{X_j \beta_\ell}}, & k=K. \end{cases} \end{equation*}\]

Poisson regression: a model for non-negative counts. The model is that \(\{Y_j\}\) are independent Poisson random variables with corresponding rates \(\{\lambda_j\}\) and that

\[\begin{equation*} \log \lambda_j | X = X_j \beta.\end{equation*}\]

Poisson regression¶

Poisson regression fits Poisson models with parametrically related rates \(\lambda_j\) to count data \(\{Y_j\}\) and a vector of covariates \(\{X_j\}\) (each \(X_j\) is a \(p\)-vector).

According to the model, \(\{Y_j\}\) are independent, and

\[\begin{equation*} Y_j \sim \mbox{Poisson}(\lambda_j), \end{equation*}\]

where, given \(X_j\), \(\ln \lambda_j = X_j \beta\), for some \(\beta \in \Re^p\).

Equivalently, \(\{Y_j\}\) are independent and \(\log \mathbb{E}(Y_j | X_j) = X_j \beta\).

Poisson regression estimates \(\beta\) from observations of \(Y\) and \(X\).

Maximum likelihood¶

The most common estimator for this model is the maximum-likelihood estimator (MLE), which we shall derive.

Given \(X_j=x_j\), the pmf of \(Y_j\) is

\[\begin{equation*} \Pr \{Y_j = y | X_j = x_j || \beta \} = e^{-\lambda} \lambda^y/y! = e^{-e^{x_j \beta}} (e^{x_j\beta })^y/y! = e^{y x_j\beta - e^{x_j\beta }}/y!, y = 0, 1, \ldots \end{equation*}\]

Since \(\{Y_j\}\) are independent, their joint pmf is the product of their marginal pmfs. Thus, the likelihood as a function of \(\gamma\) given \(X = (x_1, \ldots, x_j)\) and \(Y = (y_1, \ldots, y_j)\) is

\[\begin{equation*} L(\gamma) \equiv \prod_{j=1}^n e^{y_j x_j\gamma - e^{x_j\gamma}}/y_j! \end{equation*}\]

A value of \(\gamma \in \Re^p\) that maximizes this is the maximum-likelihood estimator (MLE), \(\hat{\beta}_{\mathrm{MLE}}\). Since the logarithm is a monotonic function, \(\hat{\beta}_{\mathrm{MLE}}\) is also the value of \(\gamma\) that maximizes

\[\begin{equation*} \ln L(\gamma) = \sum_{j=1}^n (y_j x_j\gamma - e^{x_j\gamma} - \ln y_j!) \end{equation*}\]

The last term in the summand, \(\ln y_j!\), does not depend on \(\gamma\), so \(\hat{\beta}_{\mathrm{MLE}}\) is also the value of \(\gamma \in \Re^p\) that maximizes

\[\begin{equation*} \ell(\gamma) \equiv \sum_{j=1}^n (y_j x_j\gamma - e^{x_j\gamma}). \end{equation*}\]

How can we find a maximizer? As a function of \(\gamma\), \(\ell\) is twice differentiable. Since there are no constraints on \(\gamma \in \Re^p\), the maximum (if it is finite) occurs at a stationary point.

The first derivative of \(\ell\) with respect to \(\gamma\) is

\[\begin{equation*} \ell'(\gamma) = \sum_{j=1}^n (y_j x_j - e^{x_j\gamma} x_j). \end{equation*}\]

The second derivative is

\[\begin{equation*} \ell''(\gamma) = - \sum_{j=1}^n e^{x_j\gamma} x_j x_j. \end{equation*}\]

This is the negative of a sum of positive semi-definite matrices: it is negative semidefinite. Hence \(\ell(\gamma)\) is concave, and \(\ell\) has a unique maximum, which occurs at a stationary point.

The MLE is thus the value of \(\gamma \in \Re^p\) for which

\[\begin{equation*} \sum_{j=1}^n (y_j x_j - e^{x_j\gamma} x_j) = \sum_{j=1}^n (y_j - e^{x_j\gamma}) x_j = 0_p. \end{equation*}\]

This is a nonlinear system of equations. Since \(-\ell(\gamma)\) is a convex function, numerical methods for convex optimization can be used to minimize \(-\ell(\gamma)\) instead; that may be more stable in practice.

scipy.optimize has a number of algorithmic options for optimization and for vector root-finding. Be aware that numerical optimization is delicate, even for convex problems. In this problem, we have analytic expressions for the derivative and Hessian matrix, so it can help to use methods that can incorporate that information. (Many methods use numerical approximations to the derivative and the Hessian; depending on the computational cost of evaluating the derivative or Hessian, they can save time.)

Collaborative and Reproducible Data Science

Models, response schedules, and estimators

Contents