Models, response schedules, and estimators
Contents
Models, response schedules, and estimators¶
This notebook summarizes some probability distributions and models related to them, and draws a distinction between a model and a response schedule.
Some common probability distributions¶
Discrete¶
Bernoulli: distribution of a single trial that can result in “success” (1) or “failure” (0). A random variable \(X\) has the Bernoulli(\(p\)) distribution iff
Binomial: distribution of the number of successes in \(n\) independent Bernoulli(\(p\)) trials. Special case: Bernoulli (\(n=1\)). A random variable \(X\) has a Binomial(\(n,p\)) distribution iff
Geometric: distribution of the number of trials until the 1st success in independent Bernoulli(\(p\)) trials. A random variable \(X\) has a Geometric(\(p\)) distribution iff
Negative binomial: distribution of the number of trials until the \(k\)th success in independent Bernoulli(\(p\)) trials. Special case: geometric (\(k=1\)). A random variable \(X\) has a Negative Binomial distribution with parameters \(p\) and \(k\) distribution iff
Poisson: limit of Binomial as \(n \rightarrow \infty\) and \(p \rightarrow 0\), with \(np= \lambda\). A random variable \(X\) has a Poisson(\(\lambda\)) distribution iff
Hypergeometric: number of “good” items in a simple random sample of size \(n\) from a population of \(N\) items of which \(G\) are good. A random variable \(X\) has a hypergeometric distribution with parameters \(N\), \(G\), and \(n\) iff
Multinomial: joint distribution of the number of values in each of \(k \ge 2\) categories for \(n\) IID draws with probability \(\pi_j\) of selecting value \(j\) in each draw. Special cases: uniform distribution on \(k\) outcomes (\(n=1\), \(\pi_j = 1/k\)), binomial (\(k=2\)). A random vector \((X_1, \ldots, X_k)\) has a multinomial joint distribution with parameters \(n\) and \(\{\pi_j\}_{j=1}^k\) iff
Multi-hypergeometric: joint distribution of the number of values in each of \(k \ge 2\) categories for \(n\) draws without replacement from a finite population of \(N\) items of which \(N_j\) are in category \(j\). Special case: hypergeometric (\(k = 2\)). A random vector \((X_1, \ldots, X_k)\) has a multi-hypergeometric joint distribution with parameters \(\{N_j\}_{j=1}^k\) iff
Continuous¶
Uniform on a domain \(\mathbf{S}\). A random variable \(X\) has a Uniform distribution on \(\mathbf{S}\) iff
(Here and below, \(A\) needs to be a Lebesgue-measurable set; we will not worry about measurability.)
Normal. A random variable \(X\) has a normal distribution with mean \(\mu\) and variance \(\sigma^2\) iff
Distributions derived from the normal: Student’s t, F, chi-square
Exponential. A random variable \(X\) has an exponential distribution with rate \(\lambda\) (mean \(\lambda^{-1}\)) iff
Gamma. A random variable \(X\) has a Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\) iff
What’s a model?¶
An expression for the probability distribution of data \(X\), usually “indexed” by a (possibly abstract, possibly infinite-dimensional) parameter, often relating some observables (independent variables, covariates, explanatory variables, predictors) to others (dependent variables, response variables, data, …).
Examples¶
coins and 0-1 boxes
number of heads in 1 toss
number of heads in \(n\) tosses
number of tosses to the first head
number of tosses to the \(k\)th head
draws without replacement
boxes of numbers
boxes of categories
radioactive decay
Hooke’s Law, Ohm’s Law, Boyle’s Law
Conjoint analysis
avian-turbine interactions
Some models¶
Linear regression
Linear probability model
Logit
Probit
Multinomial logit
Poisson regression
Response schedules and causal inference¶
A response schedule is an assertion about how Nature generated the data: it says how one variable would respond if you intervened and changed the value of other variables.
Regression is about conditional expectation: the expected value of the response variable for cases selected on the basis of the values of the predictor variables.
Causal inference is about intervention: what would happen if the values of the predictor variables were exogenously set to some values.
Response schedules connect selection to intervention.
For conditioning to give the same result as intervention, the model has to be a response schedule, and the response schedule has to be correct.
Linear: a model for real-valued outcomes. \(Y_X = X\beta + \epsilon\). Nature picks \(X\), multiplies by \(\beta\), adds \(\epsilon\). \(X\) and \(\epsilon\) are independent.
Good examples (for suitable ranges of \(X\) and suitable instrumental error): Hooke’s law, Ohm’s law, Boyle’s law
Bad examples: most (if not all) applications in social science, including econometrics.
Linear probability model: a model for binary outcomes. \(\Pr \{Y_j = 1 | X \} = X_j\beta + \epsilon\), where the components of \(\epsilon\) are IID with mean zero. Not guaranteed to give probabilities between 0 and 1 when fitted to data.
Logit: a model for binary outcomes. Logistic distribution function is \(\Lambda(x) = e^x/(1+e^x)\). The logit function is \(\mathrm{logit} p \equiv \log [p/(1-p)]\), also called the log odds ratio. The logit model is that \(\{Y_j\}\) are independent with \(\Pr \{Y_j = 1 | X \} = \Lambda(X_j \beta)\). Equivalently, \(\mathrm{logit} \Pr(Y_j=1 | X) = X_j \beta\). Also equivalently, the latent variable formulation
where \(\{U_j \}\) are IID random variables with the logistic distribution, and are independent of \(X\).
Probit: a model for binary outcomes. Let \(\Phi\) denote the standard normal cdf. The probit model is that \(\{Y_j\}\) are independent with \(\Pr \{Y_j = 1 | X) = \Phi(X_j \beta)\). Equivalently, the latent variable formulation
where \(\{U_j \}\) are IID random variables with the standard normal distribution, and are independent of \(X\).
Multinomial logit: a model for categorical outcomes. Suppose there are \(K\) categories. The multinomial logit model is that \(\{Y_j\}\) are independent with
Poisson regression: a model for non-negative counts. The model is that \(\{Y_j\}\) are independent Poisson random variables with corresponding rates \(\{\lambda_j\}\) and that
Poisson regression¶
Poisson regression fits Poisson models with parametrically related rates \(\lambda_j\) to count data \(\{Y_j\}\) and a vector of covariates \(\{X_j\}\) (each \(X_j\) is a \(p\)-vector).
According to the model, \(\{Y_j\}\) are independent, and
where, given \(X_j\), \(\ln \lambda_j = X_j \beta\), for some \(\beta \in \Re^p\).
Equivalently, \(\{Y_j\}\) are independent and \(\log \mathbb{E}(Y_j | X_j) = X_j \beta\).
Poisson regression estimates \(\beta\) from observations of \(Y\) and \(X\).
Maximum likelihood¶
The most common estimator for this model is the maximum-likelihood estimator (MLE), which we shall derive.
Given \(X_j=x_j\), the pmf of \(Y_j\) is
Since \(\{Y_j\}\) are independent, their joint pmf is the product of their marginal pmfs. Thus, the likelihood as a function of \(\gamma\) given \(X = (x_1, \ldots, x_j)\) and \(Y = (y_1, \ldots, y_j)\) is
A value of \(\gamma \in \Re^p\) that maximizes this is the maximum-likelihood estimator (MLE), \(\hat{\beta}_{\mathrm{MLE}}\). Since the logarithm is a monotonic function, \(\hat{\beta}_{\mathrm{MLE}}\) is also the value of \(\gamma\) that maximizes
The last term in the summand, \(\ln y_j!\), does not depend on \(\gamma\), so \(\hat{\beta}_{\mathrm{MLE}}\) is also the value of \(\gamma \in \Re^p\) that maximizes
How can we find a maximizer? As a function of \(\gamma\), \(\ell\) is twice differentiable. Since there are no constraints on \(\gamma \in \Re^p\), the maximum (if it is finite) occurs at a stationary point.
The first derivative of \(\ell\) with respect to \(\gamma\) is
The second derivative is
This is the negative of a sum of positive semi-definite matrices: it is negative semidefinite. Hence \(\ell(\gamma)\) is concave, and \(\ell\) has a unique maximum, which occurs at a stationary point.
The MLE is thus the value of \(\gamma \in \Re^p\) for which
This is a nonlinear system of equations. Since \(-\ell(\gamma)\) is a convex function, numerical methods for convex optimization can be used to minimize \(-\ell(\gamma)\) instead; that may be more stable in practice.
scipy.optimize
has a number of algorithmic options for optimization and for vector root-finding.
Be aware that numerical optimization is delicate, even for convex problems.
In this problem, we have analytic expressions for the derivative and Hessian matrix, so it can help to use methods that can incorporate that information.
(Many methods use numerical approximations to the derivative and the Hessian;
depending on the computational cost of evaluating the derivative or Hessian, they can save time.)