Random Variables, Expectation, Random Vectors, and Stochastic Processes

Random Variables

A real-valued random variable is a mapping from outcome space \(\mathcal{S}\) to the real line \(\Re\). A real-valued random variable \(X\) can be characterized by its probability distribution, which specifies (for a suitable collection of subsets of the real line \(\Re\) that comprises a sigma-algebra), the chance that the value of \(X\) will be in each such subset. There are technical requirements regarding measurability, which generally we will ignore. Perhaps the most natural mathematical setting for probability theory involves Lebesgue integration; we will largely ignore the difference between a Riemann integral and a Lebesgue integral.

Let \(P_X\) denote the probability distribution of the random variable \(X\). Then if \(A \subset \Re\), \(P_X(A) = {\mathbb P} \{ X \in A \}\). We write \(X \sim P_X\), pronounced “\(X\) is distributed as \(P_X\)” or “\(X\) has distribution \(P_X\).”

If two random variables \(X\) and \(Y\) have the same distribution, we write \(X \sim Y\) and we say that \(X\) and \(Y\) are identically distributed.

Real-valued random variables can be continuous, discrete, or mixed (general).

Continuous random variables have probability density functions with respect to Lebesgue measure. If \(X\) is a continuous random variables, there is some nonnegative function \(f(x)\), the probability density of \(X\), such that for any (suitable) set \(A \subset \Re\), $\( {\mathbb P} \{ X \in A \} = \int_A f(x) dx. \)\( Since \){\mathbb P} { X \in \Re } = 1\(, it follows that \)\int_{-\infty}^\infty f(x) dx = 1$.

Example. Let \(f(x) = \lambda e^{-\lambda x}\) for \(x \ge 0\), with \(\lambda > 0\) fixed, and \(f(x) = 0\) otherwise. Clearly \(f(x) \ge 0\). $\( \int_{-\infty}^\infty f(x) dx = \int_0^\infty \lambda e^{-\lambda x} dx = - e^{-\lambda x}|_0^\infty = - 0 + 1 = 1. \)\( Hence, \)\lambda e^{-\lambda x}$ can be the probability density of a continuous random variable. A random variable with this density is said to be exponentially distributed. Exponentially distributed random variables are used to model radioactive decay and the failure of items that do not “fatigue.” For instance, the lifetime of a semiconductor after an initial “burn-in” period is often modeled as an exponentially distributed random variable. It is also a common model for the occurrence of earthquakes (although it does not fit the data well).

Example. Let \(a\) and \(b\) be real numbers with \(a < b\), and let \(f(x) = \frac{1}{b-a}\), \(x \in [a, b]\) and \(f(x)=0\), otherwise. Then \(f(x) \ge 0\) and \(\int_{-\infty}^\infty f(x) dx = \int_a^b \frac{1}{b-a} = 1\), so \(f(x)\) can be the probability density function of a continuous random variable. A random variable with this density is sad to be uniformly distributed on the interval \([a, b]\).

Discrete random variables assign all their probability to some countable set of points \(\{x_i\}_{i=1}^n\), where \(n\) might be infinite. Discrete random variables have probability mass functions. If \(X\) is a discrete random variable, there is a nonnegative function \(p\), the probability mass function of \(X\), such that for any set \(A \subset \Re\), $\( {\mathbb P} \{X \in A \} = \sum_{i: x_i \in A} p(x_i). \)\( The value \)p(x_i) = {\mathbb P} {X = x_i}\(, and \)\sum_{i=1}^\infty p(x_i) = 1$.

Example. Fix \(\lambda > 0\). Let \(x_i = i-1\) for \(i=1, 2, \ldots\), and let \(p(x_i) = e^{-\lambda} \lambda^{x_i}/x_i!\). Then \(p(x_i) > 0\) and $\( \sum_{i=1}^\infty p(x_i) = e^{-\lambda} \sum_{j=0}^\infty \lambda^j/j! = e^{-\lambda} e^{\lambda} = 1. \)\( Hence, \)p(x)\( is the probability mass function of a discrete random variable. A random variable with this probability mass function is said to be _Poisson distributed (with parameter \)\lambda$)_. Poisson-distributed random variables are often used to model rare events.

Example. Let \(x_i = i\) for \(i=1, \ldots, n\), and let \(p(x_i) = 1/n\) and \(p(x) = 0\), otherwise. Then \(p(x) \ge 0\) and \(\sum_{x_i} p(x_i) = 1\). Hence, \(p(x)\) can be the probability mass function of a discrete random variable. A random variable with this probability mass function is said to be uniformly distributed on \(1, \ldots, n\).

Example. Let \(x_i = i-1\) for \(i=1, \ldots, n+1\), and let \(p(x_i) = {n \choose x_i} p^{x_i} (1-p)^{n-x_i}\), and \(p(x) = 0\) otherwise. Then \(p(x) \ge 0\) and $\( \sum_{x_i} p(x_i) = \sum_{j=0}^n {n \choose j} p^j (1-p)^{n-j} = 1, \)\( by the binomial theorem. Hence \)p(x)\( is the probability mass function of a discrete random variable. A random variable with this probability mass function is said to be _binomially distributed with parameters \)n\( and \)p\(_. The number of successes in \)n\( independent trials that each have the same probability \)p\( of success has a binomial distribution with parameters \)n\( and \)p\( For instance, the number of times a fair die lands with 3 spots showing in 10 independent rolls has a binomial distribution with parameters \)n=10\( and \)p = 1/6$.

For general random variables, the chance that \(X\) is in some subset of \(\Re\) cannot be written as a sum or as a Riemann integral; it is more naturally represented as a Lebesgue integral (with respect to a measure other than Lebesgue measure). For example, imagine a random variable \(X\) that has probability \(\alpha\) of being equal to zero; and if \(X\) is not zero, it has a uniform distribution on the interval \([0, 1]\). Such a random variable is neither continuous nor discrete.

Most of the random variables in this class are either discrete or continuous.

If \(X\) is a random variable such that, for some constant \(x_1 \in \Re\), \({\mathbb P}(X = x_1) = 1\), \(X\) is called a constant random variable.


### Exercises
  1. Show analytically that \(\sum_{x_i} p(x_i) = \sum_{j=0}^n {n \choose j} p^j (1-p)^{n-j} = 1\).

  • Write a Python program that verifies that equation numerically for \(n=10\): for 1000 values of \(p\) equispaced on the interval \((0, 1)\), find the maximum absolute value of the difference between the sum and 1.

  1. Let \( \in (0, 1]\); let \(x_i = 1, 2, \ldots\); and define \(p(x_i) = (1-p)^{x_i-1}p\), and \(p(x) = 0\) otherwise. Show analytically that \(p(x)\) is the probability mass function of a discrete random variable. (A random variable with this probability mass function is said to be geometrically distributed with parameter \(p\).)


Cumulative Distribution Functions

The cumulative distribution function or cdf of a real-valued random variable is the chance that the variable is less than \(x\), as a function of \(x\). Cumulative distribution functions are often denoted with capital Roman letters (\(F\) is especially common notation):

\[F_X(x) \equiv \mathbb{P}(X \le x).\]

Clearly:

  • \(0 \le F_X(x) \le 1\)

  • \(F_X(x)\) increases monotonically with \(x\) (i.e., \(F_X(a) \le F_X(b)\) if \(a \le b\).

  • \(\lim_{x \rightarrow -\infty} F_X(x) = 0\)

  • \(\lim_{x \rightarrow \infty} F_X(x) = 1\)

The cdf of a continuous real-valued random variable is a continuous function. The cdf of a discrete real-valued random variable is piecewise constant, with jumps at the possible values of the random variable. If the cdf of a real-valued random variable has jumps and also regions where it is not constant, the random variable is neither continuous nor discrete.

Examples

[To Do]

# boilerplate
%matplotlib inline
from __future__ import division
import math
import numpy as np
import scipy as sp
from scipy import stats  # distributions
from scipy import special # special functions
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, FloatRangeSlider, fixed # interactive stuff
# Examples of densities and cdfs

# U[0,1]
def pltUnif(a,b):
    ffac = 0.1
    s = b-a
    fudge = ffac*s
    x = np.arange(a-fudge, b+fudge, s/200)
    y = np.ones(len(x))/s
    y[x<a] = np.zeros(np.sum(x < a))   # zero for x < a
    y[x>b] = np.zeros(np.sum(x > b))   # zero for x > b
    Y = (x-a)/s   # uniform CDF is linear
    Y[x<a] = np.zeros(np.sum(x < a))
    Y[x >= b] = np.ones(np.sum(x >= b))
    plt.plot(x,y,'b-',x,Y,'r-',linewidth=2)
    plt.plot((a-fudge, b+fudge), (0.5, 0.5), 'g--')  # horizontal green dashed line at 0.5
    plt.plot((a-fudge, b+fudge), (0, 0), 'k-')  # horizontal black line at 0
    plt.xlabel('$x$')  # axis labels. Can use LaTeX math markup
    plt.ylabel(r'$f(x) = 1_{[a,b]}/(b-a)')
    plt.axis([a-fudge,b+fudge,-0.1,(1+ffac)*max(1, 1/s)])  # axis limits
    plt.title('The $U[$' + str(a) + ',' + str(b) + '$]$ density and cdf')
    plt.show()

interactive(pltUnif, \
            [a, b] = FloatRangeSlider(min = -5, max = 5, step = 0.05, lower=-1, upper=1))
# Exponential(lambda)

def plotExp(lam):
    ffac = 0.05
    x = np.arange(0, 5/lam, step=(5/lam)/200)
    y = sp.stats.expon.pdf(x, scale = 1/lam)
    Y = sp.stats.expon.cdf(x, scale = 1/lam)
    plt.plot(x,y,'b-',x,Y,'r-',linewidth=2)
    plt.plot((-.1, (1+ffac)*np.max(x)), (0.5, 0.5), 'g--')  # horizontal line at 0.5
    plt.plot((-.1, (1+ffac)*np.max(x)), (1, 1), 'k:')  # horizontal line at 1
    plt.xlabel('$x$')  # axis labels. Can use LaTeX math markup
    plt.ylabel(r'$f(x) = \lambda e^{-\lambda x}; F(x) = 1-e^{\lambda x}$.')
    plt.title(r'The exponential density and cdf for $\lambda=$' + str(lam))
    plt.axis([-.1,(1+ffac)*np.max(x),-0.1,(1+ffac)*max(1, lam)])  # axis limits
    plt.show()
    
interact(plotExp, lam=(0, 10, 1))

Jointly Distributed Random Variables

Often we work with more than one random variable at a time. Indeed, much of this course concerns random vectors, the components of which are individual real-valued random variables.

The joint probability distribution of a collection of random variables \(\{X_i\}_{i=1}^n\) gives the probability that the variables simultaneously fall in subsets of their possible values. That is, for every (suitable) subset \( A \in \Re^n\), the joint probability distribution of \(\{X_i\}_{i=1}^n\) gives \({\mathbb P} \{ (X_1, \ldots, X_n) \in A \}\).

An event determined by the random variable \(X\) is an event of the form \(X \in A\), where \(A \subset \Re\).

An event determined by the random variables \(\{X_j\}_{j \in J}\) is an event of the form \((X_j)_{j \in J} \in A\), where \(A \subset \Re^{\#J}\).

Two random variables \(X_1\) and \(X_2\) are independent if every event determined by \(X_1\) is independent of every event determined by \(X_2\). If two random variables are not independent, they are dependent.

A collection of random variables \(\{X_i\}_{i=1}^n\) is independent if every event determined by every subset of those variables is independent of every event determined by any disjoint subset of those variables. If a collection of random variables is not independent, it is dependent.

Loosely speaking, a collection of random variables is independent if learning the values of some of them tells you nothing about the values of the rest of them. If learning the values of some of them tells you anything about the values of the rest of them, the collection is dependent.

For instance, imagine tossing a fair coin twice and rolling a fair die. Let \(X_1\) be the number of times the coin lands heads, and \(X_2\) be the number of spots that show on the die. Then \(X_1\) and \(X_2\) are independent: learning how many times the coin lands heads tells you nothing about what the die did.

On the other hand, let \(X_1\) be the number of times the coin lands heads, and let \(X_2\) be the sum of the number of heads and the number of spots that show on the die. Then \(X_1\) and \(X_2\) are dependent. For instance, if you know the coin landed heads twice, you know that the sum of the number of heads and the number of spots must be at least 3.

Expectation

See SticiGui: The Long Run and the Expected Value for an elementary introduction to expectation.

The expectation or expected value of a random variable \(X\), denoted \({\mathbb E}X\), is a probability-weighted average of its possible values. From a frequentist perspective, it is the long-run limit (in probabiity) of the average of its values in repeated experiments. The expected value of a real-valued random variable (when it exists) is a fixed number, not a random value. The expected value depends on the probability distribution of \(X\) but not on any realized value of \(X\). If two random variables have the same probability distribution, they have the same expected value.


### Properties of Expectation
  • For any real \(\alpha \in \Re\), if \({\mathbb P} \{X = \alpha\} = 1\), then \({\mathbb E}X = \alpha\): the expected value of a constant random variable is that constant.

  • For any real \(\alpha \in \Re\), \({\mathbb E}(\alpha X) = \alpha {\mathbb E}X\): scalar homogeneity.

  • If \(X\) and \(Y\) are random variables, \({\mathbb E}(X+Y) = {\mathbb E}X + {\mathbb E}Y\): additivity.


Calculating Expectation

If \(X\) is a continuous real-valued random variable with density \(f(x)\), then the expected value of \(X\) is $\( {\mathbb E}X = \int_{-\infty}^\infty x f(x) dx, \)$ provided the integral exists.

If \(X\) is a discrete real-valued random variable with probability function \(p\), then the expected value of \(X\) is $\( {\mathbb E}X = \sum_{i=1}^\infty x_i p(x_i), \)\( where \){x_i} = { x \in \Re: p(x) > 0}$, provided the sum exists.

Examples

Uniform

Suppose \(X\) has density \(f(x) = \frac{1}{b-a}\) for \(a \le x \le b\) and \(0\) otherwise. Then

\[ \mathbb{E} = \int_{-\infty}^\infty x f(x) dx = \frac{1}{b-a} \int_a^b x dx = \frac{b^2-a^2}{2(b-a)} = \frac{a+b}{2}.\]

Poisson

Suppose \(X\) has a Poisson distribution with parameter \(\lambda\). Then

\[\mathbb{E}X = e^{-\lambda} \sum_{j=0}^\infty j \lambda^j/j! = \lambda.\]

Examples relates to Bernoulli Trials

Bernoulli

Suppose \(X\) can take only two values, 0 and 1, and the probability that \(X= 1\) is \(p\). Then

\[\mathbb{E} X = 1 \times p + 0 \times (1-p) = p.\]

Binomial

[To do.] Derive the Binomial distribution as the number of successes in \(n\) iid Bernoulli trials.

The number of successes \(X\) in \(n\) trials is equivalent to the sum of indicators for the success in each trial. That is,

\[ X = \sum_{i=1}^n X_i,\]

where \(X_i = 1\) if the \(i\)th trial results in success, and \(X_i = 0\) otherwise. By the additive property of expectation,

\[ \mathbb{E}X = \mathbb{E} \sum_{i=1}^n X_i = \sum_{i=1}^n \mathbb{E}X_i = \sum_{i=1}^n p = np.\]

Geometric

The number of trials to the first success in iid Bernoulli(\(p\)) trials has a geometric distribution with parameter \(p\).

[To do.] Derive the geometric and calculate expectation.

Negative Binomial

The number of trials to the \(k\)th success in iid Bernoulli(\(p\)) trials has a negative binomial distribution with parameters \(p\) and \(k\).

[To do.] Derive the negative binomial.

The number of trials \(X\) until the \(k\)th success in iid Bernoulli trials can be written as the number of trials until the 1st success plus the number to the second success plus \hellip; plus the number of trials to the \(k\)th success. Each of those \(k\) “waiting times” \(X_i\) has a geometric distribution. Hence

\[ \mathbb{E}X = \mathbb{E} \sum_{i=1}^k X_i = \sum_{i=1}^k \mathbb{E}X_i = \sum_{i=1}^k 1/p = k/p.\]

Hypergeometric

[To do.] Derive hypergeometric.

Population of \(N\) numbers of which \(G\) equal 1 and \(N-G\) equal 0. Number of 1s in a sample of size \(n\) drawn without replacement.

\[ \mathbb{P} \{X = x\} = \frac{ {{G} \choose {x}}{{N-g} \choose {n-x}}}{{N}\choose{n}}.\]

[To do.] Calculate expected value. Use random permutations of “tickets” to show that expected value in each position is \(G/N\).

Variance, Standard Error, and Covariance

See SticiGui: Standard Error for an elementary introduction to variance and standard error.

The variance of a random variable \(X\) is \(\mbox{Var }X = {\mathbb E}(X - {\mathbb E}X)^2\).

Algebraically, the following identity holds: $\( \mbox{Var } X = {\mathbb E}(X - {\mathbb E}X)^2 = {\mathbb E}X^2 - 2({\mathbb E}X)^2 + ({\mathbb E}X)^2 = {\mathbb E}X^2 - ({\mathbb E}X)^2. \)\( However, this is generally not a good way to calculate \)\mbox{Var} X$ numerically, because of roundoff: it sacrifices precision unnecessarily.

The standard error of a random variable \(X\) is \(\mbox{SE }X = \sqrt{\mbox{Var } X}\).

If \(\{X_i\}_{i=1}^n\) are independent, then \(\mbox{Var} \sum_{i=1}^n X_i = \sum_{i=1}^n \mbox{Var }X_i\).

If \(X\) and \(Y\) have a joint distribution, then \(\mbox{cov} (X,Y) = {\mathbb E} (X - {\mathbb E}X)(Y - {\mathbb E}Y)\). It follows from this definition (and the commutativity of multiplication) that \(\mbox{cov}(X,Y) = \mbox{cov}(Y,X)\). Also, $\( \mbox{var }(X+Y) = \mbox{var }X + \mbox{var }Y + 2\mbox{cov}(X,Y). \)$

If \(X\) and \(Y\) are independent, \(\mbox{cov }(X,Y) = 0\). However, the converse is not necessarily true: \(\mbox{cov}(X,Y) = 0\) does not in general imply that \(X\) and \(Y\) are independent.

Examples

Variance of a Bernoulli random variable

Variance of a Binomial random variable

Variance of a Geometric and Negative Binomial random variable

Variance of the sample sum and sample mean

Random Vectors

Suppose \(\{X_i\}_{i=1}^n\) are jointly distributed random variables, and let $$ X =

()\[\begin{pmatrix} X_1 \\ \vdots \\ X_n \end{pmatrix}\]

. $\( Then \)X\( is a random vector, a \)n\( by \)1$ vector of real-valued random variables.

The expected value of \(X\) is $$ {\mathbb E} X \equiv

()\[\begin{pmatrix} {\mathbb E} X_1 \\ \vdots \\ {\mathbb E} X_n \end{pmatrix}\]

. $$

The covariance matrix of \(X\) is $$ \mbox{cov } X \equiv {\mathbb E} \left (

()\[\begin{pmatrix} X_1 - {\mathbb E} X_1 \\ \vdots \\ X_n - {\mathbb E} X_n \end{pmatrix}\]
()\[\begin{pmatrix} X_1 - {\mathbb E} X_1 & \cdots & X_n - {\mathbb E} X_n \end{pmatrix}\]