9. Some Probability Distributions#

This lecture is a supplement to this lecture on statistics with matrices.

It describes some popular distributions and uses Python to sample from them.

It also describes a way to sample from an arbitrary probability distribution that you make up by transforming a sample from a uniform probability distribution.

In addition to what’s in Anaconda, this lecture will need the following libraries:

!pip install prettytable

Hide code cell output

Requirement already satisfied: prettytable in /usr/local/lib/python3.11/dist-packages (3.16.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.11/dist-packages (from prettytable) (0.2.13)

As usual, we’ll start with some imports

import numpy as np
import matplotlib.pyplot as plt
import prettytable as pt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('retina')

9.1. Some Discrete Probability Distributions#

Let’s write some Python code to compute means and variances of some univariate random variables.

We’ll use our code to

  • compute population means and variances from the probability distribution

  • generate a sample of N independently and identically distributed draws and compute sample means and variances

  • compare population and sample means and variances

9.2. Geometric distribution#

A discrete geometric distribution has probability mass function

Prob(X=k)=(1p)k1p,k=1,2,,p(0,1)

where k=1,2, is the number of trials before the first success.

The mean and variance of this one-parameter probability distribution are

E(X)=1pVar(X)=1pp2

Let’s use Python draw observations from the distribution and compare the sample mean and variance with the theoretical results.

# specify parameters
p, n = 0.3, 1_000_000

# draw observations from the distribution
x = np.random.geometric(p, n)

# compute sample mean and variance
μ_hat = np.mean(x)
σ2_hat = np.var(x)

print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat)

# compare with theoretical results
print("\nThe population mean is: ", 1/p)
print("The population variance is: ", (1-p)/(p**2))
The sample mean is:  3.334742 
The sample variance is:  7.763557793435998

The population mean is:  3.3333333333333335
The population variance is:  7.777777777777778

9.3. Pascal (negative binomial) distribution#

Consider a sequence of independent Bernoulli trials.

Let p be the probability of success.

Let X be a random variable that represents the number of failures before we get r successes.

Its distribution is

XNB(r,p)Prob(X=k;r,p)=[k+r1r1]pr(1p)k

Here, we choose from among k+r1 possible outcomes because the last draw is by definition a success.

We compute the mean and variance to be

E(X)=k(1p)pV(X)=k(1p)p2
# specify parameters
r, p, n = 10, 0.3, 1_000_000

# draw observations from the distribution
x = np.random.negative_binomial(r, p, n)

# compute sample mean and variance
μ_hat = np.mean(x)
σ2_hat = np.var(x)

print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat)
print("\nThe population mean is: ", r*(1-p)/p)
print("The population variance is: ", r*(1-p)/p**2)
The sample mean is:  23.325444 
The sample variance is:  77.83040620286394

The population mean is:  23.333333333333336
The population variance is:  77.77777777777779

9.4. Newcomb–Benford distribution#

The Newcomb–Benford law fits many data sets, e.g., reports of incomes to tax authorities, in which the leading digit is more likely to be small than large.

See https://en.wikipedia.org/wiki/Benford’s_law

A Benford probability distribution is

Prob{X=d}=log10(d+1)log10(d)=log10(1+1d)

where d{1,2,,9} can be thought of as a first digit in a sequence of digits.

This is a well defined discrete distribution since we can verify that probabilities are nonnegative and sum to 1.

log10(1+1d)0,d=19log10(1+1d)=1

The mean and variance of a Benford distribution are

E[X]=d=19dlog10(1+1d)3.4402V[X]=d=19(dE[X])2log10(1+1d)6.0565

We verify the above and compute the mean and variance using numpy.

Benford_pmf = np.array([np.log10(1+1/d) for d in range(1,10)])
k = np.arange(1, 10)

# mean
mean = k @ Benford_pmf

# variance
var = ((k - mean) ** 2) @ Benford_pmf

# verify sum to 1
print(np.sum(Benford_pmf))
print(mean)
print(var)
0.9999999999999999
3.440236967123206
6.056512631375666
# plot distribution
plt.plot(range(1,10), Benford_pmf, 'o')
plt.title('Benford\'s distribution')
plt.show()
_images/b580409cb3f1ec64b1adb6063a57980229ff77aba943be4bd43f1563571ff400.png

Now let’s turn to some continuous random variables.

9.5. Univariate Gaussian distribution#

We write

XN(μ,σ2)

to indicate the probability distribution

f(x|u,σ2)=12πσ2e[12σ2(xu)2]

In the below example, we set μ=0,σ=0.1.

# specify parameters
μ, σ = 0, 0.1

# specify number of draws
n = 1_000_000

# draw observations from the distribution
x = np.random.normal(μ, σ, n)

# compute sample mean and variance
μ_hat = np.mean(x)
σ_hat = np.std(x)

print("The sample mean is: ", μ_hat)
print("The sample standard deviation is: ", σ_hat)
The sample mean is:  -2.8548065661503377e-05
The sample standard deviation is:  0.09990243468371045
# compare
print(μ-μ_hat < 1e-3)
print(σ-σ_hat < 1e-3)
True
True

9.6. Uniform Distribution#

XU[a,b]f(x)={1ba,axb0,otherwise

The population mean and variance are

E(X)=a+b2V(X)=(ba)212
# specify parameters
a, b = 10, 20

# specify number of draws
n = 1_000_000

# draw observations from the distribution
x = a + (b-a)*np.random.rand(n)

# compute sample mean and variance
μ_hat = np.mean(x)
σ2_hat = np.var(x)

print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat)
print("\nThe population mean is: ", (a+b)/2)
print("The population variance is: ", (b-a)**2/12)
The sample mean is:  15.00250459918252 
The sample variance is:  8.342970489200585

The population mean is:  15.0
The population variance is:  8.333333333333334

9.7. A Mixed Discrete-Continuous Distribution#

We’ll motivate this example with a little story.

Suppose that to apply for a job you take an interview and either pass or fail it.

You have 5% chance to pass an interview and you know your salary will uniformly distributed in the interval 300~400 a day only if you pass.

We can describe your daily salary as a discrete-continuous variable with the following probabilities:

P(X=0)=0.95
P(300X400)=300400f(x)dx=0.05
f(x)=0.0005

Let’s start by generating a random sample and computing sample moments.

x = np.random.rand(1_000_000)
# x[x > 0.95] = 100*x[x > 0.95]+300
x[x > 0.95] = 100*np.random.rand(len(x[x > 0.95]))+300
x[x <= 0.95] = 0

μ_hat = np.mean(x)
σ2_hat = np.var(x)

print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat)
The sample mean is:  17.536613589301137 
The sample variance is:  5874.314525073605

The analytical mean and variance can be computed:

μ=300400xf(x)dx=0.0005300400xdx=0.0005×12x2|300400
σ2=0.95×(017.5)2+300400(x17.5)2f(x)dx=0.95×17.52+0.0005300400(x17.5)2dx=0.95×17.52+0.0005×13(x17.5)3|300400
mean = 0.0005*0.5*(400**2 - 300**2)
var = 0.95*17.5**2+0.0005/3*((400-17.5)**3-(300-17.5)**3)
print("mean: ", mean)
print("variance: ", var)
mean:  17.5
variance:  5860.416666666666

9.8. Drawing a Random Number from a Particular Distribution#

Suppose we have at our disposal a pseudo random number that draws a uniform random variable, i.e., one with probability distribution

Prob{X~=i}=1I,i=0,,I1

How can we transform X~ to get a random variable X for which Prob{X=i}=fi,i=0,,I1, where fi is an arbitary discrete probability distribution on i=0,1,,I1?

The key tool is the inverse of a cumulative distribution function (CDF).

Observe that the CDF of a distribution is monotone and non-decreasing, taking values between 0 and 1.

We can draw a sample of a random variable X with a known CDF as follows:

  • draw a random variable u from a uniform distribution on [0,1]

  • pass the sample value of u into the “inverse” target CDF for X

  • X has the target CDF

Thus, knowing the “inverse” CDF of a distribution is enough to simulate from this distribution.

Note

The “inverse” CDF needs to exist for this method to work.

The inverse CDF is

F1(u)inf{xR:F(x)u}(0<u<1)

Here we use infimum because a CDF is a non-decreasing and right-continuous function.

Thus, suppose that

  • U is a uniform random variable U[0,1]

  • We want to sample a random variable X whose CDF is F.

It turns out that if we use draw uniform random numbers U and then compute X from

X=F1(U),

then X is a random variable with CDF FX(x)=F(x)=Prob{Xx}.

We’ll verify this in the special case in which F is continuous and bijective so that its inverse function exists and can be denoted by F1.

Note that

FX(x)=Prob{Xx}=Prob{F1(U)x}=Prob{UF(x)}=F(x)

where the last equality occurs because U is distributed uniformly on [0,1] while F(x) is a constant given x that also lies on [0,1].

Let’s use numpy to compute some examples.

Example: A continuous geometric (exponential) distribution

Let X follow a geometric distribution, with parameter λ>0.

Its density function is

f(x)=λeλx

Its CDF is

F(x)=0λeλx=1eλx

Let U follow a uniform distribution on [0,1].

X is a random variable such that U=F(X).

The distribution X can be deduced from

U=F(X)=1eλXU=eλXlog(1U)=λXX=(1U)λ

Let’s draw u from U[0,1] and calculate x=log(1U)λ.

We’ll check whether X seems to follow a continuous geometric (exponential) distribution.

Let’s check with numpy.

n, λ = 1_000_000, 0.3

# draw uniform numbers
u = np.random.rand(n)

# transform
x = -np.log(1-u)/λ

# draw geometric distributions
x_g = np.random.exponential(1 / λ, n)

# plot and compare
plt.hist(x, bins=100, density=True)
plt.show()
_images/944346de0ed7bceee36224a4f04876d9e1a922f353e56741054a31c8f9f6ad89.png
plt.hist(x_g, bins=100, density=True, alpha=0.6)
plt.show()
_images/0bf4299ae52fd659014f4440a47f37e64429b834b167ad41260b348eb0965f40.png

Geometric distribution

Let X distributed geometrically, that is

Prob(X=i)=(1λ)λi,λ(0,1),i=0,1,i=0Prob(X=i)=1(1λ)i=0λi=1λ1λ=1

Its CDF is given by

Prob(Xi)=(1λ)j=0iλi=(1λ)[1λi+11λ]=1λi+1=F(X)=Fi

Again, let U~ follow a uniform distribution and we want to find X such that F(X)=U~.

Let’s deduce the distribution of X from

U~=F(X)=1λx+11U~=λx+1log(1U~)=(x+1)logλlog(1U~)logλ=x+1log(1U~)logλ1=x

However, U~=F1(X) may not be an integer for any x0.

So let

x=log(1U~)logλ1

where . is the ceiling function.

Thus x is the smallest integer such that the discrete geometric CDF is greater than or equal to U~.

We can verify that x is indeed geometrically distributed by the following numpy program.

Note

The exponential distribution is the continuous analog of geometric distribution.

n, λ = 1_000_000, 0.8

# draw uniform numbers
u = np.random.rand(n)

# transform
x = np.ceil(np.log(1-u)/np.log(λ) - 1)

# draw geometric distributions
x_g = np.random.geometric(1-λ, n)

# plot and compare
plt.hist(x, bins=150, density=True)
plt.show()
_images/42f7033084cfb5772f21bd9f4816ee69a4d18cc8cc9164a9dfd260c73d6498eb.png
np.random.geometric(1-λ, n).max()
np.int64(74)
np.log(0.4)/np.log(0.3)
np.float64(0.7610560044063083)
plt.hist(x_g, bins=150, density=True, alpha=0.6)
plt.show()
_images/650e5b776e8522db8f9ca2d66eac5e29d665918522e9c9771388bcfaad9d5d5b.png