[1] -0.01700378
Lecture 3 — Joint Distributions and Statistical Inference
5 Mar 2026
In real systems, random variables rarely act alone.
Examples:
Incorrectly assuming independence can lead to:
Simulation provides flexibility to incorporate complex dependence structures — but only if we understand the joint distribution.
Figure 1: Joint probability distribution samples (black) and marginal densities (blue and red)
A joint distribution is a multivariate distribution. It is the probability structure that describes how several random variables behave together.
For two variables \(X\) and \(Y\), the joint distribution \(f_{X,Y}(x,y)\) is a bivariate distribution.
For \(k\) variables, \(X_1, \dots, X_k\), the joint distribution \(f_{X_1,\dots ,X_k}(x_1,\dots ,x_k)\) is a multivariate distribution.
| Joint PMF | \(p(x,y) = P(X=x, Y=y), \qquad p(x,y) \ge 0, \qquad \sum_x \sum_y p(x,y) = 1.\) | ||
| Joint CDF | \(F_{XY}(x,y) = P(X\leq x, Y\leq y) = \sum_{x_i \leq x}\sum_{y_j \leq y} p(x_i, y_j).\) | ||
| Marginals | \(p_X(x) = \sum_y p(x,y), \qquad p_Y(y) = \sum_x p(x,y).\) |
| Joint PDF | \(f(x,y), \qquad f(x,y) \ge 0, \qquad \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x,y)\,dx\,dy = 1.\) | ||
| Joint CDF | \(F(x,y) = P(X \le x, Y \le y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f_{X,Y}(u,v)\, dv\, du.\) | ||
| Marginals | \(f_X(x) = \int f(x,y)\,dy, \qquad f_Y(y) = \int f(x,y)\,dx.\) |
| Joint distribution | \(f_{X,Y}(x,y) = P(X=x)\, f_{Y\mid X}(y \mid x), \qquad x=1,\dots,k.\) | ||
| Joint CDF | \(F_{X,Y}(x,y) = P(X \le x,\, Y \le y) = \sum_{x_i \le x} P(X=x_i)\, F_{Y\mid X}(y \mid x_i).\) | ||
| Marginals | \(P(X=x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy \qquad f_Y(y) = \sum_{x=1}^k f_{X,Y}(x,y).\) |
The case in which the variables do not influence each other at all.
\(X\) and \(Y\) are independent \(\forall x,y\) if
\[ \begin{aligned} p(x,y) &= p_X(x) \, p_Y(y) \qquad \text{(discrete)}\\ f(x,y) &= f_X(x) \, f_Y(y) \qquad \text{(continuous)} \end{aligned} \]
Discrete Case Example
Let \(X\) and \(Y\) be two discrete independent random variables
Continuous Case Example
Let X and Y be continuous independent random variables \(X\sim \mathcal{N}(0,1)\) and \(Y\sim \mathcal{N}(5,4)\)
Because of independence:
\[f_{X,Y}(x,y)=f_X(x)\, f_Y(y)=\frac{1}{\sqrt{2\pi }}e^{-x^2/2}\times \frac{1}{\sqrt{8\pi }}e^{-(y-5)^2/8}\]
But most real systems are not independent. In practice, we often observe one variable first and then ask:
How does this information affect the behaviour of the other variable?
Dependence is formalised through conditional distributions.
\[ \begin{aligned} P(X=x \mid Y=y) &= \frac{p(x,y)}{p_Y(y)}, \quad p_Y(y)>0 \qquad \text{(discrete)} \\ f_{X|Y}(x \mid y) &= \frac{f(x,y)}{f_Y(y)}, \quad f_Y(y)>0\qquad \text{(continuous)} \end{aligned} \]
Recall that independence was defined by
\[ f(x,y) = f_X(x) f_Y(y). \]
Rewriting this gives
\[ f_{Y|X}(y \mid x) = \frac{f(x,y)}{f_X(x)}. \]
If independence holds, then
\[ f_{Y|X}(y \mid x)=f_Y(y). \]
Independence is exactly the situation in which conditioning has no effect. Conditional probability becomes marginal.
Discrete Case Example
Suppose a small coffee shop with two types of customers:
Let \(X\) be customer type, where \(X=0\) and \(X=1\) denote customer Type A and Type B, respectively.
Let \(Y\) be number of ordered completed in the next hour.
We model:
\[ \begin{aligned} Y\mid X=0\sim \mathrm{Binomial}(n=10,p=0.7), \\ Y\mid X=1\sim \mathrm{Binomial}(n=10,p=0.4) \end{aligned} \]
This creates dependence between \(X\) and \(Y\).
Even if we know the marginal distribution of \(X\) and the marginal distribution of \(Y\), we cannot simulate the system correctly without the conditional structure.
[1] -0.7032966
y <- 0:10
pmf_A <- dbinom(y, size = 10, prob = 0.7) # X = 0 (Type A)
pmf_B <- dbinom(y, size = 10, prob = 0.4) # X = 1 (Type B)
# Set up side-by-side plotting area
par(mfrow = c(2, 1), mar = c(4, 4, 0, 1))
# PMF for X = 0
barplot(
pmf_A,
names.arg = y,
col = "steelblue",
main = "",
xlab = "Y",
ylab = "P(Y | X = 0)",
ylim = c(0, max(pmf_A, pmf_B))
)
# PMF for X = 1
barplot(
pmf_B,
names.arg = y,
col = "firebrick",
main = "",
xlab = "Y",
ylab = "P(Y | X = 1)",
ylim = c(0, max(pmf_A, pmf_B))
)Continuous Case Example
Suppose interarrival time \(X\) of a coffee shop influences service time \(Y\), such that
\[ Y = 2 + 0.5X + \varepsilon, \]
\[ X \sim \text{Exp}(1), \quad \varepsilon \sim N(0, 0.2^2), \]
and \(X\) and \(\varepsilon\) are independent.
To quantify linear dependence, we define the covariance:
\[ \mathrm{Cov}(X,Y) = \mathbb{E}\left[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])\right]. \]
An equivalent expression is
\[ \mathrm{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]. \]
If \(X\) and \(Y\) are independent, then
\[\mathrm{Cov}(X,Y) = 0.\]
The correlation coefficient is
\[\rho = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y},\]
which lies in \([-1,1]\).
Correlation measures linear association but does not fully describe dependence.
Two variables may be uncorrelated yet dependent.
Example: Let \(X \sim \text{Uniform}(-1,1)\) and \(Y = X^2\).
Here, \(Y\) is completely determined by \(X\). So, knowing \(X\) tells you exatly what \(Y\) is (strong dependency).
\[ \text{Cov}(X,X^2)=E[X^3]−E[X]E[X^2]=0 \]
✅ Dependent
✅ Uncorrelated
Covariance is useful when you care about the direction of the relationship.
Covariance answers:
“Do X and Y increase together or move in opposite directions?”
But the magnitude is not interpretable, because it depends on the units of X and Y.
Correlation is the standardised version of covariance. It removes units and rescales the relationship to the familiar range [-1,1].
Correlation answers:
“How strong is the linear relationship between X and Y, on a universal scale?”
Because it’s standardised, you can compare height vs weight, income vs education, temperature vs electricity use, etc, even though all are in different units.
set.seed(123)
par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))
### 1. Negative covariance
x1 <- rnorm(500)
y1 <- -0.8 * x1 + rnorm(500, sd = 0.5)
plot(x1, y1,
pch = 16, cex = 0.5, col = "steelblue",
main = paste("Cov < 0\n", round(cov(x1, y1), 2)),
xlab = "X", ylab = "Y")
### 2. Approximately zero covariance
x2 <- rnorm(500)
y2 <- rnorm(500)
plot(x2, y2,
pch = 16, cex = 0.5, col = "darkgray",
main = paste("Cov ≈ 0\n", round(cov(x2, y2), 2)),
xlab = "X", ylab = "Y")
### 3. Positive covariance
x3 <- rnorm(500)
y3 <- 0.8 * x3 + rnorm(500, sd = 0.5)
plot(x3, y3,
pch = 16, cex = 0.5, col = "firebrick",
main = paste("Cov > 0\n", round(cov(x3, y3), 2)),
xlab = "X", ylab = "Y")set.seed(123)
par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))
### 1. Negative correlation
x1 <- rnorm(500)
y1 <- -0.8 * x1 + rnorm(500, sd = 0.5)
plot(x1, y1,
pch = 16, cex = 0.5, col = "steelblue",
main = paste("Cor < 0\n", round(cor(x1, y1), 2)),
xlab = "X", ylab = "Y")
### 2. Approximately zero correlation
x2 <- rnorm(500)
y2 <- rnorm(500)
plot(x2, y2,
pch = 16, cex = 0.5, col = "darkgray",
main = paste("Cor ≈ 0\n", round(cor(x2, y2), 2)),
xlab = "X", ylab = "Y")
### 3. Positive correlation
x3 <- rnorm(500)
y3 <- 0.8 * x3 + rnorm(500, sd = 0.5)
plot(x3, y3,
pch = 16, cex = 0.5, col = "firebrick",
main = paste("Cor > 0\n", round(cor(x3, y3), 2)),
xlab = "X", ylab = "Y")\(\rho_{XY} < 0\) indicates a clear downward linear trend: as \(X\) increases, \(Y\) tends to decrease.
\(\rho_{XY} \approx 0\) produces a round, structureless cloud of points with no visible upward or downward tilt — the classic “no linear association” look.
\(\rho_{XY} > 0\) indicates a clear upward linear trend: as \(X\) increases, \(Y\) tends to increase.
A bivariate normal distribution is a two‑dimensional version of the normal distribution. It describes the joint behaviour of two continuous random variables, usually written as:
\[(X,Y)\sim \mathrm{BVN}(\mu _X,\mu _Y,\sigma _X^2,\sigma _Y^2,\rho )\]
or
\[ \begin{pmatrix} X \\ Y \end{pmatrix} \sim N \left( \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix}, \begin{pmatrix} \sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{pmatrix} \right) \]
It is fully determined by five parameters:
Here, the parameter \(\rho\) controls linear dependence.
Each marginal is normal \(\quad X\sim N(\mu _X,\sigma _X^2),\qquad Y\sim N(\mu _Y,\sigma _Y^2)\)
The joint density forms a 3D bell surface The height of the surface at point \((x,y)\) is the joint density \(f_{X,Y}(x,y).\) The shape of this surface depends heavily on the correlation \(\rho.\)
Contours are ellipses. If you slice the 3D surface horizontally, you get ellipses. The orientation of the ellipse tells you the sign of the correlation.
Independence happens only when \(\rho =0\). This is not true for most distributions. It’s a unique property of the multivariate normal family.
library(mvtnorm)
# Grid for evaluation
x <- seq(-3, 3, length = 50)
y <- seq(-3, 3, length = 50)
grid <- expand.grid(x = x, y = y)
# Function to compute BVN density matrix
bvn_matrix <- function(rho) {
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
z <- dmvnorm(grid, mean = c(0, 0), sigma = Sigma)
matrix(z, nrow = length(x), ncol = length(y))
}
z_neg <- bvn_matrix(-0.8)
z_zero <- bvn_matrix(0)
z_pos <- bvn_matrix(0.8)
par(mfrow = c(1, 3), mar = c(2, 2, 3, 1))
# 1. Negative correlation
persp(x, y, z_neg,
theta = 30, phi = 25,
col = "lightblue",
main = "ρ = -0.8 (tilts downward)",
xlab = "X", ylab = "Y", zlab = "f(x,y)")
# 2. Zero correlation
persp(x, y, z_zero,
theta = 30, phi = 25,
col = "lightgray",
main = "ρ = 0 (symmetric hill)",
xlab = "X", ylab = "Y", zlab = "f(x,y)")
# 3. Positive correlation
persp(x, y, z_pos,
theta = 30, phi = 25,
col = "salmon",
main = "ρ = 0.8 (tilts upward)",
xlab = "X", ylab = "Y", zlab = "f(x,y)")
We can simulate correlated normals using a linear transformation:
\[ X = Z_1, \qquad Y = \rho Z_1 + \sqrt{1-\rho^2} Z_2 \]
[1] 0.7963105
# Define parameters
rho <- 0.7
# Create grid
x <- seq(-3, 3, length = 100)
y <- seq(-3, 3, length = 100)
grid <- expand.grid(x = x, y = y)
# Bivariate normal density (mean=0, var=1)
f <- function(x, y, rho) {
1/(2*pi*sqrt(1 - rho^2)) *
exp(-(x^2 - 2*rho*x*y + y^2) / (2*(1 - rho^2)))
}
# Compute density values
z <- matrix(f(grid$x, grid$y, rho), nrow = 100)
par(mar = c(0, 0, 0, 0))
# 3D perspective plot
persp(x, y, z,
theta = 30, phi = 30,
expand = 0.5,
col = "lightblue",
xlab = "X",
ylab = "Y",
zlab = "Density",
ticktype = "detailed")Up to this point, we have studied probability distributions and how to simulate random variables from them. We have generated data from known models and explored their behaviour through repeated sampling.
Probability asks: If the model is known, what kind of data will we see?
Inference asks: Given the data we observed, what can we say about the unknown model?
There are two major approaches to statistical inference:
| Frequentist Inference | Bayesian Inference |
|---|---|
| Treats probability as long-run frequency | Treats probability as a measure of uncertainty or belief |
| • Parameters are fixed but unknown. • Data are random. • Uncertainty is described using sampling distributions. • Inference is based on long-run repeated sampling behaviour. |
• Parameters are treated as random variables. • Prior beliefs are updated using observed data. • Uncertainty is described using the posterior distribution. |
| Example: • Maximum Likelihood Estimation (MLE) • Confidence Intervals • Hypothesis testing |
Example: • Bayes’ theorem provides the updating mechanism: \(\text{Posterior} \propto \text{Likelihood} \times \text{Prior}\) • Credible interval for interval estimation |
Drawing conclusions from data based on the idea that probability describes long-run frequency.
Terminology
| Concept | What it Refers To | Meaning | Example |
|---|---|---|---|
| Parameter | A true numerical characteristic of a population | Fixed but usually unknown value | Population mean \(\mu\), population variance \(\sigma^2\) |
| Statistic | A numerical summary calculated from a sample | Random variable as it depends on sampled data | Sample mean \(\bar{x}\), sample variance \(s^2\) |
| Estimator | A rule, formula, or method used to estimate a parameter | Function of sample data used to compute a statistic | \(\hat{\mu} = \bar{X} = \frac{1}{n}\sum X_i\) |
| Estimate | The numeric value obtained from applying an estimator to a sample | Realised value of the estimator | If \(\bar{x} = 5.2\), then 5.2 is the estimate of \(\mu\) |
| Bias of an Estimator | Measures how far the expected value of the estimator is from the true parameter | \(\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta\) Unbiased: \(E[\hat{\theta}] = \theta\) |
Sample mean \(\bar{X}\) is unbiased for \(\mu\) |
| Variance of an Estimator | Measures how much the estimator varies across different samples | \(\text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]\) | \(\text{Var}(\bar{X}) = \sigma^2/n\) |
We now consider a systematic way to estimate unknown parameters using observed data. One of the most widely used approaches is maximum likelihood estimation. But first, we have to know the likelihood function.
Suppose we observe data \(x_1, x_2, \dots, x_n\) from a probability distribution that depends on an unknown parameter \(\theta\).
The likelihood function describes how plausible different values of \(\theta\) are, given the observed data.
For independent continuous observations with probability density function, the likelihood function is
\[ \begin{aligned} L(x_1, x_2, \dots, x_n \mid \theta) &= \prod_{i=1}^{n} f(x_i \mid \theta) \\ &= f(x_1\mid\theta)\,f(x_2\mid\theta)\dots f(x_n\mid\theta). \end{aligned} \]
Likewise, for independent discrete observations with probability mass function, the likelihood function is
\[ \begin{aligned} L(x_1, x_2, \dots, x_n \mid \theta) &= \prod_{i=1}^{n} P(x_i \mid \theta) \\ &= P(x_1\mid\theta)\,P(x_2\mid\theta)\dots P(x_n\mid\theta). \end{aligned} \]
The likelihood function measures how well different parameter values explain the observed data.
In other words, the likelihood answers the question:
If the parameter were \(\theta\), how likely is it that we would observe the data we actually obtained?
As a method to estimate unknown parameters using observed data, maximum likelihood estimation is based on the idea of choosing parameter values that make the observed data most plausible under the assumed statistical model.
The maximum likelihood estimator (MLE) is the value of \(\theta\) that maximises the likelihood function. Formally,
\[ \hat{\theta} = \arg\max_{\theta} L(\theta). \]
In practice, it is often easier to work with the log-likelihood function
\[ \ell(\theta) = \log L(\theta), \]
because logarithms convert products into sums, which simplifies calculations. Maximising the likelihood or the log-likelihood produces the same estimator.
Maximum likelihood estimators have several useful properties.
Under suitable conditions, the MLE tends to:
These properties make maximum likelihood estimation a standard method for parameter estimation in statistics.
Example: Consider a Bernoulli experiment where \(X_i \sim \text{Bernoulli}(p),\) and \(p\) is the probability of success. Suppose we observe \(x_1, x_2, \dots, x_n.\) The likelihood function is
\[ L(p) = \prod_{i=1}^{n} p^{x_i} (1-p)^{1-x_i}. \]
This can be written as
\[ L(p) = p^{\sum x_i} (1-p)^{n - \sum x_i}. \]
The log-likelihood is
\[ \ell(p) = \sum x_i \log p + (n-\sum x_i) \log (1-p). \]
Using the first derivative test:
\[ \frac{\partial \ell}{\partial p} = \frac{\sum x_i}{p} - \frac{n-\sum x_i}{1-p}=0 \]
Rearrange, and then the estimator is
\[ \hat{p} = \frac{1}{n}\sum_{i=1}^{n} x_i. \]
You can also use the second derivative test to confirm that this is the maximum solution.
Thus, the MLE for the Bernoulli parameter \(p\) is simply the sample proportion.
[1] 18
[1] 0.45
Interpretation:
loglik <- function(p, s, n) {
s * log(p) + (n - s) * log(1 - p)
}
# Grid of p values
p_grid <- seq(0.001, 0.999, length.out = 500)
# Compute log-likelihood
ll_values <- loglik(p_grid, s, n)
# Plot
plot(p_grid, ll_values,
type = "l",
xlab = "p",
ylab = "Log-Likelihood",
main = "Log-Likelihood for Bernoulli(p)")
# Add MLE
abline(v = p_hat, lty = 2, col="red")
# Add true parameter (optional)
abline(v = p_true, lty = 3, col="blue")
# Legend
legend("topright",
legend = c("MLE (p_hat)", "True p"),
col = c("red", "blue"),
lty = c(2,3))So far, we have focused on a single numerical estimate of an unknown parameter. Because samples vary, estimates do too, so we often report an interval instead of a single value.
Suppose we observe independent observations \(X_1, X_2, \dots, X_n\) from a population with mean \(\mu\) and variance \(\sigma^2\).When the sample size is sufficiently large, the CLT implies that:
\[ \bar{X} \approx N\left(\mu,\frac{\sigma^2}{n}\right). \]
This result allows us to construct a 95% confidence interval for \(\mu\):
\[\bar{X} \pm z_{0.975}\frac{\sigma}{\sqrt{n}},\]
where \(z_{0.975}\) is the 97.5th percentile of the standard normal distribution.
In practice, the population variance \(\sigma^2\) is usually unknown and is replaced by the sample standard deviation \(s\), leading to the interval
\[\bar{X} \pm z_{0.975}\frac{s}{\sqrt{n}}.\]
Interpretation: If the interval procedure is correct, approximately 95% of the confidence intervals should contain the true parameter.
Suppose \(\hat{\theta}\) is the maximum likelihood estimator of a parameter \(\theta\). Under fairly general conditions, when the sample size is large, the MLE is approximately normally distributed:
\[ \hat{\theta} \approx N\left(\theta, \; \text{Var}(\hat{\theta})\right). \]
Then, a 95% confidence interval can be written as \(\displaystyle\hat{\theta} \pm z_{0.975}\sqrt{\text{Var}(\hat{\theta})}.\)
In practice, the variance of the estimator is usually unknown and must be estimated from the data.
Example (cont.) For a Bernoulli model \(X_i \sim \text{Bernoulli}(p),\) the approximate variance of this estimator is
\[ \text{Var}(\hat{p}) = \frac{p(1-p)}{n}. \]
Replacing \(p\) with the estimate \(\hat{p}\) gives the standard error:
\[ \text{SE}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. \]
Thus, a 95% confidence interval for \(p\) is
\[ \hat{p} \pm 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. \]
Drawing conclusions from data by treating probability as a measure of uncertainty or belief.
If \(A\) and \(B\) are events and \(P(B) \neq 0,\) Bayes’ theorem is stated as:
\[ P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}, \]
If events \(A_i\) are mutually exclusive and exhaustive, i.e., one of them is certain to occur but no two can occur together, then the Law of Total Probability states that \[ P(B)=\sum_{i=1}^k P(B \mid A_i)\,P(A_i). \]
Substituting this expression for \(P(B)\) into the denominator of Bayes’ theorem gives:
\[ P(A_i \mid B) = \frac{P(A_i)P(B \mid A_i)}{\sum_j P(A_i)P(B \mid A_i)}. \]
Example: Suppose a mail server knows the following from past data:
flowchart LR
A[Email] --> B{Spam?}
B -->|Yes 10%| C[Spam]
B -->|No 90%| D[Not Spam]
C -->|WIN 40%| E[Spam + WIN]
C -->|No WIN 60%| F[Spam + No WIN]
D -->|WIN 5%| G[Not Spam + WIN]
D -->|No WIN 95%| H[Not Spam + No WIN]
linkStyle default stroke:#000,stroke-width:2px
Let \(W\) be the event that the email contains the word “WIN” and \(S\) be the event that the email is spam.
First, we identify the probabilities: \(P(S) = 0.10, P(\neg S) = 0.90, P(W \mid S) = 0.40, P(W \mid \neg S) = 0.05\).
Suppose that you receive a new email and notice it contains “WIN”. Bayes’ rule lets you compute the probability that it is actually spam.
The probability that it is actually spam is
\[ P(S \mid W) = \frac{P(W \mid S) \, P(S)} {P(W \mid S)P(S) + P(W \mid \neg S)P(\neg S)}. \]
Compute the numerator:
\[ P(W \mid S)P(S) = 0.40 \times 0.10 = 0.04 \]
Compute the denominator:
\[ P(W \mid S)P(S) + P(W \mid \neg S)P(\neg S) = 0.40(0.10) + 0.05(0.90) = 0.085 \]
Compute the posterior probability:
\[ P(S \mid W) = \frac{0.04}{0.085} \approx 0.4706 \]
Therefore, if an email contains the word “WIN”, the probability that it is spam is about 47.1%.
Exercise: Calculate \(P(\neg S | W)\)
From Bayes’ Rule:
\[ P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}. \]
In Bayesian inference, \(P(B)\) is fixed and Bayes’ theorem shows that the posterior probabilities are proportional to the numerator, thus
\[ P(A \mid B) \propto P(B \mid A) \cdot P(A). \]
In words,
\[ \text{Posterior} \propto \text{Likelihood} \times \text{Prior}. \]
The posterior distribution represents our updated knowledge about the parameter
Coffee shop example from lecture 2: Suppose that during the morning rush, a barista records 20 customers. Among these, 9 customers chose oat milk and the rest chose dairy.
Let \(X\) be the number of oat-milk orders out of 20 and \(p\) be the true (but unknown) probability a customer chooses oat milk. This can be modelled as:
\[ p \sim \text{Beta}(\alpha=3, \beta=7) \qquad \text{(Prior belief)} \]
\[ X \,|\, p \sim \text{Binomial}(n=20, p) \qquad \text{(Likelihood)} \]
\[ \text{Posterior} \propto p^{X+\alpha-1}(1-p)^{(20-X)+\beta-1} \]
This expression is exactly the kernel of a Beta distribution, and we observe \(x=9\). Then, the posterior must be
\[ p \,|\, X \sim \text{Beta}(\alpha+X, \beta+(20-X)) \qquad \rightarrow \qquad p \,|\, X \sim \text{Beta}(12, 18). \]
[1] 0.3056364
[1] 0.397195
The posterior mean is 0.4, so our belief shifts from 30% to 40%.
Some combinations of prior distributions and likelihood functions lead to particularly convenient forms for the posterior distribution. These are called conjugate priors.
The example from the previous slide is one of the example of conjugate prior, Beta-Binomial.
In more complicated models, conjugacy may not hold, and simulation-based methods such as Markov Chain Monte Carlo (MCMC) are required to approximate the posterior distribution.
The followings are common conjugate prior pairs:
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Binomial | Beta | Beta |
| Poisson | Gamma | Gamma |
| Exponential | Gamma | Gamma |
| Normal | Normal | Normal |
Example: Calculate Beta–Binomial Posterior Mean
Recall the model:
\[ X \mid p \sim \text{Binomial}(n=20, p), \qquad p \sim \text{Beta}(\alpha=3,\beta=7), \]
and we observed \(x=9\) oat-milk orders out of \(n=20\).
By Beta–Binomial conjugacy, and substituting \(\alpha=3, \beta=7, n=20, x=9\) gives
\[ p \mid X=9 \sim \text{Beta}(12,18). \]
If \(p \sim \text{Beta}(a,b)\), then
\[ \mathbb{E}[p] = \frac{a}{a+b}. \]
Therefore, for the posterior \(\text{Beta}(12,18)\),
\[ \mathbb{E}[p \mid X=9] = \frac{12}{12+18} = \frac{12}{30} = 0.4. \]
In Bayesian inference, uncertainty about a parameter is summarised using the posterior distribution. From this distribution, we can construct credible intervals.
A credible interval is an interval that contains a specified proportion of the posterior probability.
For example, a 95% credible interval for parameter \(\theta\) satisfies
\[ P(a \leq \theta \leq b \mid y) = 0.95 \]
Unlike frequentist confidence intervals, credible intervals have a direct probabilistic interpretation:
Given the observed data and the prior distribution, there is a 95% probability that the parameter lies within the interval.
Example: Beta–Binomial Credible Interval
A central 95% Bayesian credible interval is typically defined using posterior quantiles:
\[ \left[q_{0.025},\; q_{0.975}\right], \]
where \(q_\gamma\) is the \(\gamma\)-quantile of the posterior distribution. That is,
\[ P(p \le q_\gamma \mid X=9) = \gamma. \]
So the 95% credible interval is
\[ \left[ F^{-1}(0.025;\, a=12,b=18),\; F^{-1}(0.975;\, a=12,b=18) \right], \]
where \(F^{-1}(\cdot; a,b)\) is the inverse CDF (quantile function) of a Beta(a,b) distribution.
[1] 0.2352402 0.5773954
[1] 0.4002564
2.5% 97.5%
0.2352525 0.5797690
The posterior mean 0.4 is our updated estimate of the oat-milk probability after observing 9 out of 20 orders. The 95% credible interval gives a range \([L,U]\) such that
\[ P(L \leq p \leq U \mid X= 9) = 0.95 \]
Unlike a confidence interval, this statement is a direct probability statement about the parameter \(p\) (given the prior and the observed data).
