11 Frequentist Inference

Frequentist inference is a way of drawing conclusions from data based on the idea that probability describes long-run frequency. In this framework, the parameter we want to estimate is treated as fixed but unknown, while the data are considered random because different samples would give different results. Inference focuses on how an estimator behaves if we were to repeat the study many times. Methods such as confidence intervals and hypothesis tests are built on this idea of repeated sampling. The goal is to ensure that the procedure performs well in the long run.

11.1 Frequentist Foundation

In statistical inference, our goal is to learn about unknown characteristics of a population using observed data. These characteristics are called parameters. Examples include the population mean \(\mu\), variance \(\sigma^2\), or the probability of success \(p\) in a Bernoulli experiment.

To understand frequentist inference, there are a few key ideas you need to know. We need to understand the difference between parameters (the true but unknown values for a population) and statistics (numbers we calculate from a sample). We also need to know the difference between an estimator (a method or formula) and an estimate (the actual number we get from data). Another important idea is the sampling distribution, which describes how a statistic would vary if we repeated the experiment many times. This can be explored using real experiments, observed data, or simulations. Finally, we study the bias and variance of an estimator to see how accurate and how variable it is in the long run.

Parameters and Statistics

A parameter is a numerical quantity that describes a population or probability distribution. It is fixed but usually unknown.

Examples:

The true mean height of all students at a university.
The probability that a machine component fails within one year.
The average waiting time in a queue.

A statistic is a quantity computed from sample data. Because it depends on random data, it is itself a random variable.

Examples:

Sample mean \(\bar{X}\)
Sample variance \(S^2\)
Sample proportion \(\hat{p}\)

Statistics are used to estimate the unknown parameters.

Estimators and Estimates

An estimator is a rule, formula, or method used to estimate a parameter based on sample data.

Example: the sample mean

\[ \bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i \]

is an estimator of the population mean \(\mu\).

Once the sample data are observed, the estimator produces a numerical value called an estimate.

An estimate is the numerical value you get after applying the estimator to a particular dataset. Once the data are observed, the randomness is gone and you obtain a single number.

Example: if you compute the sample mean from the data and obtain

\[ \bar{x} = 5.24, \]

then 5.24 is the estimate of \(\mu\).

Sampling Distribution vis Simulation

In practice, sampling distributions can sometimes be derived mathematically. However, simulation provides a simple and powerful way to study them. The basic idea is:

Assume a data-generating model.
Simulate many samples from that model.
Compute the statistic for each sample.
Examine the distribution of those values.

For example, to study the sampling distribution of the sample mean:

Generate a sample of size n from a distribution.
Compute \(\bar{X}\).
Repeat the process many times.
Plot a histogram of the resulting sample means.

As the number of repetitions increases, the histogram approximates the sampling distribution of the estimator.

This approach illustrates how simulation can be used to understand statistical inference without relying solely on mathematical derivations.

Bias and Variance of an Estimator

An estimator is considered unbiased if its expected value equals the true parameter.

Formally, an estimator \(\hat{\theta}\) of parameter \(\theta\) is unbiased if

\[ E(\hat{\theta}) = \theta. \]

If the expected value differs from the true parameter, the estimator is said to be biased.

Bias of an estimator measures whether an estimator systematically overestimates or underestimates the parameter.

For example, the sample mean \(\bar{X}\) is an unbiased estimator of the population mean \(\mu\).

Even if an estimator is unbiased, it may still vary substantially from sample to sample.

The variance of an estimator measures how much the estimator fluctuates across different samples.

An estimator with smaller variance is generally preferred because it produces more stable estimates.

In practice, both bias and variance are important when evaluating the quality of an estimator.

11.2 Likelihood and Maximum Likelihood Estimation

In the previous section, we introduced estimators and sampling distributions. We now consider a systematic way to estimate unknown parameters using observed data. One of the most widely used approaches is maximum likelihood estimation.

Maximum likelihood estimation is based on the idea of choosing parameter values that make the observed data most plausible under the assumed statistical model.

Likelihood function

Suppose we observe data \(x_1, x_2, \dots, x_n\) from a probability distribution that depends on an unknown parameter \(\theta\).

The likelihood function describes how plausible different values of \(\theta\) are, given the observed data.

For independent continuous observations with probability density function, the likelihood function is

\[ \begin{aligned} L(x_1, x_2, \dots, x_n \mid \theta) &= \prod_{i=1}^{n} f(x_i \mid \theta) \\ &= f(x_1\mid\theta)\,f(x_2\mid\theta)\dots f(x_n\mid\theta). \end{aligned} \]

Likewise, for independent discrete observations with probability mass function, the likelihood function is

\[ \begin{aligned} L(x_1, x_2, \dots, x_n \mid \theta) &= \prod_{i=1}^{n} P(x_i \mid \theta) \\ &= P(x_1\mid\theta)\,P(x_2\mid\theta)\dots P(x_n\mid\theta). \end{aligned} \]

The likelihood function is not a probability distribution for \(\theta\). Instead, it is a function that measures how well different parameter values explain the observed data.

In other words, the likelihood answers the question:

If the parameter were \(\theta\), how likely is it that we would observe the data we actually obtained?

Maximum likelihood estimator (MLE)

The maximum likelihood estimator (MLE) is the value of \(\theta\) that maximises the likelihood function. Formally,

\[ \hat{\theta} = \arg\max_{\theta} L(\theta). \]

In practice, it is often easier to work with the log-likelihood function

\[ \ell(\theta) = \log L(\theta), \]

because logarithms convert products into sums, which simplifies calculations. Maximising the likelihood or the log-likelihood produces the same estimator.

Maximum likelihood estimators have several useful properties.

Under suitable conditions, the MLE tends to:

be consistent, meaning it approaches the true parameter as the sample size increases;
have small variance among reasonable estimators;
be approximately normally distributed when the sample size is large.

These properties make maximum likelihood estimation a standard method for parameter estimation in statistics.

Bernoulli Example: Consider a Bernoulli experiment where \(X_i \sim \text{Bernoulli}(p),\) and \(p\) is the probability of success.

Suppose we observe \(x_1, x_2, \dots, x_n.\) The likelihood function is

\[ L(p) = \prod_{i=1}^{n} p^{x_i} (1-p)^{1-x_i}. \]

This can be written as

\[ L(p) = p^{\sum x_i} (1-p)^{n - \sum x_i}. \]

The log-likelihood is

\[ \ell(p) = \sum x_i \log p + (n-\sum x_i) \log (1-p). \]

Using the first derivative test:

\[ \frac{\partial \ell}{\partial p} = \frac{\sum x_i}{p} - \frac{n-\sum x_i}{1-p}=0 \]

Rearrange, and then the estimator is

\[ \hat{p} = \frac{1}{n}\sum_{i=1}^{n} x_i. \]

You can also use the second derivative test to confirm that this is the maximum solution.

Thus, the MLE for the Bernoulli parameter \(p\) is simply the sample proportion.

set.seed(123)

# True parameter (only for simulation)
p_true <- 0.35
n <- 40 # try larger n, what do you see?

# Data: Xi ~ Bernoulli(p_true)
x <- rbinom(n, size = 1, prob = p_true)

# Sufficient statistic and MLE
s <- sum(x)
p_hat <- mean(x) # s / n

s

[1] 18

p_hat

[1] 0.45

Based on the observed data, the probability of success that best explains seeing 18 successes out of 40 trials is \(\hat{p} = 0.45\). Because the sample size is small, the MLE \(\hat{p}\) has high variability and may differ noticeably from the true parameter \(p\). As the sample size increases, the Law of Large Numbers implies that the sample proportion converges to \(p\), so \(\hat{p}\) becomes more stable and typically closer to the true value.

Poisson Example: Let \(x_1, x_2, \dots, x_n\) be a random sample from \(X_i \sim \text{Poisson}(\lambda)\). Then, the likelihood function is

\[ L(\lambda) = \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!}=\frac{e^{-n\lambda}\lambda^{\sum x_i}}{\prod_{i=1}^n x_i!}. \]

The log-likelihood is

\[ \ell(\lambda) = -n\lambda + \sum x_i \log\lambda + \text{const.} \]

Taking derivatives and solving

\[ \hat{\lambda} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x} \]

So for Poisson data, the MLE of \(\lambda\) is the sample mean — exactly analogous to the Bernoulli case.

set.seed(123)

# True parameter (for simulation only)
lambda_true <- 2.5
n <- 30

# Generate Poisson data
x <- rpois(n, lambda = lambda_true)

# MLE for lambda
lambda_hat <- mean(x)

x

 [1] 2 4 2 4 5 0 2 5 3 2 5 2 3 3 1 5 1 0 2 5 4 3 3 7 3 3 3 3 2 1

lambda_hat

[1] 2.933333

# Likelihood function
L <- function(lambda, x) {
  n <- length(x)
  s <- sum(x)
  exp(-n * lambda) * lambda^s / prod(factorial(x))
}

# Log-likelihood function
ll <- function(lambda, x) {
  n <- length(x)
  s <- sum(x)
  -n * lambda + s * log(lambda)
} # we drop constants since they don't affect the maximiser

# Starting grid at 0.001 because log(0) is undefined.
lambda_grid <- seq(0.001, max(x) * 2, length.out = 2000)
# For each value of lambda in lambda_grid, compute ll(lambda, x) and store the result.
ll_vals <- sapply(lambda_grid, ll, x = x) 

plot(lambda_grid, ll_vals, type = "l", lwd = 2,
     xlab = expression(lambda),
     ylab = "log-likelihood",
     main = "Poisson log-likelihood and its MLE")

abline(v = lambda_hat, col = "red", lwd = 2, lty = 2)
legend("topright",
       legend = sprintf("lambda-hat = mean(x) = %.3f", lambda_hat),
       col = "red", lty = 2, lwd = 2, bty = "n")

opt <- optimize(ll,
                interval = c(1e-6, max(x) * 3), # small positive numbers to avoid log(0)
                x = x,
                maximum = TRUE)

opt$maximum

[1] 2.933329

lambda_hat

[1] 2.933333

These two values will agree up to numberical precision, showing that:

\[ \hat{\lambda} = \arg\max_{\lambda} \ell(\lambda) = \bar{x} \]

11.3 Confidence Intervals

Point estimators, such as the sample mean or the maximum likelihood estimator, provide a single numerical estimate of an unknown parameter. However, because the data are random, different samples will produce different estimates. As a result, a point estimate alone does not convey how uncertain the estimate might be.

To account for this uncertainty, we often report an interval estimate rather than a single value.

A confidence interval is a range of plausible values for the unknown parameter based on the observed data.

Example: Suppose we estimate the mean waiting time in a queue using the sample mean. If the sample mean is \(\bar{x} = 5.2\) minutes, we might ask:

How close is this estimate to the true mean?
Would a different sample give a very different value?

A confidence interval addresses these questions by providing a range of values that are consistent with the observed data. For example, \((4.7,\; 5.7)\) might be reported as a 95% confidence interval for the mean waiting time.

A confidence interval is constructed from sample data in such a way that, if the sampling procedure were repeated many times, a specified proportion of the intervals would contain the true parameter.

It is important to note that the parameter itself is fixed; it is the interval that varies from sample to sample.

Confidence Intervals for the Mean

Suppose we observe independent observations \(X_1, X_2, \dots, X_n\) from a population with mean \(\mu\) and variance \(\sigma^2\).

When the sample size is sufficiently large, the Central Limit Theorem implies that the sample mean is approximately normally distributed:

\[ \bar{X} \approx \mathcal{N}\left(\mu,\frac{\sigma^2}{n}\right). \]

This result allows us to construct a confidence interval for \(\mu\):

\[\bar{X} \pm z_{0.975}\frac{\sigma}{\sqrt{n}},\]

where \(z_{0.975}\) is the 97.5th percentile of the standard normal distribution.

In practice, the population variance \(\sigma^2\) is usually unknown and is replaced by the sample standard deviation \(s\), leading to the interval

\[\bar{X} \pm z_{0.975}\frac{s}{\sqrt{n}}.\]

This interval provides an estimate of the range in which the true population mean is likely to lie.

Interpretation: If the interval procedure is correct, approximately 95% of the confidence intervals should contain the true parameter.

Confidence Intervals for an MLE

Suppose \(\hat{\theta}\) is the maximum likelihood estimator of a parameter \(\theta\).

Under fairly general conditions, when the sample size is large, the MLE is approximately normally distributed:

\[ \hat{\theta} \approx \mathcal{N}\left(\theta, \; \text{Var}(\hat{\theta})\right). \]

This result allows us to construct an approximate confidence interval for \(\theta\).

A 95% confidence interval can be written as

\[ \hat{\theta} \pm z_{0.975}\sqrt{\text{Var}(\hat{\theta})}. \]

In practice, the variance of the estimator is usually unknown and must be estimated from the data.

Bernoulli Example (cont): For a Bernoulli model \(X_i \sim \text{Bernoulli}(p),\) the MLE for \(p\) is \(\hat{p} = \frac{1}{n}\sum_{i=1}^n X_i.\)

The approximate variance of this estimator is

\[ \text{Var}(\hat{p}) = \frac{p(1-p)}{n}. \]

Replacing \(p\) with the estimate \(\hat{p}\) gives the standard error:

\[ \text{SE}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. \]

Thus, a 95% confidence interval for \(p\) is

\[ \hat{p} \pm 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. \]

set.seed(123)
p_true <- 0.35
n <- 40
x <- rbinom(n, size = 1, prob = p_true)
s <- sum(x)
p_hat <- mean(x)

alpha <- 0.05 # 5% for 95% CI
z <- qnorm(1 - alpha / 2) # return ~1.96
se_hat <- sqrt(p_hat * (1 - p_hat) / n)

CI_wald <- c(
  lower = p_hat - z * se_hat,
  upper = p_hat + z * se_hat
)

CI_wald

    lower     upper 
0.2958279 0.6041721

A 95% confidence interval of \((0.2958,0.6042)\) means that, over repeated sampling, approximately 95% of intervals constructed in this way would contain the true probability of success \(p\).

Poisson Example (cont): The variance of a Poisson random variable is

\[ \text{Var}(X) = \lambda. \]

For a sample of size \(n\), the variance of the sample mean is

\[ \text{Var}(\hat{\lambda}) = \frac{\lambda}{n}. \]

Replacing \(\lambda\) with the estimate \(\hat{\lambda}\), the standard error is

\[ \text{SE}(\hat{\lambda}) = \sqrt{\frac{\hat{\lambda}}{n}}. \]

An approximate 95% confidence interval for \(\lambda\) is therefore

\[ \hat{\lambda} \pm 1.96\sqrt{\frac{\hat{\lambda}}{n}}. \]

set.seed(123)
lambda_true <- 2.5
n <- 30
x <- rpois(n, lambda = lambda_true)

alpha <- 0.05 # 5% for 95% CI
z <- qnorm(1 - alpha / 2) # returns ~1.96

se_hat <- sqrt(lambda_hat / n)

CI_wald <- c(
  lower = lambda_hat - z * se_hat,
  upper = lambda_hat + z * se_hat
)

CI_wald

   lower    upper 
2.320464 3.546203

A 95% confidence interval of \((2.3205, 3.5462)\) means that, over repeated sampling, approximately 95% of intervals constructed in this way would contain the true parameter \(\lambda\).