42 Monte Carlo Integration

In many scientific and statistical problems, we are interested in computing quantities that can be written as integrals. Examples include:

probabilities;
expected values;
variances;
areas and volumes;
likelihoods and posterior means in Bayesian statistics.

For simple models, these integrals can sometimes be evaluated analytically using calculus. However, in realistic applications, exact solutions are often unavailable or extremely difficult to derive.

Monte Carlo integration provides a computational alternative. Instead of solving the integral analytically, we estimate it using random sampling.

The central idea is remarkably simple:

averages computed from random samples can approximate complicated integrals.

This transforms integration from a calculus problem into a simulation problem.

Monte Carlo integration is one of the foundational ideas behind modern computational statistics, Bayesian inference, quantitative finance, machine learning, and uncertainty quantification.

Integration as Expectation

Monte Carlo integration is a method for estimating the value of an integral using random sampling. The basic idea is to sample points from a probability distribution and use these samples to approximate the integral. For example, if we want to estimate the integral of a function \(f(x)\) over a domain \(D\), we can sample points \(x_i\) from a distribution that covers the domain and compute the average value of \(f(x_i)\) to get an estimate of the integral.

Suppose \(X\) is a continuous random variable with probability density function \(f(x)\), and let \(h(x)\) be a function of interest.

Recall from probability theory that the expectation of \(h(X)\) is

\[ \mathbb{E}[h(X)] = \int h(x)f(x)\,dx. \]

This observation is fundamental.

Many difficult integrals can be rewritten as expectations of random variables.

Once written in this form, the integral can be approximated using simulation.

Example 1: Estimating an Expected Value

Suppose

\[ X \sim \text{Uniform}(0,1) \]

and we wish to compute

\[ I = \int_0^1 x^2\,dx. \]

Analytically,

\[ I = \left[ \frac{x^3}{3} \right]_0^1 = \frac{1}{3}. \]

But observe that if \(X \sim \text{Uniform}(0,1)\), then

\[ f(x)=1, \quad 0<x<1, \]

\[ I = \int_0^1 x^2 f(x)\,dx = \mathbb{E}[X^2]. \]

Therefore, instead of evaluating the integral directly, we can:

generate many random samples from Uniform(0, 1);
square each sample;
compute the average.

library(Ryacas)
x <- ysym("x")
f <- x^2
F <- integrate(f, x); F

y: x^3/3

set.seed(1234)
n <- 10000
x <- runif(n)
x_squared <- x^2

I <- integrate(function(x) x^2, lower=0, upper=1)$value
cat("Analytical value:", I, "\n")

Analytical value: 0.3333333

estimate <- mean(x_squared)
cat("Monte Carlo estimate:", estimate, "\n")

Monte Carlo estimate: 0.3327025

Python Version

import numpy as np
import sympy as sp
from scipy.integrate import quad

x = sp.symbols('x')
f = x**2
F = sp.integrate(f, x)
# print(sp.latex(F)) # copiable LaTeX
pprint(F)

np.random.seed(1234)
n = 10000
x = np.random.uniform(0, 1, n)
x_squared = x**2

I, _ = quad(lambda x: x**2, 0, 1)
print("Analytical value:", round(I, 4))
estimate = np.mean(x_squared)
print("Monte Carlo estimate:", round(estimate, 4))

Monte Carlo Estimator

Suppose \(X_1,X_2,\dots,X_n\) are independent and identically distributed (i.i.d.) samples from density \(f(x)\).

The Monte Carlo estimator of an integral

\[ I = \int h(x) f(x)\,dx = \mathbb{E}[h(X)]. \tag{42.1}\]

\[ \hat I_n = \frac{1}{n}\sum_{i=1}^n h(X_i). \tag{42.2}\]

This estimator is simply the sample mean of the simulated values.

Why This Works

Monte Carlo integration is justified by the Law of Large Numbers (LLN).

Recall from Week 1 that if \(X_1,X_2,\dots,X_n\) are i.i.d. with finite mean \(\mu\), then

\[ \bar x \to \mu \quad \text{as } n\to\infty. \]

Since the Monte Carlo estimator \(\hat I_n\) is simply a sample average, it converges to the true expectation \(\mathbb{E}[h(X)]\) as the number of samples increases. Thus, \[ \hat I_n \to I \quad \text{as } n\to\infty. \]

Example 2: Estimating an Area

One of the most famous examples of Monte Carlo integration is estimating the value of \(\pi\).

The key idea is that Monte Carlo methods can approximate geometric areas using random sampling, which otherwise calculus would require difficult integration.

Consider the square \([-1,1]\times[-1,1]\) with area 4¹.

Inside the square is a circle of radius 1 centred at the origin. The equation of the circle is

\[ x^2 + y^2 = 1. \tag{42.3}\]

Since the radius of the circle is 1, the area of the circle is

\[ \pi r^2 = \pi (1)^2 = \pi. \]

How can we use Monte Carlo integration to estimate the area of the unit circle \(\pi\)?

Method 1: Using Curve-Height

The first method is to express the area as a single integral and then apply the Monte Carlo estimator.

Step 1: Expressing Area as an Integral

From Equation 42.3, rewriting it in terms of \(y\) gives

\[ y = \pm\sqrt{1-x^2}. \]

Then, from the geometric interpretation of the definite integral², the area of the circle can be written as

\[ \begin{aligned} A &= \int_{-1}^1 \left( \sqrt{1-x^2} - (-\sqrt{1-x^2}) \right) dx \\ A &= \int_{-1}^1 2\sqrt{1-x^2}\,dx. \end{aligned} \]

Step 2: Rewriting as an Expectation

Since \(X \sim \text{Uniform}(-1,1)\), its density is

\[ f(x) = \frac12, \quad -1 < x < 1. \]

Rewriting the area integral as in Equation 42.1, we can express the area as an expectation

\[ A = \int_{-1}^1 4\sqrt{1-x^2} \left(\frac12\right)\,dx = \mathbb{E}[4\sqrt{1-X^2}]. \tag{42.4}\]

Step 3: Apply Monte Carlo Estimator

From Equation 42.2, the Monte Carlo estimator of the area is given by

\[ \hat A_n = \frac1n\sum_{i=1}^n 4\sqrt{1-X_i^2}, \]

Since \(A = \pi\), the Monte Carlo estimate of \(\pi\) is

\[ \boxed{ \hat \pi_n = \frac1n\sum_{i=1}^n 4\sqrt{1-X_i^2},} \]

where \(X \sim \text{Uniform}(-1,1)\).

set.seed(1234)
n <- 5000
x <- runif(n, -1, 1)
pi_hat <- mean(4*sqrt(1-x^2))
pi <- integrate(function(x) 2*sqrt(1-x^2), lower=-1, upper=1)$value
cat("Analytical value: ", round(pi, 5), "\n",
    "Monte Carlo estimate: ", round(pi_hat, 5), sep="")

Analytical value: 3.14159
Monte Carlo estimate: 3.14051

Python Version

import numpy as np
from scipy.integrate import quad

np.random.seed(1234)
n = 5000
x = np.random.uniform(-1, 1, n)
pi_hat = np.mean(4 * np.sqrt(1 - x**2))
pi, _ = quad(lambda x: 2 * np.sqrt(1 - x**2), -1, 1)
print("Analytical value: ", round(pi, 5), 
      "\nMonte Carlo estimate: ", round(pi_hat, 5), sep="")

Method 2: Using Dartboard

The second method is to express the area as a double integral and then apply the Monte Carlo estimator. This method is more intuitive and closely related to the original idea of Monte Carlo methods, which is to use random sampling to approximate areas and volumes.

Step 1: Expressing Area as an Integral

The area of the circle can be written as a double integral,

\[ A = \iint_{[-1, 1]^2} \mathbf{1}(x^2 + y^2 \le 1) \, dx\,dy, \]

where

\[ \mathbf{1}(x^2 + y^2 \le 1) = \begin{cases} 1 & \text{if } x^2 + y^2 \le 1, \\ 0 & \text{otherwise}. \end{cases} \]

This indicator function simply checks whether a point belongs to the circle. The integral therefore “adds up” all points inside the circle, producing the total area.

Step 2: Rewriting as an Expectation

Since \((X,Y)\) is uniformly distributed over the square \([-1,1]\times[-1,1]\), its joint density is

\[ f(x,y) = \frac{1}{4}, \quad -1 < x < 1,\; -1 < y < 1. \]

Joint Density of Uniform Distribution

For a defined rectangular region \([a, b] \times [c, d]\), if \(X\) and \(Y\) are independent Uniform(\(a, b\)) and Uniform(\(c, d\)) random variables, then their joint density is the product of their marginal densities:

\[ f(x,y) = f_X(x) \cdot f_Y(y) = \frac{1}{(b-a)(d-c)}. \]

Rewriting the area integral as in Equation 42.1, we can express the area as an expectation of the indicator function

\[ A = \iint_{[-1, 1]^2} 4\cdot\mathbf{1}(x^2 + y^2 \le 1) \left(\frac{1}{4}\right) dx\,dy = 4\cdot\mathbb{E}[\mathbf{1}(X^2 + Y^2 \le 1)]. \]

Step 3: Apply Monte Carlo Estimator

From Equation 42.2, the Monte Carlo estimator of the area is given by

\[ \hat A_n = 4\cdot\frac1n\sum_{i=1}^n \mathbf{1}(X_i^2 + Y_i^2 \le 1), \]

Since \(A = \pi\), the Monte Carlo estimate of \(\pi\) is

\[ \boxed{ \hat \pi_n = 4\cdot \frac1n\sum_{i=1}^n \mathbf{1}(X_i^2 + Y_i^2 \le 1),} \]

where \((X,Y) \sim \text{Uniform}([-1,1]\times[-1,1])\).

Step 4: From expectation to probability

If a point \((X,Y)\) is generated uniformly inside the square, then the probability that a point falls inside the circle is

\[ P(x^2 + y^2 \le 1) = \frac{\text{Area of the circle}}{\text{Area of the square}} = \frac{\pi}{4}. \tag{42.5}\]

Rearranging gives

\[ \pi = 4 \times P(x^2 + y^2 \le 1). \]

Suppose we generate \(n\) random points uniformly inside the square and let

\[ k = \text{number of points that satisfy } x^2 + y^2 \le 1. \]

Then, the proportion of points that fall inside the circle is

\[ \frac{k}{n}. \]

which is an estimate of the probability \(P(x^2 + y^2 \le 1)\).

Therefore, the Monte Carlo estimate of \(\pi\) can be simplified to

\[ \boxed{ \hat \pi \approx 4\times \frac{k}{n}} \]

As the number of simulated points increases, the estimate becomes more accurate by the Law of Large Numbers.

set.seed(1234)
par(mar=c(3,3,1,1))
plot(NA, NA, xlim=c(-1,1), ylim=c(-1,1), asp=1, ylab="", xlab="")
points(runif(1000, -1, 1), runif(1000, -1, 1), pch=16, cex=0.5, col=rgb(0,0,1,0.5))
symbols(0, 0, circles=1, add=TRUE, lwd=3, fg="red")

Python Version

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1234)
x = np.random.uniform(-1, 1, 1000)
y = np.random.uniform(-1, 1, 1000)

fig, ax = plt.subplots()
ax.set_aspect('equal')
ax.set_xlabel(""); ax.set_ylabel("")
ax.scatter(x, y, s=10, alpha=0.5)
circle = plt.Circle((0, 0), 1, fill=False, color='red', linewidth=3)
ax.add_patch(circle)
plt.show()

set.seed(1234)
n <- 1000
x <- runif(n, -1, 1)
y <- runif(n, -1, 1)
inside <- (x^2 + y^2 <= 1)
pi_hat <- 4 * mean(inside)
pi_hat

[1] 3.188

Python Version

np.random.seed(1234)
n = 1000
x = np.random.uniform(-1, 1, n)
y = np.random.uniform(-1, 1, n)
inside = (x**2 + y**2 <= 1)
pi_hat = 4 * np.mean(inside)
pi_hat

which is close to the true value: \(\pi \approx\) 3.14159.

Comparison between Monte Carlo Estimators

The curve-height method is better for showing Monte Carlo integration, while the dartboard method is better for showing simulation-based probability/area estimation.

Question: Which estimator is better in practice?

Both Monte Carlo estimators converge to the true value of \(\pi\) as the number of simulations increases. However, they do not converge at the same speed.

Recall the two estimators:

Method	Random Variable	Monte Carlo Estimator \(\hat \pi_n\)
Curve-Height	\(X \sim U(-1,1)\)	\(\hat \pi_1 = \frac1n\sum_{i=1}^n 4\sqrt{1-X_i^2}\)
Dartboard	\((X,Y) \sim U([-1,1]^2)\)	\(\hat \pi_2 = \frac4n\sum_{i=1}^n \mathbf{1}(X_i^2 + Y_i^2 \le 1)\)

Both estimators are unbiased:

\[ \mathbb E[\hat\pi_1] = \mathbb E[\hat\pi_2] = \pi. \]

However, their variances are different.

Recall that variance can be computed using the formula

\[ \text{Var}(X)=\mathbb{E}[X^2]-\mathbb{E}[X]^2. \tag{42.6}\]

Estimator 1: Curve-Height

Variance of the curve-height estimator is \[ \begin{aligned} \text{Var}(\hat\pi_1) &= \frac1n \text{Var}(4\sqrt{1-X^2}) \\ &= \frac1n \left(\mathbb{E}[16(1-X^2)] - \mathbb{E}[4\sqrt{1-X^2}]^2\right). \end{aligned} \]

\(\mathbb{E}[X^2] = \frac{1}{3}\)³ and according to Equation 42.4, \(\mathbb{E}[4\sqrt{1-X^2}] = \pi\), so

\[ \begin{aligned} \text{Var}(\hat\pi_1) &= \frac1n \left(\mathbb{E}[16(1-X^2)] - \pi^2\right) \\ &= \frac1n \left[16(1-\mathbb{E}[X^2])-\pi^2\right] \\ &= \frac1n \left[16\left(1-\frac{1}{3}\right)-\pi^2\right] \\ &= \frac1n \left(\frac{32}{3}-\pi^2\right) \\ \text{Var}(\hat\pi_1) &\approx \frac{0.797}{n}. \end{aligned} \]

set.seed(1234)
n <- 10000
x <- runif(n, -1, 1)
var_mc <- var(4*sqrt(1-x^2)) / n
var_mc

[1] 7.86232e-05

Python Version

import numpy as np
np.random.seed(1234)
n = 10000
x = np.random.uniform(-1, 1, n)
var_mc = np.var(4 * np.sqrt(1 - x**2), ddof=1) / n
var_mc

Estimator 2: Dartboard

Variance of the dartboard estimator is

\[ \begin{aligned} \text{Var}(\hat\pi_2) &= \frac{16}{n} \text{Var}(\mathbf{1}(X^2 + Y^2 \le 1)) \\ &= \frac{16}{n} \left(\mathbb{E}[\mathbf{1}(X^2 + Y^2 \le 1)] - \mathbb{E}[\mathbf{1}(X^2 + Y^2 \le 1)]^2\right). \end{aligned} \]

According to Equation 42.5, \(\mathbb{E}[\mathbf{1}(X^2 + Y^2 \le 1)] = P(x^2 + y^2 \le 1) = \pi/4\), so

\[ \begin{aligned} \text{Var}(\hat\pi_2) &= \frac{16}{n} \left(\frac{\pi}{4} - \left(\frac{\pi}{4}\right)^2\right) \\ &= \frac1n (4\pi-\pi^2) \\ &= \frac1n (4\pi-\pi^2) \\ \text{Var}(\hat\pi_2) &\approx \frac{2.697}{n}. \end{aligned} \]

set.seed(1234)
n <- 10000
x <- runif(n, -1, 1)
y <- runif(n, -1, 1)
inside <- (x^2 + y^2 <= 1)
var_mc <- var(4*inside) / n
var_mc

[1] 0.0002589865

Python Version

import numpy as np
np.random.seed(1234)
n = 10000
x = np.random.uniform(-1, 1, n)
y = np.random.uniform(-1, 1, n)
inside = (x**2 + y**2 <= 1)
var_mc = np.var(4 * inside, ddof=1) / n
var_mc

Since \(\text{Var}(\hat\pi_1)<\text{Var}(\hat\pi_2),\) the curve-height estimator is more efficient.

In fact, the dartboard estimator requires approximately

\[ \frac{\text{Var}(\hat \pi_2)}{\text{Var}(\hat \pi_1)} \approx \frac{2.697}{0.797} \approx 3.4 \]

times as many simulations to achieve the same variance.

This comparison highlights an important idea:

Monte Carlo estimators are random, and different estimators can produce very different levels of variability.

Even when two estimators target the same quantity, one may fluctuate much more than the other.

This naturally leads to the question:

How does Monte Carlo estimation error behave?

Monte Carlo Error

Monte Carlo estimates are random because they depend on random samples. If we repeat the simulation, we obtain slightly different answers each time. The accuracy of the estimator is described by the Central Limit Theorem (CLT).

Central Limit Theorem for Monte Carlo Estimators

If the Monte Carlo estimator is defined as

\[ \hat I_n = \frac1n\sum_{i=1}^n h(X_i), \]

then CLT implies that for large \(n\),

\[ \sqrt n(\hat I_n-I) \to N(0,\sigma^2), \tag{42.7}\]

where the Monte Carlo error is \(\hat I_n - I\) and the variance is \(\sigma^2 = \text{Var}(h(X))\).

The Central Limit Theorem tells us how the Monte Carlo estimation error behaves.⁴

Dividing Equation 42.7 both sides by \(\sqrt n\) gives

\[ \boxed{ \hat I_n - I \approx N\left(0, \frac{\sigma^2}{n}\right) } \]

This tells us that the Monte Carlo error has approximately a normal distribution with mean zero and standard deviation \(\sigma/\sqrt n\).⁵

Since the typical size of random fluctuations is measured by the standard deviation, the magnitude of the Monte Carlo error shrinks proportionally to

\[ \boxed{ \frac{1}{\sqrt{n}} } \]

So,

100 simulations gives error roughly proportional to 1/10,
10,000 simulations gives error roughly proportional to 1/100.

This explains why Monte Carlo converges slowly, which is one of the main computational limitations of Monte Carlo methods.

Example: Monte Carlo Error in Estimating \(\pi\)

We estimate \(\pi\) using Monte Carlo integration with different numbers of simulated points, \(n = 100, 1000, 10000\). If the CLT is useful here, we should observe two things:

the estimates cluster around the true value of \(\pi\);
the spread of the estimates becomes smaller as \(n\) increases.

set.seed(1234)

estimate_pi <- function(n) {
  x <- runif(n, -1, 1)
  y <- runif(n, -1, 1)
  
  inside <- x^2 + y^2 <= 1
  
  pi_hat <- 4 * mean(inside)
  return(pi_hat)
}

# Number of repeated experiments
B <- 1000

# Different simulation sizes
n_values <- c(100, 1000, 10000)

# Repeat the Monte Carlo experiment for each n
results <- lapply(n_values, function(n) {
  replicate(B, estimate_pi(n))
})

names(results) <- paste0("n = ", n_values)

# Summary statistics
sapply(results, function(x) {
  c(
    mean = mean(x),
    sd = sd(x),
    bias = mean(x) - pi
  )
})

          n = 100     n = 1000     n = 10000
mean  3.136320000  3.138332000  3.1409696000
sd    0.160557268  0.051433911  0.0166427178
bias -0.005272654 -0.003260654 -0.0006230536

par(mfrow = c(1, 3), mar = c(4, 4, 2, 1))

for (i in seq_along(n_values)) {
  hist(
    results[[i]],
    breaks = 30,
    main = paste("n =", n_values[i]),
    xlab = expression(hat(pi)),
    xlim = c(2.8, 3.5)
  )
  
  abline(v = pi, lwd = 2)
}

For small \(n\), the estimates vary widely because each simulation uses relatively few random points. As \(n\) increases, the estimates become more concentrated around the true value of \(\pi\). This shows the Monte Carlo estimator becoming more stable.

How to Reduce Monte Carlo Error by Half?

Suppose the current simulation size is \(n\), giving error approximately proportional to \(1/\sqrt n\). To reduce the error by half, we need to solve \[\frac{1}{\sqrt{N}} = \frac{1}{2} \times \frac{1}{\sqrt n},\]

where \(N\) is the new sample size, then \[\sqrt{N}=2\sqrt{n} \implies N = 4n.\]

To reduce the error by half, we need about four times as many simulated points.

Pros and Cons of Monte Carlo Integration

Monte Carlo integration has several important strengths.

Simple to implement (simulate \(\rightarrow\) evaluate \(\rightarrow\) average)
Works in high dimensions, where traditional numerical integration methods fail due to the curse of dimensionality.
Flexible as it can be applied to a wide variety of problems, including those with complex geometries, non-standard distributions, and intractable likelihoods.

Despite its flexibility, Monte Carlo integration also has limitations.

Results are approximate, not exact.
Slow convergence as Monte Carlo error decreases only at rate \(1/\sqrt n\)
Not suitable for rare-event problems as naive Monte Carlo may waste most simulations.

This motivates variance reduction methods and importance sampling, discussed later in the chapter.

Transition Toward MCMC

Monte Carlo integration works well when we can generate independent samples directly from the target distribution.

However, in many realistic problems, direct sampling is impossible.

For example:

Bayesian posterior distributions may have unknown normalising constants;
high-dimensional models may be analytically intractable;
complicated joint distributions may not have standard sampling algorithms.

This leads to a fundamental question:

What can we do if we cannot sample directly from the target distribution?

Markov chain Monte Carlo (MCMC) methods provide the answer.

Instead of generating independent samples directly, MCMC constructs a stochastic process whose long-run behaviour approximates the desired distribution.

\(\text{Area of the square} = 2 \times 2 = 4\)↩︎
The area between two curves \(f(x)\) and \(g(x)\) over the interval \([a,b]\) is given by \(\int_a^b (f(x)-g(x))\,dx\)↩︎
\(E[X^2]=\int_{-1}^{1} x^2\,f(x)\,dx = \int_{-1}^1 x^2\frac{1}{2} dx = \frac{1}{3}\)↩︎
After multiplying the error by \(\sqrt n\), the distribution of the error converges to a normal distribution with mean zero and variance \(\sigma^2\).↩︎
This is the same formula for the standard error of the sample mean from Week 1!↩︎