20  R Workshop Solutions

This workshop consists of two parts:

Part 1: Inverse Transform Sampling

Exercise 1: Precision Method

A mobile game awards players with different types of rewards when a treasure chest is opened. The reward type follows the probability distribution below.

Reward Type \(x\) Coins Gems Energy Rare Item
\(P(X=x)\) 0.50 0.25 0.15 0.10
  1. Construct the cumulative distribution function (CDF) for this random variable and present it in a table.
# Define outcomes and probabilities
x <- c("Coins", "Gems", "Energy", "Rare Item")
p <- c(0.50, 0.25, 0.15, 0.10)

# Check probabilities sum to 1
sum(p)
[1] 1
# Construct cumulative probabilities
F <- cumsum(p)

# Create table for PMF and CDF
data.frame(
  outcome = x,
  p = p,
  F = F
)
    outcome    p    F
1     Coins 0.50 0.50
2      Gems 0.25 0.75
3    Energy 0.15 0.90
4 Rare Item 0.10 1.00
  1. Intervals for the precision method

From the cumulative probabilities,

  • Coins: \(0 < U \le 0.50\)
  • Gems: \(0.50 < U \le 0.75\)
  • Energy: 0\(.75 < U \le 0.90\)
  • Rare Item: \(0.90 < U \le 1.00\)

So the rule is

\[ X = \begin{cases} \text{Coins} & \text{if } 0 < U \le 0.50 \\ \text{Gems} & \text{if } 0.50 < U \le 0.75 \\ \text{Energy} & \text{if } 0.75 < U \le 0.90 \\ \text{Rare Item} & \text{if } 0.90 < U \le 1.00 \end{cases} \]

# Show intervals in a table
lower <- c(0, F[-length(F)]) # F[-length(F)] excludes the last item in F
upper <- F

data.frame(
  outcome = x,
  lower = lower,
  upper = upper
)
    outcome lower upper
1     Coins  0.00  0.50
2      Gems  0.50  0.75
3    Energy  0.75  0.90
4 Rare Item  0.90  1.00
  1. Classify the given random numbers
u_given <- c(0.12, 0.63, 0.44, 0.91, 0.27)

# Find the first cumulative probability >= u
result_given <- x[sapply(u_given, function(u) which(u <= F)[1])]
# sapply(u_given, function(u) ...) applies a function to every element of u_given
# which(u <= F) checks which CDF are >= u and returns the positions of TRUE
# (u <= F)[1] takes the first index.

data.frame(
  u = u_given,
  reward = result_given
)
     u    reward
1 0.12     Coins
2 0.63      Gems
3 0.44     Coins
4 0.91 Rare Item
5 0.27     Coins

Equivalently:

find_outcome <- function(u){
  i <- which(u <= F)[1]
  return(x[i])
}

result_given <- sapply(u_given, find_outcome)

data.frame(
  u = u_given,
  reward = result_given
)
     u    reward
1 0.12     Coins
2 0.63      Gems
3 0.44     Coins
4 0.91 Rare Item
5 0.27     Coins
  1. R function to simulate one reward
simulate_reward <- function() {
  u <- runif(1)
  x[which(u <= F)[1]]
}

# Example: simulate 20 rewards
set.seed(123)
replicate(20, simulate_reward())
 [1] "Coins"     "Energy"    "Coins"     "Energy"    "Rare Item" "Coins"    
 [7] "Gems"      "Energy"    "Gems"      "Coins"     "Rare Item" "Coins"    
[13] "Gems"      "Gems"      "Coins"     "Energy"    "Coins"     "Coins"    
[19] "Coins"     "Rare Item"
  1. Simulate 10,000 rewards and compare empirical probabilities
set.seed(123)

samples <- replicate(10000, simulate_reward())

# Empirical probabilities
empirical <- prop.table(table(samples))
empirical
samples
    Coins    Energy      Gems Rare Item 
   0.5057    0.1490    0.2508    0.0945 
# Compare theoretical and empirical probabilities
comparison <- data.frame(
  outcome = x,
  theoretical = p,
  empirical = as.numeric(empirical[x])
)

comparison
    outcome theoretical empirical
1     Coins        0.50    0.5057
2      Gems        0.25    0.2508
3    Energy        0.15    0.1490
4 Rare Item        0.10    0.0945

You can also look at the differences:

comparison$difference <- comparison$empirical - comparison$theoretical
comparison
    outcome theoretical empirical difference
1     Coins        0.50    0.5057     0.0057
2      Gems        0.25    0.2508     0.0008
3    Energy        0.15    0.1490    -0.0010
4 Rare Item        0.10    0.0945    -0.0055

The empirical probabilities should be close to the theoretical probabilities, though not exactly equal, because the simulation uses a finite number of draws. With 10,000 replications, the discrepancies are usually small.

  1. Why can the precision method be inefficient for large k?

A simple implementation checks the cumulative probabilities one by one until it finds the correct interval. When the number of outcomes k is large, this search can become slow, especially if repeated many times. More efficient methods or built-in algorithms are often preferred for large discrete distributions.

Optional: more compact R version using findInterval()

simulate_reward2 <- function(n = 1) {
  u <- runif(n)
  idx <- findInterval(u, vec = F) + 1
  x[idx]
}

set.seed(123)
head(simulate_reward2(10))
[1] "Coins"     "Energy"    "Coins"     "Energy"    "Rare Item" "Coins"    
# 10,000 simulated rewards
samples2 <- simulate_reward2(10000)
prop.table(table(samples2))
samples2
    Coins    Energy      Gems Rare Item 
   0.5059    0.1490    0.2506    0.0945 

One small note: because findInterval() uses interval boundaries differently, the indexing needs care. The earlier which(u <= F)[1] version is the clearest way to the precision method.

Note

The precision method is essentially the discrete version of the inverse CDF method.

Exercise 2: Load on a bridge (CD 2014)

Consider the pdf for the random variable \(X\) which is the magnitude (in newtons) of a dynamic load on a bridge, given as

\[ f(x)= \left(\frac{1}{8} + \frac{3}{8} x \right), \quad 0 \le x \le 2 \]

\[ f(x)=0, \quad \text{otherwise.} \]

  1. Calculate the exact mean and variance.

\[ \boxed{E(X) = \int_x x f(x)\,dx, \quad \mathrm{Var}(X)= E(X^2) - [E(X)]^2} \]

First, check the pdf:

\[ \int_0^2 f(x)\, dx = \int_0^2 \left(\frac{1}{8} + \frac{3}{8} x \right)\,dx = 1 \rightarrow \text{ valid PDF.} \]

\[ \begin{aligned} E(X) &= \int_0^2 x f(x)\,dx \\ &= \int_0^2 x \left(\frac{1}{8} + \frac{3}{8} x \right)\,dx \\ &= \left[\frac{1}{16} x^2 + \frac{1}{8} x^3\right]_0^2 \\ \therefore E(X) &= \frac{5}{4} = 1.25 \\ E(X^2) &= \int_0^2 x^2 f(x)\,dx \\ &= \int_0^2 x^2 \left(\frac{1}{8} + \frac{3}{8} x \right)\,dx \\ &= \left[\frac{1}{24} x^3 + \frac{3}{32} x^4\right]_0^2 \\ E(X^2) & = \frac{11}{6} \approx 1.8333 \\ \mathrm{Var}(X) &= E(X^2) - [E(X)]^2 \\ &= \frac{11}{6} - \left(\frac{5}{4}\right)^2 \\ \therefore \mathrm{Var}(X) &= \frac{13}{48} \approx 0.2708 \end{aligned} \]

Alternatively,

# pdf
f <- function(x) {
  ifelse(x >= 0 & x <= 2, 1/8 + 3*x/8, 0)
}

# E(X)
EX <- integrate(function(x) x * f(x), lower = 0, upper = 2)$value

# E(X^2)
EX2 <- integrate(function(x) x^2 * f(x), lower = 0, upper = 2)$value

# Var(X)
VarX <- EX2 - EX^2

EX
[1] 1.25
EX2
[1] 1.833333
VarX
[1] 0.2708333
  1. Write a program to simulate values from this distribution using the inverse cdf method. Calculate the mean and variance and compare with the exact results.
  • Step 1. Firstly, we need to find the cdf \(F(x).\) \[ \begin{aligned} F(x) &= \int_0^x f(t)\, dt \\ &= \int_0^x \left(\frac{1}{8} + \frac{3}{8} t \right)\,dt \\ F(x) &= \frac{x}{8} + \frac{3x^2}{16} \end{aligned} \]
library(Ryacas)

# Define variable
x <- ysym("x")

# Define pdf
f <- 1/8 + 3*x/8

# Integrate
F <- integrate(f, x)
F
y: (3*x^2)/16+0.125*x
  • Step 2. Set \(u = F(x)\) and solve for \(x\) \[ \begin{aligned} u &= \frac{x}{8} + \frac{3x^2}{16} \\ 16u &= 3x^2 + 2x \\ 0 &= 3x^2 + 2x - 16u \\ x &= \frac{-2 \pm \sqrt{2^2 - 4(3)(-16u)}}{2(3)} \\ &= \frac{-1 \pm \sqrt{1+48u} }{3} \\ \therefore x &= \frac{-1 + \sqrt{1+48u} }{3}, \quad 0 \leq x \leq 2. \end{aligned} \]
u <- ysym("u")
solve(3*x^2 + 2*x - 16*u, "x")
{x==(Sqrt(4-(-192)*u)-2)/6, x==(-(Sqrt(4-(-192)*u)+2))/6} 
  • Step 3. Simulate \(f(x)\) from \(U \sim \text{Uniform}(0,1)\)
set.seed(1234)
u <- runif(10000)
x <- ( -1 + sqrt(1 + 48 * u) ) / 3

hist(x, prob = TRUE, col = "skyblue",
     main = "Simulated Distribution")

curve(1/8 + 3 * x / 8, add = TRUE, col = "red", lwd = 2)

cat("Theoretical mean =", EX, ", Empirical mean =", mean(x))
Theoretical mean = 1.25 , Empirical mean = 1.251618
cat("Theoretical variance =", VarX, ", Empirical variance =", var(x))
Theoretical variance = 0.2708333 , Empirical variance = 0.2673982

The simulated distribution closely matches the theoretical probability density function, as shown by the histogram and overlaid curve Additionally, the empirical mean and variance from the simulated data are very close to the exact values \(E(X)=\frac{5}{4}\) and \(\mathrm{Var}(X)=\frac{13}{48}\), confirming that the simulation is accurate.

Exercise 3: Simulating the distribution of waiting time - Transportation

The article “A Model of Pedestrians’ Waiting Times for Street Crossings at Signalised Intersections” (Transportation Research, 2013: 17–28) suggested that under some circumstances the distribution of waiting time \(X\) could be modeled with the following pdf:

\[ f(x)= \frac{\theta}{\tau}\left( 1- \frac{x}{\tau}\right)^{\theta - 1}, \quad 0 \le x < \tau \]

\[ f(x)=0, \quad \text{otherwise.} \]

  1. Write a function to simulate values from this distribution, implementing the inverse cdf method. Your function should have three inputs: the desired number of simulated values \(n\) and values for the two parameters for \(\theta\) and \(\tau.\)
  • Step 1. Determine the cdf:

    \[ F(x) = \int_0^x f(t) dt = \int_0^x \frac{\theta}{\tau}\left( 1- \frac{t}{\tau}\right)^{\theta - 1} dt \]

    Using integration by substitution method, let \[ w=\left( 1- \frac{t}{\tau}\right), \quad dw=-\frac{1}{\tau}\,dt \]

    \[ \begin{aligned} F(x) &= \int_0^x \left( 1- \frac{t}{\tau}\right)^{\theta - 1}\left(\frac{\theta}{\tau}\right) dt \\ &= \int_1^{(1-x/\tau)} w^{\theta-1}(-\theta)\,dw \\ &= \left[-w^\theta\right]_1^{(1-x/\tau)} \\ \therefore F(x) &= 1 - (1-x/\tau)^{\theta} \end{aligned} \]

  • Step 2. Set \(u=F(x)\) and solve for x

    \[ u= 1 - (1-x/\tau)^{\theta} \quad \rightarrow \quad x= \tau(1 - (1- u)^{1/\theta}) \]

  • Step 3. Write a function that accepts 3 inputs: \(n,\theta,\tau\)

waittime <- function(n, theta, tau){
u <- runif(n)
x <- tau*(1-(1-u)^(1/theta))
}
  1. Use your function in part (a) to simulate 10,000 values from this wait time distribution with \(\theta=4\) and \(\tau=80.\) Estimate \(E(X)\) under these parameter settings.
set.seed(1234)
x <- waittime(10000,4,80)
mean(x)
[1] 15.95958
var(x)
[1] 167.5203
  1. How close is your estimate to the correct value of the exact \(E(X)=16\) and \(\mathrm{Var}(X)=170.67\)?

The empirical mean and variance from the simulated data are very close to the exact mean and variance, confirming that the simulation is accurate.


Part 2: Acceptance-Rejection Sampling

Exercise 4: Simulating Beta(2,2) distribution using Uniform distribution (Templ 2016)

  1. Describe the steps. Hints: The pdf of \(Beta(\alpha,\beta)\) is \[ f(x)= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}, \quad 0 \le x \le 1 \]

    The aim is to simulate \(n\) values from a Beta(2,2) distribution with pdf such that by substituting \(\alpha=2, \beta=2\) in the above, we have \[ f(x)= 6x(1-x), \quad 0 \le x \le 1. \]

    Let \(g(x)\) be Uniform(0,1) such that \[ g(x)= 1, \quad 0 \le x \le 1. \]

    and the ratio \(f/g\) is bounded, i.e., there exists a constant \(M\) such that \[ \frac{f(x)}{g(x)} = 6x(1-x) \le M, \quad \forall x. \]

    \[ \frac{d}{dx}6x(1-x) = 6 - 12x = 0 \]

    In this case, the function is maximised at \(x=0.5, f(x)=1.5\) so that we choose \(M = 1.5.\)

    Therefore, the general rule is

    \[ U \leq \frac{f(Y)}{Mg(Y)} = \frac{6y(1-y)}{1.5(1)} = 4y(1-y) \]

    Proceed as follows:

  1. Generate \(Y \sim g \sim \text{Uniform}(0,1).\)

  2. Generate a standard uniform variate, \(U \sim \text{Uniform}(0,1).\)

  3. If \(u\le 4y(1-y),\) then assign \(x=y\) (i.e., “accept” the candidate). Otherwise, discard or reject \(y\) and return to step 1.

  • These steps are repeated until \(n\) candidate values have been accepted.
  • The resulting accepted values \(x_1, . . ., x_n\) form a simulation of a random variable with the original pdf, \(f(x).\)
  1. Check visually the plots of \(f(x)\) and \(Mg(x).\)
curve(dbeta(x, shape1 = 2, shape2 = 2), 
      from = 0, to = 1, xlab = "x", ylab = "", lwd = 2)
abline(h = 1.5, lty = 2, col = "blue", lwd = 2)
legend("topright", legend = c("f(x)", "Mg(x) = 1.5"),
       lwd = c(2, 2), lty = c(1, 2), bty = "n")

  1. Perform Accept-Reject method to simulate Beta(2,2) distribution from Uniform distribution. Calculate the acceptance rate. Plot the simulating values of a Beta(2,2) using the accep rejection approach.
set.seed(1234)
n <- 10000
x <- numeric(0) # store accepted samples

while(length(x) < n){ # keep sampling until we have n accepted values
  m <- n - length(x)   # only generate what you still need
  y <- runif(m)                      # proposal sample Y ~ g(x) ~ Uniform(0, 1)
  u <- runif(m)                      # U ~ Uniform(0,1) for acceptance test
  accept <- u <= 4 * y * (1 - y)     # acceptance rule (returns TRUE/FALSE)
  x <- c(x, y[accept])               # keep accepted y
}

x <- x[1:n] # trim to exactly n samples

hist(x, prob = TRUE, breaks = 40,
     col = "lightgray", border = "white",
     main = "Acceptance-Rejection Sampling",
     xlab = "x")

curve(6*x*(1-x), from = -1, to = 2,
      add = TRUE, col = "red", lwd = 2)

Exercise 5: Simulating the half-normal distribution using Exponential

The half-normal distribution is defined as:

\[f(x) = \sqrt{\frac{2}{\pi}} \exp (-\frac{x^2}{2}), \quad x \ge 0.\]

Consider simulating values from this distribution using the accept–reject method with a candidate distribution of an Exponential random variable with parameter \(\lambda = 1\)

\[g(x)= e^{-x}, \quad x \ge 0.\]

  1. Find the inverse cdf corresponding to \(g(x).\)

    If \(X \sim Exp(\lambda=1)\) then the pdf and cdf are \[ g(x)= e^{-x}, \quad G(x)= 1- e^{-x}. \]

    Then set \(u=G(x)\) and then solve for \(x\):

    \[ \begin{aligned} u = G(x) &= 1- e^{-x} \\ e^{-x} &= 1-u \\ -x &= \ln(1-u) \\ x &= G^{-1}(u) = -\ln(1-u). \end{aligned} \]

  2. Find the smallest majorisation constant \(M\) so that \(f(x)/g(x)\le M\) for all \(x \ge 0.\)

    The ratio \(f(x)/g(x)\) is equivalent to

    \[ \frac{f(x)}{g(x)}= \frac{\sqrt{\frac{2}{\pi}} \exp (-\frac{x^2}{2})}{\exp(-x)}=\sqrt{\frac{2}{\pi}} \exp (-\frac{x^2}{2}+x). \]

    Use the first derivative test,

    \[ \frac{d}{dx} \left(\exp (-\frac{x^2}{2}+x) \right)=\exp (-\frac{x^2}{2}+x)(1-x)=0. \]

    The absolute maximum on the interval \([0, \infty)\) occurs at \(x = 1\).

    \[ \frac{f(x)}{g(x)} \le \sqrt{\frac{2}{\pi}} \exp (-\frac{1^2}{2}+1)=\sqrt{\frac{2}{\pi}} \exp(1/2)=\sqrt{\frac{2e}{\pi}}=M \]

  3. On the average, how many candidate values will be required to generate 10,000 “accepted” values?

M = sqrt(2*exp(1)/pi); M
[1] 1.315489

The expected number of candidate values required to generate a single accepted value is \(M,\) so the expected number required to generate 10,000 x-values is \(10,000M \approx 13,155.\)

  1. Write a program to construct 10,000 values from a half-normal distribution.
set.seed(1234)
n <- 10000
M <- sqrt(2*exp(1)/pi)
x <- numeric(0) # store accepted samples

while(length(x) < n){ # keep sampling until we have n accepted values
  m <- n - length(x)   # only generate what you still need
  y <- rexp(m, rate=1)          # proposal sample Y ~ g(x)
  u <- runif(m)                 # U ~ Uniform(0,1) for acceptance test
  fg_ratio <- sqrt(2 / pi) * exp(-y^2/2 + y)                    
  accept <- u <= fg_ratio / M   # acceptance rule (returns TRUE/FALSE)
  x <- c(x, y[accept])          # keep accepted y
}

x <- x[1:n] # trim to exactly n samples

hist(x, prob = TRUE, breaks = 40,
     col = "lightgray", border = "white",
     main = "Acceptance-Rejection Sampling",
     xlab = "x")

curve(sqrt(2 / pi) * exp(-x**2/2), from = 0, to = 4,
      add = TRUE, col = "red", lwd = 2)