library(ggplot2)
library(sandwich)
library(lmtest)
library(estimatr)
library(car)
library(carData)
options(digits = 4)
data(Salaries)9. Hypothesis Testing
The F-test, the test trinity, multiple testing, and power
Chapter 8 gave us sandwich standard errors and bootstrap confidence intervals. This chapter puts them to work: we test hypotheses, build confidence regions, guard against false discoveries, and ask whether our sample is large enough to detect effects we care about.
The emphasis is on doing — running tests in R, interpreting the output correctly, and avoiding common pitfalls. We organize around applied questions:
- How do I test whether a coefficient is zero? (t-tests, robust Wald)
- How do I test multiple restrictions at once? (F-test, joint Wald)
- When do different tests disagree? (Wald, Score, F)
- How do I get confidence sets for nonlinear parameters? (test inversion, Fieller)
- What if I test many hypotheses? (Bonferroni, Holm)
- How large a sample do I need? (power analysis)
We use the Salaries dataset from carData throughout: salaries of 397 U.S. professors, with rank, discipline, years since PhD, years of service, and sex.
1 Testing a single coefficient
We start with a wage equation and ask: controlling for rank, discipline, and experience, does sex predict salary?
mod <- lm(salary ~ rank + discipline + yrs.since.phd + yrs.service + sex,
data = Salaries)
summary(mod)
Call:
lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service +
sex, data = Salaries)
Residuals:
Min 1Q Median 3Q Max
-65248 -13211 -1775 10384 99592
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65955 4589 14.37 < 2e-16 ***
rankAssocProf 12908 4145 3.11 0.002 **
rankProf 45066 4238 10.63 < 2e-16 ***
disciplineB 14418 2343 6.15 1.9e-09 ***
yrs.since.phd 535 241 2.22 0.027 *
yrs.service -490 212 -2.31 0.021 *
sexMale 4784 3859 1.24 0.216
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22500 on 390 degrees of freedom
Multiple R-squared: 0.455, Adjusted R-squared: 0.446
F-statistic: 54.2 on 6 and 390 DF, p-value: <2e-16
The summary() output gives t-statistics and p-values, but these assume homoskedasticity. The robust version uses sandwich standard errors:
coeftest(mod, vcov. = vcovHC(mod, type = "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65955 2896 22.78 < 2e-16 ***
rankAssocProf 12908 2206 5.85 1.0e-08 ***
rankProf 45066 3285 13.72 < 2e-16 ***
disciplineB 14418 2316 6.22 1.2e-09 ***
yrs.since.phd 535 312 1.71 0.088 .
yrs.service -490 307 -1.60 0.111
sexMale 4784 2396 2.00 0.047 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These are the same tests, built from the same Wald statistic \(W = (\hat{\beta}_j / \widehat{\text{se}}_j)^2\). The only difference is the denominator. When the two disagree, trust the robust version — it doesn’t assume \(\text{Var}(e_i \mid X_i)\) is constant.
You can also use linearHypothesis() for the same test, which makes the Wald structure explicit:
linearHypothesis(mod, "sexMale = 0", vcov. = vcovHC(mod, type = "HC1"))
Linear hypothesis test:
sexMale = 0
Model 1: restricted model
Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 391
2 390 1 3.99 0.047 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This tests \(H_0: \beta_{\text{Male}} = 0\) against \(H_1: \beta_{\text{Male}} \neq 0\) using the robust covariance matrix. The Chisq column is the Wald statistic; divide by 1 (one restriction) and you get the squared t-statistic.
1.1 Testing against a non-zero value
Suppose theory predicts a gender gap of $5,000. The null is \(H_0: \beta_{\text{Male}} = 5000\):
linearHypothesis(mod, "sexMale = 5000", vcov. = vcovHC(mod, type = "HC1"))
Linear hypothesis test:
sexMale = 5000
Model 1: restricted model
Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 391
2 390 1 0.01 0.93
The test is not significant if the confidence interval from Chapter 8 already contains $5,000. Test and CI carry exactly the same information — they are duals.
2 Testing multiple restrictions: the F-test
Individual t-tests tell you about one coefficient at a time. But sometimes you need a joint test. Does rank matter at all? That means testing two coefficients simultaneously (since rank is a three-level factor with two dummies):
\[H_0: \beta_{\text{AssocProf}} = \beta_{\text{Prof}} = 0\]
2.1 By hand: comparing two regressions
The F-test compares the fit of an unrestricted model to a restricted model that imposes \(H_0\):
\[F = \frac{(\text{SSE}_R - \text{SSE}_U) / q}{\text{SSE}_U / (n - k)}\]
where \(q\) is the number of restrictions.
## Unrestricted model (full)
mod_U <- mod
## Restricted model: drop rank
mod_R <- lm(salary ~ discipline + yrs.since.phd + yrs.service + sex,
data = Salaries)
## Manual F-stat
SSE_U <- sum(resid(mod_U)^2)
SSE_R <- sum(resid(mod_R)^2)
q <- 2 # two restrictions
n <- nobs(mod_U)
k <- length(coef(mod_U))
F_stat <- ((SSE_R - SSE_U) / q) / (SSE_U / (n - k))
p_val <- 1 - pf(F_stat, q, n - k)
cat("F =", round(F_stat, 2), " df1 =", q, " df2 =", n - k,
" p =", format.pval(p_val), "\n")F = 68.41 df1 = 2 df2 = 390 p = <2e-16
2.2 The easy way: anova() and linearHypothesis()
## Nested model comparison
anova(mod_R, mod_U)Analysis of Variance Table
Model 1: salary ~ discipline + yrs.since.phd + yrs.service + sex
Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 392 2.68e+11
2 390 1.98e+11 2 6.95e+10 68.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Same test via linearHypothesis (with robust SEs)
linearHypothesis(mod, c("rankAssocProf = 0", "rankProf = 0"),
vcov. = vcovHC(mod, type = "HC1"))
Linear hypothesis test:
rankAssocProf = 0
rankProf = 0
Model 1: restricted model
Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 392
2 390 2 109 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that anova() uses the classical (homoskedastic) F-test, while linearHypothesis() with a robust vcov. gives the heteroskedasticity-robust Wald test. In this example they agree; with strongly heteroskedastic data they might not.
2.3 Equality constraints
Does the return to years since PhD equal the return to years of service? This is \(H_0: \beta_{\text{yrs.phd}} = \beta_{\text{yrs.service}}\):
linearHypothesis(mod, "yrs.since.phd = yrs.service",
vcov. = vcovHC(mod, type = "HC1"))
Linear hypothesis test:
yrs.since.phd - yrs.service = 0
Model 1: restricted model
Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 391
2 390 1 2.92 0.088 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This works because linearHypothesis() can test any linear restriction \(\mathbf{R}\boldsymbol{\beta} = \mathbf{r}\). Under the hood it constructs the restriction matrix \(\mathbf{R}\) and computes: \[W = (\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r})' [\mathbf{R}\hat{\mathbf{V}}\mathbf{R}']^{-1} (\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r}) \;\xrightarrow{d}\; \chi^2_q \tag{1}\]
Theorem 1 (Wald Statistic) For testing \(H_0: R\beta = r\), the Wald statistic is \(W = (R\hat\beta - r)'[R\hat{V}R']^{-1}(R\hat\beta - r) \xrightarrow{d} \chi^2_q\) under \(H_0\), where \(q\) is the number of restrictions. With robust \(\hat{V}\), the test is valid under heteroskedasticity.
3 The test trinity: Wald, Score, and F
Three classical approaches test the same null \(H_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}\):
| Test | Estimates from | Measures |
|---|---|---|
| Wald | Unrestricted model | How far \(\hat{\boldsymbol{\beta}}\) is from \(H_0\) |
| Score (LM) | Restricted model | Whether the objective function wants to move away from \(H_0\) |
| F / LR | Both models | How much the fit worsens when we impose \(H_0\) |
For linear restrictions in a normal linear model, they are monotone transformations of each other and always give the same accept/reject decision. Let’s verify:
## Test: rank doesn't matter (q = 2 restrictions)
## Wald (homoskedastic, for comparability)
W <- linearHypothesis(mod_U, c("rankAssocProf = 0", "rankProf = 0"))
W_stat <- W$F[2] * q # linearHypothesis reports F; multiply by q for chi-sq scale
## F from anova
Ftest <- anova(mod_R, mod_U)
## Score test (LM): use restricted residuals
## S = n * (1 - SSE_U / SSE_R) under homoskedasticity
S <- n * (1 - SSE_U / SSE_R)
cat("Wald statistic (chi-sq scale):", round(W_stat, 3), "\n")Wald statistic (chi-sq scale): 136.8
cat("F statistic (x q): ", round(Ftest$F[2] * q, 3), "\n")F statistic (x q): 136.8
cat("Score statistic: ", round(S, 3), "\n")Score statistic: 103.1
The three numbers are close but not identical — they use different variance estimates (unrestricted \(s^2\), restricted \(\tilde{\sigma}^2\), or pooled). Asymptotically they converge. Under normality, we have the exact distributions that make the F-test slightly more conservative in finite samples, which is why anova() uses \(F_{q, n-k}\) critical values instead of \(\chi^2_q / q\).
3.1 When do they diverge?
For non-normal models, the trinity can disagree because the three test statistics are no longer monotone transformations of each other. Here is an example using a joint test in the probit model from Chapter 7:
data(Mroz)
probit <- glm(lfp ~ k5 + k618 + age + wc + lwg + inc,
family = binomial(link = "probit"), data = Mroz)
## Test H0: k5 = 0, k618 = 0 (joint, q = 2)
probit_r <- glm(lfp ~ age + wc + lwg + inc,
family = binomial(link = "probit"), data = Mroz)
## Wald
W_probit <- linearHypothesis(probit, c("k5 = 0", "k618 = 0"))
## LR
LR <- -2 * as.numeric(logLik(probit_r) - logLik(probit))
## Score (Rao)
S_probit <- anova(probit_r, probit, test = "Rao")
cat("Wald chi-sq: ", round(W_probit$Chisq[2], 2), "\n")Wald chi-sq: 58.45
cat("LR chi-sq: ", round(LR, 2), "\n")LR chi-sq: 65.77
cat("Score chi-sq:", round(S_probit$Rao[2], 2), "\n")Score chi-sq: 65.03
The three statistics differ noticeably: the Wald statistic (58.5) is substantially smaller than the LR (65.8) and Score (65.0). All three reject decisively here, but in borderline cases they could lead to different conclusions. The LR test tends to be the most reliable in nonlinear models because it directly measures the change in the log-likelihood.
4 Confidence regions and test inversion
A 95% confidence interval is the set of values \(\theta_0\) that would not be rejected at the 5% level:
\[\hat{C} = \{\theta_0 : |T(\theta_0)| \leq 1.96\}\]
This is test inversion. For a single linear coefficient, test inversion gives the familiar \(\hat{\beta} \pm 1.96 \cdot \widehat{\text{se}}\). But for nonlinear functions of parameters, test inversion is more reliable than the delta method.
A 95% confidence set is exactly the set of parameter values not rejected at the 5% level. For linear parameters, this gives the familiar \(\hat\beta \pm 1.96 \cdot \text{SE}\). For nonlinear functions, Fieller’s method (test inversion over a grid) is more reliable than the delta method.
4.1 Confidence ellipses for two coefficients
For two parameters tested jointly, the confidence region is an ellipse — the set of \((\beta_1, \beta_2)\) values consistent with the Wald test:
## Joint confidence region for the two experience variables
confidenceEllipse(mod, which.coef = c("yrs.since.phd", "yrs.service"),
vcov. = vcovHC(mod, type = "HC1"),
main = "95% joint confidence region",
xlab = "Coefficient on yrs.since.phd",
ylab = "Coefficient on yrs.service",
col = "steelblue", lwd = 2)
abline(h = 0, lty = 2, col = "gray50")
abline(v = 0, lty = 2, col = "gray50")
## Add the 45-degree line for equality
abline(a = 0, b = 1, lty = 3, col = "coral", lwd = 1.5)
legend("topright", legend = c("95% joint region", "equality line"),
col = c("steelblue", "coral"), lty = c(1, 3), lwd = c(2, 1.5))
The dashed lines at zero show the individual null values. The coral line shows where the two coefficients are equal. If the ellipse crosses a line, the corresponding null cannot be rejected.
4.2 Test inversion for a nonlinear parameter: Fieller’s method
Suppose we want a confidence interval for the ratio \(\theta = \beta_{\text{Prof}} / \beta_{\text{AssocProf}}\) — how many times larger is the full professor salary premium than the associate professor premium? The delta method (Chapter 8) gives a quick CI, but it can be unreliable when the denominator is imprecisely estimated.
Fieller’s trick: rewrite \(\theta = \beta_1 / \beta_2\) as the linear restriction \(\beta_1 - \theta \beta_2 = 0\), then invert the Wald test over a grid of \(\theta\) values:
b <- coef(mod)
V <- vcovHC(mod, type = "HC1")
## Delta method CI for comparison
dm <- deltaMethod(mod, "rankProf / rankAssocProf",
vcov. = vcovHC(mod, type = "HC1"))
dm_ci <- c(dm$Estimate - 1.96 * dm$SE, dm$Estimate + 1.96 * dm$SE)
## Grid search: invert the Wald test
theta_grid <- seq(0, 8, by = 0.01)
idx_prof <- which(names(b) == "rankProf")
idx_assoc <- which(names(b) == "rankAssocProf")
in_CI <- sapply(theta_grid, function(th) {
r_vec <- numeric(length(b))
r_vec[idx_prof] <- 1
r_vec[idx_assoc] <- -th
num <- (sum(r_vec * b))^2
den <- t(r_vec) %*% V %*% r_vec
num / den <= qchisq(0.95, 1)
})
fieller_ci <- range(theta_grid[in_CI])
cat("Point estimate: ", round(dm$Estimate, 3), "\n")Point estimate: 3.491
cat("Delta method CI: ", round(dm_ci, 3), "\n")Delta method CI: 2.602 4.381
cat("Fieller CI: ", round(fieller_ci, 3), "\n")Fieller CI: 2.81 4.79
Here the two CIs are similar because the denominator (\(\beta_{\text{AssocProf}}\)) is precisely estimated. The ratio tells us that the full professor premium is about 3.5 times the associate professor premium.
## Plot the Wald statistic over theta
wald_vals <- sapply(theta_grid, function(th) {
r_vec <- numeric(length(b))
r_vec[idx_prof] <- 1
r_vec[idx_assoc] <- -th
num <- (sum(r_vec * b))^2
den <- c(t(r_vec) %*% V %*% r_vec)
num / den
})
df_fieller <- data.frame(theta = theta_grid, W = wald_vals)
ggplot(df_fieller, aes(theta, W)) +
geom_line(color = "steelblue", linewidth = 0.8) +
geom_hline(yintercept = qchisq(0.95, 1), linetype = "dashed", color = "coral") +
annotate("text", x = 6.5, y = qchisq(0.95, 1) + 0.5, label = "95% critical value",
color = "coral") +
coord_cartesian(ylim = c(0, 15)) +
labs(title = "Test inversion: Wald statistic as a function of θ₀",
subtitle = "The Fieller CI is the set of θ₀ values below the dashed line",
x = expression(theta[0] == beta[Prof] / beta[AssocProf]),
y = "Wald statistic")
4.3 When Fieller and the delta method disagree
The real value of Fieller’s method appears when the denominator is imprecise. Consider \(\theta = \beta_{\text{yrs.phd}} / \beta_{\text{yrs.service}}\) — the ratio of the two experience coefficients. Both are individually insignificant:
## Experience coefficients are imprecise
cat("yrs.since.phd: estimate =", round(b["yrs.since.phd"]),
" SE =", round(sqrt(V["yrs.since.phd", "yrs.since.phd"])), "\n")yrs.since.phd: estimate = 535 SE = 312
cat("yrs.service: estimate =", round(b["yrs.service"]),
" SE =", round(sqrt(V["yrs.service", "yrs.service"])), "\n")yrs.service: estimate = -490 SE = 307
## Delta method gives a finite CI
dm2 <- deltaMethod(mod, "yrs.since.phd / yrs.service",
vcov. = vcovHC(mod, type = "HC1"))
cat("\nDelta method CI:", round(c(dm2$Estimate - 1.96 * dm2$SE,
dm2$Estimate + 1.96 * dm2$SE), 2), "\n")
Delta method CI: -1.75 -0.43
## Fieller: check a wide grid
theta_wide <- seq(-50, 50, by = 0.1)
idx_phd <- which(names(b) == "yrs.since.phd")
idx_svc <- which(names(b) == "yrs.service")
in_CI2 <- sapply(theta_wide, function(th) {
r_vec <- numeric(length(b))
r_vec[idx_phd] <- 1
r_vec[idx_svc] <- -th
num <- (sum(r_vec * b))^2
den <- c(t(r_vec) %*% V %*% r_vec)
num / den <= qchisq(0.95, 1)
})
## Check if CI covers entire grid (unbounded)
if (all(in_CI2)) {
cat("Fieller CI: (-∞, +∞) [unbounded!]\n")
} else if (any(in_CI2)) {
cat("Fieller CI:", range(theta_wide[in_CI2]), "\n")
}Fieller CI: (-∞, +∞) [unbounded!]
The delta method reports a finite CI of width ~1.3, but the Fieller CI is unbounded — the entire real line. This is the correct answer: since \(\beta_{\text{yrs.service}}\) is not significantly different from zero, the ratio could be anything. The delta method’s finite interval is overconfident.
5 Multiple testing
When you look at many coefficients, some will be “significant” by chance.
5.1 Simulation: false discoveries under the null
To make this concrete: generate 20 pure-noise regressors, test each at 5%, and count how many reject:
set.seed(303)
n <- 200
k_noise <- 20
Y <- rnorm(n)
X_noise <- matrix(rnorm(n * k_noise), nrow = n)
colnames(X_noise) <- paste0("X", 1:k_noise)
dat_noise <- data.frame(Y, X_noise)
mod_noise <- lm(Y ~ ., data = dat_noise)
pvals <- summary(mod_noise)$coefficients[-1, 4] # drop intercept
cat("Number of regressors with p < 0.05:", sum(pvals < 0.05), "out of", k_noise, "\n")Number of regressors with p < 0.05: 1 out of 20
cat("Significant variables:", paste(names(pvals[pvals < 0.05]), collapse = ", "), "\n")Significant variables: X12
Under the global null, we expect \(20 \times 0.05 = 1\) false rejection on average. Repeat this many times to see the problem clearly:
set.seed(42)
B <- 5000
n <- 200
k_noise <- 20
false_rejections <- replicate(B, {
Y <- rnorm(n)
X <- matrix(rnorm(n * k_noise), nrow = n)
pv <- summary(lm(Y ~ X))$coefficients[-1, 4]
sum(pv < 0.05)
})
cat("P(at least one false rejection):", mean(false_rejections >= 1), "\n")P(at least one false rejection): 0.6204
cat("Expected:", round(1 - (1 - 0.05)^k_noise, 3), "\n")Expected: 0.642
ggplot(data.frame(rejections = false_rejections), aes(rejections)) +
geom_bar(fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = 1, linetype = "dashed", color = "coral") +
labs(title = "False rejections when all 20 nulls are true",
subtitle = paste0("P(≥1 false rejection) = ",
round(mean(false_rejections >= 1), 3)),
x = "Number of rejections at α = 0.05",
y = "Frequency")
With 20 tests at 5%, you have about a 64% chance of at least one false rejection. This is the familywise error rate (FWER) problem.
With \(k\) independent tests at level \(\alpha\), the probability of at least one false rejection is \(1 - (1 - \alpha)^k\). With 20 tests at 5%, this exceeds 60%. Use Holm’s correction (uniformly more powerful than Bonferroni) to control the FWER.
5.2 Corrections: Bonferroni and Holm
The Bonferroni correction rejects the \(j\)th hypothesis only if \(p_j < \alpha / k\). It controls the FWER for any dependence structure:
## Back to our salary model
pvals_salary <- summary(mod)$coefficients[-1, 4]
corrections <- data.frame(
raw = round(pvals_salary, 4),
bonferroni = round(p.adjust(pvals_salary, method = "bonferroni"), 4),
holm = round(p.adjust(pvals_salary, method = "holm"), 4)
)
corrections raw bonferroni holm
rankAssocProf 0.0020 0.0119 0.0079
rankProf 0.0000 0.0000 0.0000
disciplineB 0.0000 0.0000 0.0000
yrs.since.phd 0.0270 0.1619 0.0643
yrs.service 0.0214 0.1286 0.0643
sexMale 0.2158 1.0000 0.2158
Holm’s method is uniformly more powerful than Bonferroni while still controlling the FWER. It works by ordering the p-values from smallest to largest and comparing the \(j\)th smallest to \(\alpha / (k - j + 1)\). Use Holm whenever you would use Bonferroni.
5.3 When to worry about multiple testing
Multiple testing corrections matter when you:
- Examine many coefficients and report the “most significant”
- Try many specifications and present the one that “works”
- Test across subgroups (by age, gender, region, …)
- Use stepwise selection procedures
They matter less when you have a small number of pre-specified hypotheses motivated by theory.
6 Power: can you detect the effect?
A test’s power is the probability it rejects when \(H_0\) is false:
\[\text{Power}(\theta) = P[\text{Reject } H_0 \mid \theta \neq \theta_0]\]
Power depends on four things: the true effect size, the standard error, the significance level, and the sample size. In OLS, for a two-sided test of \(H_0: \beta = 0\):
\[\delta = \frac{|\beta|}{\text{se}(\hat{\beta})} \approx \frac{|\beta| \cdot \text{sd}(X) \cdot \sqrt{n}}{\sigma_e}\]
Power \(\approx \Phi(\delta - z_{\alpha/2}) + \Phi(-\delta - z_{\alpha/2})\), where \(\delta\) is the signal-to-noise ratio.
Theorem 2 (Power of a Two-Sided Test) For testing \(H_0: \beta = 0\) at level \(\alpha\), the power against alternative \(\beta \neq 0\) is approximately \(\Phi(\delta - z_{\alpha/2}) + \Phi(-\delta - z_{\alpha/2})\), where \(\delta = |\beta|/\text{SE}(\hat\beta)\) is the signal-to-noise ratio.
## Power for a two-sided t-test in OLS
power_ols <- function(effect, sigma_e, sd_x, n, alpha = 0.05) {
se <- sigma_e / (sd_x * sqrt(n))
delta <- abs(effect) / se
z <- qnorm(1 - alpha / 2)
pnorm(delta - z) + pnorm(-delta - z)
}6.1 Power curves
How does power vary with sample size and effect size?
n_seq <- seq(20, 500, by = 5)
effects <- c(0.05, 0.10, 0.20, 0.40)
df_power <- do.call(rbind, lapply(effects, function(eff) {
data.frame(
n = n_seq,
power = sapply(n_seq, function(nn) power_ols(eff, sigma_e = 1, sd_x = 1, nn)),
effect = paste0("β = ", eff)
)
}))
ggplot(df_power, aes(n, power, color = effect)) +
geom_line(linewidth = 1) +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray50") +
annotate("text", x = 480, y = 0.82, label = "80% power", color = "gray50") +
geom_hline(yintercept = 0.05, linetype = "dotted", color = "gray70") +
annotate("text", x = 480, y = 0.07, label = "α = 0.05", color = "gray70") +
scale_color_brewer(palette = "Set1") +
labs(title = "Power curves for a two-sided OLS t-test",
subtitle = "σ_e = 1, sd(X) = 1",
x = "Sample size (n)", y = "Power", color = "True effect") +
coord_cartesian(ylim = c(0, 1))
Small effects require enormous samples. With \(\beta = 0.10\) and \(\sigma_e / \text{sd}(X) = 1\), you need roughly \(n = 800\) for 80% power.
A study should have at least 80% power to detect the smallest effect of scientific interest. This requires \(n \approx 16\sigma_e^2 / (\beta^2 \cdot \text{Var}(X))\) for a two-sided 5% test. Adding relevant controls reduces \(\sigma_e\) and boosts power for free.
6.2 Simulation: observing power
Let’s verify the formula with a Monte Carlo simulation:
set.seed(12)
B <- 2000
true_beta <- 0.3
sigma_e <- 1
n_vals <- c(20, 50, 100, 200)
sim_power <- function(n, B) {
rejections <- replicate(B, {
x <- rnorm(n)
y <- true_beta * x + rnorm(n, sd = sigma_e)
pval <- summary(lm(y ~ x))$coefficients["x", "Pr(>|t|)"]
pval < 0.05
})
mean(rejections)
}
results <- data.frame(
n = n_vals,
simulated = sapply(n_vals, sim_power, B = B),
formula = sapply(n_vals, function(nn) power_ols(true_beta, sigma_e, 1, nn))
)
results n simulated formula
1 20 0.2290 0.2687
2 50 0.5575 0.5641
3 100 0.8250 0.8508
4 200 0.9880 0.9888
The simulated power matches the formula closely.
6.3 Controls boost power
Adding a relevant control variable reduces \(\sigma_e\) without reducing \(\text{sd}(X)\), giving a free power boost:
set.seed(99)
B <- 2000
n <- 100
beta_x <- 0.2
beta_z <- 1.0 # strong control
## Without control
power_no_ctrl <- mean(replicate(B, {
x <- rnorm(n); z <- rnorm(n)
y <- beta_x * x + beta_z * z + rnorm(n)
summary(lm(y ~ x))$coefficients["x", "Pr(>|t|)"] < 0.05
}))
## With control
power_with_ctrl <- mean(replicate(B, {
x <- rnorm(n); z <- rnorm(n)
y <- beta_x * x + beta_z * z + rnorm(n)
summary(lm(y ~ x + z))$coefficients["x", "Pr(>|t|)"] < 0.05
}))
cat("Power without control:", round(power_no_ctrl, 3), "\n")Power without control: 0.294
cat("Power with control: ", round(power_with_ctrl, 3), "\n")Power with control: 0.5
cat("Residual SD without z:", round(sqrt(beta_z^2 + 1), 2), "\n")Residual SD without z: 1.41
cat("Residual SD with z: ", round(1, 2), "\n")Residual SD with z: 1
The control cuts the residual standard deviation roughly in half, dramatically increasing power. This is why pre-treatment covariates help in experiments even though they are unnecessary for unbiasedness.
6.4 Joint tests have less power
Testing more restrictions simultaneously dilutes power. Fix the total non-centrality parameter and compare:
lambda_seq <- seq(0, 15, by = 0.1)
df_joint <- do.call(rbind, lapply(c(1, 2, 5, 10), function(q) {
data.frame(
lambda = lambda_seq,
power = 1 - pchisq(qchisq(0.95, q), q, ncp = lambda_seq),
q = paste0("q = ", q)
)
}))
ggplot(df_joint, aes(lambda, power, color = q)) +
geom_line(linewidth = 1) +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray50") +
scale_color_brewer(palette = "Set1") +
labs(title = "Power vs. non-centrality for joint tests",
subtitle = "More restrictions require a larger signal to achieve the same power",
x = expression(paste("Non-centrality parameter ", lambda)),
y = "Power", color = "Restrictions")
A single-coefficient t-test is more powerful than a joint F-test that lumps in other restrictions. Test exactly what you care about.
7 Statistical vs. economic significance
With a large enough sample, any nonzero coefficient becomes statistically significant. This is test consistency: \(P[\text{Reject}] \to 1\) whenever \(\beta \neq 0\).
set.seed(77)
true_beta <- 0.01 # tiny effect
n_vals <- c(50, 200, 1000, 5000, 20000)
B <- 1000
rejection_rate <- sapply(n_vals, function(nn) {
mean(replicate(B, {
x <- rnorm(nn)
y <- true_beta * x + rnorm(nn)
summary(lm(y ~ x))$coefficients["x", "Pr(>|t|)"] < 0.05
}))
})
df_consist <- data.frame(n = n_vals, rejection_rate = rejection_rate)
ggplot(df_consist, aes(n, rejection_rate)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_hline(yintercept = 0.05, linetype = "dashed", color = "coral") +
scale_x_log10() +
labs(title = "Test consistency: even β = 0.01 gets rejected eventually",
subtitle = "With enough data, statistical significance is guaranteed for any nonzero effect",
x = "Sample size (log scale)", y = "Rejection rate at α = 0.05")
A \(p\)-value tells you whether the data are inconsistent with \(H_0\). It tells you nothing about whether the effect is large enough to matter. Always report the magnitude alongside the significance.
8 Applied workflow: the salary example
Let’s bring everything together with a thorough analysis of the professor salary data.
8.1 Step 1: Fit and inspect
mod_full <- lm(salary ~ rank + discipline + yrs.since.phd + yrs.service + sex,
data = Salaries)
coeftest(mod_full, vcov. = vcovHC(mod_full, type = "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65955 2896 22.78 < 2e-16 ***
rankAssocProf 12908 2206 5.85 1.0e-08 ***
rankProf 45066 3285 13.72 < 2e-16 ***
disciplineB 14418 2316 6.22 1.2e-09 ***
yrs.since.phd 535 312 1.71 0.088 .
yrs.service -490 307 -1.60 0.111
sexMale 4784 2396 2.00 0.047 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8.2 Step 2: Key hypotheses
Does rank matter? (joint test, \(q = 2\))
linearHypothesis(mod_full, c("rankAssocProf = 0", "rankProf = 0"),
vcov. = vcovHC(mod_full, type = "HC1"))
Linear hypothesis test:
rankAssocProf = 0
rankProf = 0
Model 1: restricted model
Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 392
2 390 2 109 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Does discipline matter? (single restriction)
linearHypothesis(mod_full, "disciplineB = 0",
vcov. = vcovHC(mod_full, type = "HC1"))
Linear hypothesis test:
disciplineB = 0
Model 1: restricted model
Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 391
2 390 1 38.7 1.2e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Are the two experience measures interchangeable? (\(\beta_{\text{phd}} = \beta_{\text{svc}}\))
linearHypothesis(mod_full, "yrs.since.phd = yrs.service",
vcov. = vcovHC(mod_full, type = "HC1"))
Linear hypothesis test:
yrs.since.phd - yrs.service = 0
Model 1: restricted model
Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 391
2 390 1 2.92 0.088 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8.3 Step 3: Confidence intervals
## Robust CIs
ci_robust <- coefci(mod_full, vcov. = vcovHC(mod_full, type = "HC1"))
ci_robust 2.5 % 97.5 %
(Intercept) 60261.83 71648.6
rankAssocProf 8571.09 17244.1
rankProf 38607.31 51524.7
disciplineB 9863.61 18971.6
yrs.since.phd -79.26 1149.4
yrs.service -1092.74 113.7
sexMale 72.96 9494.0
8.4 Step 4: Multiple testing adjustment
If we’re testing all six regressors simultaneously:
pvals_all <- coeftest(mod_full, vcov. = vcovHC(mod_full, type = "HC1"))[-1, 4]
data.frame(
raw_p = round(pvals_all, 4),
holm_p = round(p.adjust(pvals_all, method = "holm"), 4),
significant_raw = ifelse(pvals_all < 0.05, "*", ""),
significant_holm = ifelse(p.adjust(pvals_all, method = "holm") < 0.05, "*", "")
) raw_p holm_p significant_raw significant_holm
rankAssocProf 0.0000 0.0000 * *
rankProf 0.0000 0.0000 * *
disciplineB 0.0000 0.0000 * *
yrs.since.phd 0.0876 0.1752
yrs.service 0.1114 0.1752
sexMale 0.0466 0.1397 *
8.5 Step 5: Power assessment for the gender gap
Suppose the “true” gender gap is $5,000 (about 5% of mean salary). Could we detect it?
sigma_e <- sigma(mod_full)
sd_sex <- sd(as.numeric(Salaries$sex == "Male"))
n <- nrow(Salaries)
power_sex <- power_ols(effect = 5000, sigma_e = sigma_e,
sd_x = sd_sex, n = n, alpha = 0.05)
cat("Estimated power to detect a $5,000 gender gap:", round(power_sex, 3), "\n")Estimated power to detect a $5,000 gender gap: 0.261
## What sample size for 80% power?
n_needed <- seq(100, 5000, by = 10)
pow_curve <- sapply(n_needed, function(nn)
power_ols(5000, sigma_e, sd_sex, nn, 0.05))
n80 <- n_needed[which(pow_curve >= 0.80)[1]]
cat("Sample size for 80% power:", n80, "\n")Sample size for 80% power: 1800
9 Practical advice (after Hansen)
A checklist for reporting regression results:
Report standard errors, not just t-ratios. Standard errors convey precision; t-statistics reduce everything to a binary yes/no.
Report p-values, not asterisks. The difference between \(p = 0.049\) and \(p = 0.051\) is not meaningful. Stars obscure this.
Focus on substantive hypotheses. Don’t mechanically test every coefficient against zero. Test things that answer a scientific question.
“Fail to reject” \(\neq\) “accept.” Low power can explain non-rejection. Always ask: could we have detected a meaningful effect?
Statistical significance \(\neq\) economic significance. A tiny but significant coefficient may be policy-irrelevant. A large but insignificant coefficient may reflect low power.
Use robust inference by default. There is rarely a good reason to report classical SEs when \(n\) is not tiny. Use HC1/HC2 or the bootstrap.
Account for multiple testing. If you report the “most significant” result from many tests, apply Holm or Bonferroni corrections.
10 Connection to GMM
The Wald test has a natural GMM interpretation. In GMM, testing \(H_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}\) is equivalent to testing whether the restricted moment conditions \(E[\mathbf{Z}'(Y - \mathbf{X}\boldsymbol{\beta})] = \mathbf{0}\) with \(\mathbf{R}\boldsymbol{\beta} = \mathbf{r}\) are compatible with the data. The J-test of overidentifying restrictions, which we’ll meet in Chapter 11, generalizes the F-test to the setting where there are more moment conditions than parameters.
| OLS test | GMM counterpart |
|---|---|
| F-test (restricted vs. unrestricted) | Distance test (restricted vs. unrestricted GMM) |
| Wald test | Wald test (same formula, different \(\hat{\mathbf{V}}\)) |
| Score / LM test | Score test from restricted GMM |
| Overall F-statistic | J-test of overidentifying restrictions |