Chapter 15 Model Comparison and Selection in Bayesian Analysis

15.1 Introduction

Bayesian statistical modeling offers a flexible and principled approach to quantifying uncertainty and incorporating prior knowledge. As readers gain familiarity with building a variety of models—ranging from simple linear regressions to complex hierarchical and non-parametric models—a critical question emerges: how do we choose among competing models? Bayesian model comparison and selection are fundamental steps in the modeling process, ensuring that inferences are both robust and interpretable.

Unlike frequentist paradigms, Bayesian methods provide a coherent framework for comparing models by evaluating the plausibility of the data given the model, known as model evidence. However, this task is often complicated by practical challenges in estimating model evidence and balancing predictive accuracy with model complexity. This article explores the key tools available for Bayesian model comparison, including posterior predictive checks, information criteria like WAIC and LOO-CV, and Bayes factors. Each method offers unique insights, and their strengths and limitations make them suitable for different analytical contexts.

While selecting a single model is a common goal, Bayesian modeling also allows for model averaging and stacking, approaches that combine models to account for model uncertainty and improve predictive robustness. These alternatives are especially useful when no model clearly dominates or when combining perspectives from multiple models provides more comprehensive insights.

15.2 Posterior Predictive Checks

Posterior predictive checks are a fundamental diagnostic tool used to assess the fit of a Bayesian model to the observed data. The idea is to simulate data from the posterior predictive distribution and compare it to the actual observed data. Discrepancies between the simulated and observed data can indicate model misfit or structural inadequacies.

Formally, the posterior predictive distribution is given by:

\(p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta) p(\theta \mid y) d\theta\)

where \(\tilde{y}\) represents future or replicated data, \(y\) is the observed data, and \(\theta\) denotes the parameters.

Visualization tools such as histograms, density plots, and test statistics (e.g., discrepancy measures) are commonly used to perform these checks. In practice, packages like bayesplot in R make it easy to implement these diagnostics. Although posterior predictive checks are excellent for identifying model fit issues, they do not provide a direct basis for comparing multiple models.

Transition: While posterior predictive checks help diagnose model adequacy, they are less informative for ranking competing models. For this, we turn to information-theoretic approaches like WAIC and LOO-CV.

15.3 Information Criteria: WAIC and LOO-CV

Two widely used Bayesian model comparison tools are the Watanabe-Akaike Information Criterion (WAIC) and Leave-One-Out Cross-Validation (LOO-CV). Both aim to estimate a model’s out-of-sample predictive performance.

15.3.1 WAIC (Watanabe-Akaike Information Criterion)

WAIC is a fully Bayesian criterion that estimates out-of-sample predictive accuracy while penalizing for model complexity. It is computed as:

\(\text{WAIC} = -2(\text{lppd} - p_{\text{WAIC}})\)

where lppd is the log pointwise predictive density and \(p_{\text{WAIC}}\) is the effective number of parameters, which quantifies the flexibility of the posterior distribution in fitting the data. WAIC is asymptotically equivalent to Bayesian cross-validation and is computable directly from posterior samples.

Unlike classical AIC or BIC, which penalize complexity by parameter count, WAIC adjusts based on how flexible the posterior is in fitting the data. This leads to more nuanced complexity penalties, especially in hierarchical models where the number of effective parameters can be non-integer and context-sensitive.

15.3.2 LOO-CV (Leave-One-Out Cross-Validation)

LOO-CV involves fitting the model repeatedly, leaving out one observation at a time and evaluating predictive performance on the omitted data. To avoid the computational burden of refitting the model \(n\) times, Pareto-smoothed importance sampling (PSIS) offers an efficient approximation, implemented in the loo R package.

The loo package also provides diagnostic tools, such as Pareto \(k\) values, to assess the reliability of importance sampling. Typically, values below 0.7 suggest stable estimates; higher values may indicate influential observations that require model reassessment.

Additionally, LOO-CV provides a natural framework for identifying influential data points. Observations with high \(k\) values may point to model misspecification, data entry errors, or unaccounted-for heterogeneity. Thus, beyond model ranking, LOO-CV offers an avenue for deepening model diagnostics.

Both WAIC and LOO-CV yield estimates of expected out-of-sample deviance. Lower values indicate better predictive accuracy.

Transition: While WAIC and LOO-CV emphasize predictive performance, Bayes factors offer an alternative grounded in model plausibility and marginal likelihood.

15.4 Bayes Factors vs. Information Criteria

Bayes factors provide a coherent Bayesian approach to model comparison by evaluating how well each model explains the observed data, integrating over all parameter values:

\(BF_{12} = \frac{p(y \mid M_1)}{p(y \mid M_2)}\)

A Bayes factor greater than 1 indicates support for model \(M_1\) over \(M_2\), and vice versa. Interpretation guidelines (Kass & Raftery, 1995) suggest:

1 to 3: Anecdotal evidence
3 to 10: Moderate evidence
10: Strong evidence

Bayes factors incorporate prior model probabilities and are grounded in decision theory. However, they are often sensitive to the choice of priors and can be computationally intensive due to the need to estimate marginal likelihoods.

Despite their theoretical appeal, Bayes factors suffer from two key limitations. First, the sensitivity to prior distributions — especially priors on nuisance parameters — can lead to unstable or unintuitive conclusions. Second, the marginal likelihood integrates over the entire parameter space, potentially over-penalizing complex models when priors are diffuse. This makes Bayes factors ill-suited to models with weak or uninformative priors unless priors are carefully calibrated.

A useful computational shortcut in nested models is the Savage-Dickey density ratio, which allows Bayes factor computation via prior and posterior densities of a parameter of interest.

In contrast, WAIC and LOO-CV focus on predictive accuracy and are generally more robust to prior specification, making them preferable in prediction-oriented tasks.

15.5 Summary of Bayesian Model Comparison Methods

Method	Goal	Sensitive to Priors	Penalizes Complexity	Emphasis	Tools (R Packages)	Use Case
Posterior Predictive	Assess model fit	No	No	Model adequacy	`bayesplot`, `rstanarm`	Diagnosing misfit or model structure
WAIC	Estimate predictive accuracy	Mildly	Yes	Out-of-sample prediction	`loo`, `rstanarm`	Model ranking for prediction
LOO-CV (PSIS)	Estimate predictive accuracy	Mildly	Yes	Out-of-sample prediction	`loo`, `rstanarm`	Robust prediction comparison
Bayes Factors	Marginal likelihood comparison	Yes	Implicit	Model evidence	`bridgesampling`, `BayesFactor`	Hypothesis testing, model plausibility

15.6 Practical Example: Comparing Models in R

To illustrate these concepts, consider the following example using a synthetic dataset. We compare two models: a simple linear regression and a polynomial regression.

library(rstanarm)
library(loo)
library(bayesplot)

# Simulate data
set.seed(123)
x <- rnorm(100)
y <- 2 * x + x^2 + rnorm(100)
data <- data.frame(x = x, y = y)

# Fit linear and quadratic models
fit_linear <- stan_glm(y ~ x, data = data, refresh = 0)
fit_quad <- stan_glm(y ~ x + I(x^2), data = data, refresh = 0)

# Posterior predictive checks
pp_check(fit_linear)
pp_check(fit_quad)

# WAIC comparison
waic_linear <- waic(fit_linear)
waic_quad <- waic(fit_quad)
print(waic_linear)
print(waic_quad)

# LOO-CV comparison
loo_linear <- loo(fit_linear)
loo_quad <- loo(fit_quad)
print(loo_compare(loo_linear, loo_quad))

# Bayes factor using bridgesampling

library(rstanarm)
library(bridgesampling)

# Fit model with diagnostic_file argument
fit_linear <- stan_glm(
  y ~ x, 
  data = data, 
  refresh = 0,
  diagnostic_file = file.path(tempdir(), "linear_diag.csv")
)

fit_quad <- stan_glm(
  y ~ x + I(x^2), 
  data = data, 
  refresh = 0,
  diagnostic_file = file.path(tempdir(), "quad_diag.csv")
)

# Run bridge sampling
bridge_linear <- bridge_sampler(fit_linear)
bridge_quad <- bridge_sampler(fit_quad)

# Compute Bayes factor
bf_result <- bf(bridge_quad, bridge_linear)
print(bf_result)

In situations where neither model clearly dominates or when both offer complementary insights, Bayesian stacking or model averaging may be employed to combine predictive strengths rather than choosing a single model. The loo package also supports stacking weights, enabling ensemble predictive performance evaluation. No single method is universally superior; the choice depends on the modeling context, computational resources, and analytic goals. In some cases, model averaging or stacking may offer additional robustness when no single model clearly dominates. Posterior predictive checks serve as intuitive diagnostics for model adequacy, while information criteria like WAIC and LOO-CV offer principled approaches for evaluating predictive performance. Bayes factors provide an alternative rooted in model evidence, though they require careful prior specification and computational resources. By incorporating these tools into their workflow, practitioners can enhance the credibility, interpretability, and predictive utility of their Bayesian analyses. A reasonable workflow might begin with posterior predictive checks to assess absolute model fit, followed by WAIC or LOO-CV to evaluate predictive utility. Bayes factors can be used where prior calibration is strong and the goal is hypothesis testing. When model choice is ambiguous, averaging via stacking provides a principled compromise. Ultimately, careful model comparison is as much about understanding the modeling context as it is about applying formulas.