Chapter 5 Bayesian Regularized Regression
5.1 Introduction
In this post, we will explore Bayesian analogues of regularized linear regression models such as LASSO and Ridge regression, which extend traditional linear regression to handle challenging modeling scenarios. These methods prove particularly valuable for improving prediction accuracy, estimating regression models with many variables, and providing alternatives to traditional model selection approaches.
The foundation of our discussion rests on the standard linear regression model \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\), where \(\mathbf{y} \in \mathbb{R}^n\) represents the response vector, \(\mathbf{X} \in \mathbb{R}^{n \times p}\) is the design matrix, \(\boldsymbol{\beta} \in \mathbb{R}^p\) contains the coefficients of interest, and \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2\mathbf{I})\) captures the random error. Traditional linear regression treats coefficients as independent, producing what statisticians call “unbiased” estimates. However, this independence assumption can lead to poor performance in settings with limited data, high-dimensional predictor spaces, or substantial noise.
Regularized linear regression models deliberately introduce bias into coefficient estimates by pooling information across parameters, causing them to shrink toward a common value, typically zero. This shrinkage represents a fundamental trade-off captured by the bias-variance decomposition: \(\text{MSE}(\hat{\boldsymbol{\beta}}) = \text{Bias}^2(\hat{\boldsymbol{\beta}}) + \text{Var}(\hat{\boldsymbol{\beta}})\). While regularization increases bias, it often reduces variance substantially enough that the overall mean squared error decreases, particularly in high-dimensional or noisy settings where traditional methods tend to overfit.
The Bayesian perspective reframes regularization as a problem of specifying appropriate prior distributions for model parameters. Rather than viewing penalty terms as arbitrary constraints on optimization, Bayesian regularization emerges naturally from the choice of prior distribution over coefficients. This connection provides both theoretical insight into why regularization works and practical advantages through full posterior distributions for all parameters.
5.2 Ridge Regression and Gaussian Priors
Ridge regression modifies the ordinary least squares objective by adding a penalty proportional to the sum of squared coefficients. In the frequentist framework, ridge regression minimizes \(\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2\), yielding the closed-form solution \(\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\). The penalty parameter \(\lambda\) controls the strength of regularization, with larger values producing more shrinkage toward zero.
The Bayesian equivalent emerges by placing independent Gaussian priors on each coefficient: \(\beta_j \sim \mathcal{N}(0, \tau^2)\) for \(j = 1, \ldots, p\). Under this prior specification, the posterior distribution becomes \(p(\boldsymbol{\beta}|\mathbf{y}, \mathbf{X}, \sigma^2, \tau^2) \propto \exp\left(-\frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 - \frac{1}{2\tau^2}\|\boldsymbol{\beta}\|_2^2\right)\). The mathematical equivalence becomes clear when we recognize that \(\lambda = \sigma^2/\tau^2\), establishing a direct correspondence between the frequentist penalty parameter and the ratio of error variance to prior variance.
This Bayesian formulation offers several advantages over its frequentist counterpart. Most importantly, it provides full posterior distributions for all parameters rather than just point estimates, enabling principled uncertainty quantification without requiring bootstrap procedures. The prior variance \(\tau^2\) also admits a natural hierarchical extension by treating it as an unknown parameter with its own prior distribution, allowing the data to inform the appropriate level of shrinkage.
5.3 LASSO Regression and Laplace Priors
LASSO regression replaces the ridge penalty with an L1 penalty on the coefficient magnitudes, minimizing \(\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1\). This seemingly minor modification produces dramatically different behavior: while ridge regression shrinks coefficients toward zero, LASSO can set them exactly to zero, performing automatic variable selection. The geometric intuition reveals that the L1 penalty creates a diamond-shaped constraint region whose corners lie on the coordinate axes, making exact zeros more likely when the unconstrained optimum lies in particular regions of the parameter space.
The Bayesian LASSO emerges by specifying independent Laplace priors for each coefficient: \(\beta_j \sim \text{Laplace}(0, b)\) for \(j = 1, \ldots, p\). The Laplace density \(p(\beta_j|b) = \frac{1}{2b}\exp\left(-\frac{|\beta_j|}{b}\right)\) places much more probability mass at zero compared to a Gaussian distribution with the same variance, naturally encouraging sparsity. The relationship between the frequentist penalty and Bayesian scale parameter follows \(\lambda = \sigma^2/b\).
While the Bayesian LASSO provides the same point estimates as its frequentist counterpart through the posterior mode, its real advantage lies in proper uncertainty quantification for the selected variables. The full posterior distribution accounts for both parameter uncertainty and model uncertainty, providing more honest assessments of prediction intervals and coefficient significance than frequentist methods that condition on the selected model.
5.4 Hierarchical Shrinkage and Adaptive Priors
Both ridge and LASSO impose uniform shrinkage across all coefficients, but real data often contains a mixture of large, moderate, and negligible effects that would benefit from adaptive shrinkage. Hierarchical shrinkage priors address this limitation by allowing different coefficients to experience different amounts of shrinkage based on their apparent signal strength in the data.
The general framework specifies \(\beta_j \sim \mathcal{N}(0, \tau^2 \lambda_j^2)\) for \(j = 1, \ldots, p\), where \(\tau\) represents a global shrinkage parameter affecting all coefficients and \(\lambda_j\) represents local shrinkage parameters that can vary across coefficients. This hierarchical structure enables the model to shrink small coefficients aggressively while leaving large coefficients relatively unshrunk.
The horseshoe prior represents one particularly successful implementation of this idea, specifying \(\lambda_j \sim \text{Cauchy}^+(0, 1)\) and \(\tau \sim \text{Cauchy}^+(0, 1)\). The Cauchy distribution’s heavy tails ensure that large coefficients can escape shrinkage while still encouraging small coefficients toward zero. The horseshoe prior gets its name from the shape of its shrinkage function, which resembles an inverted horseshoe when plotted against the signal-to-noise ratio.
5.5 Implementation
library(tidymodels)
library(rstan)
library(tidyverse)
# Data simulation
set.seed(123)
n <- 1000
n_train <- 700
n_test <- 300
alpha_true <- 40 # intercept
beta_true <- c(-2, 3, 4, 1, 0.25) # slopes
sigma_true <- 5 # residual standard deviation
p <- length(beta_true)
# Generate design matrix and response
X <- matrix(rnorm(n * p), n, p)
epsilon <- rnorm(n, mean = 0, sd = sigma_true)
y <- alpha_true + X %*% beta_true + epsilon
# Create dataset
data <- data.frame(y = y, X)
colnames(data) <- c("y", paste0("X", 1:p))
# Train-test split
set.seed(42)
data_split <- initial_split(data, prop = 0.7)
train_data <- training(data_split)
test_data <- testing(data_split)
# Prepare data for Stan
X_train <- as.matrix(train_data[, -1])
X_test <- as.matrix(test_data[, -1])
y_train <- train_data$y
y_test <- test_data$y
5.5.1 Bayesian Ridge Regression
# Stan model for Bayesian Ridge Regression
stan_ridge <- "
data {
int<lower=0> N_train;
int<lower=0> N_test;
int<lower=0> p;
matrix[N_train, p] X_train;
matrix[N_test, p] X_test;
vector[N_train] y_train;
}
parameters {
real alpha;
vector[p] beta;
real<lower=0> sigma;
real<lower=0> tau; // prior standard deviation for beta
}
model {
// Priors
alpha ~ normal(0, 10);
tau ~ cauchy(0, 1);
sigma ~ cauchy(0, 5);
beta ~ normal(0, tau); // Ridge prior
// Likelihood
y_train ~ normal(alpha + X_train * beta, sigma);
}
generated quantities {
vector[N_test] y_pred;
for (i in 1:N_test) {
y_pred[i] = normal_rng(alpha + X_test[i] * beta, sigma);
}
}
"
# Prepare data
stan_data_ridge <- list(
N_train = n_train,
N_test = n_test,
p = p,
X_train = X_train,
X_test = X_test,
y_train = y_train
)
# Fit model
fit_ridge <- stan(model_code = stan_ridge,
data = stan_data_ridge,
chains = 4, iter = 2000,
cores = parallel::detectCores())
print(fit_ridge, pars = c("alpha", "beta", "sigma", "tau"))
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha 40.32 0.00 0.18 39.96 40.19 40.31 40.44 40.67 7148 1
## beta[1] -1.86 0.00 0.19 -2.24 -1.99 -1.86 -1.73 -1.48 6679 1
## beta[2] 2.64 0.00 0.18 2.28 2.52 2.64 2.76 2.99 5238 1
## beta[3] 3.91 0.00 0.20 3.53 3.77 3.91 4.04 4.28 6158 1
## beta[4] 1.09 0.00 0.19 0.70 0.97 1.10 1.22 1.47 6568 1
## beta[5] 0.32 0.00 0.19 -0.04 0.20 0.32 0.45 0.70 6110 1
## sigma 4.97 0.00 0.13 4.72 4.88 4.97 5.06 5.24 7300 1
## tau 2.53 0.02 0.94 1.42 1.92 2.32 2.88 4.87 2591 1
##
## Samples were drawn using NUTS(diag_e) at Wed Oct 1 09:37:58 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
5.5.2 Bayesian LASSO Regression
# Stan model for Bayesian LASSO
stan_lasso <- "
data {
int<lower=0> N_train;
int<lower=0> N_test;
int<lower=0> p;
matrix[N_train, p] X_train;
matrix[N_test, p] X_test;
vector[N_train] y_train;
}
parameters {
real alpha;
vector[p] beta;
real<lower=0> sigma;
real<lower=0> b; // Laplace scale parameter
}
model {
// Priors
alpha ~ normal(0, 10);
b ~ cauchy(0, 1);
sigma ~ cauchy(0, 5);
beta ~ double_exponential(0, b); // LASSO prior
// Likelihood
y_train ~ normal(alpha + X_train * beta, sigma);
}
generated quantities {
vector[N_test] y_pred;
for (i in 1:N_test) {
y_pred[i] = normal_rng(alpha + X_test[i] * beta, sigma);
}
}
"
# Prepare data (same as Ridge)
stan_data_lasso <- stan_data_ridge
# Fit model
fit_lasso <- stan(model_code = stan_lasso,
data = stan_data_lasso,
chains = 4, iter = 2000,
cores = parallel::detectCores())
print(fit_lasso, pars = c("alpha", "beta", "sigma", "b"))
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha 40.32 0.00 0.19 39.95 40.19 40.32 40.44 40.68 7136 1
## beta[1] -1.85 0.00 0.19 -2.21 -1.98 -1.85 -1.72 -1.48 7288 1
## beta[2] 2.64 0.00 0.19 2.27 2.52 2.64 2.77 3.01 7301 1
## beta[3] 3.92 0.00 0.20 3.53 3.78 3.92 4.06 4.29 7501 1
## beta[4] 1.08 0.00 0.19 0.71 0.95 1.08 1.22 1.46 7458 1
## beta[5] 0.31 0.00 0.18 -0.06 0.19 0.31 0.43 0.67 6465 1
## sigma 4.97 0.00 0.13 4.72 4.88 4.97 5.06 5.24 6770 1
## b 2.12 0.02 1.04 0.90 1.41 1.88 2.53 4.78 4190 1
##
## Samples were drawn using NUTS(diag_e) at Wed Oct 1 09:39:13 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
5.5.3 Hierarchical Shrinkage (Horseshoe Prior)
# Stan model for Horseshoe prior
stan_horseshoe <- "
data {
int<lower=0> N_train;
int<lower=0> N_test;
int<lower=0> p;
matrix[N_train, p] X_train;
matrix[N_test, p] X_test;
vector[N_train] y_train;
}
parameters {
real alpha;
vector[p] beta_raw;
real<lower=0> sigma;
real<lower=0> tau; // global shrinkage
vector<lower=0>[p] lambda; // local shrinkage
}
transformed parameters {
vector[p] beta;
beta = beta_raw .* lambda * tau; // hierarchical shrinkage
}
model {
// Priors
alpha ~ normal(0, 10);
tau ~ cauchy(0, 1); // global shrinkage
lambda ~ cauchy(0, 1); // local shrinkage
beta_raw ~ normal(0, 1);
sigma ~ cauchy(0, 5);
// Likelihood
y_train ~ normal(alpha + X_train * beta, sigma);
}
generated quantities {
vector[N_test] y_pred;
for (i in 1:N_test) {
y_pred[i] = normal_rng(alpha + X_test[i] * beta, sigma);
}
}
"
# Prepare data (same as before)
stan_data_horseshoe <- stan_data_ridge
# Fit model
fit_horseshoe <- stan(model_code = stan_horseshoe,
data = stan_data_horseshoe,
chains = 4, iter = 2000,
cores = parallel::detectCores())
## Warning: There were 106 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha 40.32 0.00 0.19 39.95 40.19 40.32 40.45 40.70 3660 1
## beta[1] -1.84 0.00 0.20 -2.23 -1.98 -1.84 -1.71 -1.46 3905 1
## beta[2] 2.65 0.00 0.18 2.30 2.52 2.65 2.77 3.01 3678 1
## beta[3] 3.93 0.00 0.20 3.54 3.80 3.93 4.06 4.31 3803 1
## beta[4] 1.07 0.00 0.19 0.69 0.94 1.07 1.20 1.44 4014 1
## beta[5] 0.26 0.00 0.18 -0.05 0.13 0.26 0.39 0.64 2376 1
## sigma 4.98 0.00 0.14 4.72 4.88 4.98 5.07 5.24 3966 1
## tau 2.21 0.04 1.50 0.56 1.21 1.83 2.75 6.06 1429 1
##
## Samples were drawn using NUTS(diag_e) at Wed Oct 1 09:40:36 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
5.6 Empirical Comparison and Model Evaluation
Understanding the theoretical properties of different regularization approaches provides essential intuition, but empirical evaluation reveals how these methods perform in practice. The code below demonstrates how to compare ridge regression, LASSO, and horseshoe priors using both predictive accuracy and coefficient recovery metrics.
# Extract predictions and compute metrics
extract_metrics <- function(fit, true_values) {
y_pred <- extract(fit)$y_pred
y_pred_mean <- colMeans(y_pred)
# RMSE
rmse <- sqrt(mean((y_pred_mean - true_values)^2))
# Coverage (95% credible intervals)
y_pred_lower <- apply(y_pred, 2, quantile, 0.025)
y_pred_upper <- apply(y_pred, 2, quantile, 0.975)
coverage <- mean(true_values >= y_pred_lower & true_values <= y_pred_upper)
return(list(rmse = rmse, coverage = coverage))
}
# True test values (computed in R, not passed to Stan)
y_test_true <- alpha_true + X_test %*% beta_true
# Compare models
metrics_ridge <- extract_metrics(fit_ridge, y_test_true)
metrics_lasso <- extract_metrics(fit_lasso, y_test_true)
metrics_horseshoe <- extract_metrics(fit_horseshoe, y_test_true)
# Print comparison
cat("Model Comparison (Test Set):\n")
cat("Ridge - RMSE:", round(metrics_ridge$rmse, 3),
"Coverage:", round(metrics_ridge$coverage, 3), "\n")
cat("LASSO - RMSE:", round(metrics_lasso$rmse, 3),
"Coverage:", round(metrics_lasso$coverage, 3), "\n")
cat("Horseshoe - RMSE:", round(metrics_horseshoe$rmse, 3),
"Coverage:", round(metrics_horseshoe$coverage, 3), "\n")
# Plot coefficient estimates
library(bayesplot)
library(ggplot2)
# Extract posterior samples
beta_ridge <- extract(fit_ridge)$beta
beta_lasso <- extract(fit_lasso)$beta
beta_horseshoe <- extract(fit_horseshoe)$beta
# Create comparison plot
beta_df <- data.frame(
coefficient = rep(paste0("beta", 1:p), 3),
method = rep(c("Ridge", "LASSO", "Horseshoe"), each = p),
mean = c(colMeans(beta_ridge), colMeans(beta_lasso), colMeans(beta_horseshoe)),
lower = c(apply(beta_ridge, 2, quantile, 0.025),
apply(beta_lasso, 2, quantile, 0.025),
apply(beta_horseshoe, 2, quantile, 0.025)),
upper = c(apply(beta_ridge, 2, quantile, 0.975),
apply(beta_lasso, 2, quantile, 0.975),
apply(beta_horseshoe, 2, quantile, 0.975)),
true_value = rep(beta_true, 3)
)
ggplot(beta_df, aes(x = coefficient, y = mean, color = method)) +
geom_point(position = position_dodge(0.3)) +
geom_errorbar(aes(ymin = lower, ymax = upper),
position = position_dodge(0.3), width = 0.2) +
geom_point(aes(y = true_value), color = "black", shape = 4, size = 3) +
labs(title = "Coefficient Estimates Comparison",
subtitle = "Black crosses show true values",
x = "Coefficient", y = "Estimate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The empirical comparison typically reveals distinct patterns in how each method handles different types of signals. Ridge regression tends to shrink all coefficients proportionally, maintaining the relative magnitudes while reducing overall scale. This behavior works well when most predictors contribute meaningfully to the response, even if their individual effects are modest. LASSO regression exhibits more dramatic behavior, often setting smaller coefficients to exactly zero while leaving larger ones relatively unshrunk. This creates sparse solutions that can be easier to interpret but may sacrifice some predictive accuracy when the true model is dense.
Hierarchical shrinkage methods like the horseshoe prior attempt to get the best of both worlds by adapting the amount of shrinkage to each coefficient individually. Large coefficients experience minimal shrinkage, allowing them to maintain their full predictive power, while small coefficients get shrunk aggressively toward zero, reducing noise in the model. This adaptivity often leads to superior predictive performance, especially in settings where the true coefficient vector contains a mixture of large, moderate, and negligible effects.
5.7 Understanding Posterior Behavior
The mathematical structure of each prior determines not just point estimates but the entire posterior distribution. Ridge regression produces a posterior mean with the closed form \(E[\boldsymbol{\beta}|\mathbf{y}] = \left(\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X} + \frac{1}{\tau^2}\mathbf{I}\right)^{-1}\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{y}\), which clearly shows how the prior variance \(\tau^2\) balances against the data-driven term \(\mathbf{X}^T\mathbf{X}\). When \(\tau^2\) is small relative to \(\sigma^2\), the prior dominates and coefficients shrink strongly toward zero. When \(\tau^2\) is large, the data term dominates and estimates approach ordinary least squares.
LASSO regression does not admit a closed-form posterior mean due to the non-conjugate Laplace prior, but the posterior mode corresponds exactly to the frequentist LASSO solution. The full posterior distribution, obtained through MCMC sampling, provides uncertainty intervals that properly account for both the sparsity-inducing prior and the inherent variability in coefficient estimates. This represents a substantial improvement over frequentist LASSO inference, which requires complex procedures to obtain valid confidence intervals for selected variables.
The horseshoe prior creates particularly interesting posterior behavior through its adaptive shrinkage mechanism. Each coefficient experiences shrinkage according to \(E[\beta_j|\text{data}] \approx (1 - \kappa_j) \hat{\beta}_j^{\text{OLS}}\), where \(\kappa_j \in (0,1)\) represents a data-dependent shrinkage factor. Coefficients with strong signals in the data have shrinkage factors close to zero, leaving them essentially unshrunk, while coefficients with weak signals have shrinkage factors approaching one, causing aggressive shrinkage toward zero.
5.8 Choosing Among Regularization Methods
The choice between ridge regression, LASSO, and hierarchical shrinkage depends critically on the structure expected in the true coefficient vector and the goals of the analysis. Ridge regression works best when most predictors contribute meaningfully to the response, even if some effects are small. This commonly occurs in settings where the predictors represent different aspects of the same underlying phenomenon or when domain knowledge suggests that excluding variables entirely would be inappropriate.
LASSO regression excels when the true model is sparse, meaning that only a subset of predictors have non-zero effects. The automatic variable selection property makes LASSO particularly attractive for exploratory analyses where identifying the most important predictors is as important as achieving good predictive performance. However, LASSO can struggle when several predictors are highly correlated, as it tends to select one arbitrarily and zero out the others, potentially missing important relationships.
Hierarchical shrinkage methods like the horseshoe prior offer the most flexibility by allowing the data to determine the appropriate level of sparsity. These methods work well across a wide range of scenarios, from dense models where most predictors matter to sparse models where only a few predictors have substantial effects. The primary cost of this flexibility is computational complexity, as hierarchical models typically require more sophisticated MCMC sampling schemes and longer chains to achieve convergence.
5.9 Conclusion
Bayesian regularized regression represents a natural and principled approach to handling the challenges that arise in modern statistical modeling. By reformulating penalty-based methods in terms of prior distributions, the Bayesian framework provides not only point estimates equivalent to their frequentist counterparts but also full posterior distributions that enable proper uncertainty quantification and model comparison. The connection between prior specifications and regularization behavior offers valuable insight into why different methods work well in different contexts and how to choose appropriately among them.
The three approaches examined here span a useful range of assumptions about coefficient behavior. Ridge regression assumes all coefficients are a priori similar and should be shrunk proportionally, LASSO assumes that many coefficients are exactly zero and should be eliminated entirely, and hierarchical shrinkage assumes that coefficients vary in importance and should be shrunk adaptively. Modern computing makes all three approaches feasible for most practical applications, shifting the primary challenge from computational limitations to thoughtful model selection based on domain knowledge and data characteristics.
These Bayesian methods also integrate naturally into larger modeling workflows that require uncertainty quantification, model averaging, or decision-making under uncertainty. The posterior distributions provide the building blocks for more complex analyses while maintaining the computational efficiency and theoretical elegance that make regularized regression so appealing in the first place.
5.10 References
- Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465-480.
- Park, T., & Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association, 103(482), 681-686.
- Piironen, J., & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2), 5018-5051.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267-288.