Introduction
In this post, we will explore Bayesian analogues of regularized/penalized linear regression models (e.g., LASSO, Ridge regression), which are an extention of traditional linear regression models of the form. First we will discuss shrinkage and regularization in regression problems. These methods are useful for improving prediction, estimating regression models with many variables, and as an alternative to model selection methods. Our goal is to understand shrinkage/regularization of coefficients as an alternative/complement to variable selection. Use three specific approaches to regularization—ridge, LASSO, hierarchical shrinkage—which regularize through different priors, with different consequences for sparsity, and to recognize connections between Bayesian regularized regression and regularized regression in a machine learning/penalized maximum likelihood estimation (MLE) framework. Traditional linear regression models assume that weights share no group-level information (i.e. they are independent), which leads to so-called unbiased estimates. Unlike traditional linear regression models, regularized linear regression models produce biased estimates for the weights. Specifically, Bayesian regularized linear regression models pool information across weights, resulting in regression toward a common mean. When the common mean is centered at 0, this pooling of information produces more conservative estimates for each weight (they are biased toward 0). Shrinkage estimation deliberately increases the bias of the model in order to reduce variance and improve overall model performance, often at the cost of individual estimates. In other words, by adding bias to the model, shrinkage estimators provide a means to adjust the bias-variance in the model in order to achieve lower generalization error. Bayesian models with non-informative priors will produce similar to the MLE. However, parameters are modeled as exchangeable and given a proper prior, it induces some amount of shrinkage. But stronger priors can produce estimate much different estimtes than MLE.
Regularized Regression
As described above, regularized linear regression models aim to estimate more conservative values for the weights in a model, and this is true for both frequentist and Bayesian versions of regularization. While there are many methods that can be used to regularize your estimation procedure, we will focus specifically on two popular forms—namely, ridge and LASSO regression. We start below by describing each regression generally, and then proceed to implement both the frequentist and Bayesian versions.
Bayesian Ridge Regression
Ridge regression modifies the loss function to include a penalty term for model complexity, where model complexity is operationalized as the sum of squared weights. Bayesian Ridge regression differs from the frequentist variant in only one way, and it is with how we think of the penalty term. In the frequentist perspective, we showed that effectively tells our model how much it is allowed to learn from the data. Bayesian models view estimation as a problem of integrating prior information with information gained from data, which we formalize using probability distributions. The Bayesian estimation captures this in the form of a prior distribution over our weights the choice of prior distribution on is what determines how much information we learn from the data, analagous to the penalty term used for MLE regularization.
library(tidymodels)
set.seed(123)
n <- 1000
ntr=700
nts=300
a <- 40 #intercept
b <- c(-2, 3, 4, 1 , 0.25) #slopes
sigma2 <- 25 #residual variance (sd=5)
x <- matrix(rnorm(5000),1000,5)
eps <- rnorm(n, mean = 0, sd = sqrt(sigma2)) #residuals
y <- a +x%*%b+ eps #response variable
data <- data.frame(y, x) #dataset
head(data)
set.seed(42)
data_split <- initial_split(data, prop = 0.7)
train_data <- training(data_split )
test_data <- testing(data_split )
stan_mod = "data {
int<lower=0> N; // number of observations
int<lower=0> K; // number of predictors
matrix[N, K] X; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(alpha + X * beta, sigma); // target density
}"
writeLines(stan_mod, con = "stan_mod.stan")
cat(stan_mod)
stan_mod_ridge = "data{
int N_train; // training observations
int N_test; // test observations
int N_pred; // predictor variables
vector[N_train] y_train; // training outcomes
matrix[N_train, N_pred] X_train; // training data
matrix[N_test, N_pred] X_test; // testing data
}
parameters{
real alpha; // intercept
real<lower=0> sigma; // error SD
real<lower=0> sigma_B; // hierarchical SD across betas
vector[N_pred] beta; // regression beta weights
}
model{
// group-level (hierarchical) SD across betas
sigma_B ~ cauchy(0, 1);
// model error SD
sigma ~ normal(0, 1);
// beta prior
beta ~ normal(0, sigma_B);
// model likelihood
y_train ~ normal(alpha+X_train*beta, sigma);
}
generated quantities{
real y_test[N_test]; // test data predictions
for(i in 1:N_test){
y_test[i] = normal_rng(alpha+ X_test[i,] * beta, sigma);
}
}"
writeLines(stan_mod_ridge, con = "stan_mod_ridge.stan")
cat(stan_mod_ridge)
library(tidyverse)
X_trt <- train_data %>%
select(-y)
X_tst <- test_data %>%
select(-y)
Y_trt <- train_data %>%
select(y)
stan_data <- list(
N_train=ntr,
N_test=nts,
N_pred=5,
y_train=Y_trt,
X_train=X_trt,
X_test=X_tst
)
fit_rstan <- rstan::stan(
file = "stan_mod_ridge.stan",
data = stan_data
)
Bayesian LASSO Regression
LASSO regression only involves a minor change to the loss function compared to ridge regression. Specifically, as opposed to penalizing the model based on the sum of squared weights, it will penalize the model by the sum of the absolute value of weights. As for Bayesian ridge regression, we only needed to specifiy a normal prior distribution to the weights that we were aiming to regularize, for Bayesian LASSO regression, the only difference is in the form of the prior distribution by setting it to a double-exponential prior on the weights is mathematically equivalent in expectation to the frequentist LASSO penalty. Laplace distribiution places much more probability mass directly on 0, which produces the variable selection effect specific to LASSO regression.
stan_mod_lasso = "data{
int N_train; // training observations
int N_test; // test observations
int N_pred; // predictor variables
vector[N_train] y_train; // training outcomes
matrix[N_train, N_pred] X_train; // training data
matrix[N_test, N_pred] X_test; // testing data
}
parameters{
real alpha; // intercept
real<lower=0> sigma; // error SD
real<lower=0> sigma_B; // hierarchical SD across betas
vector[N_pred] beta; // regression beta weights
}
model{
// group-level (hierarchical) SD across betas
sigma_B ~ cauchy(0, 1);
// model error SD
sigma ~ normal(0, 1);
// beta prior
beta ~ double_exponential(0, sigma_B);
// model likelihood
y_train ~ normal(alpha + X_train*beta, sigma);
}
generated quantities{
real y_test[N_test]; // test data predictions
for(i in 1:N_test){
y_test[i] = normal_rng(alpha + X_test[i,] * beta, sigma);
}
}"
writeLines(stan_mod_lasso, con = "stan_mod_lasso.stan")
cat(stan_mod_lasso)
library(tidyverse)
X_trt <- train_data %>%
select(-y)
X_tst <- test_data %>%
select(-y)
Y_trt <- train_data$y
stan_data <- list(
N_train=ntr,
N_test=nts,
N_pred=5,
y_train=Y_trt,
X_train=X_trt,
X_test=X_tst
)
fit_rstan2 <- rstan::stan(
file = "stan_mod_lasso.stan",
data = stan_data
)
fit_rstan2
Hierarchical shrinkage
IF The Bayesian LASSO doesn’t get us sparsity with a few relatively large coefficients, and many coefficients very close to zero it is possible to use different global-local scale mixtures of normal distributions as our priors to encourage more sparsity. In that case it is possible to combine the global scale for all coefficient priors along with a local scale for each coefficient.
stan_mod_hr = "data {
int N; // observations
vector[N] y; // outcome
int K; // number of columns in the design matrix X
matrix [N, K] X; // design matrix X
real<lower = 0> tau; // global scale prior scale
}
transformed data {
real<lower = 0> y_sd;
real a_pr_scale;
real sigma_pr_scale;
y_sd = sd(y);
sigma_pr_scale = y_sd * 5;
a_pr_scale = 10;
}
parameters {
real a; // regression coefficient vector
vector[K] b_raw; // scale of the regression errors
real<lower = 0> sigma; // local scales of coefficients
vector<lower = 0>[K] lambda;
}
transformed parameters {
vector[N] mu; // mu is the observation fitted/predicted value
vector[K] b; // b is the transformed beta
b = b_raw * tau .* lambda;
mu = a + X * b;
}
model {
lambda ~ cauchy(0, 1); // priors
a ~ normal(0, a_pr_scale);
b_raw ~ normal(0, 1);
sigma ~ cauchy(0, sigma_pr_scale);
y ~ normal(mu, sigma); // likelihood
}
generated quantities {
vector[N] y_rep; // simulate data from the posterior
vector[N] log_lik; // log-likelihood posterior
for (n in 1:N) {
y_rep[n] = normal_rng(mu[n], sigma);
log_lik[n] = normal_lpdf(y[n] | mu[n], sigma);
}
}"
writeLines(stan_mod_hr, con = "stan_mod_hr.stan")
cat(stan_mod_hr)
stan_data <- list(
N=ntr,
y=Y_trt,
K=5,
X=X_trt,
tau=1
)
fit_rstan3 <- rstan::stan(
file = "stan_mod_hr.stan",
data = stan_data
)
fit_rstan3
Conclusion
In this post, we learned about the benefits of using regularized/penalized regression models over traditional regression. We determined that in low and/or noisy data settings, the so-called unbiased estimates given by non-regularized regression modeling actually lead to worse-off model performance. Importantly, we learned that this occurs because being ubiased allows a model to learn a lot from the data, including learning patterns of noise. Then, we learned that biased methods such as ridge and LASSO regression restrict the amount of learning that we get from data, which leads to better estimates in low and/or noisy data settings. Finally, hierarchical Bayesian models can choose a prior distribution across the weights that gives us a solution that is equivalent to that of the frequentist ridge or LASSO methods, with a full posterior distribution for each parameter, thus circumventing problems with frequentist regularization that require the use of bootstrapping to estimate confidence intervals.