Dirichlet Distribution in R and Stan
The Dirichlet distribution is a family of continuous multivariate probability distributions parameterized by a vector of positive numbers. It is often used as a prior distribution in Bayesian statistics, particularly in the context of categorical distributions. Named after Johann Peter Gustav Lejeune Dirichlet, this distribution is used in different fields.
Understanding the Dirichlet Distribution
The Dirichlet distribution is defined over a probability simplex, meaning that it deals with vectors of probabilities that sum to 1. If we have $K$ categories, then a Dirichlet distribution is defined over a vector $\mathbf{p} = (p_1, p_2, \ldots, p_K)$, where each $p_i \geq 0$ and $\sum_{i=1}^{K} p_i = 1$. The distribution is governed by a parameter vector $\mathbf{\alpha} = (\alpha_1, \alpha_2, \ldots, \alpha_K)$, where each $\alpha_i > 0$.
The density function of the Dirichlet distribution for a $K$-dimensional vector $\mathbf{p}$ is given by:
\[f(\mathbf{p} \mid \mathbf{\alpha}) = \frac{1}{B(\mathbf{\alpha})} \prod_{i=1}^{K} p_i^{\alpha_i - 1}\]where $B(\mathbf{\alpha})$ is the multinomial beta function, acting as a normalization constant to ensure the distribution integrates to 1.
Properties and Characteristics
- Support: The Dirichlet distribution is defined over the simplex, meaning that the components of a random vector drawn from a Dirichlet distribution are positive and sum to one.
- Parameters: The parameter vector \(\alpha\) influences the shape and concentration of the distribution. Each component of \(\alpha\) can be thought of as a prior count or concentration parameter for the corresponding component of the probability vector.
- Mean and Variance: The mean of each component of a Dirichlet-distributed vector is given by the ratio of the corresponding \(\alpha\) parameter to the sum of all \(\alpha\) parameters. The variance and covariance of the components are also functions of the \(\alpha\) parameters, reflecting the distribution’s flexibility in modeling different levels of uncertainty and correlation among components.
Applications of the Dirichlet Distribution
The Dirichlet distribution is widely used in Bayesian inference, topic modeling, and clustering. In Bayesian inference, it acts as a conjugate prior for the multinomial distribution, meaning that updates based on new data produce another Dirichlet distribution. This property makes calculations simpler and is especially useful when working with categorical data. In natural language processing, the Dirichlet distribution is foundational for Latent Dirichlet Allocation (LDA), a method used to uncover hidden topics within a collection of texts. By modeling the distribution of topics in documents and the distribution of words within topics, LDA helps reveal the underlying thematic structure of large datasets. The Dirichlet distribution also plays a crucial role in clustering, particularly in Dirichlet Process Mixture Models (DPMMs). These models are powerful for cases where the number of clusters isn’t known in advance, allowing for adaptive and flexible grouping of data. Whether simplifying complex computations or uncovering patterns in diverse datasets, the Dirichlet distribution proves itself a cornerstone of modern statistical and machine learning methods.
Dirichlet Distribution in R
Dirichlet distribution can be implemented in R using the gtools
package and the LaplacesDemon
package:
Generating Samples from a Dirichlet Distribution
library(gtools)
alpha <- c(2, 3, 5)
n=1000
samples <- rdirichlet(n, alpha)
print(samples)
colMeans(samples)
print(posterior_mean)
Code generates 10 samples from a Dirichlet distribution with parameters ( $\alpha$ = (2, 3, 5) ). Each sample is a probability vector of three elements, and the rows of the output matrix represent different samples.
Using the Dirichlet Distribution as a Prior
observed_data <- c(10, 15, 25)
prior_alpha <- c(2, 3, 5)
posterior_alpha <- prior_alpha + observed_data
posterior_samples <- rdirichlet(1000, posterior_alpha)
posterior_mean <- colMeans(posterior_samples)
print(posterior_mean)
Dirichlet distribution can be used as a prior in a Bayesian framework with the prior parameters are updated with observed data, and samples are drawn from the posterior distribution to estimate the mean probabilities of each category.
Visualizing the Dirichlet Distribution
You can visualize the Dirichlet distribution using ggtern
in R. The code below generates 10,000 random samples from a Dirichlet distribution with parameters ($\alpha$ = (2, 3, 5)), then plots the samples in a ternary diagram with proportions across three variables in a compositional space.
alpha <- c(2, 3, 5)
samples <- rdirichlet(10000, alpha)
library(ggtern)
data <- as.data.frame(samples)
colnames(data) <- c("X1", "X2", "X3")
ggtern(data = data, aes(x = X1, y = X2, z = X3)) +
geom_point(alpha = 0.1) +
labs(title = "Dirichlet Distribution Visualization")
Dirichlet Distribution in Stan
To implement the Dirichlet distribution using Stan and RStan first, create a Stan model string or text file:
stan_model <-'data {
int<lower=1> K; // Number of categories
vector<lower=0>[K] alpha; // Dirichlet parameters
}
parameters {
simplex[K] theta; // Probability vector
}
model {
theta ~ dirichlet(alpha); // Dirichlet prior
}'
Then it is possbile RStan to compile and run the Stan model:
library(rstan)
K <- 3
alpha <- c(2, 3, 5) # Dirichlet parameters
data_list <- list(K = K, alpha = alpha)
fit <- stan(model_code=stan_model, data = data_list, iter = 2000, chains = 4)
print(fit)
- Data Block: The data block in the Stan model specifies the number of categories
K
and the Dirichlet parametersalpha
. These are provided as inputs to the model. - Parameters Block: The parameters block defines
theta
as a simplex, which ensures that the elements oftheta
are non-negative and sum to one, making it suitable for modeling probabilities. - Model Block: The model block specifies that
theta
follows a Dirichlet distribution with parametersalpha
. - RStan: The R code uses the
rstan
package to compile the Stan model and perform sampling. Thesampling
function runs the Markov Chain Monte Carlo (MCMC) to generate samples from the posterior distribution oftheta
. - Output: The
print
function outputs the summary of the posterior samples, including estimates for the components oftheta
.
Bayesian Model with Dirichlet Prior in Stan
library(rstan)
library(ggplot2)
dirichlet_model_code <- "
data {
int<lower=1> K; // Number of categories
int<lower=0> N; // Number of observations
int<lower=1, upper=K> y[N]; // Observations (categorical data)
vector<lower=0>[K] alpha; // Dirichlet prior parameters
}
parameters {
simplex[K] theta; // Probabilities for each category
}
model {
theta ~ dirichlet(alpha); // Dirichlet prior
y ~ categorical(theta); // Likelihood
}
generated quantities {
vector[K] p_hat; // Posterior mean of theta
p_hat = theta;
}
"
- Data Block: Specifies the number of categories
K
, the number of observationsN
, the observed datay
, and the Dirichlet prior parametersalpha
. - Parameters Block: Defines
theta
, a vector representing the probabilities for each category, which is constrained to be a simplex (i.e., sums to 1). - Model Block: Applies the Dirichlet prior to
theta
and models the observed datay
using a categorical distribution. - Generated Quantities Block: Computes the posterior mean of
theta
.
Now lets generate some data to fit the model:
# Number of categories
K <- 3
# Observed data (e.g., 1, 2, 3 corresponds to categories)
y <- c(rep(1,20),rep(2,30),rep(3,50))
# Number of observations
N <- length(y)
# Dirichlet prior parameters (e.g., prior beliefs about category probabilities)
alpha <- c(2, 2, 2)
# Data list to pass to Stan
data_list <- list(K = K, N = N, y = y, alpha = alpha)
Compile and run the Stan model using the stan
function:
dirichlet_model <- stan_model(model_code = dirichlet_model_code)
fit <- sampling(dirichlet_model, data = data_list, iter = 2000, chains = 4, seed = 123)
print(fit)
Extract the posterior distributions:
# Extract the posterior samples
posterior_samples <- extract(fit)
posterior_mean <- apply(posterior_samples$p_hat, 2, mean)
print(posterior_mean)
theta_samples <- as.data.frame(posterior_samples$theta)
colnames(theta_samples) <- paste0("theta_", 1:K)
In this post we discussed Dirichlet distribution that is used in Bayesian statistics and machine learning for modeling probabilities over multiple categories. More specifically, it has utility as a conjugate prior for multinomial distributions in applications such as topic modeling, clustering, and Bayesian inference. It is possbile to implement and explore Dirichlet models, both for simulation and for real-world data analysis in R and Stan . The examples in this blog demonstrate how to generate samples from the Dirichlet distribution, use it as a prior in Bayesian modeling, and apply it using RStan for more simple models. In a following post we will go through clustering and multinomial regression using the Dirichlet distribution.
References
- The multinomial model in stan - fit dirichlet distribution parameters
- Dirichlet distribution example stan
- Guide to understanding the intuition behind the Dirichlet distribution
- Introduction to the Dirichlet Distribution and Related Processes
- LDA Model: Simulated Data Generation in R & Parameter Recovery Study in RStan
- Dirichlet distribution in Stan
- Dirichlet Distribution R LaplacesDemon
- The Dirichlet Distribution: What Is It and Why Is It Useful?
- Dirichlet Distribution Example
- Bayesian Regression for a Dirichlet Distributed Response using Stan