Chapter 7 Beyond Linear Models: Q-Learning with Random Forest Function Approximation in R
7.1 Introduction
While linear function approximation provides a solid foundation for scaling reinforcement learning beyond tabular methods, it assumes a linear relationship between features and Q-values. Real-world problems often exhibit complex, non-linear patterns that linear models cannot capture effectively. This post extends our previous exploration by implementing Q-Learning with Random Forest function approximation, demonstrating how ensemble methods can learn intricate state-action value relationships while maintaining interpretability and robust generalization.
Random Forests offer several advantages over linear approximation: they handle non-linear relationships naturally, provide built-in feature importance measures, resist overfitting through ensemble averaging, and require minimal hyperparameter tuning. We’ll implement this approach using the same 10-state, 2-action environment, comparing the learned policies and examining the unique characteristics of tree-based function approximation.
Random Forest function approximation replaces the linear parameterization with an ensemble of decision trees. Instead of:
\[ Q(s, a; \theta) = \phi(s, a)^T \theta \]
we now approximate the action-value function as:
\[ Q(s, a) = \frac{1}{B} \sum_{b=1}^{B} T_b(\phi(s, a)) \]
where \(T_b\) represents the \(b\)-th tree in the ensemble, \(B\) is the number of trees, and \(\phi(s, a)\) is our feature representation. Each tree \(T_b\) is trained on a bootstrap sample of the data with random feature subsets at each split, providing natural regularization and variance reduction.
7.1.1 Q-Learning with Random Forest Approximation
The Q-Learning update process with Random Forest approximation involves:
- Experience Collection: Gather state-action-reward-next state tuples \((s, a, r, s')\)
- Target Computation: Calculate TD targets \(y = r + \gamma \max_{a'} Q(s', a')\)
- Model Training: Fit Random Forest regressor to predict \(Q(s, a)\) from features \(\phi(s, a)\)
- Policy Update: Use updated model for epsilon-greedy action selection
Unlike linear methods with continuous parameter updates, Random Forest approximation requires periodic model retraining on accumulated experience. This batch-like approach trades computational efficiency for modeling flexibility.
For our implementation, we use a simple concatenation of one-hot encoded state and action vectors:
\[ \phi(s, a) = [e_s^{(state)} \; || \; e_a^{(action)}] \]
where \(e_s^{(state)}\) is a one-hot vector for state \(s\) and \(e_a^{(action)}\) is a one-hot vector for action \(a\). This encoding allows trees to learn complex interactions between states and actions while maintaining interpretability.
7.1.2 Comparison with Previous Methods
| Aspect | Tabular Q-Learning | Linear Function Approximation | Random Forest Function Approximation |
|---|---|---|---|
| Model Complexity | None; direct storage | Linear combinations | Non-linear ensemble |
| Feature Interactions | Implicit | None (unless engineered) | Automatic discovery |
| Interpretability | Full | Moderate (weights) | High (tree structures) |
| Training | Online updates | Gradient descent | Batch retraining |
| Overfitting Risk | None | Low | Low (ensemble averaging) |
| Computational Cost | \(O(1)\) lookup | \(O(d)\) linear algebra | \(O(B \cdot \log n)\) prediction |
7.2 R Implementation
Our implementation builds upon the previous environment while introducing Random Forest-based Q-value approximation. The key innovation lies in accumulating training examples and periodically retraining the forest to incorporate new experience.
# Load required libraries
library(randomForest)
library(ggplot2)
library(gridExtra)
# Set global seed for reproducibility
set.seed(42)
# Environment setup
n_states <- 10
n_actions <- 2
gamma <- 0.9
terminal_state <- n_states
# Build valid transition and reward models
transition_model <- array(0, dim = c(n_states, n_actions, n_states))
reward_model <- array(0, dim = c(n_states, n_actions, n_states))
for (s in 1:(n_states - 1)) {
for (a in 1:n_actions) {
# Ensure valid, normalized probability distributions
# Action 1: mostly forward, with small random transition
if (a == 1) {
next_state <- min(s + 1, n_states - 1) # Don't randomly jump to terminal
random_state <- sample(setdiff(1:(n_states-1), next_state), 1)
transition_model[s, 1, next_state] <- 0.9
transition_model[s, 1, random_state] <- 0.1
} else {
# Action 2: two random non-terminal states
random_states <- sample(1:(n_states-1), 2, replace = FALSE)
transition_model[s, 2, random_states[1]] <- 0.8
transition_model[s, 2, random_states[2]] <- 0.2
}
# Assign rewards
for (s_prime in 1:n_states) {
if (transition_model[s, a, s_prime] > 0) {
reward_model[s, a, s_prime] <- ifelse(
s_prime == n_states,
ifelse(a == 1, 1.0, 0.5), # Terminal reward
runif(1, 0, ifelse(a == 1, 0.1, 0.05)) # Step reward
)
}
}
}
}
# Terminal state: no transitions
transition_model[n_states, , ] <- 0
reward_model[n_states, , ] <- 0
# Verify transition probabilities sum to 1
for (s in 1:(n_states-1)) {
for (a in 1:n_actions) {
prob_sum <- sum(transition_model[s, a, ])
if (abs(prob_sum - 1.0) > 1e-6) {
warning(sprintf("State %d, Action %d: probabilities sum to %.3f", s, a, prob_sum))
}
}
}
# Sampling function
sample_env <- function(s, a) {
probs <- transition_model[s, a, ]
s_prime <- sample(1:n_states, 1, prob = probs)
reward <- reward_model[s, a, s_prime]
list(s_prime = s_prime, reward = reward)
}
# Feature encoding
encode_features <- function(s, a, n_states, n_actions) {
state_vec <- rep(0, n_states)
action_vec <- rep(0, n_actions)
state_vec[s] <- 1
action_vec[a] <- 1
c(state_vec, action_vec)
}
n_features <- n_states + n_actions
# Experience replay buffer
ExperienceBuffer <- setRefClass(
"ExperienceBuffer",
fields = list(
capacity = "numeric",
buffer = "list",
position = "numeric"
),
methods = list(
initialize = function(cap = 5000) {
capacity <<- cap
buffer <<- vector("list", capacity)
position <<- 0
},
add = function(experience) {
idx <- (position %% capacity) + 1
buffer[[idx]] <<- experience
position <<- position + 1
},
sample = function(batch_size) {
valid_size <- min(position, capacity)
if (valid_size < batch_size) {
indices <- 1:valid_size
} else {
indices <- sample(1:valid_size, batch_size, replace = FALSE)
}
buffer[indices]
},
size = function() {
min(position, capacity)
}
)
)
# Tabular Q-Learning (baseline)
q_learning_tabular <- function(episodes = 1000, alpha = 0.1, epsilon = 0.1) {
Q <- matrix(0, nrow = n_states, ncol = n_actions)
rewards_history <- numeric(episodes)
for (ep in 1:episodes) {
s <- sample(1:(n_states - 1), 1)
episode_reward <- 0
while (s != terminal_state) {
# Epsilon-greedy action selection
a <- if (runif(1) < epsilon) {
sample(1:n_actions, 1)
} else {
which.max(Q[s, ])
}
# Take action
out <- sample_env(s, a)
s_prime <- out$s_prime
r <- out$reward
episode_reward <- episode_reward + r
# Q-learning update
max_q_next <- ifelse(s_prime == terminal_state, 0, max(Q[s_prime, ]))
td_target <- r + gamma * max_q_next
Q[s, a] <- Q[s, a] + alpha * (td_target - Q[s, a])
s <- s_prime
}
rewards_history[ep] <- episode_reward
}
# Extract policy
policy <- apply(Q[1:(n_states-1), ], 1, which.max)
list(Q = Q, policy = c(policy, NA), rewards = rewards_history)
}
# Q-Learning with Random Forest (Fixed)
q_learning_rf <- function(episodes = 1000, epsilon = 0.1, buffer_capacity = 5000,
batch_size = 128, retrain_freq = 50, update_target_freq = 100) {
# Initialize experience buffer
replay_buffer <- ExperienceBuffer$new(buffer_capacity)
# Initialize models (current and target)
current_model <- NULL
target_model <- NULL
rewards_history <- numeric(episodes)
td_errors <- numeric(episodes)
model_updates <- 0
for (ep in 1:episodes) {
s <- sample(1:(n_states - 1), 1)
episode_reward <- 0
episode_td_error <- 0
steps <- 0
while (s != terminal_state) {
# Predict Q-values using current model
q_values <- sapply(1:n_actions, function(a) {
x <- encode_features(s, a, n_states, n_actions)
if (!is.null(current_model)) {
predict(current_model, as.data.frame(t(x)))
} else {
0 # Initialize to zero
}
})
# Epsilon-greedy action selection
a <- if (runif(1) < epsilon) {
sample(1:n_actions, 1)
} else {
which.max(q_values)
}
# Take action
out <- sample_env(s, a)
s_prime <- out$s_prime
r <- out$reward
episode_reward <- episode_reward + r
steps <- steps + 1
# Store experience
replay_buffer$add(list(s = s, a = a, r = r, s_prime = s_prime))
s <- s_prime
}
rewards_history[ep] <- episode_reward
# Train model if enough samples
if (replay_buffer$size() >= batch_size && ep %% retrain_freq == 0) {
# Sample batch from replay buffer
batch <- replay_buffer$sample(batch_size)
# Prepare training data
train_x <- matrix(0, nrow = batch_size, ncol = n_features)
train_y <- numeric(batch_size)
for (i in 1:batch_size) {
exp <- batch[[i]]
# Use target model for computing Q-values (DQN-style)
model_to_use <- if (!is.null(target_model)) target_model else current_model
# Compute TD target
if (exp$s_prime == terminal_state) {
target_q <- exp$r
} else {
# Get max Q-value from target network
q_next <- sapply(1:n_actions, function(a_next) {
x_next <- encode_features(exp$s_prime, a_next, n_states, n_actions)
if (!is.null(model_to_use)) {
predict(model_to_use, as.data.frame(t(x_next)))
} else {
0
}
})
target_q <- exp$r + gamma * max(q_next)
}
train_x[i, ] <- encode_features(exp$s, exp$a, n_states, n_actions)
train_y[i] <- target_q
}
# Train Random Forest
current_model <- randomForest(
x = as.data.frame(train_x),
y = train_y,
ntree = 50, # Reduced for efficiency
nodesize = 10,
mtry = max(1, floor(n_features / 3)),
keep.forest = TRUE
)
model_updates <- model_updates + 1
# Update target model periodically
if (model_updates %% (update_target_freq / retrain_freq) == 0) {
target_model <- current_model
}
# Track TD error
predictions <- predict(current_model, as.data.frame(train_x))
td_errors[ep] <- mean(abs(predictions - train_y))
}
}
# Extract final policy
policy <- sapply(1:(n_states-1), function(s) {
if (!is.null(current_model)) {
q_vals <- sapply(1:n_actions, function(a) {
x <- encode_features(s, a, n_states, n_actions)
predict(current_model, as.data.frame(t(x)))
})
which.max(q_vals)
} else {
1
}
})
list(
model = current_model,
policy = c(policy, NA),
rewards = rewards_history,
td_errors = td_errors,
buffer = replay_buffer
)
}
# Run both algorithms
cat("Running Tabular Q-Learning...\n")
tabular_result <- q_learning_tabular(episodes = 1000, alpha = 0.1, epsilon = 0.1)
cat("Running Q-Learning with Random Forest...\n")
rf_result <- q_learning_rf(
episodes = 1000,
epsilon = 0.1,
buffer_capacity = 5000,
batch_size = 128,
retrain_freq = 50,
update_target_freq = 100
)
# Compute moving average (properly)
moving_average <- function(x, window = 50) {
n <- length(x)
result <- numeric(n)
for (i in 1:n) {
start_idx <- max(1, i - window + 1)
result[i] <- mean(x[start_idx:i])
}
result
}
tabular_smooth <- moving_average(tabular_result$rewards, 50)
rf_smooth <- moving_average(rf_result$rewards, 50)
# Create comparison dataframe
reward_df <- data.frame(
Episode = rep(1:1000, 2),
Reward = c(tabular_smooth, rf_smooth),
Algorithm = rep(c("Tabular Q-Learning", "Q-Learning + RF"), each = 1000)
)
# Plot 1: Learning curves comparison
reward_plot <- ggplot(reward_df, aes(x = Episode, y = Reward, color = Algorithm)) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("Tabular Q-Learning" = "blue",
"Q-Learning + RF" = "forestgreen")) +
theme_minimal() +
labs(
title = "Learning Curves Comparison (50-episode moving average)",
x = "Episode",
y = "Average Reward"
) +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "bottom"
)
# Plot 2: Policy comparison
policy_df <- data.frame(
State = rep(1:(n_states-1), 2),
Action = c(tabular_result$policy[1:(n_states-1)],
rf_result$policy[1:(n_states-1)]),
Algorithm = rep(c("Tabular Q-Learning", "Q-Learning + RF"),
each = n_states-1)
)
policy_plot <- ggplot(policy_df, aes(x = State, y = Action, color = Algorithm, shape = Algorithm)) +
geom_point(size = 4) +
geom_line(linewidth = 1, alpha = 0.6) +
scale_color_manual(values = c("Tabular Q-Learning" = "blue",
"Q-Learning + RF" = "forestgreen")) +
scale_shape_manual(values = c(16, 17)) +
scale_y_continuous(breaks = 1:2, labels = c("Action 1", "Action 2"),
limits = c(0.5, 2.5)) +
scale_x_continuous(breaks = 1:(n_states-1)) +
theme_minimal() +
labs(
title = "Learned Policies Comparison",
x = "State",
y = "Action"
) +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "bottom"
)
# Display plots
print(reward_plot)
print(policy_plot)
# Feature importance (if model exists)
if (!is.null(rf_result$model)) {
importance_df <- data.frame(
Feature = c(paste0("S", 1:n_states), paste0("A", 1:n_actions)),
Importance = importance(rf_result$model)[, 1]
)
importance_plot <- ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "forestgreen", alpha = 0.7) +
coord_flip() +
theme_minimal() +
labs(
title = "Feature Importance in Random Forest Q-Function",
x = "Feature",
y = "Importance (IncNodePurity)"
) +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12)
)
print(importance_plot)
}
# Summary statistics
cat("\n=== Summary Statistics ===\n")
cat("\nTabular Q-Learning:\n")
cat(" Final 100-episode avg reward:", mean(tail(tabular_result$rewards, 100)), "\n")
cat(" Best policy reward:", max(tabular_smooth), "\n")
cat("\nQ-Learning with Random Forest:\n")
cat(" Final 100-episode avg reward:", mean(tail(rf_result$rewards, 100)), "\n")
cat(" Best policy reward:", max(rf_smooth), "\n")
cat(" Experience buffer size:", rf_result$buffer$size(), "\n")
if (!is.null(rf_result$model)) {
cat(" RF model trees:", rf_result$model$ntree, "\n")
cat(" RF OOB MSE:", mean(rf_result$model$mse), "\n")
}
# Policy agreement
policy_agreement <- sum(tabular_result$policy[1:(n_states-1)] ==
rf_result$policy[1:(n_states-1)]) / (n_states-1)
cat("\nPolicy agreement:", round(policy_agreement * 100, 1), "%\n")Environment Setup
This chunk defines the basic parameters of the reinforcement learning problem.
n_states <- 10: The “world” or “game” has 10 distinct states.n_actions <- 2: In any state, the agent can choose between 2 possible actions.gamma <- 0.9: This is the discount factor. It determines how much the agent values future rewards. A value of 0.9 means a reward received in the next step is worth 90% of a reward received now.terminal_state <- n_states: State 10 is designated as the “end” state. When the agent reaches this state, the “episode” (one attempt) is over.
Environment Model Definition
transition_model <- array(0, dim = c(n_states, n_actions, n_states))
reward_model <- array(0, dim = c(n_states, n_actions, n_states))
for (s in 1:(n_states - 1)) {
for (a in 1:n_actions) {
if (a == 1) {
next_state <- min(s + 1, n_states - 1)
random_state <- sample(setdiff(1:(n_states-1), next_state), 1)
transition_model[s, 1, next_state] <- 0.9
transition_model[s, 1, random_state] <- 0.1
} else {
random_states <- sample(1:(n_states-1), 2, replace = FALSE)
transition_model[s, 2, random_states[1]] <- 0.8
transition_model[s, 2, random_states[2]] <- 0.2
}
# ... reward assignment ...
}
}This chunk builds the “rules” of the environment with critical fixes:
- Fixed Probability Distribution:
- Action 1: 90% chance to move to the next state, 10% to a different random non-terminal state (using
setdiffto prevent overwriting) - Action 2: 80% to one random state, 20% to another different random state (using
replace = FALSE) - This ensures probabilities always sum to exactly 1.0 for each (state, action) pair
- Action 1: 90% chance to move to the next state, 10% to a different random non-terminal state (using
- No Invalid Terminal Transitions: Random states are sampled from
1:(n_states-1), excluding the terminal state - Verification: The code includes a check that prints warnings if any probabilities don’t sum to 1.0
Sampling Function
sample_env <- function(s, a) {
probs <- transition_model[s, a, ]
s_prime <- sample(1:n_states, 1, prob = probs)
reward <- reward_model[s, a, s_prime]
list(s_prime = s_prime, reward = reward)
}This function acts as the simulator. The Q-learning agent doesn’t get to see the full transition_model. Instead, it gives a state (s) and action (a) to this function, which:
- Looks up the probabilities for all possible next states
- Samples one next state based on those probabilities
- Returns the next state and corresponding reward
Feature Encoding
encode_features <- function(s, a, n_states, n_actions) {
state_vec <- rep(0, n_states)
action_vec <- rep(0, n_actions)
state_vec[s] <- 1
action_vec[a] <- 1
c(state_vec, action_vec)
}This performs one-hot encoding to convert states and actions into numerical features:
- Creates a vector of 10 zeros, puts a 1 at position
s(e.g., state 3 →[0,0,1,0,0,0,0,0,0,0]) - Creates a vector of 2 zeros, puts a 1 at position
a(e.g., action 2 →[0,1]) - Combines them into a 12-dimensional feature vector for the Random Forest
Experience Replay Buffer
ExperienceBuffer <- setRefClass(
"ExperienceBuffer",
fields = list(capacity = "numeric", buffer = "list", position = "numeric"),
methods = list(
add = function(experience) { ... },
sample = function(batch_size) { ... },
size = function() { ... }
)
)- Circular Buffer: Stores up to 5000 experiences (state, action, reward, next_state)
- Prevents Catastrophic Forgetting: By sampling randomly from past experiences rather than learning only from the most recent ones
- Breaks Temporal Correlation: Random sampling decorrelates sequential experiences, improving learning stability
- Methods:
add(): Stores a new experience, overwriting old ones when fullsample(): Returns a random batch of experiences for trainingsize(): Returns current number of stored experiences
Tabular Q-Learning Baseline
q_learning_tabular <- function(episodes = 1000, alpha = 0.1, epsilon = 0.1) {
Q <- matrix(0, nrow = n_states, ncol = n_actions)
# ... standard Q-learning update ...
Q[s, a] <- Q[s, a] + alpha * (td_target - Q[s, a])
# ...
}This implements standard tabular Q-learning as a baseline:
- Maintains a 10×2 table of Q-values
- Uses the classical update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s’,a’) - Q(s,a)]
- Provides a gold standard to compare against the Random Forest approximation
- Should converge to the optimal policy on this small problem
Q-Learning with Random Forest
q_learning_rf <- function(episodes = 1000, epsilon = 0.1,
buffer_capacity = 5000, batch_size = 128,
retrain_freq = 50, update_target_freq = 100) {The main function now has improved hyperparameters:
buffer_capacity = 5000: Size of experience replay bufferbatch_size = 128: Number of experiences to sample for each training updateretrain_freq = 50: Retrain model every 50 episodes (less frequent, more stable)update_target_freq = 100: Update target network every 100 episodes
Initialization
replay_buffer <- ExperienceBuffer$new(buffer_capacity)
current_model <- NULL
target_model <- NULL
rewards_history <- numeric(episodes)
td_errors <- numeric(episodes)replay_buffer: Stores experiences for replaycurrent_model: The Q-function being actively trainedtarget_model: A frozen copy used for computing TD targets (DQN-style)td_errors: Tracks learning progress by monitoring prediction errors
Episode Loop
for (ep in 1:episodes) {
s <- sample(1:(n_states - 1), 1)
episode_reward <- 0
while (s != terminal_state) {
# ... one step ...
replay_buffer$add(list(s = s, a = a, r = r, s_prime = s_prime))
s <- s_prime
}For each episode:
- Start in a random non-terminal state
- Take steps until reaching the terminal state
- Store each experience in the replay buffer (not directly used for training)
- Accumulate episode reward
Action Selection
q_values <- sapply(1:n_actions, function(a) {
x <- encode_features(s, a, n_states, n_actions)
if (!is.null(current_model)) {
predict(current_model, as.data.frame(t(x)))
} else {
0
}
})
a <- if (runif(1) < epsilon) {
sample(1:n_actions, 1)
} else {
which.max(q_values)
}Epsilon-greedy action selection:
- Predict Q-values for all actions using the current model
- With probability ε (10%), explore by choosing randomly
- With probability 1-ε (90%), exploit by choosing the action with highest Q-value
Training with Experience Replay
if (replay_buffer$size() >= batch_size && ep %% retrain_freq == 0) {
batch <- replay_buffer$sample(batch_size)
train_x <- matrix(0, nrow = batch_size, ncol = n_features)
train_y <- numeric(batch_size)
for (i in 1:batch_size) {
exp <- batch[[i]]
model_to_use <- if (!is.null(target_model)) target_model else current_model
if (exp$s_prime == terminal_state) {
target_q <- exp$r
} else {
q_next <- sapply(1:n_actions, function(a_next) {
x_next <- encode_features(exp$s_prime, a_next, n_states, n_actions)
if (!is.null(model_to_use)) {
predict(model_to_use, as.data.frame(t(x_next)))
} else {
0
}
})
target_q <- exp$r + gamma * max(q_next)
}
train_x[i, ] <- encode_features(exp$s, exp$a, n_states, n_actions)
train_y[i] <- target_q
}
current_model <- randomForest(...)
}Random Forest Training
current_model <- randomForest(
x = as.data.frame(train_x),
y = train_y,
ntree = 50, # Reduced from 100 for efficiency
nodesize = 10, # Increased from 5 to prevent overfitting
mtry = max(1, floor(n_features / 3)),
keep.forest = TRUE
)Improved hyperparameters:
ntree = 50: Fewer trees (faster training, still accurate)nodesize = 10: Larger leaf nodes (less overfitting on small batches)- These settings balance accuracy with computational efficiency
Policy Extraction
policy <- sapply(1:(n_states-1), function(s) {
if (!is.null(current_model)) {
q_vals <- sapply(1:n_actions, function(a) {
x <- encode_features(s, a, n_states, n_actions)
predict(current_model, as.data.frame(t(x)))
})
which.max(q_vals)
} else {
1
}
})After all episodes, extract the final learned policy:
- For each state, predict Q-values for both actions
- Select the action with highest Q-value
- This creates a lookup table: “in state s, take action a”
Visualization and Comparison
moving_average <- function(x, window = 50) {
n <- length(x)
result <- numeric(n)
for (i in 1:n) {
start_idx <- max(1, i - window + 1)
result[i] <- mean(x[start_idx:i])
}
result
}Properly implements a moving average:
- For episode i, averages rewards from episodes [i-49, i]
- Early episodes (i < 50) use expanding windows
- This smooths noisy reward data to show learning trends
Learning Curves Comparison
reward_df <- data.frame(
Episode = rep(1:1000, 2),
Reward = c(tabular_smooth, rf_smooth),
Algorithm = rep(c("Tabular Q-Learning", "Q-Learning + RF"), each = 1000)
)
reward_plot <- ggplot(reward_df, aes(x = Episode, y = Reward, color = Algorithm)) +
geom_line(linewidth = 1) + ...Creates a side-by-side comparison plot:
- Blue line: Tabular Q-learning (baseline)
- Green line: Random Forest Q-learning
- Shows which algorithm learns faster and reaches higher rewards
- Allows visual assessment of whether function approximation helps or hurts
Policy Compariso
policy_plot <- ggplot(policy_df, aes(x = State, y = Action,
color = Algorithm, shape = Algorithm)) +
geom_point(size = 4) +
geom_line(linewidth = 1, alpha = 0.6) + ...Visualizes the final policies learned by both algorithms:
- Shows which action (1 or 2) each algorithm chooses for each state
- Points and lines make it easy to see agreements and disagreements
- Ideal: both algorithms should converge to similar policies
Feature Importance
importance_df <- data.frame(
Feature = c(paste0("S", 1:n_states), paste0("A", 1:n_actions)),
Importance = importance(rf_result$model)[, 1]
)
importance_plot <- ggplot(importance_df,
aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "forestgreen", alpha = 0.7) +
coord_flip() + ...Shows which features the Random Forest found most important:
- Higher bars = more important for predicting Q-values
- Typically, states near the goal (State 9, State 10) will be most important
- Action features show which action provides more informative signals
- Provides interpretability: “what does the model care about?”
Summary Statistics
cat("\n=== Summary Statistics ===\n")
cat("\nTabular Q-Learning:\n")
cat(" Final 100-episode avg reward:", mean(tail(tabular_result$rewards, 100)), "\n")
cat("\nQ-Learning with Random Forest:\n")
cat(" Final 100-episode avg reward:", mean(tail(rf_result$rewards, 100)), "\n")
cat(" Experience buffer size:", rf_result$buffer$size(), "\n")
cat(" RF OOB MSE:", mean(rf_result$model$mse), "\n")
policy_agreement <- sum(tabular_result$policy[1:(n_states-1)] ==
rf_result$policy[1:(n_states-1)]) / (n_states-1)
cat("\nPolicy agreement:", round(policy_agreement * 100, 1), "%\n")Provides quantitative comparison:
- Final avg reward: Performance of each algorithm (higher is better)
- Buffer size: How many experiences were collected
- OOB MSE: Random Forest’s internal error estimate (lower is better)
- Policy agreement: Percentage of states where both algorithms chose the same action
- 100% = perfect agreement (both found the same optimal policy)
- Lower values suggest one algorithm failed to converge or the problem has multiple optimal solutions
7.3 Analysis and Insights
Random Forest function approximation exhibits several characteristics that distinguish it from linear methods. Trees can capture non-linear decision boundaries, enabling the model to learn state-action relationships that linear approaches cannot represent. The random feature sampling at each split performs automatic feature selection, focusing computational resources on the most informative variables. Ensemble averaging across multiple trees reduces overfitting and provides stable predictions across different training samples. Individual trees maintain interpretable decision paths that show how Q-values are estimated for specific state-action pairs.
The batch retraining approach creates distinct computational trade-offs that affect implementation decisions. Training frequency must balance responsiveness against computational cost, as more frequent updates improve adaptation but require additional processing time. Trees need sufficient data to learn meaningful patterns, which can slow initial learning compared to methods that update continuously. Memory requirements increase over time as training examples accumulate, requiring careful management of historical data.
Random Forest methods naturally generate feature importance measures that reveal which states and actions most influence Q-value predictions. This interpretability provides diagnostic capabilities for understanding learning issues and analyzing policy decisions. The feature ranking can guide state representation choices and help identify redundant or irrelevant variables in the problem formulation.
Random Forest function approximation occupies a position between simple linear models and neural networks in terms of complexity and capability. The method handles larger state spaces more effectively than tabular approaches while remaining computationally tractable. It captures non-linear patterns without requiring extensive feature engineering or domain expertise. The approach shows less sensitivity to hyperparameter choices compared to neural networks while maintaining stability across different problem instances. The inherent interpretability provides insights into the decision-making process that can be valuable for debugging and analysis.
Random Forest methods demonstrate several advantages and trade-offs when compared to linear function approximation. The tree-based approach excels at pattern recognition, learning state-action relationships that linear models cannot capture due to their representational limitations. However, initial learning proceeds more slowly as trees require sufficient data to construct meaningful decision boundaries. Computational costs are higher due to periodic retraining requirements, contrasting with the continuous gradient updates used in linear methods. Generalization performance tends to be superior, as ensemble averaging provides natural regularization that reduces overfitting tendencies.
7.4 Conclusion
Random Forest function approximation extends linear methods by offering enhanced modeling flexibility while preserving interpretability characteristics. The approach performs particularly well in environments with non-linear state-action relationships and provides regularization through ensemble averaging.
Several key observations emerge from this analysis. Non-linear function approximation can capture patterns that linear models miss, enabling better policy learning in complex environments. Batch learning approaches require careful consideration of training frequency and sample requirements to balance performance with computational efficiency. Feature importance analysis provides insights into learned policies that can guide problem formulation and debugging efforts. Tree-based methods offer an interpretable alternative to neural network approaches while maintaining theoretical foundations.
This exploration demonstrates how ensemble methods can enhance reinforcement learning without abandoning the established principles of Q-Learning. Future work could investigate online tree learning algorithms that avoid batch retraining requirements, adaptive schedules that optimize training frequency based on performance metrics, or hybrid approaches that combine strengths from different function approximation methods.