Chapter 7 Beyond Linear Models: Q-Learning with Random Forest Function Approximation in R

7.1 Introduction

While linear function approximation provides a solid foundation for scaling reinforcement learning beyond tabular methods, it assumes a linear relationship between features and Q-values. Real-world problems often exhibit complex, non-linear patterns that linear models cannot capture effectively. This post extends our previous exploration by implementing Q-Learning with Random Forest function approximation, demonstrating how ensemble methods can learn intricate state-action value relationships while maintaining interpretability and robust generalization.

Random Forests offer several advantages over linear approximation: they handle non-linear relationships naturally, provide built-in feature importance measures, resist overfitting through ensemble averaging, and require minimal hyperparameter tuning. We’ll implement this approach using the same 10-state, 2-action environment, comparing the learned policies and examining the unique characteristics of tree-based function approximation.

Random Forest function approximation replaces the linear parameterization with an ensemble of decision trees. Instead of:

\[ Q(s, a; \theta) = \phi(s, a)^T \theta \]

we now approximate the action-value function as:

\[ Q(s, a) = \frac{1}{B} \sum_{b=1}^{B} T_b(\phi(s, a)) \]

where \(T_b\) represents the \(b\)-th tree in the ensemble, \(B\) is the number of trees, and \(\phi(s, a)\) is our feature representation. Each tree \(T_b\) is trained on a bootstrap sample of the data with random feature subsets at each split, providing natural regularization and variance reduction.

7.1.1 Q-Learning with Random Forest Approximation

The Q-Learning update process with Random Forest approximation involves:

Experience Collection: Gather state-action-reward-next state tuples \((s, a, r, s')\)
Target Computation: Calculate TD targets \(y = r + \gamma \max_{a'} Q(s', a')\)
Model Training: Fit Random Forest regressor to predict \(Q(s, a)\) from features \(\phi(s, a)\)
Policy Update: Use updated model for epsilon-greedy action selection

Unlike linear methods with continuous parameter updates, Random Forest approximation requires periodic model retraining on accumulated experience. This batch-like approach trades computational efficiency for modeling flexibility.

For our implementation, we use a simple concatenation of one-hot encoded state and action vectors:

\[ \phi(s, a) = [e_s^{(state)} \; || \; e_a^{(action)}] \]

where \(e_s^{(state)}\) is a one-hot vector for state \(s\) and \(e_a^{(action)}\) is a one-hot vector for action \(a\). This encoding allows trees to learn complex interactions between states and actions while maintaining interpretability.

7.1.2 Comparison with Previous Methods

Aspect	Tabular Q-Learning	Linear Function Approximation	Random Forest Function Approximation
Model Complexity	None; direct storage	Linear combinations	Non-linear ensemble
Feature Interactions	Implicit	None (unless engineered)	Automatic discovery
Interpretability	Full	Moderate (weights)	High (tree structures)
Training	Online updates	Gradient descent	Batch retraining
Overfitting Risk	None	Low	Low (ensemble averaging)
Computational Cost	\(O(1)\) lookup	\(O(d)\) linear algebra	\(O(B \cdot \log n)\) prediction

7.2 R Implementation

Our implementation builds upon the previous environment while introducing Random Forest-based Q-value approximation. The key innovation lies in accumulating training examples and periodically retraining the forest to incorporate new experience.

# Load required libraries
library(randomForest)
library(ggplot2)
library(gridExtra)

# Set global seed for reproducibility
set.seed(42)

# Environment setup
n_states <- 10
n_actions <- 2
gamma <- 0.9
terminal_state <- n_states

# Build valid transition and reward models
transition_model <- array(0, dim = c(n_states, n_actions, n_states))
reward_model <- array(0, dim = c(n_states, n_actions, n_states))

for (s in 1:(n_states - 1)) {
  for (a in 1:n_actions) {
    # Ensure valid, normalized probability distributions
    # Action 1: mostly forward, with small random transition
    if (a == 1) {
      next_state <- min(s + 1, n_states - 1)  # Don't randomly jump to terminal
      random_state <- sample(setdiff(1:(n_states-1), next_state), 1)
      transition_model[s, 1, next_state] <- 0.9
      transition_model[s, 1, random_state] <- 0.1
    } else {
      # Action 2: two random non-terminal states
      random_states <- sample(1:(n_states-1), 2, replace = FALSE)
      transition_model[s, 2, random_states[1]] <- 0.8
      transition_model[s, 2, random_states[2]] <- 0.2
    }
    
    # Assign rewards
    for (s_prime in 1:n_states) {
      if (transition_model[s, a, s_prime] > 0) {
        reward_model[s, a, s_prime] <- ifelse(
          s_prime == n_states, 
          ifelse(a == 1, 1.0, 0.5),  # Terminal reward
          runif(1, 0, ifelse(a == 1, 0.1, 0.05))  # Step reward
        )
      }
    }
  }
}

# Terminal state: no transitions
transition_model[n_states, , ] <- 0
reward_model[n_states, , ] <- 0

# Verify transition probabilities sum to 1
for (s in 1:(n_states-1)) {
  for (a in 1:n_actions) {
    prob_sum <- sum(transition_model[s, a, ])
    if (abs(prob_sum - 1.0) > 1e-6) {
      warning(sprintf("State %d, Action %d: probabilities sum to %.3f", s, a, prob_sum))
    }
  }
}

# Sampling function
sample_env <- function(s, a) {
  probs <- transition_model[s, a, ]
  s_prime <- sample(1:n_states, 1, prob = probs)
  reward <- reward_model[s, a, s_prime]
  list(s_prime = s_prime, reward = reward)
}

# Feature encoding
encode_features <- function(s, a, n_states, n_actions) {
  state_vec <- rep(0, n_states)
  action_vec <- rep(0, n_actions)
  state_vec[s] <- 1
  action_vec[a] <- 1
  c(state_vec, action_vec)
}

n_features <- n_states + n_actions

# Experience replay buffer
ExperienceBuffer <- setRefClass(
  "ExperienceBuffer",
  fields = list(
    capacity = "numeric",
    buffer = "list",
    position = "numeric"
  ),
  methods = list(
    initialize = function(cap = 5000) {
      capacity <<- cap
      buffer <<- vector("list", capacity)
      position <<- 0
    },
    add = function(experience) {
      idx <- (position %% capacity) + 1
      buffer[[idx]] <<- experience
      position <<- position + 1
    },
    sample = function(batch_size) {
      valid_size <- min(position, capacity)
      if (valid_size < batch_size) {
        indices <- 1:valid_size
      } else {
        indices <- sample(1:valid_size, batch_size, replace = FALSE)
      }
      buffer[indices]
    },
    size = function() {
      min(position, capacity)
    }
  )
)

# Tabular Q-Learning (baseline)
q_learning_tabular <- function(episodes = 1000, alpha = 0.1, epsilon = 0.1) {
  Q <- matrix(0, nrow = n_states, ncol = n_actions)
  rewards_history <- numeric(episodes)
  
  for (ep in 1:episodes) {
    s <- sample(1:(n_states - 1), 1)
    episode_reward <- 0
    
    while (s != terminal_state) {
      # Epsilon-greedy action selection
      a <- if (runif(1) < epsilon) {
        sample(1:n_actions, 1)
      } else {
        which.max(Q[s, ])
      }
      
      # Take action
      out <- sample_env(s, a)
      s_prime <- out$s_prime
      r <- out$reward
      episode_reward <- episode_reward + r
      
      # Q-learning update
      max_q_next <- ifelse(s_prime == terminal_state, 0, max(Q[s_prime, ]))
      td_target <- r + gamma * max_q_next
      Q[s, a] <- Q[s, a] + alpha * (td_target - Q[s, a])
      
      s <- s_prime
    }
    
    rewards_history[ep] <- episode_reward
  }
  
  # Extract policy
  policy <- apply(Q[1:(n_states-1), ], 1, which.max)
  list(Q = Q, policy = c(policy, NA), rewards = rewards_history)
}

# Q-Learning with Random Forest (Fixed)
q_learning_rf <- function(episodes = 1000, epsilon = 0.1, buffer_capacity = 5000,
                         batch_size = 128, retrain_freq = 50, update_target_freq = 100) {
  
  # Initialize experience buffer
  replay_buffer <- ExperienceBuffer$new(buffer_capacity)
  
  # Initialize models (current and target)
  current_model <- NULL
  target_model <- NULL
  
  rewards_history <- numeric(episodes)
  td_errors <- numeric(episodes)
  model_updates <- 0
  
  for (ep in 1:episodes) {
    s <- sample(1:(n_states - 1), 1)
    episode_reward <- 0
    episode_td_error <- 0
    steps <- 0
    
    while (s != terminal_state) {
      # Predict Q-values using current model
      q_values <- sapply(1:n_actions, function(a) {
        x <- encode_features(s, a, n_states, n_actions)
        if (!is.null(current_model)) {
          predict(current_model, as.data.frame(t(x)))
        } else {
          0  # Initialize to zero
        }
      })
      
      # Epsilon-greedy action selection
      a <- if (runif(1) < epsilon) {
        sample(1:n_actions, 1)
      } else {
        which.max(q_values)
      }
      
      # Take action
      out <- sample_env(s, a)
      s_prime <- out$s_prime
      r <- out$reward
      episode_reward <- episode_reward + r
      steps <- steps + 1
      
      # Store experience
      replay_buffer$add(list(s = s, a = a, r = r, s_prime = s_prime))
      
      s <- s_prime
    }
    
    rewards_history[ep] <- episode_reward
    
    # Train model if enough samples
    if (replay_buffer$size() >= batch_size && ep %% retrain_freq == 0) {
      # Sample batch from replay buffer
      batch <- replay_buffer$sample(batch_size)
      
      # Prepare training data
      train_x <- matrix(0, nrow = batch_size, ncol = n_features)
      train_y <- numeric(batch_size)
      
      for (i in 1:batch_size) {
        exp <- batch[[i]]
        
        # Use target model for computing Q-values (DQN-style)
        model_to_use <- if (!is.null(target_model)) target_model else current_model
        
        # Compute TD target
        if (exp$s_prime == terminal_state) {
          target_q <- exp$r
        } else {
          # Get max Q-value from target network
          q_next <- sapply(1:n_actions, function(a_next) {
            x_next <- encode_features(exp$s_prime, a_next, n_states, n_actions)
            if (!is.null(model_to_use)) {
              predict(model_to_use, as.data.frame(t(x_next)))
            } else {
              0
            }
          })
          target_q <- exp$r + gamma * max(q_next)
        }
        
        train_x[i, ] <- encode_features(exp$s, exp$a, n_states, n_actions)
        train_y[i] <- target_q
      }
      
      # Train Random Forest
      current_model <- randomForest(
        x = as.data.frame(train_x),
        y = train_y,
        ntree = 50,  # Reduced for efficiency
        nodesize = 10,
        mtry = max(1, floor(n_features / 3)),
        keep.forest = TRUE
      )
      
      model_updates <- model_updates + 1
      
      # Update target model periodically
      if (model_updates %% (update_target_freq / retrain_freq) == 0) {
        target_model <- current_model
      }
      
      # Track TD error
      predictions <- predict(current_model, as.data.frame(train_x))
      td_errors[ep] <- mean(abs(predictions - train_y))
    }
  }
  
  # Extract final policy
  policy <- sapply(1:(n_states-1), function(s) {
    if (!is.null(current_model)) {
      q_vals <- sapply(1:n_actions, function(a) {
        x <- encode_features(s, a, n_states, n_actions)
        predict(current_model, as.data.frame(t(x)))
      })
      which.max(q_vals)
    } else {
      1
    }
  })
  
  list(
    model = current_model,
    policy = c(policy, NA),
    rewards = rewards_history,
    td_errors = td_errors,
    buffer = replay_buffer
  )
}

# Run both algorithms
cat("Running Tabular Q-Learning...\n")
tabular_result <- q_learning_tabular(episodes = 1000, alpha = 0.1, epsilon = 0.1)

cat("Running Q-Learning with Random Forest...\n")
rf_result <- q_learning_rf(
  episodes = 1000, 
  epsilon = 0.1, 
  buffer_capacity = 5000,
  batch_size = 128,
  retrain_freq = 50,
  update_target_freq = 100
)

# Compute moving average (properly)
moving_average <- function(x, window = 50) {
  n <- length(x)
  result <- numeric(n)
  for (i in 1:n) {
    start_idx <- max(1, i - window + 1)
    result[i] <- mean(x[start_idx:i])
  }
  result
}

tabular_smooth <- moving_average(tabular_result$rewards, 50)
rf_smooth <- moving_average(rf_result$rewards, 50)

# Create comparison dataframe
reward_df <- data.frame(
  Episode = rep(1:1000, 2),
  Reward = c(tabular_smooth, rf_smooth),
  Algorithm = rep(c("Tabular Q-Learning", "Q-Learning + RF"), each = 1000)
)

# Plot 1: Learning curves comparison
reward_plot <- ggplot(reward_df, aes(x = Episode, y = Reward, color = Algorithm)) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c("Tabular Q-Learning" = "blue", 
                                 "Q-Learning + RF" = "forestgreen")) +
  theme_minimal() +
  labs(
    title = "Learning Curves Comparison (50-episode moving average)",
    x = "Episode",
    y = "Average Reward"
  ) +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    legend.position = "bottom"
  )

# Plot 2: Policy comparison
policy_df <- data.frame(
  State = rep(1:(n_states-1), 2),
  Action = c(tabular_result$policy[1:(n_states-1)], 
             rf_result$policy[1:(n_states-1)]),
  Algorithm = rep(c("Tabular Q-Learning", "Q-Learning + RF"), 
                  each = n_states-1)
)

policy_plot <- ggplot(policy_df, aes(x = State, y = Action, color = Algorithm, shape = Algorithm)) +
  geom_point(size = 4) +
  geom_line(linewidth = 1, alpha = 0.6) +
  scale_color_manual(values = c("Tabular Q-Learning" = "blue", 
                                 "Q-Learning + RF" = "forestgreen")) +
  scale_shape_manual(values = c(16, 17)) +
  scale_y_continuous(breaks = 1:2, labels = c("Action 1", "Action 2"), 
                     limits = c(0.5, 2.5)) +
  scale_x_continuous(breaks = 1:(n_states-1)) +
  theme_minimal() +
  labs(
    title = "Learned Policies Comparison",
    x = "State",
    y = "Action"
  ) +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    legend.position = "bottom"
  )

# Display plots
print(reward_plot)
print(policy_plot)

# Feature importance (if model exists)
if (!is.null(rf_result$model)) {
  importance_df <- data.frame(
    Feature = c(paste0("S", 1:n_states), paste0("A", 1:n_actions)),
    Importance = importance(rf_result$model)[, 1]
  )
  
  importance_plot <- ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
    geom_col(fill = "forestgreen", alpha = 0.7) +
    coord_flip() +
    theme_minimal() +
    labs(
      title = "Feature Importance in Random Forest Q-Function",
      x = "Feature",
      y = "Importance (IncNodePurity)"
    ) +
    theme(
      plot.title = element_text(size = 14, face = "bold"),
      axis.title = element_text(size = 12)
    )
  
  print(importance_plot)
}

# Summary statistics
cat("\n=== Summary Statistics ===\n")
cat("\nTabular Q-Learning:\n")
cat("  Final 100-episode avg reward:", mean(tail(tabular_result$rewards, 100)), "\n")
cat("  Best policy reward:", max(tabular_smooth), "\n")

cat("\nQ-Learning with Random Forest:\n")
cat("  Final 100-episode avg reward:", mean(tail(rf_result$rewards, 100)), "\n")
cat("  Best policy reward:", max(rf_smooth), "\n")
cat("  Experience buffer size:", rf_result$buffer$size(), "\n")
if (!is.null(rf_result$model)) {
  cat("  RF model trees:", rf_result$model$ntree, "\n")
  cat("  RF OOB MSE:", mean(rf_result$model$mse), "\n")
}

# Policy agreement
policy_agreement <- sum(tabular_result$policy[1:(n_states-1)] == 
                        rf_result$policy[1:(n_states-1)]) / (n_states-1)
cat("\nPolicy agreement:", round(policy_agreement * 100, 1), "%\n")

Environment Setup

n_states <- 10
n_actions <- 2
gamma <- 0.9
terminal_state <- n_states

This chunk defines the basic parameters of the reinforcement learning problem.

n_states <- 10: The “world” or “game” has 10 distinct states.
n_actions <- 2: In any state, the agent can choose between 2 possible actions.
gamma <- 0.9: This is the discount factor. It determines how much the agent values future rewards. A value of 0.9 means a reward received in the next step is worth 90% of a reward received now.
terminal_state <- n_states: State 10 is designated as the “end” state. When the agent reaches this state, the “episode” (one attempt) is over.

Environment Model Definition

transition_model <- array(0, dim = c(n_states, n_actions, n_states))
reward_model <- array(0, dim = c(n_states, n_actions, n_states))

for (s in 1:(n_states - 1)) {
  for (a in 1:n_actions) {
    if (a == 1) {
      next_state <- min(s + 1, n_states - 1)
      random_state <- sample(setdiff(1:(n_states-1), next_state), 1)
      transition_model[s, 1, next_state] <- 0.9
      transition_model[s, 1, random_state] <- 0.1
    } else {
      random_states <- sample(1:(n_states-1), 2, replace = FALSE)
      transition_model[s, 2, random_states[1]] <- 0.8
      transition_model[s, 2, random_states[2]] <- 0.2
    }
    # ... reward assignment ...
  }
}

This chunk builds the “rules” of the environment with critical fixes:

Fixed Probability Distribution:
- Action 1: 90% chance to move to the next state, 10% to a different random non-terminal state (using setdiff to prevent overwriting)
- Action 2: 80% to one random state, 20% to another different random state (using replace = FALSE)
- This ensures probabilities always sum to exactly 1.0 for each (state, action) pair
No Invalid Terminal Transitions: Random states are sampled from 1:(n_states-1), excluding the terminal state
Verification: The code includes a check that prints warnings if any probabilities don’t sum to 1.0

Sampling Function

sample_env <- function(s, a) {
  probs <- transition_model[s, a, ]
  s_prime <- sample(1:n_states, 1, prob = probs)
  reward <- reward_model[s, a, s_prime]
  list(s_prime = s_prime, reward = reward)
}

This function acts as the simulator. The Q-learning agent doesn’t get to see the full transition_model. Instead, it gives a state (s) and action (a) to this function, which:

Looks up the probabilities for all possible next states
Samples one next state based on those probabilities
Returns the next state and corresponding reward

Feature Encoding

encode_features <- function(s, a, n_states, n_actions) {
  state_vec <- rep(0, n_states)
  action_vec <- rep(0, n_actions)
  state_vec[s] <- 1
  action_vec[a] <- 1
  c(state_vec, action_vec)
}

This performs one-hot encoding to convert states and actions into numerical features:

Creates a vector of 10 zeros, puts a 1 at position s (e.g., state 3 → [0,0,1,0,0,0,0,0,0,0])
Creates a vector of 2 zeros, puts a 1 at position a (e.g., action 2 → [0,1])
Combines them into a 12-dimensional feature vector for the Random Forest

Experience Replay Buffer

ExperienceBuffer <- setRefClass(
  "ExperienceBuffer",
  fields = list(capacity = "numeric", buffer = "list", position = "numeric"),
  methods = list(
    add = function(experience) { ... },
    sample = function(batch_size) { ... },
    size = function() { ... }
  )
)

Circular Buffer: Stores up to 5000 experiences (state, action, reward, next_state)
Prevents Catastrophic Forgetting: By sampling randomly from past experiences rather than learning only from the most recent ones
Breaks Temporal Correlation: Random sampling decorrelates sequential experiences, improving learning stability
Methods:
- add(): Stores a new experience, overwriting old ones when full
- sample(): Returns a random batch of experiences for training
- size(): Returns current number of stored experiences

Tabular Q-Learning Baseline

q_learning_tabular <- function(episodes = 1000, alpha = 0.1, epsilon = 0.1) {
  Q <- matrix(0, nrow = n_states, ncol = n_actions)
  # ... standard Q-learning update ...
  Q[s, a] <- Q[s, a] + alpha * (td_target - Q[s, a])
  # ...
}

This implements standard tabular Q-learning as a baseline:

Maintains a 10×2 table of Q-values
Uses the classical update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s’,a’) - Q(s,a)]
Provides a gold standard to compare against the Random Forest approximation
Should converge to the optimal policy on this small problem

Q-Learning with Random Forest

q_learning_rf <- function(episodes = 1000, epsilon = 0.1, 
                         buffer_capacity = 5000, batch_size = 128,
                         retrain_freq = 50, update_target_freq = 100) {

The main function now has improved hyperparameters:

buffer_capacity = 5000: Size of experience replay buffer
batch_size = 128: Number of experiences to sample for each training update
retrain_freq = 50: Retrain model every 50 episodes (less frequent, more stable)
update_target_freq = 100: Update target network every 100 episodes

Initialization

  replay_buffer <- ExperienceBuffer$new(buffer_capacity)
  current_model <- NULL
  target_model <- NULL
  rewards_history <- numeric(episodes)
  td_errors <- numeric(episodes)

replay_buffer: Stores experiences for replay
current_model: The Q-function being actively trained
target_model: A frozen copy used for computing TD targets (DQN-style)
td_errors: Tracks learning progress by monitoring prediction errors

Episode Loop

for (ep in 1:episodes) {
  s <- sample(1:(n_states - 1), 1)
  episode_reward <- 0
  
  while (s != terminal_state) {
    # ... one step ...
    replay_buffer$add(list(s = s, a = a, r = r, s_prime = s_prime))
    s <- s_prime
  }

For each episode:

Start in a random non-terminal state
Take steps until reaching the terminal state
Store each experience in the replay buffer (not directly used for training)
Accumulate episode reward

Action Selection

  q_values <- sapply(1:n_actions, function(a) {
    x <- encode_features(s, a, n_states, n_actions)
    if (!is.null(current_model)) {
      predict(current_model, as.data.frame(t(x)))
    } else {
      0
    }
  })
  
  a <- if (runif(1) < epsilon) {
    sample(1:n_actions, 1)
  } else {
    which.max(q_values)
  }

Epsilon-greedy action selection:

Predict Q-values for all actions using the current model
With probability ε (10%), explore by choosing randomly
With probability 1-ε (90%), exploit by choosing the action with highest Q-value

Training with Experience Replay

if (replay_buffer$size() >= batch_size && ep %% retrain_freq == 0) {
  batch <- replay_buffer$sample(batch_size)
  
  train_x <- matrix(0, nrow = batch_size, ncol = n_features)
  train_y <- numeric(batch_size)
  
  for (i in 1:batch_size) {
    exp <- batch[[i]]
    model_to_use <- if (!is.null(target_model)) target_model else current_model
    
    if (exp$s_prime == terminal_state) {
      target_q <- exp$r
    } else {
      q_next <- sapply(1:n_actions, function(a_next) {
        x_next <- encode_features(exp$s_prime, a_next, n_states, n_actions)
        if (!is.null(model_to_use)) {
          predict(model_to_use, as.data.frame(t(x_next)))
        } else {
          0
        }
      })
      target_q <- exp$r + gamma * max(q_next)
    }
    
    train_x[i, ] <- encode_features(exp$s, exp$a, n_states, n_actions)
    train_y[i] <- target_q
  }
  
  current_model <- randomForest(...)
}

Random Forest Training

current_model <- randomForest(
  x = as.data.frame(train_x),
  y = train_y,
  ntree = 50,          # Reduced from 100 for efficiency
  nodesize = 10,       # Increased from 5 to prevent overfitting
  mtry = max(1, floor(n_features / 3)),
  keep.forest = TRUE
)

Improved hyperparameters:

ntree = 50: Fewer trees (faster training, still accurate)
nodesize = 10: Larger leaf nodes (less overfitting on small batches)
These settings balance accuracy with computational efficiency

Policy Extraction

policy <- sapply(1:(n_states-1), function(s) {
  if (!is.null(current_model)) {
    q_vals <- sapply(1:n_actions, function(a) {
      x <- encode_features(s, a, n_states, n_actions)
      predict(current_model, as.data.frame(t(x)))
    })
    which.max(q_vals)
  } else {
    1
  }
})

After all episodes, extract the final learned policy:

For each state, predict Q-values for both actions
Select the action with highest Q-value
This creates a lookup table: “in state s, take action a”

Visualization and Comparison

moving_average <- function(x, window = 50) {
  n <- length(x)
  result <- numeric(n)
  for (i in 1:n) {
    start_idx <- max(1, i - window + 1)
    result[i] <- mean(x[start_idx:i])
  }
  result
}

Properly implements a moving average:

For episode i, averages rewards from episodes [i-49, i]
Early episodes (i < 50) use expanding windows
This smooths noisy reward data to show learning trends

Learning Curves Comparison

reward_df <- data.frame(
  Episode = rep(1:1000, 2),
  Reward = c(tabular_smooth, rf_smooth),
  Algorithm = rep(c("Tabular Q-Learning", "Q-Learning + RF"), each = 1000)
)

reward_plot <- ggplot(reward_df, aes(x = Episode, y = Reward, color = Algorithm)) +
  geom_line(linewidth = 1) + ...

Creates a side-by-side comparison plot:

Blue line: Tabular Q-learning (baseline)
Green line: Random Forest Q-learning
Shows which algorithm learns faster and reaches higher rewards
Allows visual assessment of whether function approximation helps or hurts

Policy Compariso

policy_plot <- ggplot(policy_df, aes(x = State, y = Action, 
                                     color = Algorithm, shape = Algorithm)) +
  geom_point(size = 4) +
  geom_line(linewidth = 1, alpha = 0.6) + ...

Visualizes the final policies learned by both algorithms:

Shows which action (1 or 2) each algorithm chooses for each state
Points and lines make it easy to see agreements and disagreements
Ideal: both algorithms should converge to similar policies

Feature Importance

importance_df <- data.frame(
  Feature = c(paste0("S", 1:n_states), paste0("A", 1:n_actions)),
  Importance = importance(rf_result$model)[, 1]
)

importance_plot <- ggplot(importance_df, 
                         aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = "forestgreen", alpha = 0.7) +
  coord_flip() + ...

Shows which features the Random Forest found most important:

Higher bars = more important for predicting Q-values
Typically, states near the goal (State 9, State 10) will be most important
Action features show which action provides more informative signals
Provides interpretability: “what does the model care about?”

Summary Statistics

cat("\n=== Summary Statistics ===\n")
cat("\nTabular Q-Learning:\n")
cat("  Final 100-episode avg reward:", mean(tail(tabular_result$rewards, 100)), "\n")

cat("\nQ-Learning with Random Forest:\n")
cat("  Final 100-episode avg reward:", mean(tail(rf_result$rewards, 100)), "\n")
cat("  Experience buffer size:", rf_result$buffer$size(), "\n")
cat("  RF OOB MSE:", mean(rf_result$model$mse), "\n")

policy_agreement <- sum(tabular_result$policy[1:(n_states-1)] == 
                        rf_result$policy[1:(n_states-1)]) / (n_states-1)
cat("\nPolicy agreement:", round(policy_agreement * 100, 1), "%\n")

Provides quantitative comparison:

Final avg reward: Performance of each algorithm (higher is better)
Buffer size: How many experiences were collected
OOB MSE: Random Forest’s internal error estimate (lower is better)
Policy agreement: Percentage of states where both algorithms chose the same action
- 100% = perfect agreement (both found the same optimal policy)
- Lower values suggest one algorithm failed to converge or the problem has multiple optimal solutions

7.3 Analysis and Insights

Random Forest function approximation exhibits several characteristics that distinguish it from linear methods. Trees can capture non-linear decision boundaries, enabling the model to learn state-action relationships that linear approaches cannot represent. The random feature sampling at each split performs automatic feature selection, focusing computational resources on the most informative variables. Ensemble averaging across multiple trees reduces overfitting and provides stable predictions across different training samples. Individual trees maintain interpretable decision paths that show how Q-values are estimated for specific state-action pairs.

The batch retraining approach creates distinct computational trade-offs that affect implementation decisions. Training frequency must balance responsiveness against computational cost, as more frequent updates improve adaptation but require additional processing time. Trees need sufficient data to learn meaningful patterns, which can slow initial learning compared to methods that update continuously. Memory requirements increase over time as training examples accumulate, requiring careful management of historical data.

Random Forest methods naturally generate feature importance measures that reveal which states and actions most influence Q-value predictions. This interpretability provides diagnostic capabilities for understanding learning issues and analyzing policy decisions. The feature ranking can guide state representation choices and help identify redundant or irrelevant variables in the problem formulation.

Random Forest function approximation occupies a position between simple linear models and neural networks in terms of complexity and capability. The method handles larger state spaces more effectively than tabular approaches while remaining computationally tractable. It captures non-linear patterns without requiring extensive feature engineering or domain expertise. The approach shows less sensitivity to hyperparameter choices compared to neural networks while maintaining stability across different problem instances. The inherent interpretability provides insights into the decision-making process that can be valuable for debugging and analysis.

Random Forest methods demonstrate several advantages and trade-offs when compared to linear function approximation. The tree-based approach excels at pattern recognition, learning state-action relationships that linear models cannot capture due to their representational limitations. However, initial learning proceeds more slowly as trees require sufficient data to construct meaningful decision boundaries. Computational costs are higher due to periodic retraining requirements, contrasting with the continuous gradient updates used in linear methods. Generalization performance tends to be superior, as ensemble averaging provides natural regularization that reduces overfitting tendencies.

7.4 Conclusion

Random Forest function approximation extends linear methods by offering enhanced modeling flexibility while preserving interpretability characteristics. The approach performs particularly well in environments with non-linear state-action relationships and provides regularization through ensemble averaging.

Several key observations emerge from this analysis. Non-linear function approximation can capture patterns that linear models miss, enabling better policy learning in complex environments. Batch learning approaches require careful consideration of training frequency and sample requirements to balance performance with computational efficiency. Feature importance analysis provides insights into learned policies that can guide problem formulation and debugging efforts. Tree-based methods offer an interpretable alternative to neural network approaches while maintaining theoretical foundations.

This exploration demonstrates how ensemble methods can enhance reinforcement learning without abandoning the established principles of Q-Learning. Future work could investigate online tree learning algorithms that avoid batch retraining requirements, adaptive schedules that optimize training frequency based on performance metrics, or hybrid approaches that combine strengths from different function approximation methods.