Chapter 5 On-Policy vs Off-Policy Reinforcement Learning: SARSA, Q-Learning, and Monte Carlo in R

5.1 Introduction

In model-free reinforcement learning (RL), agents learn optimal policies directly from experience without a model of the environment’s dynamics. Two key approaches are on-policy and off-policy methods, exemplified by SARSA (State-Action-Reward-State-Action) and Q-Learning, respectively. Additionally, off-policy Monte Carlo methods leverage importance sampling to learn optimal policies from exploratory data. This post explores these methods, focusing on their theoretical foundations, practical implications, and implementation in R. We use a 10-state, 2-action environment to compare how SARSA, Q-Learning, and off-policy Monte Carlo learn policies and adapt to environmental changes, such as outcome devaluation. Mathematical formulations and R code are provided to illustrate the concepts. SARSA, Q-Learning, and off-policy Monte Carlo aim to estimate the action-value function \(Q^\pi(s,a)\), the expected discounted return for taking action \(a\) in state \(s\) and following policy \(\pi\):

\[ Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R_{t+1} \mid S_0 = s, A_0 = a \right] \]

where \(\gamma \in [0,1]\) is the discount factor, and \(R_{t+1}\) is the reward at time \(t+1\).

5.2 SARSA (On-Policy)

SARSA is an on-policy method, meaning it learns the value of the policy being followed, including exploration. The update rule is:

\[ Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma Q(s', a') - Q(s,a) \right) \]

where \(\alpha\) is the learning rate, \(r\) is the reward, \(s'\) is the next state, and \(a'\) is the action actually taken in \(s'\) according to the current policy (e.g., \(\epsilon\)-greedy). SARSA updates \(Q\) based on the next state-action pair \((s', a')\), making it sensitive to the exploration policy. In the 10-state environment, SARSA learns a policy that accounts for exploratory actions, potentially avoiding risky moves that lead to lower rewards.

5.3 Q-Learning (Off-Policy)

Q-Learning is an off-policy method, meaning it learns the optimal policy \(\pi^*\) regardless of the exploration policy (e.g., \(\epsilon\)-greedy). The update rule is:

\[ Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right) \]

where \(\max_{a'} Q(s', a')\) estimates the value of the next state assuming the optimal action. This bootstrapping makes Q-Learning converge to the optimal action-value function \(Q^*(s,a)\). In the 10-state environment, Q-Learning favors actions that maximize future rewards (e.g., action 1 in state 9, yielding a 1.0 reward at the terminal state), ignoring exploration effects.

5.4 Off-Policy Monte Carlo with Importance Sampling

Off-policy Monte Carlo uses importance sampling to learn the value of a target policy (e.g., greedy) from episodes generated by a behavior policy (e.g., random). The return \(G_t\) (cumulative discounted reward from time \(t\) onward) is weighted by the importance sampling ratio:

\[ \rho_t = \prod_{k=t}^T \frac{\pi(a_k|s_k)}{\mu(a_k|s_k)} \]

where \(\pi\) is the target policy, \(\mu\) is the behavior policy, and \(T\) is the episode length. The Q-value update is:

\[ Q(s,a) \leftarrow Q(s,a) + \alpha \left( \rho_t G_t - Q(s,a) \right) \]

Importance Sampling Mechanics: In the 10-state environment, suppose the behavior policy is random (0.5 probability for actions 1 and 2), but the target policy is greedy (choosing the action with the highest Q-value). If the agent in state 9 takes action 2 (reward 0.5 at the terminal state), but the greedy policy prefers action 1 (reward 1.0), the importance sampling ratio \(\rho_t\) is low (e.g., 0 if the greedy policy assigns zero probability to action 2), reducing the update’s impact. This allows learning the optimal policy from exploratory trajectories, but high variance can occur if the policies diverge significantly. Weighted importance sampling (as in the code below) normalizes weights to reduce variance, and early termination (stopping when \(\rho_t = 0\)) improves efficiency.

5.5 Key Differences

Aspect	SARSA (On-Policy)	Q-Learning (Off-Policy)	Off-Policy Monte Carlo
Update Rule	Uses \(Q(s', a')\), where \(a'\) is sampled from the current policy.	Uses \(\max_{a'} Q(s', a')\), assuming the optimal action.	Uses \(\rho_t G_t\), where \(\rho_t\) reweights returns based on policy likelihoods.
Policy Learning	Learns the value of the policy being followed (including exploration).	Learns the optimal policy, independent of exploration.	Learns the optimal policy using importance sampling from exploratory trajectories.
Exploration Impact	Exploration affects learned Q-values.	Exploration does not affect learned Q-values.	Exploration affects returns, reweighted by importance sampling.
Convergence	Converges to the policy’s value if exploration decreases (e.g., \(\epsilon \to 0\)).	Converges to the optimal policy even with fixed exploration.	Converges to the optimal policy, but variance depends on policy similarity.
Behavior	More conservative, accounts for exploration risks.	More aggressive, assumes optimal future actions.	Aggressive, but variance can lead to unstable learning if policies differ significantly.

# Common settings
n_states <- 10
n_actions <- 2
gamma <- 0.9
terminal_state <- n_states

# Environment: transition and reward models
set.seed(42)
transition_model <- array(0, dim = c(n_states, n_actions, n_states))
reward_model <- array(0, dim = c(n_states, n_actions, n_states))

for (s in 1:(n_states - 1)) {
  transition_model[s, 1, s + 1] <- 0.9
  transition_model[s, 1, sample(1:n_states, 1)] <- 0.1
  transition_model[s, 2, sample(1:n_states, 1)] <- 0.8
  transition_model[s, 2, sample(1:n_states, 1)] <- 0.2
  for (s_prime in 1:n_states) {
    reward_model[s, 1, s_prime] <- ifelse(s_prime == n_states, 1.0, 0.1 * runif(1))
    reward_model[s, 2, s_prime] <- ifelse(s_prime == n_states, 0.5, 0.05 * runif(1))
  }
}

transition_model[n_states, , ] <- 0
reward_model[n_states, , ] <- 0

# Helper function: Epsilon-greedy policy
epsilon_greedy <- function(Q, state, epsilon) {
  if (runif(1) < epsilon) {
    sample(1:n_actions, 1)
  } else {
    which.max(Q[state, ])
  }
}

# Helper function: Simulate environment
simulate_step <- function(state, action) {
  probs <- transition_model[state, action, ]
  next_state <- sample(1:n_states, 1, prob = probs)
  reward <- reward_model[state, action, next_state]
  list(next_state = next_state, reward = reward)
}

# SARSA
sarsa <- function(n_episodes = 1000, alpha = 0.1, epsilon = 0.1) {
  Q <- matrix(0, n_states, n_actions)
  policy <- rep(0, n_states)
  rewards <- numeric(n_episodes)
  
  for (episode in 1:n_episodes) {
    state <- sample(1:(n_states - 1), 1)
    action <- epsilon_greedy(Q, state, epsilon)
    episode_reward <- 0
    
    while (state != terminal_state) {
      step <- simulate_step(state, action)
      next_state <- step$next_state
      reward <- step$reward
      next_action <- epsilon_greedy(Q, next_state, epsilon)
      
      Q[state, action] <- Q[state, action] + alpha * (
        reward + gamma * Q[next_state, next_action] - Q[state, action]
      )
      
      state <- next_state
      action <- next_action
      episode_reward <- episode_reward + reward
    }
    rewards[episode] <- episode_reward
  }
  
  policy[1:(n_states - 1)] <- apply(Q[1:(n_states - 1), ], 1, which.max)
  list(Q = Q, policy = policy, rewards = rewards)
}

# Q-Learning
q_learning <- function(n_episodes = 1000, alpha = 0.1, epsilon = 0.1) {
  Q <- matrix(0, n_states, n_actions)
  policy <- rep(0, n_states)
  rewards <- numeric(n_episodes)
  
  for (episode in 1:n_episodes) {
    state <- sample(1:(n_states - 1), 1)
    episode_reward <- 0
    
    while (state != terminal_state) {
      action <- epsilon_greedy(Q, state, epsilon)
      step <- simulate_step(state, action)
      next_state <- step$next_state
      reward <- step$reward
      
      Q[state, action] <- Q[state, action] + alpha * (
        reward + gamma * max(Q[next_state, ]) - Q[state, action]
      )
      
      state <- next_state
      episode_reward <- episode_reward + reward
    }
    rewards[episode] <- episode_reward
  }
  
  policy[1:(n_states - 1)] <- apply(Q[1:(n_states - 1), ], 1, which.max)
  list(Q = Q, policy = policy, rewards = rewards)
}

# Off-Policy Monte Carlo
off_policy_mc <- function(n_episodes = 1000, epsilon = 0.1) {
  Q <- matrix(0, n_states, n_actions)
  C <- matrix(0, n_states, n_actions)  # Cumulative weights
  policy <- rep(0, n_states)
  rewards <- numeric(n_episodes)
  
  for (episode in 1:n_episodes) {
    # Generate episode using behavior policy (epsilon-greedy)
    states <- numeric(0)
    actions <- numeric(0)
    rewards_ep <- numeric(0)
    state <- sample(1:(n_states - 1), 1)
    
    while (state != terminal_state) {
      action <- sample(1:n_actions, 1)  # Behavior policy: random
      step <- simulate_step(state, action)
      next_state <- step$next_state
      reward <- step$reward
      
      states <- c(states, state)
      actions <- c(actions, action)
      rewards_ep <- c(rewards_ep, reward)
      state <- next_state
    }
    rewards[episode] <- sum(rewards_ep)
    
    # Update Q using importance sampling
    G <- 0
    W <- 1
    for (t in length(states):1) {
      state <- states[t]
      action <- actions[t]
      reward <- rewards_ep[t]
      
      G <- gamma * G + reward
      C[state, action] <- C[state, action] + W
      Q[state, action] <- Q[state, action] + (W / C[state, action]) * (G - Q[state, action])
      
      pi_action <- which.max(Q[state, ])
      if (action != pi_action) break
      W <- W / (1 / n_actions)  # Importance sampling ratio
    }
  }
  
  policy[1:(n_states - 1)] <- apply(Q[1:(n_states - 1), ], 1, which.max)
  list(Q = Q, policy = policy, rewards = rewards)
}

# Value Iteration (from DP)
value_iteration <- function(transition_model, reward_model, gamma, epsilon = 1e-6, max_iter = 1000) {
  V <- rep(0, n_states)
  policy <- rep(0, n_states)
  delta <- Inf
  iter <- 0
  
  while (delta > epsilon && iter < max_iter) {
    delta <- 0
    V_old <- V
    
    for (s in 1:(n_states - 1)) {
      Q <- numeric(n_actions)
      for (a in 1:n_actions) {
        Q[a] <- sum(transition_model[s, a, ] * (reward_model[s, a, ] + gamma * V))
      }
      V[s] <- max(Q)
      policy[s] <- which.max(Q)
      delta <- max(delta, abs(V[s] - V_old[s]))
    }
    iter <- iter + 1
  }
  
  # Evaluate DP policy
  rewards <- numeric(1000)
  for (episode in 1:1000) {
    state <- sample(1:(n_states - 1), 1)
    episode_reward <- 0
    while (state != terminal_state) {
      action <- policy[state]
      step <- simulate_step(state, action)
      episode_reward <- episode_reward + step$reward
      state <- step$next_state
    }
    rewards[episode] <- episode_reward
  }
  
  list(V = V, policy = policy, rewards = rewards)
}

# Run algorithms
set.seed(42)
dp_result <- value_iteration(transition_model, reward_model, gamma)
sarsa_result <- sarsa(n_episodes = 1000, alpha = 0.1, epsilon = 0.1)
qlearn_result <- q_learning(n_episodes = 1000, alpha = 0.1, epsilon = 0.1)
mc_result <- off_policy_mc(n_episodes = 1000, epsilon = 0.1)

# Visualization
library(ggplot2)
library(gridExtra)

# Policy comparison
policy_df <- data.frame(
  State = rep(1:n_states, 4),
  Policy = c(dp_result$policy, sarsa_result$policy, qlearn_result$policy, mc_result$policy),
  Algorithm = rep(c("DP", "SARSA", "Q-Learning", "Off-Policy MC"), each = n_states)
)
policy_df$Policy[n_states * 0:3 + n_states] <- NA  # Terminal state

policy_plot <- ggplot(policy_df, aes(x = State, y = Policy, color = Algorithm)) +
  geom_point(size = 3) +
  geom_line(aes(group = Algorithm), na.rm = TRUE) +
  theme_minimal() +
  labs(title = "Optimal Policies by Algorithm", x = "State", y = "Action") +
  scale_x_continuous(breaks = 1:n_states) +
  scale_y_continuous(breaks = 1:n_actions, labels = c("Action 1", "Action 2")) +
  theme(legend.position = "bottom")

# Reward comparison
reward_df <- data.frame(
  Episode = rep(1:1000, 4),
  Reward = c(
    cumsum(dp_result$rewards),
    cumsum(sarsa_result$rewards),
    cumsum(qlearn_result$rewards),
    cumsum(mc_result$rewards)
  ),
  Algorithm = rep(c("DP", "SARSA", "Q-Learning", "Off-Policy MC"), each = 1000)
)

reward_plot <- ggplot(reward_df, aes(x = Episode, y = Reward, color = Algorithm)) +
  geom_line() +
  theme_minimal() +
  labs(title = "Cumulative Reward Comparison", x = "Episode", y = "Cumulative Reward") +
  theme(legend.position = "bottom")

# Display plots
grid.arrange(policy_plot, reward_plot, ncol = 1)

# Print performance metrics
cat("Average Cumulative Reward per Episode:\n")
cat("DP:", mean(dp_result$rewards), "\n")
cat("SARSA:", mean(sarsa_result$rewards), "\n")
cat("Q-Learning:", mean(qlearn_result$rewards), "\n")
cat("Off-Policy MC:", mean(mc_result$rewards), "\n")

5.6 Interpretation and Discussion

5.6.0.1 Policy Differences

SARSA: As an on-policy method, it learns the value of the \(\epsilon\)-greedy policy, which includes exploratory actions. In the 10-state environment, SARSA may balance between actions 1 and 2, reflecting the impact of random exploration, leading to a more conservative policy.
Q-Learning: As an off-policy method, it learns the optimal policy, favoring action 1 in state 9 (higher terminal reward of 1.0) due to its greedy updates. Its policy is less sensitive to exploration noise, as it assumes optimal future actions.
Off-Policy Monte Carlo: Also off-policy, it learns the optimal policy using importance sampling to reweight returns from a random behavior policy. It may align closely with Q-Learning’s policy but can exhibit variability due to high variance in importance sampling ratios, especially if the random policy frequently selects action 2 (lower reward).

5.6.0.2 Devaluation

All methods exhibit habitual behavior without retraining, retaining their original policies after the terminal reward is removed. This highlights a limitation of model-free methods compared to model-based approaches (e.g., dynamic programming), which adapt instantly to reward changes.

5.6.0.3 Practical Implications

SARSA: Better suited for environments where the exploration policy must be accounted for, such as safety-critical systems (e.g., robotics), where risky exploratory actions could lead to poor outcomes.
Q-Learning: Ideal for scenarios where the optimal policy is desired regardless of exploration, such as games or simulations where exploration does not incur real-world costs.
Off-Policy Monte Carlo: Suitable for offline learning from logged data (e.g., recommendation systems), but high variance can make it less stable than Q-Learning in dynamic environments.

5.6.0.4 Experimental Observations

Before devaluation, Q-Learning and off-policy Monte Carlo likely favor action 1 in state 9 due to its higher terminal reward, while SARSA’s policy may show more variability due to exploration.
After devaluation, all policies remain unchanged without retraining, illustrating their reliance on cached Q-values.
Off-policy Monte Carlo’s performance depends on the similarity between the random behavior policy and the greedy target policy, with high variance potentially leading to less consistent policies compared to Q-Learning.

5.7 Conclusion

SARSA, Q-Learning, and off-policy Monte Carlo represent distinct paradigms in model-free RL. SARSA’s on-policy updates reflect the exploration policy, making it conservative. Q-Learning’s off-policy updates target the optimal policy, ignoring exploration effects. Off-policy Monte Carlo uses importance sampling to learn from diverse trajectories, enabling offline learning but introducing variance. The R implementations demonstrate these differences in a 10-state environment, and the devaluation experiment underscores their habitual nature. Future posts could explore advanced topics, such as SARSA(\(\lambda\)), deep RL extensions, or variance reduction in off-policy Monte Carlo.

5.8 Comparison Table

Aspect	SARSA (On-Policy)	Q-Learning (Off-Policy)	Off-Policy Monte Carlo
Learning Approach	Learns incrementally, updates based on action taken by behavior policy.	Learns incrementally, updates based on best action in next state.	Learns from complete episodes, using importance sampling.
Update Rule	\(Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma Q(s', a') - Q(s,a) \right)\)	\(Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right)\)	\(Q(s,a) \leftarrow Q(s,a) + \alpha \left( \rho_t G_t - Q(s,a) \right)\)
Episode Requirement	Updates online, no episode completion needed.	Updates online, no episode completion needed.	Requires complete episodes for returns and importance weights.
Bias and Variance	Biased due to bootstrapping, moderate variance.	Biased due to bootstrapping, lower variance.	Unbiased but high variance due to importance sampling.
Policy Type	On-policy; learns value of behavior policy.	Off-policy; learns optimal policy via max Q-value.	Off-policy; learns greedy policy using importance sampling.
Exploration Impact	Exploration affects learned Q-values.	Exploration does not affect learned Q-values.	Exploration affects returns, reweighted by importance sampling.
Convergence	Converges to policy’s value if \(\epsilon \to 0\).	Converges to optimal policy even with fixed \(\epsilon\).	Converges to optimal policy, but variance depends on policy similarity.
Behavior	Conservative, accounts for exploration risks.	Aggressive, assumes optimal future actions.	Aggressive, but variance can lead to instability.
Example in Environment	Balances actions 1 and 2, sensitive to exploration.	Favors action 1 (higher reward) in state 9.	Favors action 1, but variance may cause variability.