Chapter 11 Appendix
11.1 Comprehensive Reinforcement Learning Concepts Guide
Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
---|---|---|---|---|
Agent | Autonomous learner/decision-maker that perceives environment and takes actions to maximize cumulative reward | Agent: (π, V, Q) | Autonomous vehicles, trading bots, game AI (AlphaGo), robotic controllers, chatbots | Must balance exploration vs exploitation, learns from trial and error |
Environment | External system that responds to agent actions with state transitions and rewards | Markov Decision Process (MDP): ⟨S, A, P, R, γ⟩ | OpenAI Gym environments, real-world robotics, financial markets, video games | Can be deterministic or stochastic, fully or partially observable |
State (s) | Complete information needed to make optimal decisions at current time | s ∈ S, where S is state space | Chess position (64 squares), robot pose (x,y,θ), market conditions, pixel arrays | Markov property: future depends only on current state, not history |
Action (a) | Discrete or continuous choices available to agent in given state | a ∈ A(s), where A(s) ⊆ A | Discrete: {up, down, left, right}, Continuous: steering angle [-30°, 30°] | Action space can be finite, infinite, or parameterized |
Reward (r) | Immediate scalar feedback signal indicating desirability of action | r ∈ ℝ, often bounded: r ∈ [-R_max, R_max] | Sparse: +1 goal, 0 elsewhere; Dense: distance-based; Shaped: intermediate milestones | Defines objective, can be sparse, dense, or carefully shaped |
11.2 Learning Mechanisms
Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
---|---|---|---|---|
Return (G) | Total discounted future reward from current time step | G_t = Σ_{k=0}^∞ γ^k r_{t+k+1} | Episode return in games, lifetime value in finance, trajectory rewards | Discount factor γ ∈ [0,1] balances immediate vs future rewards |
Discount Factor (γ) | Weighs importance of future vs immediate rewards | γ ∈ [0,1], where γ=0 is myopic, γ=1 values all future equally | γ=0.9 in games, γ=0.99 in continuous control, γ≈1 in finance | Controls learning horizon and convergence properties |
Policy (π) | Strategy mapping states to action probabilities or deterministic actions | Stochastic: π(a|s) ∈ [0,1], Deterministic: π(s) → a | ε-greedy, Boltzmann/softmax, Gaussian policies, neural networks | Can be deterministic, stochastic, parameterized, or tabular |
Value Function (V) | Expected return starting from state under policy π | V^π(s) = 𝔼_π[G_t | S_t = s] | State evaluation in chess, expected lifetime value | Higher values indicate more promising states |
Q-Function (Q) | Expected return from taking action a in state s, then following policy π | Q^π(s,a) = 𝔼_π[G_t | S_t = s, A_t = a] | Q-tables in tabular RL, neural Q-networks in DQN | Enables action selection without environment model |
Advantage (A) | How much better action a is compared to average in state s | A^π(s,a) = Q^π(s,a) - V^π(s) | Policy gradient variance reduction, actor-critic methods | Positive advantage → above average action, zero → average |
11.3 Environment Properties
Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
---|---|---|---|---|
Model | Agent’s internal representation of environment dynamics | P(s’,r|s,a) or T(s,a,s’) and R(s,a) | Forward models in planning, world models in Dyna-Q | Enables planning and simulation of future trajectories |
Markov Property | Future state depends only on current state and action, not history | P(S_{t+1}|S_t, A_t, S_{t-1}, …) = P(S_{t+1}|S_t, A_t) | Most RL algorithms assume this, POMDP when violated | Simplifies learning by eliminating need to track full history |
Stationarity | Environment dynamics don’t change over time | P_t(s’,r|s,a) = P(s’,r|s,a) ∀t | Stationary games vs non-stationary financial markets | Non-stationary environments require adaptation mechanisms |
Observability | Extent to which agent can perceive true environment state | Fully: O_t = S_t, Partially: O_t = f(S_t) + noise | Full: chess, Partial: poker (hidden cards), autonomous driving | POMDPs require memory or state estimation techniques |
11.4 Learning Paradigms
Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
---|---|---|---|---|
On-Policy | Learning from data generated by current policy being improved | π_{behavior} = π_{target}, update π using its own experience | SARSA, Policy Gradient, A2C, PPO, TRPO | More stable but less sample efficient, safer updates |
Off-Policy | Learning from data generated by different (often older) policy | π_{behavior} ≠ π_{target}, importance sampling: ρ = π/μ | Q-Learning, DQN, DDPG, SAC, experience replay | More sample efficient but requires correction for distribution shift |
Model-Free | Learn directly from experience without learning environment model | Direct V/Q updates from (s,a,r,s’) tuples | Q-Learning, Policy Gradient, Actor-Critic methods | Simpler but may require more samples, works with unknown dynamics |
Model-Based | Learn environment model first, then use for planning/learning | Learn P(s’|s,a), then plan using model | Dyna-Q, MCTS, MuZero, world models | Sample efficient but model errors can compound |
11.5 Exploration Strategies
Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
---|---|---|---|---|
Exploration | Trying suboptimal actions to discover potentially better policies | Various strategies with different theoretical guarantees | ε-greedy, UCB, Thompson sampling, curiosity-driven | Essential for discovering optimal policies, balances with exploitation |
Exploitation | Using current knowledge to maximize immediate reward | π_{greedy}(s) = argmax_a Q(s,a) | Greedy action selection after learning | Must be balanced with exploration to avoid local optima |
ε-Greedy | Choose random action with probability ε, best known action otherwise | π(a|s) = ε/|A| + (1-ε)δ_{a,a} where a = argmax Q(s,a) | Simple exploration in tabular RL, decaying ε schedules | Simple but may explore inefficiently in large state spaces |
Upper Confidence Bound | Choose actions with highest optimistic estimate | a_t = argmax_a [Q_t(a) + c√(ln t / N_t(a))] | Multi-armed bandits, MCTS (UCT), confidence-based exploration | Theoretically principled, provides exploration guarantees |
Thompson Sampling | Sample actions according to probability they are optimal | Sample θ ~ posterior, choose a = argmax_a Q_θ(s,a) | Bayesian bandits, Bayesian RL | Naturally balances exploration/exploitation via uncertainty |
11.6 Key Algorithms & Methods
Algorithm | Type | Key Innovation | Mathematical Update | Applications |
---|---|---|---|---|
Q-Learning | Off-policy, Model-free | Learns optimal Q-function without following optimal policy | Q(s,a) ← Q(s,a) + α[r + γ max_{a’} Q(s’,a’) - Q(s,a)] | Tabular problems, foundation for DQN |
SARSA | On-policy, Model-free | Updates based on actual next action taken | Q(s,a) ← Q(s,a) + α[r + γQ(s’,a’) - Q(s,a)] | Safe exploration, tabular RL |
Policy Gradient | On-policy, Model-free | Direct policy optimization using gradient ascent | ∇θ J(θ) = 𝔼[∇θ log π_θ(a|s) A^π(s,a)] | Continuous action spaces, stochastic policies |
Actor-Critic | On-policy, Model-free | Combines value estimation with policy optimization | Actor: ∇θ π, Critic: learns V^π or Q^π | Reduces variance in policy gradients |
DQN | Off-policy, Model-free | Neural networks + experience replay + target networks | Q-learning with neural network approximation | High-dimensional state spaces, Atari games |
PPO | On-policy, Model-free | Clipped surrogate objective for stable policy updates | L^{CLIP}(θ) = 𝔼[min(r_t(θ)A_t, clip(r_t(θ))A_t)] | Stable policy optimization, continuous control |
11.7 Advanced Concepts
Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
---|---|---|---|---|
Function Approximation | Using parameterized functions to represent V, Q, or π | V_θ(s), Q_θ(s,a), π_θ(a|s) where θ are parameters | Neural networks, linear functions, tile coding | Enables scaling to large/continuous state spaces |
Experience Replay | Storing and reusing past transitions to improve sample efficiency | Sample (s,a,r,s’) from replay buffer D uniformly | DQN, DDPG, rainbow improvements | Breaks correlation in sequential data, enables off-policy learning |
Target Networks | Using separate, slowly updated networks for stable learning | θ^- ← τθ + (1-τ)θ^- where τ << 1 | DQN, DDPG for reducing moving target problem | Stabilizes learning by reducing correlation between Q-values and targets |
Eligibility Traces | Credit assignment mechanism that bridges temporal difference and Monte Carlo | e_t(s) = γλe_{t-1}(s) + 1_{S_t=s} | TD(λ), SARSA(λ) for faster learning | Allows faster propagation of rewards to earlier states |
Multi-Agent RL | Multiple agents learning simultaneously in shared environment | Nash equilibria, correlated equilibria, social welfare | Game theory, autonomous vehicle coordination, economics | Requires handling non-stationarity from other learning agents |
11.8 Fundamental Equations
11.8.1 Bellman Equations
- State Value: V^π(s) = Σ_a π(a|s) Σ_{s’,r} p(s’,r|s,a)[r + γV^π(s’)]
- Action Value: Q^π(s,a) = Σ_{s’,r} p(s’,r|s,a)[r + γ Σ_{a’} π(a’|s’)Q^π(s’,a’)]
- Optimality (State): V*(s) = max_a Σ_{s’,r} p(s’,r|s,a)[r + γV*(s’)]
- Optimality (Action): Q*(s,a) = Σ_{s’,r} p(s’,r|s,a)[r + γ max_{a’} Q*(s’,a’)]
11.9 Common Challenges & Solutions
Challenge | Problem | Solutions | Examples |
---|---|---|---|
Sample Efficiency | RL often requires many environment interactions | Model-based methods, experience replay, transfer learning | Dyna-Q, DQN replay buffer, pre-training |
Exploration | Discovering good policies in large state spaces | Curiosity-driven exploration, count-based bonuses, UCB | ICM, pseudo-counts, optimism under uncertainty |
Stability | Function approximation can cause instability | Target networks, experience replay, regularization | DQN improvements, PPO clipping |
Sparse Rewards | Learning with infrequent feedback signals | Reward shaping, hierarchical RL, curriculum learning | HER, Options framework, progressive tasks |
Partial Observability | Agent cannot fully observe environment state | Recurrent policies, belief state estimation, memory | LSTM policies, particle filters, attention mechanisms |
11.10 Function Approximation Fundamentals in Reinforcement Learning
Reinforcement Learning (RL) has made remarkable progress, largely by moving from tabular methods, which store values for every state, to deep learning approaches that can handle vast, continuous environments. This leap, however, is built upon a critical intermediate step: classical function approximation. Understanding linear function approximation, basis functions, and feature engineering is essential for grasping why deep RL works, diagnosing its failures, and designing effective agent representations.
Tabular RL is impractical for large or continuous state spaces due to the curse of dimensionality. Function approximation overcomes this by learning a parameterized function that generalizes knowledge across similar states. This generalization is both a strength and a challenge. It allows agents to learn in complex worlds but introduces a fundamental tradeoff between discrimination (distinguishing important states) and generalization (sharing knowledge) that shapes all modern RL algorithms.
This appendix bridges the gap between tabular methods and deep RL. We begin with feature engineering, then cover the mathematical principles of linear function approximation, and finally explore classical basis functions like coarse coding, tile coding, and radial basis functions (RBFs).
11.11 Feature Engineering and State Representation
The success of any RL algorithm hinges on the quality of its state representation. Raw environmental observations are rarely optimal for learning. Effective feature engineering transforms high-dimensional, noisy data into a compact, informative format that accelerates learning and improves generalization.
11.11.1 The Discrimination vs. Generalization Tradeoff
At the core of feature design is a fundamental tension:
- Discrimination is the ability to distinguish between states that require different actions or have different values. For example, a robot navigating near a cliff must have features that are highly sensitive to small changes in its position.
- Generalization is the ability to treat similar states similarly, allowing experiences in one state to inform decisions in others. In the same navigation task, large open areas should be represented similarly so the agent doesn’t have to re-learn how to move in each one.
Mathematically, we approximate a value function \(V^\*(s)\) with a parameterized function \(V\_{\\boldsymbol{\\theta}}(s)\). Using a linear model, this is expressed as:
\[V_{\boldsymbol{\theta}}(s) = \boldsymbol{\phi}(s)^T \boldsymbol{\theta} = \sum_{i=1}^d \phi_i(s) \theta_i\]
Here, \(\\boldsymbol{\\phi}(s)\) is the feature vector for state \(s\), and \(\\boldsymbol{\\theta}\) is the weight vector we learn. The quality of the approximation depends on how well the features \(\\boldsymbol{\\phi}(s)\) can represent the true value function \(V^*(s)\). The unavoidable error in the best possible linear approximation is given by \(|V^* - \\Pi V^\*|^2\), where \(\\Pi\) is the projection operator onto the space spanned by the features. A good feature set minimizes this error.
11.11.2 Principles of Feature Extraction
Effective feature design in RL often involves:
- Encoding Temporal Dependencies: Features should capture dynamics, such as position and velocity in a physics problem or recent price trends in a trading algorithm.
- Focusing on Action-Relevant Information: Features should highlight aspects of the state that are critical for making a decision. The shape of a chess position is more important than the specific pieces if the underlying tactical pattern is the same.
- Capturing Hierarchical Structure: Good features can represent abstract concepts, like “in the kitchen” for a home robot or “opening phase” in a board game, which allows for hierarchical planning and learning.
11.12 Mathematical Foundations of Linear Function Approximation
Linear function approximation is the cornerstone of parametric value function methods. Its simplicity provides theoretical guarantees and computational efficiency, making it an essential starting point.
11.12.1 Linear Value Functions
We represent the state-value function \(V(s)\) or action-value function \(Q(s, a)\) as a linear combination of features: \[V_{\boldsymbol{\theta}}(s) = \boldsymbol{\phi}(s)^T \boldsymbol{\theta}\] \[Q_{\boldsymbol{\theta}}(s, a) = \boldsymbol{\phi}(s, a)^T \boldsymbol{\theta}\] A key advantage is that the gradient of the value function with respect to the parameters is simply the feature vector itself: \[\nabla_{\boldsymbol{\theta}} V_{\boldsymbol{\theta}}(s) = \boldsymbol{\phi}(s)\] This property makes learning algorithms based on gradient descent straightforward and computationally cheap.
11.12.2 Temporal Difference (TD) Learning with Function Approximation
With linear function approximation, the TD(0) update rule for learning \(V\_{\\boldsymbol{\\theta}}(s)\) becomes: \[\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t + \alpha [R_{t+1} + \gamma V_{\boldsymbol{\theta}}(S_{t+1}) - V_{\boldsymbol{\theta}}(S_t)] \boldsymbol{\phi}(S_t)\] This update adjusts the weights \(\\boldsymbol{\\theta}\) to reduce the TD error for the observed transition. Since the value of any state depends on the shared weights \(\\boldsymbol{\\theta}\), this single update influences the value estimates across the entire state space.
Similarly, the update for Q-learning is: \[\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t + \alpha [R_{t+1} + \gamma \max_{a'} Q_{\boldsymbol{\theta}}(S_{t+1}, a') - Q_{\boldsymbol{\theta}}(S_t, A_t)] \boldsymbol{\phi}(S_t, A_t)\]
11.12.3 Convergence and the “Deadly Triad”
While on-policy TD learning with linear function approximation is guaranteed to converge, stability is not assured under all conditions. The combination of off-policy learning, function approximation, and bootstrapping (updating estimates from other estimates) is known as the “deadly triad” and can lead to divergence, where the value estimates grow uncontrollably. This instability is a fundamental challenge that motivated much of the research leading to modern deep RL algorithms like DQN, which employ techniques like experience replay and target networks to mitigate it.
11.13 Classical Basis Function Methods
Basis functions are systematic methods for constructing features. They provide a bridge from simple linear models to the powerful, automatically-learned features of neural networks.
11.13.1 Coarse Coding
Coarse coding uses overlapping binary features, where each feature corresponds to a “receptive field” or region in the state space. A feature is active (value 1) if the state falls within its region and inactive (0) otherwise. The overlap between regions enables generalization: nearby states activate a similar set of features, so they receive similar value estimates.
The degree of overlap controls the smoothness of the learned function. More overlap leads to broader generalization.
# Load necessary libraries for all visualizations
library(ggplot2)
library(ggforce) # For drawing circles
# --- Coarse Coding Implementation ---
create_coarse_coder <- function(state_dims, num_features, overlap_factor = 2) {
# Generate random centers for the receptive fields
centers <- data.frame(
x = runif(num_features, 0, state_dims[1]),
y = runif(num_features, 0, state_dims[2])
)
# Calculate radius based on average spacing to ensure overlap
avg_spacing <- sqrt(prod(state_dims) / num_features)
radius <- overlap_factor * avg_spacing / 2
# Function to compute the binary feature vector for a given state
compute_features <- function(state) {
state_vec <- as.numeric(state)
distances <- sqrt((centers$x - state_vec[1])^2 + (centers$y - state_vec[2])^2)
features <- as.numeric(distances <= radius)
return(features)
}
return(list(
compute_features = compute_features,
centers = centers,
radius = radius,
state_dims = state_dims
))
}
# --- Visualization Function for Coarse Coding ---
visualize_coarse_coding <- function(coarse_coder, state = NULL) {
active_features <- if (!is.null(state)) coarse_coder$compute_features(state) else NULL
active_indices <- if (!is.null(active_features)) which(active_features == 1) else integer(0)
p <- ggplot() +
# Draw all receptive fields (inactive)
geom_circle(
data = coarse_coder$centers,
aes(x0 = x, y0 = y, r = coarse_coder$radius),
color = "grey70", fill = NA, linetype = "dashed"
) +
# Highlight active receptive fields
geom_circle(
data = coarse_coder$centers[active_indices, ],
aes(x0 = x, y0 = y, r = coarse_coder$radius),
color = "#e41a1c", fill = "#e41a1c", alpha = 0.2, size = 1
) +
# Add center points
geom_point(data = coarse_coder$centers, aes(x = x, y = y), color = "grey50", size = 1)
# Add the query state point
if (!is.null(state)) {
p <- p + geom_point(aes(x = state[1], y = state[2]), color = "#377eb8", size = 5, shape = 17)
}
p +
coord_fixed(xlim = c(0, coarse_coder$state_dims[1]), ylim = c(0, coarse_coder$state_dims[2])) +
labs(
title = "Coarse Coding Receptive Fields",
subtitle = paste(length(active_indices), "of", nrow(coarse_coder$centers), "features active"),
x = "State Dimension 1",
y = "State Dimension 2"
) +
theme_minimal(base_size = 14)
}
# Example Usage
set.seed(123)
coarse_coder_2d <- create_coarse_coder(state_dims = c(10, 10), num_features = 50, overlap_factor = 1.5)
visualize_coarse_coding(coarse_coder_2d, state = c(3.2, 7.8))
11.13.2 Tile Coding
Tile coding, also known as CMAC (Cerebellar Model Articulation Controller), improves upon coarse coding by providing more structured and uniform coverage. It partitions the state space using multiple overlapping grids, called tilings. Each tiling is offset from the others. For any given state, exactly one tile in each tiling will be active. This ensures a constant number of active features, which is computationally efficient.
The offsets provide fine-grained discrimination, while the tile size controls generalization. Unlike coarse coding with circular fields, tile coding’s rectangular tiles create asymmetric generalization that aligns with the coordinate axes, which is often desirable in control problems where dimensions represent different physical properties (e.g., position and velocity).
# --- Tile Coding Implementation ---
create_tile_coder <- function(state_low, state_high, num_tilings, tiles_per_dim) {
state_dims <- length(state_low)
state_range <- state_high - state_low
tile_widths <- state_range / tiles_per_dim
# Generate offsets for each tiling
offsets <- sapply(1:state_dims, function(i) {
(1:num_tilings - 1) * tile_widths[i] / num_tilings
})
# Function to get the indices of active tiles for a given state
get_active_tiles <- function(state) {
active_tiles <- list()
for (t in 1:num_tilings) {
tile_indices <- floor((state - state_low + offsets[t, ]) / tile_widths)
active_tiles[[t]] <- tile_indices
}
return(active_tiles)
}
return(list(
get_active_tiles = get_active_tiles,
state_low = state_low,
state_high = state_high,
num_tilings = num_tilings,
tiles_per_dim = tiles_per_dim,
tile_widths = tile_widths,
offsets = offsets
))
}
# --- Visualization Function for Tile Coding ---
visualize_tile_coding <- function(tile_coder, state = NULL) {
if (length(tile_coder$state_low) != 2) stop("Visualization is for 2D state spaces only.")
# Create a data frame of grid lines for each tiling
grids <- list()
for (t in 1:tile_coder$num_tilings) {
# Vertical lines
x_lines <- seq(
tile_coder$state_low[1] - tile_coder$offsets[t, 1],
tile_coder$state_high[1],
by = tile_coder$tile_widths[1]
)
# Horizontal lines
y_lines <- seq(
tile_coder$state_low[2] - tile_coder$offsets[t, 2],
tile_coder$state_high[2],
by = tile_coder$tile_widths[2]
)
grids[[t]] <- list(x = x_lines, y = y_lines)
}
# Base plot
p <- ggplot() +
coord_fixed(
xlim = tile_coder$state_low,
ylim = tile_coder$state_high,
expand = FALSE
)
# Add grid lines for each tiling with a unique color
colors <- RColorBrewer::brewer.pal(tile_coder$num_tilings, "Set1")
for (t in 1:tile_coder$num_tilings) {
p <- p +
geom_vline(xintercept = grids[[t]]$x, color = colors[t], linetype = "dashed", alpha = 0.8) +
geom_hline(yintercept = grids[[t]]$y, color = colors[t], linetype = "dashed", alpha = 0.8)
}
# Highlight the active tile for each tiling
if (!is.null(state)) {
active_tiles_indices <- tile_coder$get_active_tiles(state)
for (t in 1:tile_coder$num_tilings) {
tile_xmin <- tile_coder$state_low[1] + active_tiles_indices[[t]][1] * tile_coder$tile_widths[1] - tile_coder$offsets[t, 1]
tile_ymin <- tile_coder$state_low[2] + active_tiles_indices[[t]][2] * tile_coder$tile_widths[2] - tile_coder$offsets[t, 2]
p <- p + annotate(
"rect",
xmin = tile_xmin,
xmax = tile_xmin + tile_coder$tile_widths[1],
ymin = tile_ymin,
ymax = tile_ymin + tile_coder$tile_widths[2],
fill = colors[t], alpha = 0.3, color = colors[t], size = 1.2
)
}
p <- p + geom_point(aes(x = state[1], y = state[2]), color = "black", size = 5, shape = 17)
}
p + labs(
title = "Tile Coding",
subtitle = paste(tile_coder$num_tilings, "Overlapping Tilings"),
x = "State Dimension 1",
y = "State Dimension 2"
) +
theme_minimal(base_size = 14)
}
# Example Usage
tile_coder_2d <- create_tile_coder(
state_low = c(0, 0),
state_high = c(10, 10),
num_tilings = 4,
tiles_per_dim = c(5, 5)
)
visualize_tile_coding(tile_coder_2d, state = c(3.2, 7.8))
11.13.3 Radial Basis Functions (RBFs)
Radial Basis Functions (RBFs) are smooth, continuous basis functions that produce a graded response based on the distance to a center point. The most common form is the Gaussian RBF:
\[\phi_i(s) = \exp\left(-\frac{\|s - c_i\|^2}{2\sigma_i^2}\right)\]
Here, \(c\_i\) is the center of the \(i\)-th RBF and \(\\sigma\_i\) is its width (or bandwidth). Unlike binary features, RBFs provide a continuous measure of similarity. The width \(\\sigma\) controls the locality of influence: a small \(\\sigma\) creates a “spiky” function with local generalization, while a large \(\\sigma\) creates a smooth function with broad generalization.
RBF networks are universal approximators, meaning they can represent any continuous function with arbitrary accuracy, given enough basis functions. The placement of centers and choice of widths are critical. Centers can be placed on a fixed grid, randomly, or adapted during learning using techniques like k-means clustering on experienced states.
# --- Radial Basis Function (RBF) Implementation ---
create_rbf_network <- function(centers, sigmas) {
if (is.vector(sigmas)) {
sigmas <- rep(sigmas[1], nrow(centers))
}
compute_features <- function(state) {
state_vec <- as.numeric(state)
distances_sq <- colSums((t(centers) - state_vec)^2)
features <- exp(-distances_sq / (2 * sigmas^2))
return(features)
}
return(list(
compute_features = compute_features,
centers = centers,
sigmas = sigmas
))
}
# --- Visualization Function for RBF Network Activation ---
visualize_rbf_activation <- function(rbf_network, state_dims) {
# Create a grid of points to evaluate the RBF network
grid_df <- expand.grid(
x = seq(0, state_dims[1], length.out = 100),
y = seq(0, state_dims[2], length.out = 100)
)
# Calculate the sum of all RBF activations at each grid point
activations <- apply(grid_df, 1, function(point) {
sum(rbf_network$compute_features(point))
})
grid_df$activation <- activations
ggplot(grid_df, aes(x = x, y = y, fill = activation)) +
geom_raster(interpolate = TRUE) +
geom_point(data = as.data.frame(rbf_network$centers), aes(x = V1, y = V2),
color = "white", shape = 3, size = 2, alpha = 0.8) +
scale_fill_viridis_c(name = "Total\nActivation") +
coord_fixed(expand = FALSE) +
labs(
title = "Radial Basis Function Network Activation",
subtitle = "White crosses indicate RBF centers",
x = "State Dimension 1",
y = "State Dimension 2"
) +
theme_minimal(base_size = 14)
}
# Example Usage
set.seed(42)
rbf_centers <- matrix(runif(20, 0, 10), nrow = 10, ncol = 2)
rbf_sigmas <- rep(2.0, 10) # Uniform width for all RBFs
rbf_net <- create_rbf_network(centers = rbf_centers, sigmas = rbf_sigmas)
visualize_rbf_activation(rbf_net, state_dims = c(10, 10))
The choice of basis function depends on the problem, computational budget, and desired generalization properties.
- Computational Efficiency: Tile coding is often the fastest, as it involves simple arithmetic and guarantees a fixed, small number of active features. Coarse coding can be slower if it requires checking against many overlapping regions. RBFs are typically the most expensive, as they require distance calculations to all centers.
- Memory Requirements: A tile coder with \(N\) tilings and \(T\) tiles per tiling has \(N \\times T\) features. RBFs require storing all centers and widths. The memory footprint depends on the number of basis functions needed for the desired accuracy.
- Generalization Control: RBFs offer the most fine-grained control over generalization through the placement of centers and tuning of widths. Tile coding provides structured, axis-aligned generalization controlled by the number and shape of tiles. Coarse coding is the most straightforward but offers the least precise control.
- Ease of Use: Tile coding is often a strong default choice for low-to-medium dimensional control problems due to its computational efficiency and robust performance with minimal tuning.
Linear function approximation and classical basis functions are not merely historical footnotes; they are the conceptual foundation upon which deep RL is built. They provide a clear and analyzable framework for understanding the core challenge of RL: balancing discrimination and generalization. The principles of feature design, the mathematics of TD updates, and the properties of different basis functions offer crucial insights that remain relevant when designing, training, and debugging complex neural network architectures.
Mastering these fundamentals provides the intuition needed to navigate the complexities of modern RL. The journey from tabular methods to deep learning necessarily passes through this foundational territory, making it essential knowledge for any serious RL practitioner.