Chapter 18 Appendix

18.1 Comprehensive Reinforcement Learning Concepts Guide

Concept Description Mathematical Formalism Applications/Examples Key Properties
Agent Autonomous learner/decision-maker that perceives environment and takes actions to maximize cumulative reward Agent: (π, V, Q) Autonomous vehicles, trading bots, game AI (AlphaGo), robotic controllers, chatbots Must balance exploration vs exploitation, learns from trial and error
Environment External system that responds to agent actions with state transitions and rewards Markov Decision Process (MDP): ⟨S, A, P, R, γ⟩ OpenAI Gym environments, real-world robotics, financial markets, video games Can be deterministic or stochastic, fully or partially observable
State (s) Complete information needed to make optimal decisions at current time s ∈ S, where S is state space Chess position (64 squares), robot pose (x,y,θ), market conditions, pixel arrays Markov property: future depends only on current state, not history
Action (a) Discrete or continuous choices available to agent in given state a ∈ A(s), where A(s) ⊆ A Discrete: {up, down, left, right}, Continuous: steering angle [-30°, 30°] Action space can be finite, infinite, or parameterized
Reward (r) Immediate scalar feedback signal indicating desirability of action r ∈ ℝ, often bounded: r ∈ [-R_max, R_max] Sparse: +1 goal, 0 elsewhere; Dense: distance-based; Shaped: intermediate milestones Defines objective, can be sparse, dense, or carefully shaped

18.2 Learning Mechanisms

Concept Description Mathematical Formalism Applications/Examples Key Properties
Return (G) Total discounted future reward from current time step G_t = Σ_{k=0}^∞ γ^k r_{t+k+1} Episode return in games, lifetime value in finance, trajectory rewards Discount factor γ ∈ [0,1] balances immediate vs future rewards
Discount Factor (γ) Weighs importance of future vs immediate rewards γ ∈ [0,1], where γ=0 is myopic, γ=1 values all future equally γ=0.9 in games, γ=0.99 in continuous control, γ≈1 in finance Controls learning horizon and convergence properties
Policy (π) Strategy mapping states to action probabilities or deterministic actions Stochastic: π(a|s) ∈ [0,1], Deterministic: π(s) → a ε-greedy, Boltzmann/softmax, Gaussian policies, neural networks Can be deterministic, stochastic, parameterized, or tabular
Value Function (V) Expected return starting from state under policy π V^π(s) = 𝔼_π[G_t | S_t = s] State evaluation in chess, expected lifetime value Higher values indicate more promising states
Q-Function (Q) Expected return from taking action a in state s, then following policy π Q^π(s,a) = 𝔼_π[G_t | S_t = s, A_t = a] Q-tables in tabular RL, neural Q-networks in DQN Enables action selection without environment model
Advantage (A) How much better action a is compared to average in state s A^π(s,a) = Q^π(s,a) - V^π(s) Policy gradient variance reduction, actor-critic methods Positive advantage → above average action, zero → average

18.3 Environment Properties

Concept Description Mathematical Formalism Applications/Examples Key Properties
Model Agent’s internal representation of environment dynamics P(s’,r|s,a) or T(s,a,s’) and R(s,a) Forward models in planning, world models in Dyna-Q Enables planning and simulation of future trajectories
Markov Property Future state depends only on current state and action, not history P(S_{t+1}|S_t, A_t, S_{t-1}, …) = P(S_{t+1}|S_t, A_t) Most RL algorithms assume this, POMDP when violated Simplifies learning by eliminating need to track full history
Stationarity Environment dynamics don’t change over time P_t(s’,r|s,a) = P(s’,r|s,a) ∀t Stationary games vs non-stationary financial markets Non-stationary environments require adaptation mechanisms
Observability Extent to which agent can perceive true environment state Fully: O_t = S_t, Partially: O_t = f(S_t) + noise Full: chess, Partial: poker (hidden cards), autonomous driving POMDPs require memory or state estimation techniques

18.4 Learning Paradigms

Concept Description Mathematical Formalism Applications/Examples Key Properties
On-Policy Learning from data generated by current policy being improved π_{behavior} = π_{target}, update π using its own experience SARSA, Policy Gradient, A2C, PPO, TRPO More stable but less sample efficient, safer updates
Off-Policy Learning from data generated by different (often older) policy π_{behavior} ≠ π_{target}, importance sampling: ρ = π/μ Q-Learning, DQN, DDPG, SAC, experience replay More sample efficient but requires correction for distribution shift
Model-Free Learn directly from experience without learning environment model Direct V/Q updates from (s,a,r,s’) tuples Q-Learning, Policy Gradient, Actor-Critic methods Simpler but may require more samples, works with unknown dynamics
Model-Based Learn environment model first, then use for planning/learning Learn P(s’|s,a), then plan using model Dyna-Q, MCTS, MuZero, world models Sample efficient but model errors can compound

18.5 Exploration Strategies

Concept Description Mathematical Formalism Applications/Examples Key Properties
Exploration Trying suboptimal actions to discover potentially better policies Various strategies with different theoretical guarantees ε-greedy, UCB, Thompson sampling, curiosity-driven Essential for discovering optimal policies, balances with exploitation
Exploitation Using current knowledge to maximize immediate reward π_{greedy}(s) = argmax_a Q(s,a) Greedy action selection after learning Must be balanced with exploration to avoid local optima
ε-Greedy Choose random action with probability ε, best known action otherwise π(a|s) = ε/|A| + (1-ε)δ_{a,a} where a = argmax Q(s,a) Simple exploration in tabular RL, decaying ε schedules Simple but may explore inefficiently in large state spaces
Upper Confidence Bound Choose actions with highest optimistic estimate a_t = argmax_a [Q_t(a) + c√(ln t / N_t(a))] Multi-armed bandits, MCTS (UCT), confidence-based exploration Theoretically principled, provides exploration guarantees
Thompson Sampling Sample actions according to probability they are optimal Sample θ ~ posterior, choose a = argmax_a Q_θ(s,a) Bayesian bandits, Bayesian RL Naturally balances exploration/exploitation via uncertainty

18.6 Key Algorithms & Methods

Algorithm Type Key Innovation Mathematical Update Applications
Q-Learning Off-policy, Model-free Learns optimal Q-function without following optimal policy Q(s,a) ← Q(s,a) + α[r + γ max_{a’} Q(s’,a’) - Q(s,a)] Tabular problems, foundation for DQN
SARSA On-policy, Model-free Updates based on actual next action taken Q(s,a) ← Q(s,a) + α[r + γQ(s’,a’) - Q(s,a)] Safe exploration, tabular RL
Policy Gradient On-policy, Model-free Direct policy optimization using gradient ascent ∇θ J(θ) = 𝔼[∇θ log π_θ(a|s) A^π(s,a)] Continuous action spaces, stochastic policies
Actor-Critic On-policy, Model-free Combines value estimation with policy optimization Actor: ∇θ π, Critic: learns V^π or Q^π Reduces variance in policy gradients
DQN Off-policy, Model-free Neural networks + experience replay + target networks Q-learning with neural network approximation High-dimensional state spaces, Atari games
PPO On-policy, Model-free Clipped surrogate objective for stable policy updates L^{CLIP}(θ) = 𝔼[min(r_t(θ)A_t, clip(r_t(θ))A_t)] Stable policy optimization, continuous control

18.7 Advanced Concepts

Concept Description Mathematical Formalism Applications/Examples Key Properties
Function Approximation Using parameterized functions to represent V, Q, or π V_θ(s), Q_θ(s,a), π_θ(a|s) where θ are parameters Neural networks, linear functions, tile coding Enables scaling to large/continuous state spaces
Experience Replay Storing and reusing past transitions to improve sample efficiency Sample (s,a,r,s’) from replay buffer D uniformly DQN, DDPG, rainbow improvements Breaks correlation in sequential data, enables off-policy learning
Target Networks Using separate, slowly updated networks for stable learning θ^- ← τθ + (1-τ)θ^- where τ << 1 DQN, DDPG for reducing moving target problem Stabilizes learning by reducing correlation between Q-values and targets
Eligibility Traces Credit assignment mechanism that bridges temporal difference and Monte Carlo e_t(s) = γλe_{t-1}(s) + 1_{S_t=s} TD(λ), SARSA(λ) for faster learning Allows faster propagation of rewards to earlier states
Multi-Agent RL Multiple agents learning simultaneously in shared environment Nash equilibria, correlated equilibria, social welfare Game theory, autonomous vehicle coordination, economics Requires handling non-stationarity from other learning agents

18.8 Fundamental Equations

18.8.1 Bellman Equations

  • State Value: V^π(s) = Σ_a π(a|s) Σ_{s’,r} p(s’,r|s,a)[r + γV^π(s’)]
  • Action Value: Q^π(s,a) = Σ_{s’,r} p(s’,r|s,a)[r + γ Σ_{a’} π(a’|s’)Q^π(s’,a’)]
  • Optimality (State): V*(s) = max_a Σ_{s’,r} p(s’,r|s,a)[r + γV*(s’)]
  • Optimality (Action): Q*(s,a) = Σ_{s’,r} p(s’,r|s,a)[r + γ max_{a’} Q*(s’,a’)]

18.8.2 Policy Gradient Theorem

∇_θ J(θ) = ∇_θ V^{π_θ}(s_0) = Σ_s μ^{π_θ}(s) Σ_a ∇_θ π_θ(a|s) Q^{π_θ}(s,a)

where μ^{π_θ}(s) is the stationary distribution under policy π_θ.

18.8.3 Temporal Difference Error

δ_t = r_{t+1} + γV(s_{t+1}) - V(s_t)

This error drives learning in most RL algorithms and represents the difference between expected and actual returns.

18.9 Common Challenges & Solutions

Challenge Problem Solutions Examples
Sample Efficiency RL often requires many environment interactions Model-based methods, experience replay, transfer learning Dyna-Q, DQN replay buffer, pre-training
Exploration Discovering good policies in large state spaces Curiosity-driven exploration, count-based bonuses, UCB ICM, pseudo-counts, optimism under uncertainty
Stability Function approximation can cause instability Target networks, experience replay, regularization DQN improvements, PPO clipping
Sparse Rewards Learning with infrequent feedback signals Reward shaping, hierarchical RL, curriculum learning HER, Options framework, progressive tasks
Partial Observability Agent cannot fully observe environment state Recurrent policies, belief state estimation, memory LSTM policies, particle filters, attention mechanisms