Chapter 18 Appendix
18.1 Comprehensive Reinforcement Learning Concepts Guide
| Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
|---|---|---|---|---|
| Agent | Autonomous learner/decision-maker that perceives environment and takes actions to maximize cumulative reward | Agent: (π, V, Q) | Autonomous vehicles, trading bots, game AI (AlphaGo), robotic controllers, chatbots | Must balance exploration vs exploitation, learns from trial and error |
| Environment | External system that responds to agent actions with state transitions and rewards | Markov Decision Process (MDP): ⟨S, A, P, R, γ⟩ | OpenAI Gym environments, real-world robotics, financial markets, video games | Can be deterministic or stochastic, fully or partially observable |
| State (s) | Complete information needed to make optimal decisions at current time | s ∈ S, where S is state space | Chess position (64 squares), robot pose (x,y,θ), market conditions, pixel arrays | Markov property: future depends only on current state, not history |
| Action (a) | Discrete or continuous choices available to agent in given state | a ∈ A(s), where A(s) ⊆ A | Discrete: {up, down, left, right}, Continuous: steering angle [-30°, 30°] | Action space can be finite, infinite, or parameterized |
| Reward (r) | Immediate scalar feedback signal indicating desirability of action | r ∈ ℝ, often bounded: r ∈ [-R_max, R_max] | Sparse: +1 goal, 0 elsewhere; Dense: distance-based; Shaped: intermediate milestones | Defines objective, can be sparse, dense, or carefully shaped |
18.2 Learning Mechanisms
| Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
|---|---|---|---|---|
| Return (G) | Total discounted future reward from current time step | G_t = Σ_{k=0}^∞ γ^k r_{t+k+1} | Episode return in games, lifetime value in finance, trajectory rewards | Discount factor γ ∈ [0,1] balances immediate vs future rewards |
| Discount Factor (γ) | Weighs importance of future vs immediate rewards | γ ∈ [0,1], where γ=0 is myopic, γ=1 values all future equally | γ=0.9 in games, γ=0.99 in continuous control, γ≈1 in finance | Controls learning horizon and convergence properties |
| Policy (π) | Strategy mapping states to action probabilities or deterministic actions | Stochastic: π(a|s) ∈ [0,1], Deterministic: π(s) → a | ε-greedy, Boltzmann/softmax, Gaussian policies, neural networks | Can be deterministic, stochastic, parameterized, or tabular |
| Value Function (V) | Expected return starting from state under policy π | V^π(s) = 𝔼_π[G_t | S_t = s] | State evaluation in chess, expected lifetime value | Higher values indicate more promising states |
| Q-Function (Q) | Expected return from taking action a in state s, then following policy π | Q^π(s,a) = 𝔼_π[G_t | S_t = s, A_t = a] | Q-tables in tabular RL, neural Q-networks in DQN | Enables action selection without environment model |
| Advantage (A) | How much better action a is compared to average in state s | A^π(s,a) = Q^π(s,a) - V^π(s) | Policy gradient variance reduction, actor-critic methods | Positive advantage → above average action, zero → average |
18.3 Environment Properties
| Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
|---|---|---|---|---|
| Model | Agent’s internal representation of environment dynamics | P(s’,r|s,a) or T(s,a,s’) and R(s,a) | Forward models in planning, world models in Dyna-Q | Enables planning and simulation of future trajectories |
| Markov Property | Future state depends only on current state and action, not history | P(S_{t+1}|S_t, A_t, S_{t-1}, …) = P(S_{t+1}|S_t, A_t) | Most RL algorithms assume this, POMDP when violated | Simplifies learning by eliminating need to track full history |
| Stationarity | Environment dynamics don’t change over time | P_t(s’,r|s,a) = P(s’,r|s,a) ∀t | Stationary games vs non-stationary financial markets | Non-stationary environments require adaptation mechanisms |
| Observability | Extent to which agent can perceive true environment state | Fully: O_t = S_t, Partially: O_t = f(S_t) + noise | Full: chess, Partial: poker (hidden cards), autonomous driving | POMDPs require memory or state estimation techniques |
18.4 Learning Paradigms
| Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
|---|---|---|---|---|
| On-Policy | Learning from data generated by current policy being improved | π_{behavior} = π_{target}, update π using its own experience | SARSA, Policy Gradient, A2C, PPO, TRPO | More stable but less sample efficient, safer updates |
| Off-Policy | Learning from data generated by different (often older) policy | π_{behavior} ≠ π_{target}, importance sampling: ρ = π/μ | Q-Learning, DQN, DDPG, SAC, experience replay | More sample efficient but requires correction for distribution shift |
| Model-Free | Learn directly from experience without learning environment model | Direct V/Q updates from (s,a,r,s’) tuples | Q-Learning, Policy Gradient, Actor-Critic methods | Simpler but may require more samples, works with unknown dynamics |
| Model-Based | Learn environment model first, then use for planning/learning | Learn P(s’|s,a), then plan using model | Dyna-Q, MCTS, MuZero, world models | Sample efficient but model errors can compound |
18.5 Exploration Strategies
| Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
|---|---|---|---|---|
| Exploration | Trying suboptimal actions to discover potentially better policies | Various strategies with different theoretical guarantees | ε-greedy, UCB, Thompson sampling, curiosity-driven | Essential for discovering optimal policies, balances with exploitation |
| Exploitation | Using current knowledge to maximize immediate reward | π_{greedy}(s) = argmax_a Q(s,a) | Greedy action selection after learning | Must be balanced with exploration to avoid local optima |
| ε-Greedy | Choose random action with probability ε, best known action otherwise | π(a|s) = ε/|A| + (1-ε)δ_{a,a} where a = argmax Q(s,a) | Simple exploration in tabular RL, decaying ε schedules | Simple but may explore inefficiently in large state spaces |
| Upper Confidence Bound | Choose actions with highest optimistic estimate | a_t = argmax_a [Q_t(a) + c√(ln t / N_t(a))] | Multi-armed bandits, MCTS (UCT), confidence-based exploration | Theoretically principled, provides exploration guarantees |
| Thompson Sampling | Sample actions according to probability they are optimal | Sample θ ~ posterior, choose a = argmax_a Q_θ(s,a) | Bayesian bandits, Bayesian RL | Naturally balances exploration/exploitation via uncertainty |
18.6 Key Algorithms & Methods
| Algorithm | Type | Key Innovation | Mathematical Update | Applications |
|---|---|---|---|---|
| Q-Learning | Off-policy, Model-free | Learns optimal Q-function without following optimal policy | Q(s,a) ← Q(s,a) + α[r + γ max_{a’} Q(s’,a’) - Q(s,a)] | Tabular problems, foundation for DQN |
| SARSA | On-policy, Model-free | Updates based on actual next action taken | Q(s,a) ← Q(s,a) + α[r + γQ(s’,a’) - Q(s,a)] | Safe exploration, tabular RL |
| Policy Gradient | On-policy, Model-free | Direct policy optimization using gradient ascent | ∇θ J(θ) = 𝔼[∇θ log π_θ(a|s) A^π(s,a)] | Continuous action spaces, stochastic policies |
| Actor-Critic | On-policy, Model-free | Combines value estimation with policy optimization | Actor: ∇θ π, Critic: learns V^π or Q^π | Reduces variance in policy gradients |
| DQN | Off-policy, Model-free | Neural networks + experience replay + target networks | Q-learning with neural network approximation | High-dimensional state spaces, Atari games |
| PPO | On-policy, Model-free | Clipped surrogate objective for stable policy updates | L^{CLIP}(θ) = 𝔼[min(r_t(θ)A_t, clip(r_t(θ))A_t)] | Stable policy optimization, continuous control |
18.7 Advanced Concepts
| Concept | Description | Mathematical Formalism | Applications/Examples | Key Properties |
|---|---|---|---|---|
| Function Approximation | Using parameterized functions to represent V, Q, or π | V_θ(s), Q_θ(s,a), π_θ(a|s) where θ are parameters | Neural networks, linear functions, tile coding | Enables scaling to large/continuous state spaces |
| Experience Replay | Storing and reusing past transitions to improve sample efficiency | Sample (s,a,r,s’) from replay buffer D uniformly | DQN, DDPG, rainbow improvements | Breaks correlation in sequential data, enables off-policy learning |
| Target Networks | Using separate, slowly updated networks for stable learning | θ^- ← τθ + (1-τ)θ^- where τ << 1 | DQN, DDPG for reducing moving target problem | Stabilizes learning by reducing correlation between Q-values and targets |
| Eligibility Traces | Credit assignment mechanism that bridges temporal difference and Monte Carlo | e_t(s) = γλe_{t-1}(s) + 1_{S_t=s} | TD(λ), SARSA(λ) for faster learning | Allows faster propagation of rewards to earlier states |
| Multi-Agent RL | Multiple agents learning simultaneously in shared environment | Nash equilibria, correlated equilibria, social welfare | Game theory, autonomous vehicle coordination, economics | Requires handling non-stationarity from other learning agents |
18.8 Fundamental Equations
18.8.1 Bellman Equations
- State Value: V^π(s) = Σ_a π(a|s) Σ_{s’,r} p(s’,r|s,a)[r + γV^π(s’)]
- Action Value: Q^π(s,a) = Σ_{s’,r} p(s’,r|s,a)[r + γ Σ_{a’} π(a’|s’)Q^π(s’,a’)]
- Optimality (State): V*(s) = max_a Σ_{s’,r} p(s’,r|s,a)[r + γV*(s’)]
- Optimality (Action): Q*(s,a) = Σ_{s’,r} p(s’,r|s,a)[r + γ max_{a’} Q*(s’,a’)]
18.9 Common Challenges & Solutions
| Challenge | Problem | Solutions | Examples |
|---|---|---|---|
| Sample Efficiency | RL often requires many environment interactions | Model-based methods, experience replay, transfer learning | Dyna-Q, DQN replay buffer, pre-training |
| Exploration | Discovering good policies in large state spaces | Curiosity-driven exploration, count-based bonuses, UCB | ICM, pseudo-counts, optimism under uncertainty |
| Stability | Function approximation can cause instability | Target networks, experience replay, regularization | DQN improvements, PPO clipping |
| Sparse Rewards | Learning with infrequent feedback signals | Reward shaping, hierarchical RL, curriculum learning | HER, Options framework, progressive tasks |
| Partial Observability | Agent cannot fully observe environment state | Recurrent policies, belief state estimation, memory | LSTM policies, particle filters, attention mechanisms |