Chapter 18 Appendix

18.1 Comprehensive Reinforcement Learning Concepts Guide

Concept	Description	Mathematical Formalism	Applications/Examples	Key Properties
Agent	Autonomous learner/decision-maker that perceives environment and takes actions to maximize cumulative reward	Agent: (π, V, Q)	Autonomous vehicles, trading bots, game AI (AlphaGo), robotic controllers, chatbots	Must balance exploration vs exploitation, learns from trial and error
Environment	External system that responds to agent actions with state transitions and rewards	Markov Decision Process (MDP): ⟨S, A, P, R, γ⟩	OpenAI Gym environments, real-world robotics, financial markets, video games	Can be deterministic or stochastic, fully or partially observable
State (s)	Complete information needed to make optimal decisions at current time	s ∈ S, where S is state space	Chess position (64 squares), robot pose (x,y,θ), market conditions, pixel arrays	Markov property: future depends only on current state, not history
Action (a)	Discrete or continuous choices available to agent in given state	a ∈ A(s), where A(s) ⊆ A	Discrete: {up, down, left, right}, Continuous: steering angle [-30°, 30°]	Action space can be finite, infinite, or parameterized
Reward (r)	Immediate scalar feedback signal indicating desirability of action	r ∈ ℝ, often bounded: r ∈ [-R_max, R_max]	Sparse: +1 goal, 0 elsewhere; Dense: distance-based; Shaped: intermediate milestones	Defines objective, can be sparse, dense, or carefully shaped

18.2 Learning Mechanisms

Concept	Description	Mathematical Formalism	Applications/Examples	Key Properties
Return (G)	Total discounted future reward from current time step	G_t = Σ_{k=0}^∞ γ^k r_{t+k+1}	Episode return in games, lifetime value in finance, trajectory rewards	Discount factor γ ∈ [0,1] balances immediate vs future rewards
Discount Factor (γ)	Weighs importance of future vs immediate rewards	γ ∈ [0,1], where γ=0 is myopic, γ=1 values all future equally	γ=0.9 in games, γ=0.99 in continuous control, γ≈1 in finance	Controls learning horizon and convergence properties
Policy (π)	Strategy mapping states to action probabilities or deterministic actions	Stochastic: π(a\|s) ∈ [0,1], Deterministic: π(s) → a	ε-greedy, Boltzmann/softmax, Gaussian policies, neural networks	Can be deterministic, stochastic, parameterized, or tabular
Value Function (V)	Expected return starting from state under policy π	V^π(s) = 𝔼_π[G_t \| S_t = s]	State evaluation in chess, expected lifetime value	Higher values indicate more promising states
Q-Function (Q)	Expected return from taking action a in state s, then following policy π	Q^π(s,a) = 𝔼_π[G_t \| S_t = s, A_t = a]	Q-tables in tabular RL, neural Q-networks in DQN	Enables action selection without environment model
Advantage (A)	How much better action a is compared to average in state s	A^π(s,a) = Q^π(s,a) - V^π(s)	Policy gradient variance reduction, actor-critic methods	Positive advantage → above average action, zero → average

18.3 Environment Properties

Concept	Description	Mathematical Formalism	Applications/Examples	Key Properties
Model	Agent’s internal representation of environment dynamics	P(s’,r\|s,a) or T(s,a,s’) and R(s,a)	Forward models in planning, world models in Dyna-Q	Enables planning and simulation of future trajectories
Markov Property	Future state depends only on current state and action, not history	P(S_{t+1}\|S_t, A_t, S_{t-1}, …) = P(S_{t+1}\|S_t, A_t)	Most RL algorithms assume this, POMDP when violated	Simplifies learning by eliminating need to track full history
Stationarity	Environment dynamics don’t change over time	P_t(s’,r\|s,a) = P(s’,r\|s,a) ∀t	Stationary games vs non-stationary financial markets	Non-stationary environments require adaptation mechanisms
Observability	Extent to which agent can perceive true environment state	Fully: O_t = S_t, Partially: O_t = f(S_t) + noise	Full: chess, Partial: poker (hidden cards), autonomous driving	POMDPs require memory or state estimation techniques

18.4 Learning Paradigms

Concept	Description	Mathematical Formalism	Applications/Examples	Key Properties
On-Policy	Learning from data generated by current policy being improved	π_{behavior} = π_{target}, update π using its own experience	SARSA, Policy Gradient, A2C, PPO, TRPO	More stable but less sample efficient, safer updates
Off-Policy	Learning from data generated by different (often older) policy	π_{behavior} ≠ π_{target}, importance sampling: ρ = π/μ	Q-Learning, DQN, DDPG, SAC, experience replay	More sample efficient but requires correction for distribution shift
Model-Free	Learn directly from experience without learning environment model	Direct V/Q updates from (s,a,r,s’) tuples	Q-Learning, Policy Gradient, Actor-Critic methods	Simpler but may require more samples, works with unknown dynamics
Model-Based	Learn environment model first, then use for planning/learning	Learn P(s’\|s,a), then plan using model	Dyna-Q, MCTS, MuZero, world models	Sample efficient but model errors can compound

18.5 Exploration Strategies

Concept	Description	Mathematical Formalism	Applications/Examples	Key Properties
Exploration	Trying suboptimal actions to discover potentially better policies	Various strategies with different theoretical guarantees	ε-greedy, UCB, Thompson sampling, curiosity-driven	Essential for discovering optimal policies, balances with exploitation
Exploitation	Using current knowledge to maximize immediate reward	π_{greedy}(s) = argmax_a Q(s,a)	Greedy action selection after learning	Must be balanced with exploration to avoid local optima
ε-Greedy	Choose random action with probability ε, best known action otherwise	π(a\|s) = ε/\|A\| + (1-ε)δ_{a,a} where a = argmax Q(s,a)	Simple exploration in tabular RL, decaying ε schedules	Simple but may explore inefficiently in large state spaces
Upper Confidence Bound	Choose actions with highest optimistic estimate	a_t = argmax_a [Q_t(a) + c√(ln t / N_t(a))]	Multi-armed bandits, MCTS (UCT), confidence-based exploration	Theoretically principled, provides exploration guarantees
Thompson Sampling	Sample actions according to probability they are optimal	Sample θ ~ posterior, choose a = argmax_a Q_θ(s,a)	Bayesian bandits, Bayesian RL	Naturally balances exploration/exploitation via uncertainty

18.6 Key Algorithms & Methods

Algorithm	Type	Key Innovation	Mathematical Update	Applications
Q-Learning	Off-policy, Model-free	Learns optimal Q-function without following optimal policy	Q(s,a) ← Q(s,a) + α[r + γ max_{a’} Q(s’,a’) - Q(s,a)]	Tabular problems, foundation for DQN
SARSA	On-policy, Model-free	Updates based on actual next action taken	Q(s,a) ← Q(s,a) + α[r + γQ(s’,a’) - Q(s,a)]	Safe exploration, tabular RL
Policy Gradient	On-policy, Model-free	Direct policy optimization using gradient ascent	∇θ J(θ) = 𝔼[∇θ log π_θ(a\|s) A^π(s,a)]	Continuous action spaces, stochastic policies
Actor-Critic	On-policy, Model-free	Combines value estimation with policy optimization	Actor: ∇θ π, Critic: learns V^π or Q^π	Reduces variance in policy gradients
DQN	Off-policy, Model-free	Neural networks + experience replay + target networks	Q-learning with neural network approximation	High-dimensional state spaces, Atari games
PPO	On-policy, Model-free	Clipped surrogate objective for stable policy updates	L^{CLIP}(θ) = 𝔼[min(r_t(θ)A_t, clip(r_t(θ))A_t)]	Stable policy optimization, continuous control

18.7 Advanced Concepts

Concept	Description	Mathematical Formalism	Applications/Examples	Key Properties
Function Approximation	Using parameterized functions to represent V, Q, or π	V_θ(s), Q_θ(s,a), π_θ(a\|s) where θ are parameters	Neural networks, linear functions, tile coding	Enables scaling to large/continuous state spaces
Experience Replay	Storing and reusing past transitions to improve sample efficiency	Sample (s,a,r,s’) from replay buffer D uniformly	DQN, DDPG, rainbow improvements	Breaks correlation in sequential data, enables off-policy learning
Target Networks	Using separate, slowly updated networks for stable learning	θ^- ← τθ + (1-τ)θ^- where τ << 1	DQN, DDPG for reducing moving target problem	Stabilizes learning by reducing correlation between Q-values and targets
Eligibility Traces	Credit assignment mechanism that bridges temporal difference and Monte Carlo	e_t(s) = γλe_{t-1}(s) + 1_{S_t=s}	TD(λ), SARSA(λ) for faster learning	Allows faster propagation of rewards to earlier states
Multi-Agent RL	Multiple agents learning simultaneously in shared environment	Nash equilibria, correlated equilibria, social welfare	Game theory, autonomous vehicle coordination, economics	Requires handling non-stationarity from other learning agents

18.8 Fundamental Equations

18.8.1 Bellman Equations

State Value: V^π(s) = Σ_a π(a|s) Σ_{s’,r} p(s’,r|s,a)[r + γV^π(s’)]
Action Value: Q^π(s,a) = Σ_{s’,r} p(s’,r|s,a)[r + γ Σ_{a’} π(a’|s’)Q^π(s’,a’)]
Optimality (State): V*(s) = max_a Σ_{s’,r} p(s’,r|s,a)[r + γV*(s’)]
Optimality (Action): Q*(s,a) = Σ_{s’,r} p(s’,r|s,a)[r + γ max_{a’} Q*(s’,a’)]

18.8.2 Policy Gradient Theorem

∇_θ J(θ) = ∇_θ V^{π_θ}(s_0) = Σ_s μ^{π_θ}(s) Σ_a ∇_θ π_θ(a|s) Q^{π_θ}(s,a)

where μ^{π_θ}(s) is the stationary distribution under policy π_θ.

18.8.3 Temporal Difference Error

δ_t = r_{t+1} + γV(s_{t+1}) - V(s_t)

This error drives learning in most RL algorithms and represents the difference between expected and actual returns.

18.9 Common Challenges & Solutions

Challenge	Problem	Solutions	Examples
Sample Efficiency	RL often requires many environment interactions	Model-based methods, experience replay, transfer learning	Dyna-Q, DQN replay buffer, pre-training
Exploration	Discovering good policies in large state spaces	Curiosity-driven exploration, count-based bonuses, UCB	ICM, pseudo-counts, optimism under uncertainty
Stability	Function approximation can cause instability	Target networks, experience replay, regularization	DQN improvements, PPO clipping
Sparse Rewards	Learning with infrequent feedback signals	Reward shaping, hierarchical RL, curriculum learning	HER, Options framework, progressive tasks
Partial Observability	Agent cannot fully observe environment state	Recurrent policies, belief state estimation, memory	LSTM policies, particle filters, attention mechanisms