Chapter 1 Understanding Reinforcement Learning: From Bandits to Policy Optimization

1.1 Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a dynamic subfield of artificial intelligence concerned with how agents ought to take actions in an environment to maximize cumulative reward. It is inspired by behavioral psychology and decision theory, involving a learning paradigm where feedback from the environment is neither supervised (as in classification tasks) nor completely unsupervised, but rather in the form of scalar rewards. The foundational concepts of RL can be understood by progressing from simple problems, such as the multi-armed bandit, to more sophisticated frameworks like Markov Decision Processes (MDPs) and their many solution methods.

1.2 The Multi-Armed Bandit: The Simplest Case

The journey begins with the multi-armed bandit problem, a classic formulation that captures the essence of the exploration-exploitation dilemma. In this setting, an agent must choose among several actions (arms), each yielding stochastic rewards from an unknown distribution. There are no state transitions or temporal dependencies—just the immediate outcome of the chosen action. The objective is to maximize the expected reward over time.

Despite its simplicity, the bandit framework introduces crucial ideas such as reward estimation, uncertainty, and exploration strategies. Algorithms like ε-greedy methods introduce random exploration, while Upper Confidence Bound (UCB) techniques adjust choices based on uncertainty estimates. Thompson Sampling applies Bayesian reasoning to balance exploration and exploitation. Though limited in scope, these strategies establish foundational principles that generalize to more complex environments. In bandits, action selection strategies serve a role similar to policies in full RL problems, but without dependence on state transitions.

1.3 Transition to Markov Decision Processes

The limitations of bandit models become evident when we consider sequential decision-making problems where actions influence future states and rewards. This leads to the formalism of Markov Decision Processes (MDPs), which model environments through states \(S\), actions \(A\), transition probabilities \(P(s'|s, a)\), rewards \(R(s, a)\), and a discount factor \(\gamma \in [0, 1]\). The Markov property assumes that the future is independent of the past given the present state, simplifying the dynamics and enabling tractable analysis.

The agent’s goal is to learn an optimal policy \(\pi^*(s)\) that maximizes the expected cumulative return:

\(G_t = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \right]\)

This process relies on value functions such as:

\(V^\pi(s) = \mathbb{E}_\pi \left[ G_t | S_t = s \right], \quad Q^\pi(s,a) = \mathbb{E}_\pi \left[ G_t | S_t = s, A_t = a \right]\)

Solving MDPs involves either learning these functions or directly learning \(\pi\).

1.4 Comparing Reinforcement Learning Methods

To frame this spectrum of approaches clearly, the table below summarizes key RL method families, highlighting whether they rely on an explicit model of the environment, whether they learn policies, value functions, or both, and notes on convergence and sample efficiency.

Method Type	Uses Model?	Learns Policy?	Learns Value?	Sample Efficient?	Converges to Optimal?	Suitable For	Example Algorithms
Multi-Armed Bandits	No	Action Rule	No	Yes	Yes (in expectation)	Stateless problems	ε-Greedy, UCB, Thompson
Dynamic Programming	Yes	Yes	Yes	No	Yes	Known model	Value Iteration, Policy Iteration
Monte Carlo	No	Yes	Yes	No	Yes (episodic)	Episodic tasks	MC Prediction, MC Control
TD Learning	No	Yes	Yes	Moderate	Yes	Ongoing tasks	SARSA, Q-Learning
Dyna-style Methods	Yes (learned)	Yes	Yes	High	Yes (with perfect model)	Sample-limited tasks	Dyna-Q, Dyna+, Prioritized Sweeping
Q-Learning + Function Approx.	No	Yes	Yes	Moderate	Not always (non-linear)	High-dimensional spaces	DQN
Policy Gradient	No	Yes	Maybe	Low	Local Optimum	Continuous action spaces	REINFORCE, PPO, A2C
Actor-Critic	No	Yes	Yes	Moderate	Local Optimum	Most RL settings	A2C, DDPG, SAC
Model-Based RL	Yes (learned)	Yes	Yes	High	Not guaranteed	Data-limited tasks	Dreamer, MBPO, MuZero

This classification illustrates the diversity of RL approaches and underscores the flexibility of the RL paradigm. Some methods assume access to a perfect model, while others learn entirely from data. Some directly optimize policies, while others estimate values to guide policy improvement.

1.4.1 Dynamic Programming: Model-Based Learning

Dynamic Programming (DP) methods such as Value Iteration and Policy Iteration assume complete knowledge of the environment’s dynamics. These algorithms exploit the Bellman equations to iteratively compute optimal value functions and policies. The Bellman optimality equation is given by:

\(V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a) + \gamma V^*(s')]\)

Value Iteration applies this update directly, while Policy Iteration alternates between evaluating the current policy and improving it by acting greedily with respect to the value function.

Although DP methods guarantee convergence to the optimal policy, they are rarely applicable to real-world problems due to their assumption of known transitions and the computational infeasibility of operating over large or continuous state spaces.

1.4.2 Model-Free Approaches: Monte Carlo and TD Learning

When the model is unknown, we turn to model-free methods that learn directly from sampled experience. Monte Carlo methods estimate the value of policies by averaging the total return over multiple episodes. These methods are simple and intuitive, suitable for episodic environments, but suffer from high variance and are not efficient in online learning scenarios.

Temporal Difference (TD) learning bridges the gap between Monte Carlo and DP by updating value estimates based on partial returns. Algorithms like SARSA and Q-learning fall into this category. SARSA is on-policy, updating values based on the actual trajectory taken, while Q-learning is off-policy, learning about the greedy policy regardless of the agent’s current behavior. These methods do not require waiting until the end of an episode and are thus applicable in ongoing tasks. They offer a tradeoff between bias and variance in value estimation.

1.4.3 Dyna: Bridging Model-Free and Model-Based Learning

The Dyna architecture, introduced by Richard Sutton, represents an elegant solution to combining model-free learning with model-based planning. Dyna methods simultaneously learn a model of the environment while using that model to generate synthetic experiences for additional learning. This approach addresses a fundamental limitation of pure model-free methods: their sample inefficiency.

The core Dyna algorithm operates through three interleaved processes:

Direct RL: Learn from real experience using standard model-free methods (e.g., Q-learning)
Model Learning: Update the learned model based on observed transitions
Planning: Use the learned model to generate simulated experiences and update the value function

The basic Dyna-Q algorithm combines Q-learning with a simple tabular model. After each real experience \((s, a, r, s')\), the agent updates both its Q-values and its model estimates \(\hat{R}(s,a)\) and \(\hat{T}(s,a)\). The agent then performs several planning steps by randomly sampling previously visited state-action pairs and using the model to generate simulated transitions for additional Q-learning updates.

Dyna+ and Exploration Bonuses: The original Dyna-Q algorithm can suffer when the environment changes, as the learned model becomes outdated. Dyna+ addresses this by adding exploration bonuses to state-action pairs that haven’t been visited recently, encouraging the agent to re-explore and update its model. The bonus term is typically:

\(\text{bonus}(s,a) = \kappa \sqrt{\tau(s,a)}\)

where \(\kappa\) is a small constant and \(\tau(s,a)\) is the time since state-action pair \((s,a)\) was last visited.

Prioritized Sweeping: Rather than randomly selecting state-action pairs for planning updates, prioritized sweeping focuses computational resources on the most important updates. It maintains a priority queue of state-action pairs, prioritizing those where the model predicts large changes in value. This targeted approach can significantly improve learning efficiency compared to uniform random planning.

The Dyna framework demonstrates that the dichotomy between model-free and model-based methods is not absolute. By learning and using simple models alongside direct RL, Dyna methods can achieve much better sample efficiency than pure model-free approaches while remaining more robust than methods that rely entirely on learned models.

1.4.4 Q-Learning and Function Approximation

Q-learning is one of the most widely used RL algorithms due to its simplicity and theoretical guarantees. However, when dealing with large state-action spaces, tabular Q-learning becomes infeasible. Function approximation, particularly using neural networks, allows Q-learning to scale to high-dimensional problems. This gave rise to Deep Q-Networks (DQNs), where a neural network is trained to approximate the Q-function.

DQNs introduced mechanisms like experience replay—storing and reusing past interactions to reduce correlation between updates—and target networks—fixed Q-value targets updated slowly to stabilize learning. These enhancements enabled RL to tackle complex environments like Atari games directly from pixels. Nonetheless, deep RL methods often suffer from sample inefficiency and training instability, especially when generalizing to new environments.

1.4.5 Policy Gradient and Actor-Critic Methods

While value-based methods derive policies from value functions, policy-based methods directly parameterize and optimize the policy itself. Policy Gradient methods compute the gradient of expected return with respect to policy parameters and perform gradient ascent. The REINFORCE algorithm is the archetype of this approach, but it often suffers from high variance in gradient estimates. The Policy Gradient Theorem provides the theoretical foundation:

\(\nabla J(\theta) = \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(a|s) Q^{\pi}(s,a)]\)

To address variance, Actor-Critic methods introduce a second component: the critic, which estimates value functions to inform and stabilize the updates of the actor (policy). Algorithms like Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), and Soft Actor-Critic (SAC) build on this architecture, each adding unique elements to improve performance and stability.

1.4.6 Advanced Policy Optimization Techniques

More recent advances in policy optimization have focused on improving training stability and sample efficiency. Trust Region Policy Optimization (TRPO) constrains policy updates to stay within a trust region defined by the KL divergence, ensuring small, safe steps in parameter space. Proximal Policy Optimization (PPO) simplifies TRPO with a clipped objective function, striking a balance between ease of implementation and empirical performance.

Soft Actor-Critic (SAC), on the other hand, incorporates an entropy maximization objective, encouraging exploration by maintaining a degree of randomness in the policy. This leads to better performance in environments with sparse or deceptive rewards and is particularly effective in continuous control tasks.

Model-based approaches, such as MuZero or Dreamer, offer a complementary strategy: by learning a model of the environment dynamics and reward structure, they can generate synthetic experiences to improve sample efficiency. However, model inaccuracies can lead to cascading errors and suboptimal policies.

1.5 Further Directions

Reinforcement Learning has evolved into a mature and diverse field, offering a rich set of tools for decision-making under uncertainty. From simple bandit strategies to deep policy optimization and model-based reasoning, RL provides a versatile framework for learning from interaction. The Dyna architecture exemplifies how combining different learning paradigms can yield methods that are both sample-efficient and robust, highlighting the value of hybrid approaches in RL.

However, key challenges remain: training instability, sample inefficiency, reward mis-specification, and generalization across tasks. These remain active areas of research, with promising directions including meta-learning, hierarchical RL, and more sophisticated model learning techniques.

A solid understanding of the main categories—such as those outlined in the comparative table—is essential for navigating the RL landscape. Whether one is interested in theoretical foundations, algorithm development, or practical deployment, the key ideas of exploration, value estimation, policy optimization, and model learning form the backbone of modern RL. For further reading, Reinforcement Learning: An Introduction by Sutton and Barto remains the canonical text. As artificial agents continue to do more complex and dynamic environments, reinforcement learning stands at the forefront of AI research and application.

1.6 References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (pp. 1889–1897). https://proceedings.mlr.press/v37/schulman15.html
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning. https://arxiv.org/abs/1801.01290
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … & Hassabis, D. (2020). Mastering the game of Go without human knowledge. Nature, 550(7676), 354–359. https://doi.org/10.1038/nature24270
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … & Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140–1144. https://doi.org/10.1126/science.aar6404
OpenAI. (2018). Spinning Up in Deep RL. https://spinningup.openai.com
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. https://arxiv.org/abs/1509.02971
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256. https://doi.org/10.1023/A:1013689704352
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294. https://doi.org/10.2307/2332286

Reinforcement Learning in R