Chapter 1 Understanding Reinforcement Learning: From Bandits to Policy Optimization

1.1 Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a dynamic subfield of artificial intelligence concerned with how agents ought to take actions in an environment to maximize cumulative reward. It is inspired by behavioral psychology and decision theory, involving a learning paradigm where feedback from the environment is neither supervised (as in classification tasks) nor completely unsupervised, but rather in the form of scalar rewards. The foundational concepts of RL can be understood by progressing from simple problems, such as the multi-armed bandit, to more sophisticated frameworks like Markov Decision Processes (MDPs) and their many solution methods.

1.2 The Multi-Armed Bandit: The Simplest Case

The journey begins with the multi-armed bandit problem, a classic formulation that captures the essence of the exploration-exploitation dilemma. In this setting, an agent must choose among several actions (arms), each yielding stochastic rewards from an unknown distribution. There are no state transitions or temporal dependencies—just the immediate outcome of the chosen action. The objective is to maximize the expected reward over time.

Despite its simplicity, the bandit framework introduces crucial ideas such as reward estimation, uncertainty, and exploration strategies. Algorithms like ε-greedy methods introduce random exploration, while Upper Confidence Bound (UCB) techniques adjust choices based on uncertainty estimates. Thompson Sampling applies Bayesian reasoning to balance exploration and exploitation. Though limited in scope, these strategies establish foundational principles that generalize to more complex environments. In bandits, action selection strategies serve a role similar to policies in full RL problems, but without dependence on state transitions.

1.3 Transition to Markov Decision Processes

The limitations of bandit models become evident when we consider sequential decision-making problems where actions influence future states and rewards. This leads to the formalism of Markov Decision Processes (MDPs), which model environments through states \(S\), actions \(A\), transition probabilities \(P(s'|s, a)\), rewards \(R(s, a)\), and a discount factor \(\gamma \in [0, 1]\). The Markov property assumes that the future is independent of the past given the present state, simplifying the dynamics and enabling tractable analysis.

The agent’s goal is to learn an optimal policy \(\pi^*(s)\) that maximizes the expected cumulative return:

\(G_t = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \right]\)

This process relies on value functions such as:

\(V^\pi(s) = \mathbb{E}_\pi \left[ G_t | S_t = s \right], \quad Q^\pi(s,a) = \mathbb{E}_\pi \left[ G_t | S_t = s, A_t = a \right]\)

Solving MDPs involves either learning these functions or directly learning \(\pi\).

1.4 Comparing Reinforcement Learning Methods

To frame this spectrum of approaches clearly, the table below summarizes key RL method families, highlighting whether they rely on an explicit model of the environment, whether they learn policies, value functions, or both, and notes on convergence and sample efficiency.

Method Type Uses Model? Learns Policy? Learns Value? Sample Efficient? Converges to Optimal? Suitable For Example Algorithms
Multi-Armed Bandits No Action Rule No Yes Yes (in expectation) Stateless problems ε-Greedy, UCB, Thompson
Dynamic Programming Yes Yes Yes No Yes Known model Value Iteration, Policy Iteration
Monte Carlo No Yes Yes No Yes (episodic) Episodic tasks MC Prediction, MC Control
TD Learning No Yes Yes Moderate Yes Ongoing tasks SARSA, Q-Learning
Q-Learning + Function Approx. No Yes Yes Moderate Not always (non-linear) High-dimensional spaces DQN
Policy Gradient No Yes Maybe Low Local Optimum Continuous action spaces REINFORCE, PPO, A2C
Actor-Critic No Yes Yes Moderate Local Optimum Most RL settings A2C, DDPG, SAC
Model-Based RL Yes (learned) Yes Yes High Not guaranteed Data-limited tasks Dreamer, MBPO, MuZero

This classification illustrates the diversity of RL approaches and underscores the flexibility of the RL paradigm. Some methods assume access to a perfect model, while others learn entirely from data. Some directly optimize policies, while others estimate values to guide policy improvement.

1.4.1 Dynamic Programming: Model-Based Learning

Dynamic Programming (DP) methods such as Value Iteration and Policy Iteration assume complete knowledge of the environment’s dynamics. These algorithms exploit the Bellman equations to iteratively compute optimal value functions and policies. The Bellman optimality equation is given by:

\(V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a) + \gamma V^*(s')]\)

Value Iteration applies this update directly, while Policy Iteration alternates between evaluating the current policy and improving it by acting greedily with respect to the value function.

Although DP methods guarantee convergence to the optimal policy, they are rarely applicable to real-world problems due to their assumption of known transitions and the computational infeasibility of operating over large or continuous state spaces.

1.4.2 Model-Free Approaches: Monte Carlo and TD Learning

When the model is unknown, we turn to model-free methods that learn directly from sampled experience. Monte Carlo methods estimate the value of policies by averaging the total return over multiple episodes. These methods are simple and intuitive, suitable for episodic environments, but suffer from high variance and are not efficient in online learning scenarios.

Temporal Difference (TD) learning bridges the gap between Monte Carlo and DP by updating value estimates based on partial returns. Algorithms like SARSA and Q-learning fall into this category. SARSA is on-policy, updating values based on the actual trajectory taken, while Q-learning is off-policy, learning about the greedy policy regardless of the agent’s current behavior. These methods do not require waiting until the end of an episode and are thus applicable in ongoing tasks. They offer a tradeoff between bias and variance in value estimation.

1.4.3 Q-Learning and Function Approximation

Q-learning is one of the most widely used RL algorithms due to its simplicity and theoretical guarantees. However, when dealing with large state-action spaces, tabular Q-learning becomes infeasible. Function approximation, particularly using neural networks, allows Q-learning to scale to high-dimensional problems. This gave rise to Deep Q-Networks (DQNs), where a neural network is trained to approximate the Q-function.

DQNs introduced mechanisms like experience replay—storing and reusing past interactions to reduce correlation between updates—and target networks—fixed Q-value targets updated slowly to stabilize learning. These enhancements enabled RL to tackle complex environments like Atari games directly from pixels. Nonetheless, deep RL methods often suffer from sample inefficiency and training instability, especially when generalizing to new environments.

1.4.4 Policy Gradient and Actor-Critic Methods

While value-based methods derive policies from value functions, policy-based methods directly parameterize and optimize the policy itself. Policy Gradient methods compute the gradient of expected return with respect to policy parameters and perform gradient ascent. The REINFORCE algorithm is the archetype of this approach, but it often suffers from high variance in gradient estimates. The Policy Gradient Theorem provides the theoretical foundation:

\(\nabla J(\theta) = \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(a|s) Q^{\pi}(s,a)]\)

To address variance, Actor-Critic methods introduce a second component: the critic, which estimates value functions to inform and stabilize the updates of the actor (policy). Algorithms like Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), and Soft Actor-Critic (SAC) build on this architecture, each adding unique elements to improve performance and stability.

1.4.5 Advanced Policy Optimization Techniques

More recent advances in policy optimization have focused on improving training stability and sample efficiency. Trust Region Policy Optimization (TRPO) constrains policy updates to stay within a trust region defined by the KL divergence, ensuring small, safe steps in parameter space. Proximal Policy Optimization (PPO) simplifies TRPO with a clipped objective function, striking a balance between ease of implementation and empirical performance.

Soft Actor-Critic (SAC), on the other hand, incorporates an entropy maximization objective, encouraging exploration by maintaining a degree of randomness in the policy. This leads to better performance in environments with sparse or deceptive rewards and is particularly effective in continuous control tasks.

Model-based approaches, such as MuZero or Dreamer, offer a complementary strategy: by learning a model of the environment dynamics and reward structure, they can generate synthetic experiences to improve sample efficiency. However, model inaccuracies can lead to cascading errors and suboptimal policies.

1.5 Conclusion and Further Directions

Reinforcement Learning has evolved into a mature and diverse field, offering a rich set of tools for decision-making under uncertainty. From simple bandit strategies to deep policy optimization and model-based reasoning, RL provides a versatile framework for learning from interaction. However, key challenges remain: training instability, sample inefficiency, reward mis-specification, and generalization across tasks. These remain active areas of research.

A solid understanding of the main categories—such as those outlined in the comparative table—is essential for navigating the RL landscape. Whether one is interested in theoretical foundations, algorithm development, or practical deployment, the key ideas of exploration, value estimation, policy optimization, and model learning form the backbone of modern RL. For further reading, Reinforcement Learning: An Introduction by Sutton and Barto remains the canonical text. Online resources like OpenAI’s Spinning Up, DeepMind’s technical blog, and repositories of papers on arXiv are excellent for staying current with the latest advancements. As artificial agents continue to tackle more complex and dynamic environments, reinforcement learning stands at the forefront of AI research and application.

1.6 References

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html

  • Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236

  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (pp. 1889–1897). https://proceedings.mlr.press/v37/schulman15.html

  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347

  • Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning. https://arxiv.org/abs/1801.01290

  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … & Hassabis, D. (2020). Mastering the game of Go without human knowledge. Nature, 550(7676), 354–359. https://doi.org/10.1038/nature24270

  • Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … & Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140–1144. https://doi.org/10.1126/science.aar6404

  • OpenAI. (2018). Spinning Up in Deep RL. https://spinningup.openai.com

  • Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. https://arxiv.org/abs/1509.02971

  • Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256. https://doi.org/10.1023/A:1013689704352

  • Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294. https://doi.org/10.2307/2332286