Chapter 3 Markov Decision Processes and Dynamic Programming
3.1 Introduction
Markov Decision Processes (MDPs) provide a mathematical framework for modeling sequential decision-making under uncertainty. Applied across operations research, control theory, economics, and artificial intelligence, MDPs capture environments where outcomes depend on both stochastic transitions and agent actions. This post explores MDPs from a computational perspective, emphasizing Dynamic Programming (DP) methods—particularly Value Iteration—for solving them when the model is fully known. We define the mathematical components of MDPs, implement them in R, and demonstrate policy derivation using value iteration.
An MDP is formally defined by the tuple \((S, A, P, R, \gamma)\), where:
- \(S\): a finite set of states
- \(A\): a finite set of actions
- \(P(s'|s, a)\): the transition probability of moving to state \(s'\) given current state \(s\) and action \(a\)
- \(R(s, a, s')\): the reward received after transitioning from \(s\) to \(s'\) via action \(a\)
- \(\gamma \in [0,1]\): the discount factor, representing preference for immediate versus delayed rewards
The agent’s objective is to learn a policy \(\pi: S \to A\) that maximizes expected cumulative discounted reward:
\[ V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R(S_t, A_t, S_{t+1}) \,\bigg|\, S_0 = s \right] \]
The optimal value function \(V^*\) satisfies the Bellman optimality equation:
\[ V^*(s) = \max_{a \in A} \sum_{s' \in S} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right] \]
Once \(V^*\) is computed, the optimal policy \(\pi^*\) follows:
\[ \pi^*(s) = \arg\max_{a \in A} \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right] \]
3.2 Constructing the MDP in R
We implement an environment with 10 states and 2 actions per state, with stochastic transitions and rewards. The final state is absorbing.
n_states <- 10
n_actions <- 2
gamma <- 0.9
terminal_state <- n_states
set.seed(42)
transition_model <- array(0, dim = c(n_states, n_actions, n_states))
reward_model <- array(0, dim = c(n_states, n_actions, n_states))
for (s in 1:(n_states - 1)) {
# Action 1: 90% forward, 10% random
transition_model[s, 1, s + 1] <- 0.9
random_state <- sample(setdiff(1:n_states, s + 1), 1)
transition_model[s, 1, random_state] <- 0.1
# Action 2: split between two random states
random_states <- sample(1:n_states, 2, replace = FALSE)
transition_model[s, 2, random_states[1]] <- 0.8
transition_model[s, 2, random_states[2]] <- 0.2
# Rewards
for (s_prime in 1:n_states) {
reward_model[s, 1, s_prime] <- ifelse(s_prime == n_states, 1.0, 0.1 * runif(1))
reward_model[s, 2, s_prime] <- ifelse(s_prime == n_states, 0.5, 0.05 * runif(1))
}
}
# Terminal state: absorbing with no rewards
transition_model[n_states, , n_states] <- 1
reward_model[n_states, , ] <- 0Explanation: This defines the MDP with 10 states and 2 actions. Transition probabilities are stored in a 3D array indexed by current state, action, and next state. Action 1 strongly favors moving forward (90% to next state), while Action 2 distributes probability between random states. Rewards are stochastic, with higher values (1.0 for Action 1, 0.5 for Action 2) upon reaching the terminal state. The terminal state transitions to itself with probability 1 and yields no further rewards.
We define a function to sample from the environment:
sample_env <- function(s, a) {
probs <- transition_model[s, a, ]
s_prime <- sample(1:n_states, 1, prob = probs)
reward <- reward_model[s, a, s_prime]
list(s_prime = s_prime, reward = reward)
}Explanation:
This function simulates the MDP dynamics. Given state s and action a, it samples the next state according to transition probabilities and retrieves the corresponding reward, mimicking agent-environment interaction.
3.3 Value Iteration Algorithm
Value Iteration is a fundamental DP method that computes the optimal value function through successive approximation of the Bellman optimality equation.
value_iteration <- function() {
V <- rep(0, n_states)
policy <- rep(1, n_states)
theta <- 1e-4
repeat {
delta <- 0
for (s in 1:(n_states - 1)) {
v <- V[s]
q_values <- numeric(n_actions)
for (a in 1:n_actions) {
for (s_prime in 1:n_states) {
q_values[a] <- q_values[a] +
transition_model[s, a, s_prime] *
(reward_model[s, a, s_prime] + gamma * V[s_prime])
}
}
V[s] <- max(q_values)
policy[s] <- which.max(q_values)
delta <- max(delta, abs(v - V[s]))
}
if (delta < theta) break
}
list(V = V, policy = policy)
}Explanation:
This implements value iteration. Starting with zero-initialized values, the algorithm iteratively updates each state’s value by computing expected returns for all actions and selecting the maximum. The associated action becomes the policy for that state. Iteration continues until the maximum value change falls below threshold theta.
Apply the algorithm:
## [1] 2 2 1 1 2 1 1 1 1 1
Explanation:
Running value iteration yields the optimal state values (dp_value) and policy (dp_policy), providing a complete mapping from states to optimal actions.
3.4 Evaluation and Interpretation
The derived policy reflects optimal behavior in this environment. Action 1 typically yields higher long-term returns due to its tendency to reach the high-reward terminal state efficiently. Action 2, introducing more randomness and lower rewards, is optimal only where it offers better expected value. This exemplifies the Bellman principle: every sub-policy of an optimal policy must itself be optimal.
Visualize the value function:
plot(dp_value, type = "b", col = "blue", pch = 19,
xlab = "State", ylab = "Value",
main = "Optimal State Values via Value Iteration")
grid()
Explanation: This plot displays the value function across states. Higher values indicate greater expected long-term returns. The pattern reveals how proximity to the terminal state and the effectiveness of Action 1 drive expected values.
3.5 Theoretical Properties
Value Iteration provides strong guarantees:
- Convergence: Guaranteed convergence to \(V^*\) for any finite MDP with bounded rewards and \(0 \leq \gamma < 1\)
- Optimality: The greedy policy derived from \(V^*\) is optimal
- Complexity: Each iteration requires \(O(S^2 A)\) operations
Practical limitations include:
- Requirement for complete knowledge of \(P\) and \(R\)
- Infeasibility in large or continuous state spaces (curse of dimensionality)
These constraints motivate approximation methods, sampling-based approaches, and model-free reinforcement learning.
3.6 Summary Table
| Property | Value Iteration (DP) |
|---|---|
| Model Required | Yes (transition probabilities and rewards) |
| State Representation | Tabular (explicit state-value storage) |
| Action Selection | Greedy w.r.t. value estimates |
| Convergence Guarantee | Yes (under finite \(S, A\), \(\gamma < 1\)) |
| Sample Efficiency | High (uses full model, no sampling error) |
| Scalability | Poor in large or continuous state spaces |
| Output | Optimal value function and policy |
| Computational Complexity | \(O(S^2 A)\) per iteration |
3.7 Conclusion
Dynamic Programming offers theoretically grounded algorithms for solving MDPs with complete environment models. Value Iteration leverages the recursive Bellman optimality equation to iteratively compute optimal value functions and policies.
While limited by scalability and model requirements, DP forms the foundation for understanding decision processes. These principles extend naturally to model-free reinforcement learning, function approximation, and policy gradient methods.
Future posts will explore Temporal Difference learning, Monte Carlo methods, and policy optimization—approaches that relax model requirements and enable learning from interaction, opening pathways to real-world applications.
Explanation: This plot shows the computed value function across all states. Higher values indicate states with greater expected long-term returns under the optimal policy. The shape of the curve helps us see how proximity to the terminal state (and the possibility of reaching it through Action 1) drives up the expected value.
3.8 Theoretical Properties of Value Iteration
Value Iteration exhibits the following theoretical guarantees:
- Convergence: The algorithm is guaranteed to converge to $V^*$ for any finite MDP with bounded rewards and $0 < 1$.
- Policy Derivation: Once $V^*$ is known, the greedy policy is optimal.
- Computational Complexity: For state space size $S$ and action space size $A$, each iteration requires $O(S^2 A)$ operations due to the summation over all successor states.
However, in practice, the applicability of DP methods is restricted by:
- The need for full knowledge of $P$ and $R$,
- Infeasibility in large or continuous state spaces (the “curse of dimensionality”).
These challenges motivate the use of approximation, sampling-based methods, and model-free approaches in reinforcement learning contexts.
3.9 Summary Table
The following table summarizes the key elements and tradeoffs of the value iteration algorithm:
| Property | Value Iteration (DP) |
|---|---|
| Model Required | Yes (transition probabilities and rewards) |
| State Representation | Tabular (explicit state-value storage) |
| Action Selection | Greedy w.r.t. value estimates |
| Convergence Guarantee | Yes (under finite $S, A$, $< 1$) |
| Sample Efficiency | High (uses full model, no sampling error) |
| Scalability | Poor in large or continuous state spaces |
| Output | Optimal value function and policy |
| Computational Complexity | $O(S^2 A)$ per iteration |
3.10 Conclusion
Dynamic Programming offers elegant and theoretically grounded algorithms for solving MDPs when the environment model is fully specified. Value Iteration, as illustrated, leverages the recursive Bellman optimality equation to iteratively compute the value function and derive the optimal policy.
While its practical scope is limited by scalability and model assumptions, DP remains foundational in the study of decision processes. Understanding these principles is essential for extending to model-free reinforcement learning, function approximation, and policy gradient methods.
Future posts in this series will explore Temporal Difference learning, Monte Carlo methods, and the transition to policy optimization. These approaches lift the strict requirements of model knowledge and allow learning from interaction, thereby opening the door to real-world applications.