Introduction to Reinforcement Learning
The goal of reinforcement learning
The probability of a trajectory is given by:
This is the augmented markov chain.
(We use the linearity of expectations to simplify)
Imagine at time , we have some probability distribution over all possible states , and actions we choose in those states. Call this distribution .
The transition operator explains how to evolve from to . In particular:
So we'll sometimes write this succintly as .
What if . Does it settles to a stationary distribution eventually? If the MDP is ergodic i.e. every state is reachable from every other state, and aperiodic i.e. the period of every state is 1, then the MDP is positive recurrent and has a unique stationary distribution.
This implies that the expectation when is well defined.
will have infinitely many terms that are close to stationary distribution. There will be some finite number of terms that are not close to stationary distribution, but they will be dominated by the terms that are close to stationary distribution, so we're interested in this expectation. Lets define as the stationary distribution.
This is the expected reward under the stationary distribution in the limit of .
Expectation maximization
So in RL, we're interested in expected values of the rewards even if we're optimizing the expectations. Expectation can be continous in the parameters of the correspondings distributions even when the function we're taking the expectation of is highly discontinuous. This is important to understand why RL optimization algorithms can use smooth optimization methods like gradient descent to optimize objectives that are non differentiable.
If you're driving down the road on a hill and the reward is if you're on the road and if you're off the road, then the reward function is discontinuous. But the expectation of the reward is continuous in the parameters of the distribution. If you're optimizing it with respect to the position of the car on the road then you can't use gradient descent because the reward function is discontinuous.
But if you consider a policy , then the expectation of the reward is continuous in i.e. is continuous in .
Expected values of non-smooth and non-differentiable functions under smooth and differentiable probability distributions are themselves smooth and differentiable.
General Process of RL
- Generate samples (i.e. run the policy in the environment)
- Fit a model and estimate the return i.e.
- Update the policy where is the learning rate, and is the policy parameters.
- Repeat
Below is a detailed “story” for the model-based RL setup shown in the first figure (the one with the separate neural net that predicts next states). We’ll assume we also have (or learn) a reward model . This approach differs from model-free RL in that we explicitly learn an environment model for both state transitions and rewards.
Model Based RL
Step 1: Generate Samples (Collect Real Trajectories)
- Use the current policy to act in the real environment.
- Record state transitions and rewards for several episodes (trajectories).
- Collect this dataset of .
Key point: We gather real data so we can train the environment model to mimic the transitions and rewards observed in the environment.
Step 2: Fit a Model (Dynamics and Reward)
-
Dynamics model: learn
-
Reward model (optional in some tasks): learn
In practice, you might train both models jointly or separately. The point is to have a differentiable approximation of the environment’s transition dynamics and (if needed) reward function.
Step 3: Policy Improvement via Backprop
Once you have a differentiable model of transitions and rewards, you can:
-
Simulate the policy inside your learned model :
- Start from a state (real or imagined) and predict
- Predict the reward
- Then move on to the next step, and so on.
-
Accumulate the predicted return:
-
Backpropagate the gradient of this predicted return w.r.t.\ the policy parameters :
Because the entire simulation is a differentiable computational graph ( and are neural nets, and is also a neural net), we can compute this gradient directly—under the assumption that and accurately model the real environment.
Step 4: Repeat
-
After updating , you can either:
- Re-run the real environment with the new policy to collect more real transitions and update again.
- Or keep re-simulating in the model for several “imagined” rollouts before refreshing the data from the real environment (if you suspect the model remains reasonably accurate).
-
Continue iterating until convergence or until you’ve trained enough.
Value functions
What if we knew this part?
Easy to modify if is known!
Example: if .
Q-function
total reward from taking in
Value function
**total reward from **
Basic optimization ideas
Using Q-functions and value functions
Idea 1: if we have policy , and we know , then we can improve :
this policy is at least as good as (and probably better)! and it doesn't matter what is
Idea 2: compute gradient to increase probability of good actions :
if , then is better than average
(recall that under )
modify to increase probability of if
Types of Algorithms
Our Objective:
The lecture covers different categories of Reinforcement Learning (RL) algorithms and their typical structures. The main goal in RL is to maximize expected reward (or return). Formally, if an agent follows a policy parameterized by , the RL objective can often be written as:
where
- denotes a trajectory (or "rollout") ,
- is the reward at time ,
- is the discount factor.
The lecture introduces four main categories of RL algorithms:
- Policy Gradient Methods
- Value-based Methods
- Actor-Critic Methods
- Model-based Methods
Policy Gradient Methods
-
Core Idea: Directly optimize by estimating .
-
Process:
- Collect trajectories (rollouts) by sampling actions from .
- Compute the (discounted) returns or reward sums for each trajectory.
- Use those returns to estimate the gradient .
- Update parameters .
-
Notes:
- No explicit value function is required (though baseline value functions can help).
- The "green box" (in the lecture's diagram) is very simple: just sum rewards.
- The "blue box" is the gradient update w.r.t. .
Value-based Methods
-
Core Idea: Learn a value function or action-value function that estimates how good a state or action-state pair is.
-
Examples:
- Q-learning: approximate for the optimal policy.
- SARSA: on-policy method for estimating .
-
Implementation:
- Green Box: Fit a neural network to approximate or .
- Blue Box: The policy is derived implicitly by taking the argmax of :
- Thus, the policy is not parameterized separately. It’s just .
Actor-Critic Methods
- Core Idea: Hybrid of policy gradient ("actor") and value-based ("critic").
- Process:
- Green Box: Learn a value function or a function (the "critic").
- Blue Box: Use this learned value/ function to improve the policy parameters (the "actor") via gradient steps.
- The learned value function acts as a baseline or a more efficient estimator for the policy gradient. Instead of just summing up rewards, the actor can use the critic’s estimates for lower-variance gradient updates.
Model-based Methods
-
Core Idea: Learn a model of the environment’s transition dynamics:
This model can be used for planning or for generating synthetic data.
-
Model-based Approaches:
- Planning: Use the learned model to search for good policies (e.g., Monte Carlo Tree Search in discrete domains like chess, or trajectory optimization in continuous robotics).
- Backpropagation Through Dynamics: Differentiate the expected return w.r.t. policy parameters through the learned model. This may require advanced optimizers (e.g., second-order methods) for stability.
- Learning Value/Q Functions via simulated data: Use the model to perform dynamic programming or to generate "imaginary" rollouts and train a value-based or policy-based method.
- Dyna-style Methods: Use the model to generate additional (state, action, next state, reward) samples for a model-free RL algorithm, effectively augmenting real data.
Why So Many RL Algorithms?
There is no single “best” RL algorithm because each comes with its own trade-offs around:
- Sample Efficiency (how many environment interactions are needed).
- Stability and Ease of Use (how likely the method is to converge or require tuning).
- Computational Efficiency (wall-clock time vs. data collection costs).
- Assumptions (e.g., full observability, continuity, discrete vs. continuous action spaces, etc.).
Different domains place different demands on these trade-offs, so multiple RL algorithms exist to accommodate the variety of possible problem settings.
Sample Efficiency
- Definition: The number of samples (environment interactions) needed to achieve a good policy.
- Key Distinction: On-policy vs. Off-policy.
- On-policy algorithms (e.g., vanilla policy gradient):
- Must collect new data every time the policy changes (cannot reuse old data under a different policy).
- Often less sample-efficient.
- Off-policy algorithms (e.g., Q-learning):
- Can learn from data generated by any policy (or even random policies).
- More sample-efficient in principle.
- On-policy algorithms (e.g., vanilla policy gradient):
Hence, if sample collection is expensive (e.g., real-world robotics), off-policy methods can be very appealing. Conversely, if simulation is extremely cheap (e.g., a fast game simulator), even less sample-efficient methods might be acceptable if they have other advantages (like simpler implementation or better stability).
Stability and Ease of Use
- Convergence: Does the algorithm converge to a stable solution?
- Many RL algorithms only converge under special conditions (e.g., tabular Q-learning is convergent, but Q-learning with neural networks may not always converge in theory).
- Practical Considerations:
- Hyperparameter tuning can be challenging (learning rate, exploration, discount factor, etc.).
- Policy Gradient methods:
- Directly optimize the true RL objective
- Can be stable but require a lot of samples (on-policy).
- Value-based methods:
- Involve fixed-point iteration for the value function
- Might diverge if function approximation is used incorrectly.
- Model-based methods:
- Converge in modeling the environment’s dynamics
- But no guarantee a better model yields a better final policy.
Other Common Assumptions
-
Full Observability (Markov Property)
- Many algorithms assume the state (s_t) is fully observed or at least Markov.
- If real-world observations are partial (e.g., partially observed states), we need extra machinery (recurrent networks, belief states, etc.).
-
Episodic Tasks
- Policy gradient approaches often assume the ability to reset the environment and collect episodic rollouts.
- Makes it easier to compute returns:
-
Continuity / Smoothness
- Common in model-based RL (optimal control style) and some continuous value function methods.
- Helps with certain planning/optimization routines that rely on derivatives.
Putting It All Together
- Sample Efficiency: On-policy methods like policy gradient are often the least sample-efficient, while off-policy or model-based methods can be more sample-efficient.
- Stability: Pure policy gradient optimizes the actual RL objective but can have high variance and might need many samples to stabilize. Value-based methods and actor-critic can be trickier to converge but might learn faster in practice under certain conditions.
- Computation vs. Data Cost: If simulation is cheap (e.g., video games), purely on-policy, gradient-based methods can still be fine. If data collection is expensive (real robotics), off-policy or model-based approaches can shine.
- Algorithmic Assumptions:
- Full observability vs. partial observability.
- Discrete vs. continuous actions.
- Episodic vs. infinite-horizon tasks.
- Smooth dynamics vs. complicated, stochastic transitions.
Depending on the task at hand, no single method will be universally optimal. The choice of algorithm depends heavily on:
- How expensive data collection is.
- Whether the task is episodic or continuing.
- The level of noise or partial observability.
- The desired convergence properties or theoretical guarantees.
Summary
- Model-based RL: Focuses on learning and leveraging the transition model.
- Value-based RL: Focuses on learning or and deriving the policy via .
- Policy Gradients: Direct optimization of a parameterized policy.
- Actor-Critic: Combines value function learning (critic) with direct policy updates (actor).
All these methods share the common objective of maximizing cumulative reward, but they differ in how they use data, how they represent policies, and whether or not they learn a transition model.
Last updated on