Introduction to Reinforcement Learning

The goal of reinforcement learning

The probability of a trajectory $\tau$ is given by:

p_\theta(s_1, a_1, \ldots, s_T, a_T) = p(s_1) \prod_{t=1}^{T} \pi_\theta(a_t|s_t) p(s_{t+1}|s_t, a_t)

\theta^* = \arg\max_{\theta} \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t} r(s_t, a_t) \right]

This is the augmented markov chain.

p\bigl((s_{t+1}, a_{t+1}) \mid (s_t, a_t)\bigr) = p\bigl(s_{t+1} \mid s_t, a_t\bigr)\,\pi_{\theta}\bigl(a_{t+1} \mid s_{t+1}\bigr).

\theta^* = \arg\max_{\theta} \sum_{t=1}^{T} \mathbb{E}_{(s_t,a_t)\sim p_{\theta}(s_t,a_t)} \bigl[ r(s_t, a_t) \bigr].

(We use the linearity of expectations to simplify)

Imagine at time $t$ , we have some probability distribution over all possible states $s$ , and actions $a$ we choose in those states. Call this distribution $p_{t}(s_t, a_t)$ .

The transition operator $\mathcal{T}$ explains how to evolve from $p_t(s,a)$ to $p_{t+1}(s',a')$ . In particular:

p_{t+1}(s',a') ~=~ \sum_{s,a}\, p_t(s,a)\, P(s' \mid s,a)\, \pi(a' \mid s').

So we'll sometimes write this succintly as $p_{t+1} = \mathcal{T}(p_t)$ .

What if $T = \infty$ . Does it settles to a stationary distribution eventually? If the MDP is ergodic i.e. every state is reachable from every other state, and aperiodic i.e. the period of every state is 1, then the MDP is positive recurrent and has a unique stationary distribution.

This implies that the expectation when $T = \infty$ is well defined.

\theta^* = \arg\max_{\theta} \mathbb{E}_{(s_t, a_t) \sim p_{\theta}(s_t, a_t)} \left[ \sum_{t=1}^T r(s_t, a_t) \right]

will have infinitely many terms that are close to stationary distribution. There will be some finite number of terms that are not close to stationary distribution, but they will be dominated by the terms that are close to stationary distribution, so we're interested in this expectation. Lets define $\mu = p_{\theta}(s, a)$ as the stationary distribution.

\theta^* = \arg\max_{\theta} \frac{1}{T} \mathbb{E}_{(s_t, a_t) \sim p_{\theta}(s_t, a_t)} \left[ \sum_{t=1}^T r(s_t, a_t) \right] \implies E_{(s, a) \sim \mu} \left[ r(s, a) \right]

This is the expected reward under the stationary distribution in the limit of $T \to \infty$ .

Expectation maximization

So in RL, we're interested in expected values of the rewards even if we're optimizing the expectations. Expectation can be continous in the parameters of the correspondings distributions even when the function we're taking the expectation of is highly discontinuous. This is important to understand why RL optimization algorithms can use smooth optimization methods like gradient descent to optimize objectives that are non differentiable.

If you're driving down the road on a hill and the reward is $+1$ if you're on the road and $-1$ if you're off the road, then the reward function is discontinuous. But the expectation of the reward is continuous in the parameters of the distribution. If you're optimizing it with respect to the position of the car on the road then you can't use gradient descent because the reward function is discontinuous.

But if you consider a policy $\pi_{\theta}(a = fall) = \theta \quad \text{bernoulli distribution}$ , then the expectation of the reward is continuous in $\theta$ i.e. $E_{\pi_{\theta}}[r(x)]$ is continuous in $\theta$ .

E_{\pi_{\theta}}[r(x)] = \theta r_{fall} + (1 - \theta) r_{on-road}

Expected values of non-smooth and non-differentiable functions under smooth and differentiable probability distributions are themselves smooth and differentiable.

General Process of RL

Generate samples (i.e. run the policy in the environment)
Fit a model and estimate the return i.e. $J(\theta) = E_{\pi} \left[ \sum_{t=1}^T r(s_t, a_t) \right] = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T r(s_{i,t}, a_{i,t})$
Update the policy $\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta)$ where $\alpha$ is the learning rate, and $\theta$ is the policy parameters.
Repeat

Below is a detailed “story” for the model-based RL setup shown in the first figure (the one with the separate neural net $f_\phi$ that predicts next states). We’ll assume we also have (or learn) a reward model $r_\phi$ . This approach differs from model-free RL in that we explicitly learn an environment model for both state transitions and rewards.

Model Based RL

Step 1: Generate Samples (Collect Real Trajectories)

Use the current policy $\pi_\theta(s)$ to act in the real environment.
Record state transitions $(s_t, a_t, s_{t+1})$ and rewards $r_t$ for several episodes (trajectories).
Collect this dataset of $\{(s_t, a_t, s_{t+1}, r_t)\}$ .

Key point: We gather real data so we can train the environment model $(f_\phi, r_\phi)$ to mimic the transitions and rewards observed in the environment.

Step 2: Fit a Model (Dynamics and Reward)

Dynamics model: learn $f_\phi(s_t, a_t) \approx s_{t+1}$
Reward model (optional in some tasks): learn $r_\phi(s_t, a_t) \approx r_t$

In practice, you might train both models jointly or separately. The point is to have a differentiable approximation of the environment’s transition dynamics and (if needed) reward function.

Step 3: Policy Improvement via Backprop

Once you have a differentiable model of transitions and rewards, you can:

Simulate the policy $\pi_\theta$ inside your learned model $f_\phi$ :
- Start from a state $s_0$ (real or imagined) and predict $s_1 = f_\phi(s_0, \pi_\theta(s_0))$
- Predict the reward $r_0 = r_\phi(s_0, \pi_\theta(s_0))$
- Then move on to the next step, $s_2 = f_\phi(s_1, \pi_\theta(s_1)),$ and so on.
Accumulate the predicted return:
$J_\phi(\theta) \;=\; \sum_{t=0}^{T-1} r_\phi\bigl(s_t, \pi_\theta(s_t)\bigr).$
Backpropagate the gradient of this predicted return w.r.t.\ the policy parameters $\theta$ :
$\theta \;\leftarrow\; \theta \;+\; \alpha \,\nabla_\theta\, J_\phi(\theta).$

Because the entire simulation is a differentiable computational graph ( $f_\phi$ and $r_\phi$ are neural nets, and $\pi_\theta$ is also a neural net), we can compute this gradient directly—under the assumption that $f_\phi$ and $r_\phi$ accurately model the real environment.

Step 4: Repeat

After updating $\theta$ , you can either:
- Re-run the real environment with the new policy to collect more real transitions and update $f_\phi, r_\phi$ again.
- Or keep re-simulating in the model for several “imagined” rollouts before refreshing the data from the real environment (if you suspect the model remains reasonably accurate).
Continue iterating until convergence or until you’ve trained enough.

Value functions

\mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Big[ \sum_{t=1}^T r(s_t,a_t) \Big]

\mathbb{E}_{s_1 \sim p(s_1)} \Big[ \mathbb{E}_{a_1 \sim \pi(a_1 \mid s_1)} \Big[ r(s_1,a_1) \;+\; \mathbb{E}_{s_2 \sim p(s_2 \mid s_1,a_1)} \Big[ \mathbb{E}_{a_2 \sim \pi(a_2 \mid s_2)} \big[ r(s_2,a_2) + \dots \;\big|\; s_2 \big] \;\big|\; s_1,a_1 \Big] \Big] \Big]

What if we knew this part?

Q(s_1,a_1) \;=\; r(s_1,a_1) \;+\; \mathbb{E}_{s_2 \sim p(s_2 \mid s_1,a_1)} \Big[ \mathbb{E}_{a_2 \sim \pi(a_2 \mid s_2)} \big[ r(s_2,a_2) + \dots \;\big|\; s_2 \big] \;\big|\; s_1,a_1 \Big]

\mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Big[\sum_{t=1}^T r(s_t,a_t)\Big] \;=\; \mathbb{E}_{s_1 \sim p(s_1)} \Big[ \mathbb{E}_{a_1 \sim \pi(a_1 \mid s_1)} \big[ Q(s_1,a_1) \;\big|\; s_1 \big] \Big]

Easy to modify $\pi_{\theta}(a_1 \mid s_1)$ if $Q(s_1,a_1)$ is known!

Example: $\pi(a_1 \mid s_1) = 1$ if $a_1 = \arg \max_{a_1} Q(s_1,a_1)$ .

Q-function

Q^\pi(s_t, a_t) = \sum_{t' = t}^T \mathbb{E}_{\pi_\theta}\bigl[r(s_{t'}, a_{t'}) \mid s_t, a_t\bigr]

total reward from taking $a_t$ in $s_t$

Value function

V^\pi(s_t) = \sum_{t' = t}^T \mathbb{E}_{\pi_\theta}\bigl[r(s_{t'}, a_{t'}) \mid s_t\bigr]

**total reward from $s_t$ **

V^\pi(s_t) = \mathbb{E}_{a_t \sim \pi(a_t \mid s_t)}\bigl[Q^\pi(s_t, a_t)\bigr]

\mathbb{E}_{s_1 \sim p(s_1)}\bigl[V^\pi(s_1)\bigr] \text{ is the RL objective!}

Basic optimization ideas

Using Q-functions and value functions

Idea 1: if we have policy $\pi$ , and we know $Q^\pi(s,a)$ , then we can improve $\pi$ :

\pi'(a \mid s) = 1 \quad \text{if} \quad a = \arg\max_{a}\,Q^\pi(s,a)

this policy is at least as good as $\pi$ (and probably better)! and it doesn't matter what $\pi$ is

Idea 2: compute gradient to increase probability of good actions $a$ :

if $Q^\pi(s,a) > V^\pi(s)$ , then $a$ is better than average

(recall that $V^\pi(s) = \mathbb{E}[\,Q^\pi(s,a)\,]$ under $\pi(a \mid s)$ )

modify $\pi(a \mid s)$ to increase probability of $a$ if $Q^\pi(s,a) > V^\pi(s)$

Types of Algorithms

Our Objective: $\theta^* = \arg\max_{\theta} \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ \sum_{t=1}^T r(s_t, a_t) \right]$

The lecture covers different categories of Reinforcement Learning (RL) algorithms and their typical structures. The main goal in RL is to maximize expected reward (or return). Formally, if an agent follows a policy $\pi_\theta(a \mid s)$ parameterized by $\theta$ , the RL objective can often be written as:

J(\theta) \;=\; \mathbb{E}_{\tau \sim \pi_\theta} \bigg[\sum_{t=0}^T \gamma^t\,r(s_t, a_t)\bigg],

where

$\tau$ denotes a trajectory (or "rollout") $(s_0, a_0, s_1, a_1, \ldots)$ ,
$r(s_t,a_t)$ is the reward at time $t$ ,
$\gamma\in [0,1]$ is the discount factor.

The lecture introduces four main categories of RL algorithms:

Policy Gradient Methods
Value-based Methods
Actor-Critic Methods
Model-based Methods

Policy Gradient Methods

Core Idea: Directly optimize $\pi_\theta$ by estimating $\nabla_\theta J(\theta)$ .
Process:
1. Collect trajectories (rollouts) by sampling actions from $\pi_\theta$ .
2. Compute the (discounted) returns or reward sums for each trajectory.
3. Use those returns to estimate the gradient $\nabla_\theta J(\theta)$ .
4. Update parameters $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$ .
Notes:
- No explicit value function is required (though baseline value functions can help).
- The "green box" (in the lecture's diagram) is very simple: just sum rewards.
- The "blue box" is the gradient update w.r.t. $\theta$ .

Value-based Methods

Core Idea: Learn a value function $V(s)$ or action-value function $Q(s,a)$ that estimates how good a state or action-state pair is.
Examples:
- Q-learning: approximate $Q^*(s,a)$ for the optimal policy.
- SARSA: on-policy method for estimating $Q^\pi(s,a)$ .
Implementation:
1. Green Box: Fit a neural network to approximate $Q(s,a)$ or $V(s)$ .
2. Blue Box: The policy is derived implicitly by taking the argmax of $Q(s,a)$ : $\pi(s) = \arg\max_{a} Q(s,a).$
- Thus, the policy is not parameterized separately. It’s just $\arg\max$ .

Actor-Critic Methods

Core Idea: Hybrid of policy gradient ("actor") and value-based ("critic").
Process:
1. Green Box: Learn a value function or a $Q$ function (the "critic").
2. Blue Box: Use this learned value/ $Q$ function to improve the policy parameters $\theta$ (the "actor") via gradient steps.
The learned value function acts as a baseline or a more efficient estimator for the policy gradient. Instead of just summing up rewards, the actor can use the critic’s estimates for lower-variance gradient updates.

Model-based Methods

Core Idea: Learn a model of the environment’s transition dynamics:
$\hat{p}(s_{t+1} \mid s_t, a_t).$
This model can be used for planning or for generating synthetic data.
Model-based Approaches:
1. Planning: Use the learned model to search for good policies (e.g., Monte Carlo Tree Search in discrete domains like chess, or trajectory optimization in continuous robotics).
2. Backpropagation Through Dynamics: Differentiate the expected return w.r.t. policy parameters through the learned model. This may require advanced optimizers (e.g., second-order methods) for stability.
3. Learning Value/Q Functions via simulated data: Use the model to perform dynamic programming or to generate "imaginary" rollouts and train a value-based or policy-based method.
4. Dyna-style Methods: Use the model to generate additional (state, action, next state, reward) samples for a model-free RL algorithm, effectively augmenting real data.

Why So Many RL Algorithms?

There is no single “best” RL algorithm because each comes with its own trade-offs around:

Sample Efficiency (how many environment interactions are needed).
Stability and Ease of Use (how likely the method is to converge or require tuning).
Computational Efficiency (wall-clock time vs. data collection costs).
Assumptions (e.g., full observability, continuity, discrete vs. continuous action spaces, etc.).

Different domains place different demands on these trade-offs, so multiple RL algorithms exist to accommodate the variety of possible problem settings.

Sample Efficiency

Definition: The number of samples (environment interactions) needed to achieve a good policy.
Key Distinction: On-policy vs. Off-policy.
- On-policy algorithms (e.g., vanilla policy gradient):
  - Must collect new data every time the policy changes (cannot reuse old data under a different policy).
  - Often less sample-efficient.
- Off-policy algorithms (e.g., Q-learning):
  - Can learn from data generated by any policy (or even random policies).
  - More sample-efficient in principle.

Hence, if sample collection is expensive (e.g., real-world robotics), off-policy methods can be very appealing. Conversely, if simulation is extremely cheap (e.g., a fast game simulator), even less sample-efficient methods might be acceptable if they have other advantages (like simpler implementation or better stability).

Stability and Ease of Use

Convergence: Does the algorithm converge to a stable solution?
- Many RL algorithms only converge under special conditions (e.g., tabular Q-learning is convergent, but Q-learning with neural networks may not always converge in theory).
Practical Considerations:
- Hyperparameter tuning can be challenging (learning rate, exploration, discount factor, etc.).
- Policy Gradient methods:
  - Directly optimize the true RL objective
  - Can be stable but require a lot of samples (on-policy).
- Value-based methods:
  - Involve fixed-point iteration for the value function
  - Might diverge if function approximation is used incorrectly.
- Model-based methods:
  - Converge in modeling the environment’s dynamics
  - But no guarantee a better model yields a better final policy.

Other Common Assumptions

Full Observability (Markov Property)
- Many algorithms assume the state (s_t) is fully observed or at least Markov.
- If real-world observations are partial (e.g., partially observed states), we need extra machinery (recurrent networks, belief states, etc.).
Episodic Tasks
- Policy gradient approaches often assume the ability to reset the environment and collect episodic rollouts.
- Makes it easier to compute returns: $G = \sum_{t=0}^{T} \gamma^t \, r(s_t, a_t).$
Continuity / Smoothness
- Common in model-based RL (optimal control style) and some continuous value function methods.
- Helps with certain planning/optimization routines that rely on derivatives.

Putting It All Together

Sample Efficiency: On-policy methods like policy gradient are often the least sample-efficient, while off-policy or model-based methods can be more sample-efficient.
Stability: Pure policy gradient optimizes the actual RL objective but can have high variance and might need many samples to stabilize. Value-based methods and actor-critic can be trickier to converge but might learn faster in practice under certain conditions.
Computation vs. Data Cost: If simulation is cheap (e.g., video games), purely on-policy, gradient-based methods can still be fine. If data collection is expensive (real robotics), off-policy or model-based approaches can shine.
Algorithmic Assumptions:
- Full observability vs. partial observability.
- Discrete vs. continuous actions.
- Episodic vs. infinite-horizon tasks.
- Smooth dynamics vs. complicated, stochastic transitions.

Depending on the task at hand, no single method will be universally optimal. The choice of algorithm depends heavily on:

How expensive data collection is.
Whether the task is episodic or continuing.
The level of noise or partial observability.
The desired convergence properties or theoretical guarantees.

Summary

Model-based RL: Focuses on learning and leveraging the transition model.
Value-based RL: Focuses on learning $Q(s,a)$ or $V(s)$ and deriving the policy via $\arg\max$ .
Policy Gradients: Direct optimization of a parameterized policy.
Actor-Critic: Combines value function learning (critic) with direct policy updates (actor).

All these methods share the common objective of maximizing cumulative reward, but they differ in how they use data, how they represent policies, and whether or not they learn a transition model.

Bhavit Sharma