Imitation Learning

Introduction

In imitation learning, we have a dataset of expert demonstrations and we want to learn a policy that mimics the expert's behavior i.e. $\pi_{\theta}(a_t \mid s_t) \approx \pi_{*}(a_t \mid s_t)$ .

The Distribution Shift Problem

We train a RL policy under $p_{data}(o_t)$ where $o_t$ is the "observation" of a state $s_t$ (in some cases they will be same, but not always). We're interesting in maximizing this: $\max_{\theta} \mathbb{E}_{o_t \sim p_{data}(o_t)} \left[ \log \pi_{\theta}(a_t \mid o_t) \right]$ as for imitation learning we're interested in maximizing the likelihood: $\mathcal{L}_{\theta} = \prod_{t} \pi_{\theta}(a_t \mid o_t)$ .

We can simplify this and write a cost function that we have to minimize. let $c(s, t)$

c(s_t, a_t) = \begin{cases} 0 & \text{if } a_t = \pi^{*}(s_t),\\ 1 & \text{otherwise} \end{cases}

Then, we need to $minimize\ \mathbb{E}_{o_t \sim p_{\pi_{\theta}}(s_t)} \left[ c(s_t, a_t) \right]$ . Minimize the number of mistakes a policy makes_.

Some Analysis

assume: $\pi_{\theta}(a \neq \pi_{*}(s) | s) \leq \epsilon \quad \forall s \in \mathcal{D_{train}}$ . Lets say $T$ is the max length of the trajectory. Then, our $\mathbb{E}$ is:

\mathbb{E}\!\Bigl[\sum_{t} c\bigl(s_{t}, a_{t}\bigr)\Bigr] \;\le\; \epsilon T \;+\; \bigl(1 - \epsilon\bigr) \Bigl( \epsilon \bigl(T - 1\bigr) \;+\; \bigl(1 - \epsilon\bigr)(\dots) \Bigr).

We're taking the expectation on the total cost. Since we're using an imitation policy, if we make a mistake at time $t$ then on average we'll continue making mistakes for the rest of the trajectory. So if we make the mistake at the first timestep then we'll make $\epsilon T$ mistakes on average. If we make the mistake at the second timestep then we'll make $(1 - \epsilon) \epsilon (T-1)$ mistakes on average.

These are $T$ terms in the sum and each is $O(\epsilon T)$ . So, the total cost is $O(\epsilon T^2)$ .

More general case

We'll say $s \sim p_{train}(s)$ , and actually enough to say: $E_{p_{train}(s)}\left[ \pi_{\theta}(a \neq \pi_{*}(s) \mid s) \right] \leq \epsilon$ . i.e.

\mathbb{E}_{p_{\pi^*}(s)} \Bigl[\, \pi_{\theta}\bigl(a \neq \pi^*(s)\mid s\bigr) \Bigr] ~=~ \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{p_{\pi^*}(s_{t})} \Bigl[\, \pi_{\theta}\bigl(a_{t} \neq \pi^*(s_{t}) \mid s_{t}\bigr) \Bigr] ~\le~ \epsilon.

if $p_{train}(s) \neq p_{\theta}(s)$ :

%--- p_theta(s_t) p_{\theta}(s_t) ~=~ (1-\epsilon)^{t}\,p_{\text{train}}\bigl(s_t\bigr) ~+~ \Bigl(1 - (1-\epsilon)^{t}\Bigr)\,p_{\text{mistake}}\bigl(s_t\bigr).

%--- difference bound \bigl|\,p_{\theta}(s_t) \;-\; p_{\text{train}}(s_t)\bigr| ~=~ (1 - (1-\epsilon)^{t})\Bigl|\,p_{\text{mistake}}(s_t) \;-\; p_{\text{train}}(s_t)\Bigr| ~\le~ 2\,\Bigl(1 - (1-\epsilon)^{t}\Bigr) ~\le~ 2\,\epsilon\,t.

the mod operator is used to bound the difference between the two distributions, also known as total variation distance.

%--- useful identity \text{(useful identity)}\quad (1-\epsilon)^{t} \;\ge\; 1 \;-\; \epsilon\,t \quad \text{for}\;\epsilon \in [0,1].

%--- sum of expected costs \sum_{t} \mathbb{E}_{p_{\theta}(s_t)}\!\bigl[c_{t}\bigr] ~=~ \sum_{t} \sum_{s_t} p_{\theta}(s_t)\,c_{t}(s_t) ~\le~ \sum_{t} \sum_{s_t} p_{\text{train}}(s_t)\,c_{t}(s_t) ~+~ \sum_{t} \bigl|\,p_{\theta}(s_t) \;-\; p_{\text{train}}(s_t)\bigr|\;c_{\max}.

%--- bounding by O(eps T^2) \le~ \sum_{t} \sum_{s_t} p_{\text{train}}(s_t)\,c_{t}(s_t) ~+~ \sum_{t} 2\,\Bigl(1-(1-\epsilon)^{t}\Bigr)\,c_{\max} ~\le~ \sum_{t} 2\,\epsilon\,t\,c_{\max} ~\le~ \epsilon\,T \;+\; 2\,\epsilon\,T^{2} ~=~ O\bigl(\epsilon\,T^{2}\bigr).

Alternative Distribution shift analysis

This is directly taken (along with my notes) from the paper [@ross2011reduction], [@ross2010efficient], supplementary material.

Below is a structured explanation of the paragraph and its notation from the sequential decision-making context (based on Putterman 1994).

1. Expert’s policy

Expert’s policy: Denoted by $\,\pi^*$ $π^{*}$ .
- What it is: This is the policy we want to mimic (or imitate).
- Key property: It is assumed to be deterministic, so for any state $\,s\,$ , the expert’s action is $\,\pi^*(s)\,$ .

2. Policy under consideration

Generic policy: Denoted by $\,\pi\,$ $π$ (it could be stochastic).
- Distribution over actions in state $s$ : Written as $\,\pi_s\,$ $π_{s}$ or sometimes $\,\pi(a|s)\,$ $π (a ∣ s)$ .
  - This means that in state $\,s\,$ , the policy $\,\pi\,$ picks actions according to the probability distribution $\,\pi_s\,$ .

3. Task horizon

Horizon: Denoted by $\,T\,$ $T$ .
- Meaning: This represents the length of the task or the number of time steps we consider in the sequential decision-making problem.

4. Cost function

Immediate cost: Denoted by $\,C(s,a)\,$ .
- Range: $\,C(s,a) \in [0,1]\,$ (i.e., it is bounded between 0 and 1).
- Meaning: The immediate cost (or penalty) incurred for taking action $\,a\,$ in state $\,s\,$ .
Expected immediate cost under policy $\,\pi\,$ :
$C_{\pi}(s) \;=\; \mathbb{E}_{a\sim\pi_s}\bigl[C(s,a)\bigr].$
- Explanation: For a potentially stochastic policy $\,\pi\,$ at state $\,s\,$ , we first draw an action $\,a\,$ according to $\,\pi_s\,$ , then measure the cost $\,C(s,a)\,$ , and finally take the expectation over that randomness.

5. 0-1 loss for imitation learning

Indicator-based 0-1 loss:
$e(s,a) \;=\; \mathbf{I}\bigl(a \neq \pi^*(s)\bigr),$
where $\,\mathbf{I}(\cdot)\,$ is 1 if the condition is true, and 0 otherwise.
- Interpretation: If in state $\,s\,$ you take an action $\,a\,$ that differs from the expert’s action $\,\pi^*(s)\,$ , you incur a 0-1 loss of 1; otherwise 0.
Expected 0-1 loss under policy $\,\pi\,$ :
$e_{\pi}(s) \;=\; \mathbb{E}_{a\sim\pi_s}\bigl[e(s,a)\bigr].$
- Meaning: Probability (under $\,\pi_s\,$ ) that $\,\pi\,$ picks a different action than $\,\pi^*\,$ in state $\,s\,$ .

6. State distributions

State distribution at time $i$ (following policy $\,\pi\,$ ): Denoted by
$d_{\pi}^i.$
- Explanation: If you start at time step 1 (with some initial distribution over states) and follow policy $\,\pi\,$ for $\,i - 1\,$ steps, you end up with a distribution of states at time step $\,i\,$ , which is $\,d_{\pi}^i\,$ .
Average state distribution under $\,\pi\,$ over $\,T\,$ time steps:
$d_{\pi} \;=\; \frac{1}{T}\,\sum_{i=1}^T d_{\pi}^i.$
- Meaning: This describes how often you visit each state on average (frequency) when following policy $\,\pi\,$ for $\,T\,$ steps.

7. Total (T-step) cost

Definition: $J(\pi) \;=\; T\;\mathbb{E}_{s \sim d_{\pi}}\Bigl[C_{\pi}(s)\Bigr].$
- Interpretation: Over $\,T\,$ $T$ time steps, the total cost for following policy $\,\pi\,$ $π$ is computed by:
  1. Taking the average distribution $\,d_{\pi}\,$ of states visited by $\,\pi\,$ ,
  2. Measuring the expected immediate cost $\,C_{\pi}(s)\,$ at those states,
  3. Multiplying by $\,T\,$ to get the total cost over the whole horizon.

8. Regret with respect to a policy class

Policy class: Denoted by $\,\Pi\,$ (a set of possible policies).
Regret of a policy $\,\pi\,$ w.r.t. the best policy in $\,\Pi\,$ :
$R_{\Pi}(\pi) \;=\; J(\pi) \;-\; \min_{\pi' \in \Pi}\,J(\pi').$
- Meaning: This measures how much “worse” $\,\pi\,$ is, in total cost, compared to the best policy in the class $\,\Pi\,$ .
Assumption (often): $\,\pi^* \in \Pi\,$ and $\,R_{\Pi}(\pi^*)\,$ is $\,O(1)\,$ for large $\,T\,$ .
- Interpretation: The expert’s policy is assumed to be in the set $\,\Pi\,$ , and it has negligible regret for large horizon $\,T\,$ .

9. Supervised learning approach to imitation

Traditional approach:
- Goal: Minimize the 0-1 loss under the expert’s state distribution, $\,d_{\pi^*}\,$ $d_{π^{*}}$ , i.e. $\hat{\pi} \;=\; \arg\min_{\pi \in \Pi}\;\mathbb{E}_{s\sim d_{\pi^*}}\bigl[e_{\pi}(s)\bigr].$
  - This trains a classifier (policy) so it mimics $\,\pi^*\,$ on the states that $\,\pi^*\,$ visits.
Error assumption: If the learned policy $\,\hat{\pi}\,$ makes a mistake with probability $\,\epsilon\,$ under $\,d_{\pi^*}\,$ , i.e.
$\mathbb{E}_{s\sim d_{\pi^*}}\bigl[e_{\hat{\pi}}(s)\bigr] \;=\; \epsilon,$
then the paper provides a generalization bound or guarantee on how this translates to overall performance.

In summary, the paper sets up notation for cost functions, the 0-1 loss for imitation, and the concept of distribution shift, highlighting how training under $\,d_{\pi^*}\,$ can differ from deployment under $\,d_{\hat{\pi}}\,$ , which is one of the central challenges in imitation learning.

Below is an explanation of Theorem 2.1 and its proof, with detailed notation and intuitive interpretation (in bullet points).

Traditional Imitation Learning Objective
- We have an expert’s policy denoted by $\,\pi^*\,$ (assume it is deterministic).
- We collect states by following $\,\pi^*\,$ and obtain a state distribution $\,d_{\pi^*}\,$ .
- We train a classifier (or policy) $\,\hat{\pi}\,$ to minimize 0-1 loss under $\,d_{\pi^*}\,$ : $\hat{\pi} \;=\; \arg\min_{\pi \in \Pi}\; \mathbb{E}_{s \sim d_{\pi^*}}\bigl[e_{\pi}(s)\bigr],$ where $e_{\pi}(s) \;=\; \mathbb{P}_{a \sim \pi_s}[\,a \neq \pi^*(s)\,].$
Definition of $\,\varepsilon\,$ (error under the expert’s distribution)
- Suppose the learned policy $\,\hat{\pi}\,$ makes a mistake (differs from $\,\pi^*\,$ ) with probability $\,\varepsilon\,$ under $\,d_{\pi^*}\,$ : $\mathbb{E}_{s \sim d_{\pi^*}}\bigl[e_{\hat{\pi}}(s)\bigr] \;=\;\varepsilon.$
- Intuitively, $\,\varepsilon\,$ is how often $\,\hat{\pi}\,$ disagrees with the expert on states the expert visits.
Cost Function and Total Cost
- We denote immediate cost by $\,C(s,a)\,$ and define $J(\pi) \;=\; T \;\mathbb{E}_{s \sim d_\pi} \bigl[C_\pi(s)\bigr],$ where $C_\pi(s) \;=\;\mathbb{E}_{a \sim \pi_s}[\,C(s,a)\,], \quad d_{\pi}\;=\;\frac{1}{T}\sum_{i=1}^T d_{\pi}^i.$
- In this theorem, the cost function $\,C\,$ is chosen or related so that “making a mistake relative to $\,\pi^*\,$ ” is the main concern (though the proof outlines a bounding argument in terms of 0-1 mistakes).
Theorem 2.1
- Statement:
  
  If $\,\hat{\pi}\,$ satisfies $\,\mathbb{E}_{s \sim d_{\pi^*}}[\,e_{\hat{\pi}}(s)\bigr] \,\le\, \varepsilon,\,$ then
  
  $J(\hat{\pi}) \;\le\; J(\pi^*) \;+\; T^2\,\varepsilon.$
- Interpretation: If the learned policy $\,\hat{\pi}\,$ has a small error probability $\,\varepsilon\,$ under $\,d_{\pi^*}\,$ , then its total cost $\,J(\hat{\pi})\,$ is not much worse than $\,J(\pi^*)\,$ . The “extra cost” is at most $\,T^2 \varepsilon\,$ .
Outline of the Proof
1. Set up per-time-step error:
  - Let $\varepsilon_i \;=\; \mathbb{E}_{s \sim d_{\pi^*}^i} \bigl[e_{\hat{\pi}}(s)\bigr]$ be the probability of a mistake by $\,\hat{\pi}\,$ at time step $\,i\,$ under the distribution $\,d_{\pi^*}^i\,$ (i.e., the states visited by the expert at time $\,i\,$ ).
  - Overall $\,\varepsilon\,$ is the average of these per-step mistake probabilities: $\varepsilon \;=\; \frac{1}{T}\,\sum_{i=1}^T\, \varepsilon_i.$
2. Track $p_t$ = Probability of no mistake so far:
  - Define $\,p_t\,$ as the probability that $\,\hat{\pi}\,$ has not made any mistake (w.r.t. $\,\pi^*\,$ ) in the first $\,t\,$ steps.
  - The proof then considers two distributions for time $\,t\,$ $t$ :
    - $\,d_t\,$ : the distribution of states conditional on $\,\hat{\pi}\,$ having made no mistakes up to time $\,t\,$ .
    - $\,d'_t\,$ : the distribution of states conditional on $\,\hat{\pi}\,$ having made at least one mistake in the first $\,t-1\,$ steps.
  - They combine these to relate the cost of $\,\hat{\pi}\,$ to that of $\,\pi^*\,$ .
3. Relate cost under $\,d_t\,$ to cost under $\,\pi^*$ :
  - If $\,\hat{\pi}\,$ hasn’t made a mistake yet, the cost remains similar to that of $\,\pi^*\,$ (since $\,\hat{\pi}\,$ and $\,\pi^*\,$ have been taking the same actions so far).
  - If $\,\hat{\pi}\,$ has made a mistake, the proof bounds the immediate cost by at most 1 (or relates it to a small “one-time” penalty).
4. Bounding total cost:
  - They use the chain of inequalities: $J(\hat{\pi}) \;\le\; \sum_{t=1}^T \bigl[ p_{t-1}\, \mathbb{E}_{s \sim d_t} [C_{\hat{\pi}}(s)] \;+\; (1 - p_{t-1}) \bigr].$
    - The term $p_{t-1}\,\mathbb{E}_{s \sim d_t}[C_{\hat{\pi}}(s)]$ is the cost if no mistakes so far.
    - The term $(1 - p_{t-1})$ is a conservative bound (at most 1) if a mistake has occurred.
  - They connect $p_{t-1}$ to the mistake probabilities $\,\varepsilon_i\,$ and sum over all time steps.
5. Key identity $p_{t-1} \, e_t + (1 - p_{t-1}) \, e'_t \;\le\; \varepsilon_t$ :
  - This expresses that the overall mistake probability at time $\,t\,$ , $\,\varepsilon_t\,$ , can be decomposed by whether or not a mistake has happened in earlier steps.
  - They use this to show $p_t \ge p_{t-1} - \varepsilon_t$ and eventually sum the bounds to get the final result.
6. Conclusion:
  - After carefully bounding the total cost $\,J(\hat{\pi})\,$ , they find $J(\hat{\pi}) \;\le\; J(\pi^*) \;+\; T^2\,\varepsilon,$ which completes the proof.
High-Level Intuition
- If $\,\hat{\pi}\,$ rarely disagrees with $\,\pi^*\,$ (i.e., has a small $\,\varepsilon\,$ under $\,d_{\pi^*}\,$ ), then the only way it can incur a large total cost is if those rare mistakes cause large deviations into costly states.
- The factor $\,T^2\,$ arises because one mistake might shift the state distribution away from the expert’s, and that can persist. In the worst case, each of those mistakes can cost up to $\,T\,$ in subsequent steps, thus the $\,T^2\,\varepsilon\,$ bound.

Bhavit Sharma