Neural Network Diagram

Bhavit Sharma

Imitation Learning

Introduction

In imitation learning, we have a dataset of expert demonstrations and we want to learn a policy that mimics the expert's behavior i.e. πθ(atst)π(atst)\pi_{\theta}(a_t \mid s_t) \approx \pi_{*}(a_t \mid s_t).

The Distribution Shift Problem

We train a RL policy under pdata(ot)p_{data}(o_t) where oto_t is the "observation" of a state sts_t (in some cases they will be same, but not always). We're interesting in maximizing this: maxθEotpdata(ot)[logπθ(atot)]\max_{\theta} \mathbb{E}_{o_t \sim p_{data}(o_t)} \left[ \log \pi_{\theta}(a_t \mid o_t) \right] as for imitation learning we're interested in maximizing the likelihood: Lθ=tπθ(atot)\mathcal{L}_{\theta} = \prod_{t} \pi_{\theta}(a_t \mid o_t).

We can simplify this and write a cost function that we have to minimize. let c(s,t)c(s, t)

c(st,at)={0if at=π(st),1otherwisec(s_t, a_t) = \begin{cases} 0 & \text{if } a_t = \pi^{*}(s_t),\\ 1 & \text{otherwise} \end{cases}

Then, we need to minimize Eotpπθ(st)[c(st,at)]minimize\ \mathbb{E}_{o_t \sim p_{\pi_{\theta}}(s_t)} \left[ c(s_t, a_t) \right]. Minimize the number of mistakes a policy makes_.

Some Analysis

assume: πθ(aπ(s)s)ϵsDtrain\pi_{\theta}(a \neq \pi_{*}(s) | s) \leq \epsilon \quad \forall s \in \mathcal{D_{train}}. Lets say TT is the max length of the trajectory. Then, our E\mathbb{E} is:

E ⁣[tc(st,at)]    ϵT  +  (1ϵ)(ϵ(T1)  +  (1ϵ)()).\mathbb{E}\!\Bigl[\sum_{t} c\bigl(s_{t}, a_{t}\bigr)\Bigr] \;\le\; \epsilon T \;+\; \bigl(1 - \epsilon\bigr) \Bigl( \epsilon \bigl(T - 1\bigr) \;+\; \bigl(1 - \epsilon\bigr)(\dots) \Bigr).

We're taking the expectation on the total cost. Since we're using an imitation policy, if we make a mistake at time tt then on average we'll continue making mistakes for the rest of the trajectory. So if we make the mistake at the first timestep then we'll make ϵT\epsilon T mistakes on average. If we make the mistake at the second timestep then we'll make (1ϵ)ϵ(T1)(1 - \epsilon) \epsilon (T-1) mistakes on average.

These are TT terms in the sum and each is O(ϵT)O(\epsilon T). So, the total cost is O(ϵT2)O(\epsilon T^2).

More general case

We'll say sptrain(s)s \sim p_{train}(s), and actually enough to say: Eptrain(s)[πθ(aπ(s)s)]ϵE_{p_{train}(s)}\left[ \pi_{\theta}(a \neq \pi_{*}(s) \mid s) \right] \leq \epsilon. i.e.

Epπ(s)[πθ(aπ(s)s)] = 1Tt=1TEpπ(st)[πθ(atπ(st)st)]  ϵ.\mathbb{E}_{p_{\pi^*}(s)} \Bigl[\, \pi_{\theta}\bigl(a \neq \pi^*(s)\mid s\bigr) \Bigr] ~=~ \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{p_{\pi^*}(s_{t})} \Bigl[\, \pi_{\theta}\bigl(a_{t} \neq \pi^*(s_{t}) \mid s_{t}\bigr) \Bigr] ~\le~ \epsilon.

if ptrain(s)pθ(s)p_{train}(s) \neq p_{\theta}(s):

pθ(st) = (1ϵ)tptrain(st) + (1(1ϵ)t)pmistake(st).%--- p_theta(s_t) p_{\theta}(s_t) ~=~ (1-\epsilon)^{t}\,p_{\text{train}}\bigl(s_t\bigr) ~+~ \Bigl(1 - (1-\epsilon)^{t}\Bigr)\,p_{\text{mistake}}\bigl(s_t\bigr). pθ(st)    ptrain(st) = (1(1ϵ)t)pmistake(st)    ptrain(st)  2(1(1ϵ)t)  2ϵt.%--- difference bound \bigl|\,p_{\theta}(s_t) \;-\; p_{\text{train}}(s_t)\bigr| ~=~ (1 - (1-\epsilon)^{t})\Bigl|\,p_{\text{mistake}}(s_t) \;-\; p_{\text{train}}(s_t)\Bigr| ~\le~ 2\,\Bigl(1 - (1-\epsilon)^{t}\Bigr) ~\le~ 2\,\epsilon\,t.

the mod operator is used to bound the difference between the two distributions, also known as total variation distance.

(useful identity)(1ϵ)t    1    ϵtfor  ϵ[0,1].%--- useful identity \text{(useful identity)}\quad (1-\epsilon)^{t} \;\ge\; 1 \;-\; \epsilon\,t \quad \text{for}\;\epsilon \in [0,1]. tEpθ(st) ⁣[ct] = tstpθ(st)ct(st)  tstptrain(st)ct(st) + tpθ(st)    ptrain(st)  cmax.%--- sum of expected costs \sum_{t} \mathbb{E}_{p_{\theta}(s_t)}\!\bigl[c_{t}\bigr] ~=~ \sum_{t} \sum_{s_t} p_{\theta}(s_t)\,c_{t}(s_t) ~\le~ \sum_{t} \sum_{s_t} p_{\text{train}}(s_t)\,c_{t}(s_t) ~+~ \sum_{t} \bigl|\,p_{\theta}(s_t) \;-\; p_{\text{train}}(s_t)\bigr|\;c_{\max}.  tstptrain(st)ct(st) + t2(1(1ϵ)t)cmax  t2ϵtcmax  ϵT  +  2ϵT2 = O(ϵT2).%--- bounding by O(eps T^2) \le~ \sum_{t} \sum_{s_t} p_{\text{train}}(s_t)\,c_{t}(s_t) ~+~ \sum_{t} 2\,\Bigl(1-(1-\epsilon)^{t}\Bigr)\,c_{\max} ~\le~ \sum_{t} 2\,\epsilon\,t\,c_{\max} ~\le~ \epsilon\,T \;+\; 2\,\epsilon\,T^{2} ~=~ O\bigl(\epsilon\,T^{2}\bigr).

Alternative Distribution shift analysis

This is directly taken (along with my notes) from the paper [@ross2011reduction], [@ross2010efficient], supplementary material.

Below is a structured explanation of the paragraph and its notation from the sequential decision-making context (based on Putterman 1994).


1. Expert’s policy

  • Expert’s policy: Denoted by π\,\pi^*.
    • What it is: This is the policy we want to mimic (or imitate).
    • Key property: It is assumed to be deterministic, so for any state s\,s\,, the expert’s action is π(s)\,\pi^*(s)\,.

2. Policy under consideration

  • Generic policy: Denoted by π\,\pi\, (it could be stochastic).
    • Distribution over actions in state ss: Written as πs\,\pi_s\, or sometimes π(as)\,\pi(a|s)\,.
      • This means that in state s\,s\,, the policy π\,\pi\, picks actions according to the probability distribution πs\,\pi_s\,.

3. Task horizon

  • Horizon: Denoted by T\,T\,.
    • Meaning: This represents the length of the task or the number of time steps we consider in the sequential decision-making problem.

4. Cost function

  • Immediate cost: Denoted by C(s,a)\,C(s,a)\,.

    • Range: C(s,a)[0,1]\,C(s,a) \in [0,1]\, (i.e., it is bounded between 0 and 1).
    • Meaning: The immediate cost (or penalty) incurred for taking action a\,a\, in state s\,s\,.
  • Expected immediate cost under policy π\,\pi\,:

    Cπ(s)  =  Eaπs[C(s,a)].C_{\pi}(s) \;=\; \mathbb{E}_{a\sim\pi_s}\bigl[C(s,a)\bigr].
    • Explanation: For a potentially stochastic policy π\,\pi\, at state s\,s\,, we first draw an action a\,a\, according to πs\,\pi_s\,, then measure the cost C(s,a)\,C(s,a)\,, and finally take the expectation over that randomness.

5. 0-1 loss for imitation learning

  • Indicator-based 0-1 loss:

    e(s,a)  =  I(aπ(s)),e(s,a) \;=\; \mathbf{I}\bigl(a \neq \pi^*(s)\bigr),

    where I()\,\mathbf{I}(\cdot)\, is 1 if the condition is true, and 0 otherwise.

    • Interpretation: If in state s\,s\, you take an action a\,a\, that differs from the expert’s action π(s)\,\pi^*(s)\,, you incur a 0-1 loss of 1; otherwise 0.
  • Expected 0-1 loss under policy π\,\pi\,:

    eπ(s)  =  Eaπs[e(s,a)].e_{\pi}(s) \;=\; \mathbb{E}_{a\sim\pi_s}\bigl[e(s,a)\bigr].
    • Meaning: Probability (under πs\,\pi_s\,) that π\,\pi\, picks a different action than π\,\pi^*\, in state s\,s\,.

6. State distributions

  • State distribution at time ii (following policy π\,\pi\,): Denoted by

    dπi.d_{\pi}^i.
    • Explanation: If you start at time step 1 (with some initial distribution over states) and follow policy π\,\pi\, for i1\,i - 1\, steps, you end up with a distribution of states at time step i\,i\,, which is dπi\,d_{\pi}^i\,.
  • Average state distribution under π\,\pi\, over T\,T\, time steps:

    dπ  =  1Ti=1Tdπi.d_{\pi} \;=\; \frac{1}{T}\,\sum_{i=1}^T d_{\pi}^i.
    • Meaning: This describes how often you visit each state on average (frequency) when following policy π\,\pi\, for T\,T\, steps.

7. Total (T-step) cost

  • Definition: J(π)  =  T  Esdπ[Cπ(s)].J(\pi) \;=\; T\;\mathbb{E}_{s \sim d_{\pi}}\Bigl[C_{\pi}(s)\Bigr].
    • Interpretation: Over T\,T\, time steps, the total cost for following policy π\,\pi\, is computed by:
      1. Taking the average distribution dπ\,d_{\pi}\, of states visited by π\,\pi\,,
      2. Measuring the expected immediate cost Cπ(s)\,C_{\pi}(s)\, at those states,
      3. Multiplying by T\,T\, to get the total cost over the whole horizon.

8. Regret with respect to a policy class

  • Policy class: Denoted by Π\,\Pi\, (a set of possible policies).

  • Regret of a policy π\,\pi\, w.r.t. the best policy in Π\,\Pi\,:

    RΠ(π)  =  J(π)    minπΠJ(π).R_{\Pi}(\pi) \;=\; J(\pi) \;-\; \min_{\pi' \in \Pi}\,J(\pi').
    • Meaning: This measures how much “worse” π\,\pi\, is, in total cost, compared to the best policy in the class Π\,\Pi\,.
  • Assumption (often): πΠ\,\pi^* \in \Pi\, and RΠ(π)\,R_{\Pi}(\pi^*)\, is O(1)\,O(1)\, for large T\,T\,.

    • Interpretation: The expert’s policy is assumed to be in the set Π\,\Pi\,, and it has negligible regret for large horizon T\,T\,.

9. Supervised learning approach to imitation

  • Traditional approach:

    • Goal: Minimize the 0-1 loss under the expert’s state distribution, dπ\,d_{\pi^*}\,, i.e. π^  =  argminπΠ  Esdπ[eπ(s)].\hat{\pi} \;=\; \arg\min_{\pi \in \Pi}\;\mathbb{E}_{s\sim d_{\pi^*}}\bigl[e_{\pi}(s)\bigr].
      • This trains a classifier (policy) so it mimics π\,\pi^*\, on the states that π\,\pi^*\, visits.
  • Error assumption: If the learned policy π^\,\hat{\pi}\, makes a mistake with probability ϵ\,\epsilon\, under dπ\,d_{\pi^*}\,, i.e.

    Esdπ[eπ^(s)]  =  ϵ,\mathbb{E}_{s\sim d_{\pi^*}}\bigl[e_{\hat{\pi}}(s)\bigr] \;=\; \epsilon,

    then the paper provides a generalization bound or guarantee on how this translates to overall performance.


In summary, the paper sets up notation for cost functions, the 0-1 loss for imitation, and the concept of distribution shift, highlighting how training under dπ\,d_{\pi^*}\, can differ from deployment under dπ^\,d_{\hat{\pi}}\,, which is one of the central challenges in imitation learning.

Below is an explanation of Theorem 2.1 and its proof, with detailed notation and intuitive interpretation (in bullet points).


  • Traditional Imitation Learning Objective

    • We have an expert’s policy denoted by π\,\pi^*\, (assume it is deterministic).
    • We collect states by following π\,\pi^*\, and obtain a state distribution dπ\,d_{\pi^*}\,.
    • We train a classifier (or policy) π^\,\hat{\pi}\, to minimize 0-1 loss under dπ\,d_{\pi^*}\,: π^  =  argminπΠ  Esdπ[eπ(s)],\hat{\pi} \;=\; \arg\min_{\pi \in \Pi}\; \mathbb{E}_{s \sim d_{\pi^*}}\bigl[e_{\pi}(s)\bigr], where eπ(s)  =  Paπs[aπ(s)].e_{\pi}(s) \;=\; \mathbb{P}_{a \sim \pi_s}[\,a \neq \pi^*(s)\,].
  • Definition of ε\,\varepsilon\, (error under the expert’s distribution)

    • Suppose the learned policy π^\,\hat{\pi}\, makes a mistake (differs from π\,\pi^*\,) with probability ε\,\varepsilon\, under dπ\,d_{\pi^*}\,: Esdπ[eπ^(s)]  =  ε.\mathbb{E}_{s \sim d_{\pi^*}}\bigl[e_{\hat{\pi}}(s)\bigr] \;=\;\varepsilon.
    • Intuitively, ε\,\varepsilon\, is how often π^\,\hat{\pi}\, disagrees with the expert on states the expert visits.
  • Cost Function and Total Cost

    • We denote immediate cost by C(s,a)\,C(s,a)\, and define J(π)  =  T  Esdπ[Cπ(s)],J(\pi) \;=\; T \;\mathbb{E}_{s \sim d_\pi} \bigl[C_\pi(s)\bigr], where Cπ(s)  =  Eaπs[C(s,a)],dπ  =  1Ti=1Tdπi.C_\pi(s) \;=\;\mathbb{E}_{a \sim \pi_s}[\,C(s,a)\,], \quad d_{\pi}\;=\;\frac{1}{T}\sum_{i=1}^T d_{\pi}^i.
    • In this theorem, the cost function C\,C\, is chosen or related so that “making a mistake relative to π\,\pi^*\,” is the main concern (though the proof outlines a bounding argument in terms of 0-1 mistakes).
  • Theorem 2.1

    • Statement:

      If π^\,\hat{\pi}\, satisfies Esdπ[eπ^(s)]ε,\,\mathbb{E}_{s \sim d_{\pi^*}}[\,e_{\hat{\pi}}(s)\bigr] \,\le\, \varepsilon,\, then

      J(π^)    J(π)  +  T2ε.J(\hat{\pi}) \;\le\; J(\pi^*) \;+\; T^2\,\varepsilon.
    • Interpretation: If the learned policy π^\,\hat{\pi}\, has a small error probability ε\,\varepsilon\, under dπ\,d_{\pi^*}\,, then its total cost J(π^)\,J(\hat{\pi})\, is not much worse than J(π)\,J(\pi^*)\,. The “extra cost” is at most T2ε\,T^2 \varepsilon\,.
  • Outline of the Proof

    1. Set up per-time-step error:

      • Let εi  =  Esdπi[eπ^(s)]\varepsilon_i \;=\; \mathbb{E}_{s \sim d_{\pi^*}^i} \bigl[e_{\hat{\pi}}(s)\bigr] be the probability of a mistake by π^\,\hat{\pi}\, at time step i\,i\, under the distribution dπi\,d_{\pi^*}^i\, (i.e., the states visited by the expert at time i\,i\,).
      • Overall ε\,\varepsilon\, is the average of these per-step mistake probabilities: ε  =  1Ti=1Tεi.\varepsilon \;=\; \frac{1}{T}\,\sum_{i=1}^T\, \varepsilon_i.
    2. Track ptp_t = Probability of no mistake so far:

      • Define pt\,p_t\, as the probability that π^\,\hat{\pi}\, has not made any mistake (w.r.t. π\,\pi^*\,) in the first t\,t\, steps.
      • The proof then considers two distributions for time t\,t\,:
        • dt\,d_t\,: the distribution of states conditional on π^\,\hat{\pi}\, having made no mistakes up to time t\,t\,.
        • dt\,d'_t\,: the distribution of states conditional on π^\,\hat{\pi}\, having made at least one mistake in the first t1\,t-1\, steps.
      • They combine these to relate the cost of π^\,\hat{\pi}\, to that of π\,\pi^*\,.
    3. Relate cost under dt\,d_t\, to cost under π\,\pi^*:

      • If π^\,\hat{\pi}\, hasn’t made a mistake yet, the cost remains similar to that of π\,\pi^*\, (since π^\,\hat{\pi}\, and π\,\pi^*\, have been taking the same actions so far).
      • If π^\,\hat{\pi}\, has made a mistake, the proof bounds the immediate cost by at most 1 (or relates it to a small “one-time” penalty).
    4. Bounding total cost:

      • They use the chain of inequalities: J(π^)    t=1T[pt1Esdt[Cπ^(s)]  +  (1pt1)].J(\hat{\pi}) \;\le\; \sum_{t=1}^T \bigl[ p_{t-1}\, \mathbb{E}_{s \sim d_t} [C_{\hat{\pi}}(s)] \;+\; (1 - p_{t-1}) \bigr].
        • The term pt1Esdt[Cπ^(s)]p_{t-1}\,\mathbb{E}_{s \sim d_t}[C_{\hat{\pi}}(s)] is the cost if no mistakes so far.
        • The term (1pt1)(1 - p_{t-1}) is a conservative bound (at most 1) if a mistake has occurred.
      • They connect pt1p_{t-1} to the mistake probabilities εi\,\varepsilon_i\, and sum over all time steps.
    5. Key identity pt1et+(1pt1)et    εtp_{t-1} \, e_t + (1 - p_{t-1}) \, e'_t \;\le\; \varepsilon_t:

      • This expresses that the overall mistake probability at time t\,t\,, εt\,\varepsilon_t\,, can be decomposed by whether or not a mistake has happened in earlier steps.
      • They use this to show ptpt1εtp_t \ge p_{t-1} - \varepsilon_t and eventually sum the bounds to get the final result.
    6. Conclusion:

      • After carefully bounding the total cost J(π^)\,J(\hat{\pi})\,, they find J(π^)    J(π)  +  T2ε,J(\hat{\pi}) \;\le\; J(\pi^*) \;+\; T^2\,\varepsilon, which completes the proof.
  • High-Level Intuition

    • If π^\,\hat{\pi}\, rarely disagrees with π\,\pi^*\, (i.e., has a small ε\,\varepsilon\, under dπ\,d_{\pi^*}\,), then the only way it can incur a large total cost is if those rare mistakes cause large deviations into costly states.
    • The factor T2\,T^2\, arises because one mistake might shift the state distribution away from the expert’s, and that can persist. In the worst case, each of those mistakes can cost up to T\,T\, in subsequent steps, thus the T2ε\,T^2\,\varepsilon\, bound.

Last updated on