Imitation Learning
Introduction
In imitation learning, we have a dataset of expert demonstrations and we want to learn a policy that mimics the expert's behavior i.e. .
The Distribution Shift Problem
We train a RL policy under where is the "observation" of a state (in some cases they will be same, but not always). We're interesting in maximizing this: as for imitation learning we're interested in maximizing the likelihood: .
We can simplify this and write a cost function that we have to minimize. let
Then, we need to . Minimize the number of mistakes a policy makes_.
Some Analysis
assume: . Lets say is the max length of the trajectory. Then, our is:
We're taking the expectation on the total cost. Since we're using an imitation policy, if we make a mistake at time then on average we'll continue making mistakes for the rest of the trajectory. So if we make the mistake at the first timestep then we'll make mistakes on average. If we make the mistake at the second timestep then we'll make mistakes on average.
These are terms in the sum and each is . So, the total cost is .
More general case
We'll say , and actually enough to say: . i.e.
if :
the mod operator is used to bound the difference between the two distributions, also known as total variation distance.
Alternative Distribution shift analysis
This is directly taken (along with my notes) from the paper [@ross2011reduction], [@ross2010efficient], supplementary material.
Below is a structured explanation of the paragraph and its notation from the sequential decision-making context (based on Putterman 1994).
1. Expert’s policy
- Expert’s policy:
Denoted by .
- What it is: This is the policy we want to mimic (or imitate).
- Key property: It is assumed to be deterministic, so for any state , the expert’s action is .
2. Policy under consideration
- Generic policy:
Denoted by (it could be stochastic).
- Distribution over actions in state :
Written as or sometimes .
- This means that in state , the policy picks actions according to the probability distribution .
- Distribution over actions in state :
Written as or sometimes .
3. Task horizon
- Horizon:
Denoted by .
- Meaning: This represents the length of the task or the number of time steps we consider in the sequential decision-making problem.
4. Cost function
-
Immediate cost: Denoted by .
- Range: (i.e., it is bounded between 0 and 1).
- Meaning: The immediate cost (or penalty) incurred for taking action in state .
-
Expected immediate cost under policy :
- Explanation: For a potentially stochastic policy at state , we first draw an action according to , then measure the cost , and finally take the expectation over that randomness.
5. 0-1 loss for imitation learning
-
Indicator-based 0-1 loss:
where is 1 if the condition is true, and 0 otherwise.
- Interpretation: If in state you take an action that differs from the expert’s action , you incur a 0-1 loss of 1; otherwise 0.
-
Expected 0-1 loss under policy :
- Meaning: Probability (under ) that picks a different action than in state .
6. State distributions
-
State distribution at time (following policy ): Denoted by
- Explanation: If you start at time step 1 (with some initial distribution over states) and follow policy for steps, you end up with a distribution of states at time step , which is .
-
Average state distribution under over time steps:
- Meaning: This describes how often you visit each state on average (frequency) when following policy for steps.
7. Total (T-step) cost
- Definition:
- Interpretation: Over time steps, the total cost for following policy is computed by:
- Taking the average distribution of states visited by ,
- Measuring the expected immediate cost at those states,
- Multiplying by to get the total cost over the whole horizon.
- Interpretation: Over time steps, the total cost for following policy is computed by:
8. Regret with respect to a policy class
-
Policy class: Denoted by (a set of possible policies).
-
Regret of a policy w.r.t. the best policy in :
- Meaning: This measures how much “worse” is, in total cost, compared to the best policy in the class .
-
Assumption (often): and is for large .
- Interpretation: The expert’s policy is assumed to be in the set , and it has negligible regret for large horizon .
9. Supervised learning approach to imitation
-
Traditional approach:
- Goal: Minimize the 0-1 loss under the expert’s state distribution, , i.e.
- This trains a classifier (policy) so it mimics on the states that visits.
- Goal: Minimize the 0-1 loss under the expert’s state distribution, , i.e.
-
Error assumption: If the learned policy makes a mistake with probability under , i.e.
then the paper provides a generalization bound or guarantee on how this translates to overall performance.
In summary, the paper sets up notation for cost functions, the 0-1 loss for imitation, and the concept of distribution shift, highlighting how training under can differ from deployment under , which is one of the central challenges in imitation learning.
Below is an explanation of Theorem 2.1 and its proof, with detailed notation and intuitive interpretation (in bullet points).
-
Traditional Imitation Learning Objective
- We have an expert’s policy denoted by (assume it is deterministic).
- We collect states by following and obtain a state distribution .
- We train a classifier (or policy) to minimize 0-1 loss under : where
-
Definition of (error under the expert’s distribution)
- Suppose the learned policy makes a mistake (differs from ) with probability under :
- Intuitively, is how often disagrees with the expert on states the expert visits.
-
Cost Function and Total Cost
- We denote immediate cost by and define where
- In this theorem, the cost function is chosen or related so that “making a mistake relative to ” is the main concern (though the proof outlines a bounding argument in terms of 0-1 mistakes).
-
Theorem 2.1
- Statement:
If satisfies then
- Interpretation: If the learned policy has a small error probability under , then its total cost is not much worse than . The “extra cost” is at most .
- Statement:
-
Outline of the Proof
-
Set up per-time-step error:
- Let be the probability of a mistake by at time step under the distribution (i.e., the states visited by the expert at time ).
- Overall is the average of these per-step mistake probabilities:
-
Track = Probability of no mistake so far:
- Define as the probability that has not made any mistake (w.r.t. ) in the first steps.
- The proof then considers two distributions for time :
- : the distribution of states conditional on having made no mistakes up to time .
- : the distribution of states conditional on having made at least one mistake in the first steps.
- They combine these to relate the cost of to that of .
-
Relate cost under to cost under :
- If hasn’t made a mistake yet, the cost remains similar to that of (since and have been taking the same actions so far).
- If has made a mistake, the proof bounds the immediate cost by at most 1 (or relates it to a small “one-time” penalty).
-
Bounding total cost:
- They use the chain of inequalities:
- The term is the cost if no mistakes so far.
- The term is a conservative bound (at most 1) if a mistake has occurred.
- They connect to the mistake probabilities and sum over all time steps.
- They use the chain of inequalities:
-
Key identity :
- This expresses that the overall mistake probability at time , , can be decomposed by whether or not a mistake has happened in earlier steps.
- They use this to show and eventually sum the bounds to get the final result.
-
Conclusion:
- After carefully bounding the total cost , they find which completes the proof.
-
-
High-Level Intuition
- If rarely disagrees with (i.e., has a small under ), then the only way it can incur a large total cost is if those rare mistakes cause large deviations into costly states.
- The factor arises because one mistake might shift the state distribution away from the expert’s, and that can persist. In the worst case, each of those mistakes can cost up to in subsequent steps, thus the bound.
Last updated on