Implementing Inverse Q-Learning

March 1, 2024

How can we teach computers to learn from expert demonstrations? While traditional reinforcement learning (RL) can take millions of interactions to learn simple tasks - equivalent to years of experience - humans can often learn from just a few examples. This project explores IQ-Learn, a novel approach to imitation learning that aims to bridge this gap.

The concept of learning from demonstrations has a rich history. In 1998, Stuart Russell first posed inverse reinforcement learning (IRL) as a fundamental problem: given examples of an agent's behavior, can we determine what reward function they're optimizing? Early approaches struggled with computational complexity and required nested optimization - first guessing rewards, then solving for optimal policies, then updating rewards based on the match to expert behavior.

More recent approaches like behavioral cloning take a simpler route: directly copying expert actions using supervised learning. However, this often fails when the agent deviates from expert trajectories, having no data on how to recover. IQ-Learn offers an elegant alternative - directly learning a Q-function that captures both what to do (policy) and why (rewards).

The Reinforcement Learning Foundation

At its heart, reinforcement learning is about learning to make good decisions through trial and error. Imagine you're learning to play chess - each move you make leads to a new board position, and eventually to either winning or losing. To get better, you need to figure out which moves in which situations tend to lead to wins.

More formally, we model this as a Markov Decision Process (MDP) with:

  • States ss (like the board position in chess)
  • Actions aa (like possible moves)
  • A reward function r(s,a)r(s,a) (like +1 for winning, -1 for losing)
  • State transitions P(ss,a)P(s'|s,a) (how the board changes after a move)
  • A discount factor γ\gamma (making future rewards worth less than immediate ones)

The goal is to find a policy π(as)\pi(a|s) that maximizes expected future rewards. Q-learning achieves this by learning a Q-function Q(s,a)Q(s,a) that estimates the expected sum of future rewards starting from state ss, taking action aa, and then following the optimal policy:

Q(s,a)=r(s,a)+γEsP(ss,a)[maxaQ(s,a)]Q(s,a) = r(s,a) + \gamma \mathbb{E}_{s'\sim P(s'|s,a)}[\max_{a'} Q(s',a')]

This is the famous Bellman equation, and it tells us something profound: the value of taking an action now depends on both the immediate reward and the best possible future value. Once we know the true Q-function, the optimal policy is simple - just pick the action with the highest Q-value in each state.

The Inverse Problem

But what if we don't know the rewards? In many real-world situations, we can't easily write down what makes an action "good" - but we can learn from experts who already know what they're doing. This is where inverse reinforcement learning comes in.

Traditional IRL methods try to:

  1. Guess a reward function
  2. Run RL to find the optimal policy for those rewards
  3. Compare that policy to the expert
  4. Adjust the rewards and repeat

This is computationally intensive and can be unstable. IQ-Learn offers an elegant alternative: what if we could learn the Q-function directly from expert behavior, and extract both the policy and rewards from it?

The IQ-Learn Algorithm

IQ-Learn proposes a clever modification to Q-learning that allows it to learn directly from expert demonstrations, without access to rewards. Instead of learning from reward signals, it tries to make the Q-values of expert actions higher than other actions:

maxQE(s,a,s)ρE[Q(s,a)γV(s)](1γ)Esρ0[V(s)]\max_Q \mathbb{E}_{(s,a,s')\sim\rho_E}[Q(s,a) - \gamma V(s')] - (1-\gamma)\mathbb{E}_{s\sim\rho_0}[V(s)]

where V(s)=maxaQ(s,a)V(s) = \max_a Q(s,a) is the value function, ρE\rho_E is the expert's state-action distribution, and ρ0\rho_0 is the initial state distribution.

The authors showed this has several elegant properties:

  1. The objective is equivalent to minimizing certain statistical distances between expert and learned behavior
  2. The learned Q-function implicitly defines a reward function
  3. The optimal policy can be recovered directly using π(as)exp(Q(s,a))\pi(a|s) \propto \exp(Q(s,a))

Our Implementation

While the original paper demonstrated IQ-Learn's effectiveness on complex environments like MuJoCo and Atari games, we focused on understanding the core algorithm in a simpler setting. We specifically:

  1. Implemented the base algorithm using total variation distance only
  2. Used discrete action spaces rather than continuous control
  3. Developed a configurable GridWorld environment for systematic testing

Try It Yourself!

This is our GridWorld environment where you can experience what the AI learns from. Use the arrow keys to navigate the red square (your player) to the yellow square (the goal) while avoiding black walls.

The checkboxes control the difficulty:

  • Random Walls: Randomizes the position of walls each time you reach the goal
  • Random Start: Places your player in a random starting position instead of the top-left corner
  • Random Goal: Places the goal in a random position instead of the bottom-right corner

You can turn on multiple options at once - try starting with no randomization, then gradually increase the difficulty to understand how the environment becomes more challenging.

Click the grid to focus it before using the arrow keys. When you reach the goal, the grid will reset with new positions based on your selected options.

Use arrow keys to move • Click grid to focus

The Environment

Our GridWorld presents three levels of difficulty:

  1. Static Walls: Fixed wall positions and goal location
  2. Random Walls: Random wall positions but fixed goal
  3. Random Everything: Both walls and goal randomized

This progression lets us systematically evaluate how the algorithms handle increasing complexity. Players (or agents) can move in four directions while avoiding walls to reach the goal.

Implementation Details

Our implementation centers around several key components:

  1. The Q-Network
class Model:
    def __init__(self, input_size, hidden_size, output_size):
        self.network = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size)
        )
  1. The IQ-Learn Update We implemented the core IQ-Learn objective:
def iq_learn_update(self, expert_states, expert_actions):
    current_Q = Q_vals.gather(2, action.unsqueeze(1)).squeeze()
    current_V = get_max_Q(Q_vals)
    next_V = get_max_Q(self.model.get_Q(next_state))
    y = (1 - done) * self.gamma * next_V
    loss = iq_loss(current_Q, current_V, y)
  1. State Processing To avoid "dead neurons" during training, we added small random noise to state representations:
def preprocess_state(state):
    noise = np.random.normal(0, 0.01, state.shape)
    return state + noise

Training Dynamics

Our experiments revealed several interesting patterns:

  1. Convergence Behavior

    • IQ-Learn typically converged within 200 training steps
    • DQN required substantially more training time
    • Both showed degraded performance with increasing environment complexity
  2. Value Functions

    • DQN learned clear value gradients towards the goal
    • IQ-Learn's value maps appeared less structured but still produced effective policies
    • Both struggled with wall avoidance in complex scenarios
  3. State Coverage Expert demonstrations showed distinct patterns:

    • Clustered near optimal paths in static environments
    • More dispersed exploration in random configurations
    • Concentrated in central regions with random goals

Limitations and Insights

Our implementation revealed several key limitations and insights:

  1. Architectural Limitations

    • The simple MLP architecture struggled with spatial relationships
    • Both algorithms had difficulty generalizing wall avoidance strategies
    • A CNN architecture might better capture spatial structure
  2. Training Challenges

    • IQ-Learn was sensitive to state preprocessing
    • Batch size significantly affected convergence
    • Learning rate scheduling proved crucial for stability
  3. Performance Metrics We developed a distance ratio metric to evaluate performance:

    def distance_ratio(path_length, optimal_distance):
        return path_length / optimal_distance

    This helped quantify how close paths were to optimal.

Comparison to Original Paper

Our implementation focused on a subset of the original paper's scope:

  • We used only the basic IQ-Learn algorithm without extensions
  • We tested in simpler environments with discrete actions
  • We focused on understanding learning dynamics rather than benchmarking against other methods

The original paper showed additional capabilities we didn't explore:

  • Continuous control in complex environments
  • Multiple statistical distances beyond total variation
  • Actor-critic variants for high-dimensional spaces

Looking Forward

Our implementation suggests several promising directions:

  1. Architectural Improvements

    • Testing convolutional architectures for better spatial reasoning
    • Exploring attention mechanisms for handling varying environment sizes
    • Implementing hierarchical representations for complex tasks
  2. Algorithm Extensions

    • Implementing additional statistical distances
    • Testing continuous action spaces
    • Exploring curriculum learning approaches
  3. Analysis Tools

    • Better visualization of learned policies
    • More detailed analysis of failure cases
    • Tools for understanding reward function recovery

The code and detailed technical documentation are available in the project repository. Special thanks to my project partner Celine Aubuchon, without whom the game implementation, the analysis of agent behavior, and my fuller understanding of the original work would not be possible.