Back to all papers
Deep Dive #250 min read

WOSAC Challenge Deep Dive

Waymo Open Sim Agents Challenge evaluation framework, realism metrics, and winning strategies from 2023-2025.

WOSAC Challenge Deep Dive: The Waymo Open Sim Agents Challenge

A comprehensive learning guide to understanding, implementing, and excelling at the premier benchmark for realistic traffic simulation in autonomous driving.


Table of Contents

  1. Executive Summary
  2. Challenge Overview
  3. Evaluation Metrics Deep Dive
  4. Baseline Approaches
  5. Winning Strategies
  6. Interactive Code Examples
  7. Submission Format
  8. Mental Models
  9. Hands-On Exercises
  10. Interview Questions

1. Executive Summary

What is WOSAC?

The Waymo Open Sim Agents Challenge (WOSAC) is the first public benchmark for evaluating multi-agent traffic simulation systems. Introduced in 2023 and running annually, it establishes standardized metrics and protocols for assessing how realistically simulation agents behave compared to real-world driving data.

Why WOSAC Matters

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE SIMULATION BOTTLENECK                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Real World Testing          vs.        Simulation Testing                 │
│   ─────────────────                      ──────────────────                 │
│   • $1M+ per vehicle                     • $0.01 per scenario               │
│   • 1 scenario/hour                      • 1M scenarios/hour                │
│   • Safety risks                         • Zero risk                        │
│   • Weather dependent                    • Any condition                    │
│                                                                             │
│   BUT: Simulation is only useful if agents behave REALISTICALLY            │
│                                                                             │
│   WOSAC answers: "How do we MEASURE realism?"                              │
└─────────────────────────────────────────────────────────────────────────────┘

The Core Problem WOSAC Solves

Autonomous vehicle development requires testing against millions of scenarios. But simulation is only valuable if:

  1. Simulated agents behave like real humans - not too aggressive, not too passive
  2. Multi-agent interactions are realistic - agents react to each other
  3. Edge cases emerge naturally - rare but critical scenarios appear

WOSAC provides the measurement framework to quantify these properties.

Key Statistics

MetricValue
Challenge Years2023, 2024, 2025
Scenarios per Evaluation44,097 (test set)
Rollouts per Scenario32 parallel simulations
Simulation Horizon8 seconds at 10 Hz
History Context1.1 seconds
Top Realism Score (2025)0.7858 (SMART-R1)

References


2. Challenge Overview

Task Definition

Official Task: "Given the agent tracks for the past 1 second on a corresponding map, and optionally the associated lidar for this time interval, simulate 32 realistic joint futures for all the agents in the scene."

┌─────────────────────────────────────────────────────────────────────────────┐
│                         WOSAC TASK VISUALIZATION                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   TIME AXIS                                                                 │
│   ─────────────────────────────────────────────────────────────────────►    │
│                                                                             │
│   ◄────── 1.1s History ──────►◄─────────── 8.0s Future ──────────────►     │
│                                                                             │
│   [t=-10, t=-9, ..., t=0]     [t=1, t=2, ..., t=80]                        │
│         11 frames                   80 frames @ 10Hz                        │
│                                                                             │
│   ┌─────────────────────┐     ┌─────────────────────────────────────┐      │
│   │   INPUTS (Given)    │     │   OUTPUTS (Generate)                │      │
│   │   • Agent positions │     │   • 32 joint scene rollouts         │      │
│   │   • Agent velocities│     │   • All agents' trajectories        │      │
│   │   • Agent headings  │     │   • x, y, z, heading per timestep   │      │
│   │   • HD Map          │     │                                     │      │
│   │   • (Optional) LiDAR│     │                                     │      │
│   └─────────────────────┘     └─────────────────────────────────────┘      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Input Format

The challenge uses the Waymo Open Motion Dataset (WOMD), providing:

# Scenario Proto Structure (simplified)
scenario = {
    'scenario_id': 'unique_identifier_string',

    # Agent Information (up to 128 agents per scene)
    'tracks': {
        'object_id': [int],           # Unique ID per agent
        'object_type': [int],         # 1=Vehicle, 2=Pedestrian, 3=Cyclist
        'states': {
            'center_x': [N_agents, 91],    # 91 timesteps total
            'center_y': [N_agents, 91],
            'center_z': [N_agents, 91],
            'heading': [N_agents, 91],      # Yaw angle in radians
            'velocity_x': [N_agents, 91],
            'velocity_y': [N_agents, 91],
            'length': [N_agents],           # Bounding box
            'width': [N_agents],
            'height': [N_agents],
            'valid': [N_agents, 91],        # Validity mask
        }
    },

    # Map Information
    'map_features': {
        'lane': [...],                 # Lane centerlines
        'road_line': [...],            # Road boundaries
        'road_edge': [...],            # Drivable area edges
        'stop_sign': [...],
        'crosswalk': [...],
        'speed_bump': [...],
    },

    # Traffic Signals
    'dynamic_map_states': {
        'traffic_light_state': [...],  # Per timestep
    }
}

Output Format

Each submission must provide 32 parallel rollouts per scenario:

# Output Structure
output = {
    'scenario_id': 'matching_input_id',
    'joint_scenes': [  # List of 32 rollouts
        {
            'simulated_trajectories': [  # One per sim agent
                {
                    'object_id': int,
                    'center_x': [80],    # 80 future timesteps
                    'center_y': [80],
                    'center_z': [80],
                    'heading': [80],
                }
            ]
        }
    ] * 32
}

What Makes This Challenging?

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MULTI-AGENT SIMULATION CHALLENGES                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. DISTRIBUTION MATCHING (not point prediction)                           │
│      ────────────────────────────────────────────                           │
│      • Must capture DIVERSITY of plausible futures                          │
│      • 32 samples must cover the behavioral distribution                    │
│      • Single "best guess" is insufficient                                  │
│                                                                             │
│   2. CLOSED-LOOP SIMULATION                                                 │
│      ───────────────────────                                                │
│      • Agents must REACT to each other                                      │
│      • Cannot pre-compute trajectories independently                        │
│      • Covariate shift from training (open-loop) to eval (closed-loop)     │
│                                                                             │
│   3. MULTI-SCALE REALISM                                                    │
│      ────────────────────                                                   │
│      • Kinematic feasibility (physics)                                      │
│      • Social compliance (interactions)                                     │
│      • Map adherence (road rules)                                           │
│                                                                             │
│   4. JOINT CONSISTENCY                                                      │
│      ─────────────────                                                      │
│      • All agents simulated TOGETHER                                        │
│      • Cannot optimize per-agent independently                              │
│      • Emergent behaviors from interactions                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Dataset Splits

SplitScenariosPurpose
Training486,995Model training
Validation44,097Development & tuning
Testing44,920Final evaluation (hidden)

3. Evaluation Metrics Deep Dive

The Realism Meta-Metric

WOSAC's primary ranking criterion is the Realism Meta-Metric, a weighted combination of 9 component metrics across three categories.

┌─────────────────────────────────────────────────────────────────────────────┐
│                     REALISM META-METRIC HIERARCHY                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                         REALISM META-METRIC                                 │
│                              (0-1 scale)                                    │
│                                  │                                          │
│              ┌───────────────────┼───────────────────┐                      │
│              │                   │                   │                      │
│              ▼                   ▼                   ▼                      │
│      ┌──────────────┐    ┌──────────────┐    ┌──────────────┐              │
│      │  KINEMATIC   │    │ INTERACTIVE  │    │  MAP-BASED   │              │
│      │   METRICS    │    │   METRICS    │    │   METRICS    │              │
│      │   (4 items)  │    │   (3 items)  │    │   (2 items)  │              │
│      └──────────────┘    └──────────────┘    └──────────────┘              │
│              │                   │                   │                      │
│      ┌───┬───┼───┬───┐   ┌──────┼──────┐     ┌──────┼──────┐               │
│      │   │   │   │   │   │      │      │     │      │      │               │
│      ▼   ▼   ▼   ▼   ▼   ▼      ▼      ▼     ▼      ▼      ▼               │
│     LS  LA  AS  AA      DNO   COL*   TTC   DRE   OFF*                      │
│                                                                             │
│     * = Double-weighted (2x) for safety emphasis                           │
│                                                                             │
│     LS = Linear Speed          DNO = Distance to Nearest Object            │
│     LA = Linear Acceleration   COL = Collision Indicator                   │
│     AS = Angular Speed         TTC = Time to Collision                     │
│     AA = Angular Acceleration  DRE = Distance to Road Edge                 │
│                                OFF = Offroad Indicator                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Master Formula

The overall Realism Meta-Metric is computed as:

                    1     N    M
        M^K = ───────── · Σ    Σ   w_j · m_{i,j}^K
               N · W     i=1  j=1

Where:
    N = Number of scenarios
    M = 9 component metrics
    K = 32 parallel rollouts
    w_j = Weight for metric j (1.0 or 2.0)
    W = Sum of all weights (11 total: 7×1 + 2×2)
    m_{i,j}^K = Likelihood score for metric j on scenario i

3.1 Kinematic Metrics

These metrics ensure agents move in physically plausible ways.

Linear Speed

Definition: Unsigned magnitude of velocity

        ‖v_t‖ = ‖(x_{t+1} - x_t) / Δt‖_2

Where:
    x_t = [x_t, y_t, z_t]  (3D position)
    Δt = 0.1 seconds (10 Hz)

Intuition: Measures if agents move at realistic speeds. A car suddenly jumping 100 m/s would get low likelihood.

Typical Range: 0-30 m/s for vehicles, 0-2 m/s for pedestrians

Linear Acceleration

Definition: Signed magnitude of velocity change

        a_t = (‖v_{t+1}‖ - ‖v_t‖) / Δt

Intuition: Captures smooth vs. jerky motion. Real vehicles have bounded acceleration capabilities.

Typical Range: -8 to +4 m/s² for comfortable driving

Angular Speed

Definition: Rate of heading change

        ω_t = d(θ_{t+1}, θ_t) / Δt

Where d(·,·) is the minimal angular difference:
    d(θ_a, θ_b) = min(|θ_a - θ_b|, 2π - |θ_a - θ_b|)

Intuition: How fast is the agent turning? Vehicles can't spin arbitrarily fast.

Typical Range: -0.5 to +0.5 rad/s for normal driving

Angular Acceleration

Definition: Rate of angular speed change

        α_t = d(ω_{t+1}, ω_t) / Δt

Intuition: Steering smoothness. Jerky steering inputs get penalized.

3.2 Interactive Metrics

These metrics capture multi-agent social behavior.

Distance to Nearest Object

Definition: Signed distance to closest other agent

        d_{near}(i, t) = min_{j ≠ i} GJK_distance(box_i(t), box_j(t))

Where:
    GJK_distance = Gilbert-Johnson-Keerthi algorithm
    box_i(t) = Oriented bounding box of agent i at time t

Negative values indicate overlap (collision)

Intuition: Are agents maintaining realistic following distances? Too close = dangerous, too far = unrealistic.

Collision Indicator (2x Weight)

Definition: Binary indicator of any collision occurrence

        collision_i = 1  if  ∃t : d_{near}(i, t) < 0
                     = 0  otherwise

Intuition: Did the agent collide with anything during the rollout? This is CRITICAL for safety.

Why Double-Weighted: Collisions are safety-critical failures. A simulator with many collisions is useless for AV testing.

Time to Collision (TTC)

Definition: Time until collision assuming constant velocities

        TTC_t = t_collision - t_current

Where t_collision is computed assuming:
    - Both agents maintain current velocity
    - Linear extrapolation of trajectories
    - First intersection time of bounding boxes

Intuition: Are agents maintaining safe time gaps? Real drivers adjust speed to maintain TTC > 2-3 seconds.

3.3 Map-Based Metrics

These metrics ensure agents follow road structure.

Distance to Road Edge

Definition: Signed distance to nearest drivable boundary

        d_{road}(i, t) = min_{edge ∈ road_edges} distance(center_i(t), edge)

Positive = on road
Negative = off road

Intuition: Are agents staying on valid road surfaces?

Offroad Indicator (2x Weight)

Definition: Binary indicator of road departure

        offroad_i = 1  if  ∃t : d_{road}(i, t) < 0
                  = 0  otherwise

Why Double-Weighted: Vehicles driving through buildings or fields is obviously unrealistic and safety-critical.

3.4 Likelihood Computation

The key insight of WOSAC is treating simulation as distribution matching. Rather than comparing single trajectories, it compares distributions.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LIKELIHOOD COMPUTATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   STEP 1: Extract feature values from 32 rollouts                          │
│   ─────────────────────────────────────────────────                         │
│                                                                             │
│   Rollout 1: speed = [5.2, 5.3, 5.1, ...]                                  │
│   Rollout 2: speed = [5.1, 5.4, 5.2, ...]                                  │
│      ...                                                                    │
│   Rollout 32: speed = [5.3, 5.2, 5.4, ...]                                 │
│                                                                             │
│   STEP 2: Build histogram from simulated samples                           │
│   ──────────────────────────────────────────────                            │
│                                                                             │
│   Histogram bins: [0-2], [2-4], [4-6], [6-8], ...                          │
│   Counts:           0      2      25     5     ...                         │
│   Normalize → Categorical distribution P_sim                                │
│                                                                             │
│   STEP 3: Compute likelihood of REAL data under P_sim                      │
│   ────────────────────────────────────────────────────                      │
│                                                                             │
│   Real speed = 5.5 m/s → falls in bin [4-6]                                │
│   P_sim([4-6]) = 25/32 = 0.78                                              │
│   NLL = -log(0.78) = 0.25                                                  │
│                                                                             │
│   STEP 4: Average over time with Laplace smoothing                         │
│   ─────────────────────────────────────────────────                         │
│                                                                             │
│   m = exp(-[Σ_t 1{v_t} · NLL_t] / Σ_t 1{v_t})                             │
│                                                                             │
│   Laplace smoothing: Add pseudocount 0.1 to avoid log(0)                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Mathematical Formulation:

For each component metric m_j:

1. Extract values: V = {v_1, v_2, ..., v_32} from 32 rollouts

2. Build histogram: H(b) = count of samples in bin b

3. Normalize with Laplace smoothing:
   P(b) = (H(b) + 0.1) / (32 + 0.1 × n_bins)

4. Compute NLL of ground truth:
   NLL = -log(P(bin(v_real)))

5. Time-average:
   m_j = exp(- Σ_t [valid_t × NLL_t] / Σ_t valid_t)

3.5 2025 Metric Updates

The 2025 challenge introduced refinements:

ChangeDescription
Traffic Light ViolationNew metric penalizing red light running
TTC FilteringImproved vehicle selection for TTC calculation
Weight AdjustmentsFine-tuned relative importance of components

Summary Table

CategoryMetricWeightFormulaRange
KinematicLinear Speed1x||v|| = ||(x_{t+1} - x_t)/Δt||0-30 m/s
KinematicLinear Accel1xa = (v_{t+1} - v_t)/Δt-8 to +4 m/s²
KinematicAngular Speed1xω = Δθ/Δt-0.5 to +0.5 rad/s
KinematicAngular Accel1xα = Δω/ΔtVariable
InteractiveDist to Nearest1xmin GJK distance0-50 m
InteractiveCollision2xBinary indicator{0, 1}
InteractiveTime to Collision1xConstant-velocity extrapolation0-10 s
Map-BasedDist to Road Edge1xSigned distance-5 to +20 m
Map-BasedOffroad2xBinary indicator{0, 1}

4. Baseline Approaches

Understanding baselines reveals what WOSAC rewards and penalizes.

4.1 Constant Velocity Baseline

Approach: Assume each agent continues with current velocity forever.

def constant_velocity_rollout(initial_state, n_steps=80):
    """
    Simplest possible baseline: linear extrapolation.
    """
    x, y, heading = initial_state['x'], initial_state['y'], initial_state['heading']
    vx, vy = initial_state['vx'], initial_state['vy']
    dt = 0.1  # 10 Hz

    trajectory = []
    for _ in range(n_steps):
        x += vx * dt
        y += vy * dt
        # heading unchanged
        trajectory.append({'x': x, 'y': y, 'heading': heading})

    return trajectory

Performance: Realism Meta ~0.08 (very poor)

Why It Fails:

  • Zero angular motion → fails angular metrics completely
  • No interaction awareness → many collisions
  • Ignores map → drives off-road
  • No deceleration → unrealistic at intersections
CONSTANT VELOCITY FAILURE MODES:
─────────────────────────────────

    Before:                     After 8 seconds:

       ┌────┐                      ┌────┐
       │ A  │→                     │ A  │→  (drove through intersection)
       └────┘                      └────┘
          ↑                           ↑
       ┌──┴──┐                    ┌──┴──┐
       │Road │                    │Road │
       └─────┘                    └─────┘
          ↑                           ↑
       ┌────┐                      ┌────┐
       │ B  │↑                     │ B  │↑  (collision with A!)
       └────┘                      └────┘

4.2 Log Replay (Oracle)

Approach: Directly replay recorded trajectories from the dataset.

def log_replay_rollout(scenario, agent_id):
    """
    Copy ground truth future trajectories.
    """
    return scenario.tracks[agent_id].states[11:91]  # Future 80 steps

Performance: Realism Meta ~0.72 (surprisingly not perfect!)

Why It's Not Perfect:

  • Replicates exact same trajectory 32 times → no diversity
  • Histogram has all mass in one bin → NLL can be high if real trajectory has ANY variation
  • Cannot adapt to different AV behaviors in closed-loop testing
LOG REPLAY LIMITATION:
─────────────────────

    Real Distribution:              Log Replay:

    P(speed)                        P(speed)
    │                               │
    │   ████                        │        █
    │  ██████                       │        █
    │ ████████                      │        █
    └──────────── speed             └──────────── speed
       ↑                                     ↑
    Natural variation               Single point = poor coverage

4.3 Intelligent Driver Model (IDM)

Approach: Classical car-following model with simple rules.

def idm_acceleration(v, s, v_lead, params):
    """
    Intelligent Driver Model for longitudinal control.

    Args:
        v: Current velocity
        s: Gap to lead vehicle
        v_lead: Lead vehicle velocity
        params: IDM parameters
    """
    v0 = params['desired_velocity']      # e.g., 30 m/s
    T = params['time_headway']           # e.g., 1.5 s
    a = params['max_acceleration']       # e.g., 3 m/s²
    b = params['comfortable_decel']      # e.g., 2 m/s²
    s0 = params['min_gap']               # e.g., 2 m

    # Desired gap
    delta_v = v - v_lead
    s_star = s0 + max(0, v * T + v * delta_v / (2 * np.sqrt(a * b)))

    # Acceleration
    acc = a * (1 - (v / v0)**4 - (s_star / s)**2)
    return np.clip(acc, -b, a)

Performance: Realism Meta ~0.45

Strengths:

  • Maintains safe following distances
  • Avoids rear-end collisions
  • Smooth acceleration profiles

Weaknesses:

  • Too reactive/accommodating
  • Cannot handle complex intersections
  • No lane changes or lateral behavior
  • Assumes simple single-lane scenarios

4.4 Baseline Comparison

MethodRealism MetaKinematicInteractiveMap-BasedCollisions
Constant Velocity0.080.020.150.12High
Log Replay0.720.850.650.70Low
IDM0.450.550.480.35Medium
Random Noise0.050.010.100.08Very High

Key Insight: Even "cheating" by replaying ground truth doesn't achieve perfect scores, because WOSAC measures distributional coverage, not just accuracy.


5. Winning Strategies

5.1 WOSAC 2023: Multiverse Transformer (MVTA)

1st Place - Realism Meta: 0.5168

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MULTIVERSE TRANSFORMER ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ENCODER                                                                   │
│   ───────                                                                   │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                  │
│   │   Agent     │     │    Map      │     │    ADV      │                  │
│   │  History    │     │  Polylines  │     │   State     │                  │
│   └──────┬──────┘     └──────┬──────┘     └──────┬──────┘                  │
│          │                   │                   │                          │
│          └───────────────────┼───────────────────┘                          │
│                              ▼                                              │
│                    ┌─────────────────┐                                      │
│                    │   Transformer   │                                      │
│                    │    Encoder      │                                      │
│                    │  (Self-Attn)    │                                      │
│                    └────────┬────────┘                                      │
│                             │                                               │
│   DECODER (Autoregressive)  ▼                                               │
│   ────────────────────────────                                              │
│                    ┌─────────────────┐                                      │
│                    │   Transformer   │◄──── Cross-attention                 │
│                    │    Decoder      │      to encoder features             │
│                    └────────┬────────┘                                      │
│                             │                                               │
│                    ┌────────▼────────┐                                      │
│                    │   GMM Head      │                                      │
│                    │  (64 modes)     │                                      │
│                    └────────┬────────┘                                      │
│                             │                                               │
│                             ▼                                               │
│                    Position, Velocity, Heading                              │
│                    for next 0.1s timestep                                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Innovations

  1. Variable-Length History Sampling

    • Training: Randomly split trajectories into history/future
    • Creates more training samples from each trajectory
    • Improves generalization to different context lengths
  2. Receding Horizon Prediction

    • Predict 1-second ahead, but only use first 0.1s
    • Promotes multi-modal diversity
    • Reduces compounding error
  3. Top-K Sampling Strategy

    • Don't always pick max-likelihood trajectory
    • Periodically sample from top-3 modes
    • Balances realism with behavioral diversity
  4. Loss Function

loss = (
    1.0 * gaussian_nll_loss(pred_pos, gt_pos) +
    0.5 * l1_loss(pred_vel, gt_vel) +
    0.5 * l1_loss(pred_heading, gt_heading)
)

5.2 WOSAC 2023: MTR+++ (2nd Place)

Realism Meta: 0.4697

Collision-Mitigation Postprocessing

The key innovation was a graph-based collision avoidance algorithm:

def collision_mitigation(trajectories_per_agent, n_candidates=6):
    """
    Graph-based selection to avoid collisions.

    For each agent, we have n_candidates trajectory options.
    Goal: Select one trajectory per agent minimizing collisions.
    """
    n_agents = len(trajectories_per_agent)

    # Build collision adjacency matrix
    # adj[i,j,k,l] = 1 if agent i's trajectory k collides with agent j's trajectory l
    adj = compute_collision_matrix(trajectories_per_agent)

    # Greedy selection: find dense subgraph (density >= 0.95)
    selected = {}
    for agent_idx in range(n_agents):
        best_traj = None
        min_collisions = float('inf')

        for traj_idx in range(n_candidates):
            # Count collisions with already-selected trajectories
            collisions = sum(
                adj[agent_idx, other, traj_idx, selected[other]]
                for other in selected
            )
            if collisions < min_collisions:
                min_collisions = collisions
                best_traj = traj_idx

        selected[agent_idx] = best_traj

    return selected

5.3 WOSAC 2024: SMART

1st Place - Realism Meta: 0.7614

Next-Token Prediction Paradigm

SMART revolutionized the approach by treating trajectory generation as language modeling:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SMART: TRAJECTORY AS LANGUAGE                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Traditional Approach:                                                     │
│   ─────────────────────                                                     │
│   Input → Encoder → Decoder → Continuous (x, y) coordinates                │
│                                                                             │
│   SMART Approach:                                                           │
│   ───────────────                                                           │
│   Input → Tokenizer → Decoder-Only → Next Token → Detokenize              │
│                                                                             │
│   TOKENIZATION:                                                             │
│   ─────────────                                                             │
│                                                                             │
│   Continuous trajectory:  [(0,0), (0.5,0.1), (1.1,0.3), ...]              │
│                               │                                             │
│                               ▼  K-Disk Clustering                          │
│                                                                             │
│   Token sequence:         [Token_42, Token_17, Token_89, ...]              │
│                                                                             │
│   Vocabulary size: ~2048 motion tokens + ~1000 map tokens                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Architecture Details

class SMARTModel:
    """
    Decoder-only transformer for trajectory generation.
    """
    def __init__(self):
        self.motion_vocab_size = 2048
        self.map_vocab_size = 1000

        # Factorized attention
        self.temporal_attention = SelfAttention()   # Within agent history
        self.map_attention = CrossAttention()        # Agent → nearby roads
        self.agent_attention = SelfAttention()       # Agent → other agents

    def forward(self, motion_tokens, map_tokens, agent_mask):
        """
        Predict next motion token for each agent.
        """
        # Temporal: attend to own history
        h = self.temporal_attention(motion_tokens)

        # Map: attend to nearby road tokens (50m radius)
        h = self.map_attention(h, map_tokens)

        # Agent: attend to neighboring agents
        h = self.agent_attention(h, agent_mask)

        # Predict next token
        logits = self.output_head(h)  # [batch, agents, vocab_size]
        return logits

Performance Breakthrough

Model SizeRealism MetaInference Time
SMART-7M0.7591<15 ms/frame
SMART-101M0.7614~30 ms/frame
Zero-shot (NuPlan only)0.7210-

5.4 WOSAC 2025: SMART-R1 and TrajTok

Top Entries - Realism Meta: 0.7858

SMART-R1: Reinforcement Learning Fine-Tuning

The key insight: directly optimize the WOSAC metric as a reward signal.

class SMARTR1:
    """
    Three-stage training pipeline:
    1. Behavior Cloning (BC) - open-loop pretraining
    2. Closed-loop SFT - address covariate shift
    3. RFT - direct metric optimization
    """

    def compute_mpo_loss(self, rollouts, reference_policy):
        """
        Metric-oriented Policy Optimization.

        Unlike GRPO, exploits predictable reward structure.
        """
        rewards = self.compute_wosac_metrics(rollouts)

        # Simple advantage: reward minus threshold
        alpha = 0.7  # Empirical threshold
        advantages = rewards - alpha

        # Policy gradient with KL penalty
        log_probs = self.policy.log_prob(rollouts)
        ref_log_probs = reference_policy.log_prob(rollouts)
        kl_penalty = log_probs - ref_log_probs

        loss = -(advantages * log_probs - self.beta * kl_penalty).mean()
        return loss

TrajTok: Improved Tokenization

TrajTok addresses limitations of k-disk clustering:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    TRAJTOK vs K-DISKS TOKENIZATION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   K-DISKS (Previous):                                                       │
│   ───────────────────                                                       │
│   • Pure data-driven clustering                                             │
│   • Limited to observed trajectories                                        │
│   • Asymmetric (left/right turns have different coverage)                  │
│   • Vocabulary: 2048 tokens                                                 │
│                                                                             │
│   TRAJTOK (2025):                                                           │
│   ───────────────                                                           │
│   • Hybrid rule + data approach                                             │
│   • Grid-based spatial division                                             │
│   • Explicit symmetry (x-axis flipping)                                     │
│   • Noise filtering for robustness                                          │
│   • Vocabulary: 8040 (vehicle), 3001 (ped), 2798 (cyclist)                 │
│                                                                             │
│   IMPROVEMENTS:                                                             │
│   ─────────────                                                             │
│   ┌──────────────────┬──────────┬──────────┐                               │
│   │ Metric           │ K-Disks  │ TrajTok  │                               │
│   ├──────────────────┼──────────┼──────────┤                               │
│   │ Realism Meta     │ 0.7814   │ 0.7852   │                               │
│   │ Map-based        │ 0.9153   │ 0.9207   │                               │
│   │ Interactive      │ 0.8089   │ 0.8116   │                               │
│   └──────────────────┴──────────┴──────────┘                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

5.5 Evolution of Winning Approaches

┌─────────────────────────────────────────────────────────────────────────────┐
│                    WOSAC WINNING APPROACH TIMELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   2023                    2024                    2025                      │
│   ────                    ────                    ────                      │
│                                                                             │
│   MVTA (0.52)             SMART (0.76)            SMART-R1 (0.79)          │
│      │                       │                       │                      │
│      ▼                       ▼                       ▼                      │
│   Transformer            Next-Token              RL Fine-Tuning             │
│   + GMM Head             Prediction              + WOSAC Reward             │
│   + Autoregressive       + Discrete Tokens       + Improved Tokens          │
│                                                                             │
│   KEY INSIGHT:           KEY INSIGHT:            KEY INSIGHT:               │
│   Multi-modal output     Treat trajectories      Directly optimize          │
│   with top-k sampling    as language             the evaluation metric      │
│                                                                             │
│   SCORE JUMP: +0.37 (from baselines)             SCORE JUMP: +0.03          │
│   SCORE JUMP: +0.24                              (diminishing returns)      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

6. Interactive Code Examples

6.1 Computing Kinematic Metrics

import numpy as np
from typing import Dict, List, Tuple

def compute_kinematic_metrics(
    trajectories: np.ndarray,  # Shape: [n_rollouts, n_agents, n_steps, 4]
    dt: float = 0.1
) -> Dict[str, np.ndarray]:
    """
    Compute kinematic metrics for WOSAC evaluation.

    Args:
        trajectories: Simulated trajectories [x, y, z, heading]
        dt: Timestep in seconds (0.1 for 10 Hz)

    Returns:
        Dictionary of metric values per agent per timestep
    """
    n_rollouts, n_agents, n_steps, _ = trajectories.shape

    # Extract components
    positions = trajectories[..., :3]  # [n_rollouts, n_agents, n_steps, 3]
    headings = trajectories[..., 3]    # [n_rollouts, n_agents, n_steps]

    # Linear velocity: ||v|| = ||(x_{t+1} - x_t) / dt||
    velocities = np.diff(positions, axis=2) / dt  # [n_rollouts, n_agents, n_steps-1, 3]
    linear_speed = np.linalg.norm(velocities, axis=-1)  # [n_rollouts, n_agents, n_steps-1]

    # Linear acceleration: (||v_{t+1}|| - ||v_t||) / dt
    linear_accel = np.diff(linear_speed, axis=2) / dt  # [n_rollouts, n_agents, n_steps-2]

    # Angular velocity: d(theta_{t+1}, theta_t) / dt
    heading_diff = np.diff(headings, axis=2)  # [n_rollouts, n_agents, n_steps-1]
    # Wrap to [-pi, pi]
    heading_diff = np.arctan2(np.sin(heading_diff), np.cos(heading_diff))
    angular_speed = heading_diff / dt

    # Angular acceleration
    angular_accel = np.diff(angular_speed, axis=2) / dt

    return {
        'linear_speed': linear_speed,
        'linear_acceleration': linear_accel,
        'angular_speed': angular_speed,
        'angular_acceleration': angular_accel,
    }


def compute_metric_likelihood(
    simulated_values: np.ndarray,  # [n_rollouts, ...]
    ground_truth_value: float,
    n_bins: int = 100,
    value_range: Tuple[float, float] = None,
    laplace_smoothing: float = 0.1
) -> float:
    """
    Compute likelihood of ground truth under simulated distribution.

    This is the core WOSAC metric computation.
    """
    # Flatten simulated values
    sim_flat = simulated_values.flatten()

    # Determine bin range
    if value_range is None:
        value_range = (sim_flat.min(), sim_flat.max())

    # Build histogram
    hist, bin_edges = np.histogram(sim_flat, bins=n_bins, range=value_range)

    # Laplace smoothing
    hist = hist + laplace_smoothing

    # Normalize to probability
    prob = hist / hist.sum()

    # Find bin for ground truth
    bin_idx = np.digitize(ground_truth_value, bin_edges) - 1
    bin_idx = np.clip(bin_idx, 0, n_bins - 1)

    # Compute NLL
    nll = -np.log(prob[bin_idx])

    # Convert to likelihood (0-1 scale)
    likelihood = np.exp(-nll)

    return likelihood


# Example usage
if __name__ == "__main__":
    # Simulate some trajectories
    np.random.seed(42)
    n_rollouts = 32
    n_agents = 10
    n_steps = 80

    # Random walk trajectories
    trajectories = np.zeros((n_rollouts, n_agents, n_steps, 4))
    trajectories[..., 0] = np.cumsum(np.random.randn(n_rollouts, n_agents, n_steps) * 0.5, axis=2)
    trajectories[..., 1] = np.cumsum(np.random.randn(n_rollouts, n_agents, n_steps) * 0.5, axis=2)
    trajectories[..., 3] = np.cumsum(np.random.randn(n_rollouts, n_agents, n_steps) * 0.1, axis=2)

    # Compute metrics
    metrics = compute_kinematic_metrics(trajectories)

    print("Kinematic Metrics Shapes:")
    for name, values in metrics.items():
        print(f"  {name}: {values.shape}")

    # Compute likelihood for one agent
    gt_speed = 5.0  # m/s
    speed_likelihood = compute_metric_likelihood(
        metrics['linear_speed'][:, 0, :],  # First agent
        gt_speed,
        value_range=(0, 30)
    )
    print(f"\nSpeed likelihood for GT=5.0 m/s: {speed_likelihood:.4f}")

6.2 Computing Interactive Metrics

import numpy as np
from scipy.spatial import ConvexHull
from typing import List, Tuple

def get_corners(x: float, y: float, heading: float,
                length: float, width: float) -> np.ndarray:
    """Get bounding box corners in world coordinates."""
    # Half dimensions
    l2, w2 = length / 2, width / 2

    # Local corners (centered at origin)
    corners_local = np.array([
        [l2, w2],
        [l2, -w2],
        [-l2, -w2],
        [-l2, w2]
    ])

    # Rotation matrix
    cos_h, sin_h = np.cos(heading), np.sin(heading)
    R = np.array([[cos_h, -sin_h], [sin_h, cos_h]])

    # Transform to world
    corners_world = corners_local @ R.T + np.array([x, y])
    return corners_world


def polygon_distance(poly1: np.ndarray, poly2: np.ndarray) -> float:
    """
    Compute signed distance between two convex polygons.

    Positive = separated
    Negative = overlapping

    Uses simplified GJK-like algorithm.
    """
    from scipy.spatial import ConvexHull
    from shapely.geometry import Polygon

    p1 = Polygon(poly1)
    p2 = Polygon(poly2)

    if p1.intersects(p2):
        # Overlapping - return negative distance
        intersection = p1.intersection(p2)
        return -np.sqrt(intersection.area)
    else:
        # Separated - return positive distance
        return p1.distance(p2)


def compute_interactive_metrics(
    trajectories: np.ndarray,  # [n_rollouts, n_agents, n_steps, 4]
    agent_sizes: np.ndarray,   # [n_agents, 2] (length, width)
    dt: float = 0.1
) -> dict:
    """
    Compute interactive metrics for WOSAC.
    """
    n_rollouts, n_agents, n_steps, _ = trajectories.shape

    # Distance to nearest object
    dist_to_nearest = np.full((n_rollouts, n_agents, n_steps), np.inf)

    # Collision indicator
    collision_occurred = np.zeros((n_rollouts, n_agents), dtype=bool)

    for r in range(n_rollouts):
        for t in range(n_steps):
            for i in range(n_agents):
                x_i, y_i, _, h_i = trajectories[r, i, t]
                corners_i = get_corners(x_i, y_i, h_i,
                                       agent_sizes[i, 0], agent_sizes[i, 1])

                min_dist = np.inf
                for j in range(n_agents):
                    if i == j:
                        continue

                    x_j, y_j, _, h_j = trajectories[r, j, t]
                    corners_j = get_corners(x_j, y_j, h_j,
                                           agent_sizes[j, 0], agent_sizes[j, 1])

                    dist = polygon_distance(corners_i, corners_j)
                    min_dist = min(min_dist, dist)

                    if dist < 0:
                        collision_occurred[r, i] = True

                dist_to_nearest[r, i, t] = min_dist

    # Time to collision (simplified: assume constant velocity)
    ttc = compute_ttc(trajectories, agent_sizes, dt)

    return {
        'distance_to_nearest': dist_to_nearest,
        'collision_indicator': collision_occurred,
        'time_to_collision': ttc,
    }


def compute_ttc(trajectories: np.ndarray, agent_sizes: np.ndarray,
                dt: float) -> np.ndarray:
    """
    Compute time-to-collision assuming constant velocities.
    """
    n_rollouts, n_agents, n_steps, _ = trajectories.shape
    ttc = np.full((n_rollouts, n_agents, n_steps), np.inf)

    # Compute velocities
    velocities = np.diff(trajectories[..., :2], axis=2) / dt

    for r in range(n_rollouts):
        for t in range(n_steps - 1):
            for i in range(n_agents):
                pos_i = trajectories[r, i, t, :2]
                vel_i = velocities[r, i, t]

                min_ttc = np.inf
                for j in range(n_agents):
                    if i == j:
                        continue

                    pos_j = trajectories[r, j, t, :2]
                    vel_j = velocities[r, j, t]

                    # Relative position and velocity
                    rel_pos = pos_j - pos_i
                    rel_vel = vel_j - vel_i

                    # Time to closest approach
                    if np.dot(rel_vel, rel_vel) > 1e-6:
                        t_cpa = -np.dot(rel_pos, rel_vel) / np.dot(rel_vel, rel_vel)
                        if t_cpa > 0:
                            # Check if collision occurs
                            closest_dist = np.linalg.norm(rel_pos + t_cpa * rel_vel)
                            collision_threshold = (agent_sizes[i, 0] + agent_sizes[j, 0]) / 2

                            if closest_dist < collision_threshold:
                                min_ttc = min(min_ttc, t_cpa)

                ttc[r, i, t] = min_ttc

    return ttc


# Example usage
if __name__ == "__main__":
    # Two vehicles approaching each other
    n_rollouts, n_agents, n_steps = 32, 2, 80

    trajectories = np.zeros((n_rollouts, n_agents, n_steps, 4))

    # Agent 0: moving right
    trajectories[:, 0, :, 0] = np.linspace(0, 40, n_steps)  # x
    trajectories[:, 0, :, 1] = 0  # y
    trajectories[:, 0, :, 3] = 0  # heading = 0 (facing right)

    # Agent 1: moving left
    trajectories[:, 1, :, 0] = np.linspace(50, 10, n_steps)  # x
    trajectories[:, 1, :, 1] = 0  # y
    trajectories[:, 1, :, 3] = np.pi  # heading = pi (facing left)

    # Vehicle sizes
    agent_sizes = np.array([[4.5, 2.0], [4.5, 2.0]])

    metrics = compute_interactive_metrics(trajectories, agent_sizes)

    print("Interactive Metrics:")
    print(f"  Collision rate: {metrics['collision_indicator'].mean():.2%}")
    print(f"  Min distance: {metrics['distance_to_nearest'].min():.2f}m")
    print(f"  Min TTC: {metrics['time_to_collision'][metrics['time_to_collision'] < np.inf].min():.2f}s")

6.3 Computing Map-Based Metrics

import numpy as np
from shapely.geometry import Point, LineString, Polygon
from typing import List, Dict

def compute_map_metrics(
    trajectories: np.ndarray,      # [n_rollouts, n_agents, n_steps, 4]
    road_polygons: List[Polygon],  # List of drivable area polygons
    road_edges: List[LineString],  # List of road boundary lines
) -> Dict[str, np.ndarray]:
    """
    Compute map-based metrics for WOSAC.
    """
    n_rollouts, n_agents, n_steps, _ = trajectories.shape

    # Distance to road edge
    dist_to_edge = np.zeros((n_rollouts, n_agents, n_steps))

    # Offroad indicator
    offroad_occurred = np.zeros((n_rollouts, n_agents), dtype=bool)

    # Combine road polygons into single drivable area
    from shapely.ops import unary_union
    drivable_area = unary_union(road_polygons)

    for r in range(n_rollouts):
        for i in range(n_agents):
            for t in range(n_steps):
                x, y = trajectories[r, i, t, :2]
                point = Point(x, y)

                # Check if on road
                is_on_road = drivable_area.contains(point)

                if is_on_road:
                    # Positive distance to edge
                    dist = drivable_area.exterior.distance(point)
                    dist_to_edge[r, i, t] = dist
                else:
                    # Negative distance (off road)
                    dist = -drivable_area.exterior.distance(point)
                    dist_to_edge[r, i, t] = dist
                    offroad_occurred[r, i] = True

    return {
        'distance_to_road_edge': dist_to_edge,
        'offroad_indicator': offroad_occurred,
    }


def create_sample_road_geometry():
    """Create sample road geometry for testing."""
    from shapely.geometry import box

    # Simple straight road
    road = box(-10, -5, 100, 5)  # x: -10 to 100, y: -5 to 5

    # Add intersection
    intersection = box(40, -20, 60, 20)

    drivable_area = road.union(intersection)

    return [drivable_area], [drivable_area.exterior]


# Example
if __name__ == "__main__":
    # Create sample trajectories
    n_rollouts, n_agents, n_steps = 32, 3, 80
    trajectories = np.zeros((n_rollouts, n_agents, n_steps, 4))

    # Agent 0: stays on road
    trajectories[:, 0, :, 0] = np.linspace(0, 80, n_steps)
    trajectories[:, 0, :, 1] = 0

    # Agent 1: goes slightly off road
    trajectories[:, 1, :, 0] = np.linspace(0, 80, n_steps)
    trajectories[:, 1, :, 1] = np.linspace(0, 8, n_steps)  # Drifts off

    # Agent 2: stays in intersection area
    trajectories[:, 2, :, 0] = 50
    trajectories[:, 2, :, 1] = np.linspace(-15, 15, n_steps)

    road_polygons, road_edges = create_sample_road_geometry()

    metrics = compute_map_metrics(trajectories, road_polygons, road_edges)

    print("Map-Based Metrics:")
    print(f"  Offroad rate: {metrics['offroad_indicator'].mean():.2%}")
    print(f"  Min road distance: {metrics['distance_to_road_edge'].min():.2f}m")

6.4 Full Realism Meta-Metric Computation

import numpy as np
from typing import Dict

def compute_realism_meta_metric(
    kinematic_metrics: Dict[str, np.ndarray],
    interactive_metrics: Dict[str, np.ndarray],
    map_metrics: Dict[str, np.ndarray],
    ground_truth: Dict[str, np.ndarray],
) -> Dict[str, float]:
    """
    Compute the full WOSAC Realism Meta-Metric.

    Returns individual component scores and final meta-metric.
    """

    # Metric weights (collision and offroad are 2x)
    weights = {
        'linear_speed': 1.0,
        'linear_acceleration': 1.0,
        'angular_speed': 1.0,
        'angular_acceleration': 1.0,
        'distance_to_nearest': 1.0,
        'collision_indicator': 2.0,  # Safety-critical
        'time_to_collision': 1.0,
        'distance_to_road_edge': 1.0,
        'offroad_indicator': 2.0,    # Safety-critical
    }

    total_weight = sum(weights.values())  # = 11

    # Compute likelihood for each metric
    likelihoods = {}

    # Kinematic metrics
    for name in ['linear_speed', 'linear_acceleration',
                 'angular_speed', 'angular_acceleration']:
        likelihoods[name] = compute_metric_likelihood(
            kinematic_metrics[name],
            ground_truth[name],
            n_bins=100
        )

    # Interactive metrics
    likelihoods['distance_to_nearest'] = compute_metric_likelihood(
        interactive_metrics['distance_to_nearest'],
        ground_truth['distance_to_nearest']
    )

    # Collision: special handling for binary metric
    collision_rate = interactive_metrics['collision_indicator'].mean()
    gt_collision = ground_truth.get('collision_indicator', 0.05)  # ~5% baseline
    likelihoods['collision_indicator'] = 1.0 - abs(collision_rate - gt_collision)

    likelihoods['time_to_collision'] = compute_metric_likelihood(
        interactive_metrics['time_to_collision'],
        ground_truth['time_to_collision']
    )

    # Map metrics
    likelihoods['distance_to_road_edge'] = compute_metric_likelihood(
        map_metrics['distance_to_road_edge'],
        ground_truth['distance_to_road_edge']
    )

    # Offroad: binary metric
    offroad_rate = map_metrics['offroad_indicator'].mean()
    gt_offroad = ground_truth.get('offroad_indicator', 0.02)  # ~2% baseline
    likelihoods['offroad_indicator'] = 1.0 - abs(offroad_rate - gt_offroad)

    # Compute category scores
    kinematic_score = np.mean([
        likelihoods['linear_speed'],
        likelihoods['linear_acceleration'],
        likelihoods['angular_speed'],
        likelihoods['angular_acceleration'],
    ])

    interactive_score = np.mean([
        likelihoods['distance_to_nearest'],
        likelihoods['collision_indicator'],
        likelihoods['time_to_collision'],
    ])

    map_score = np.mean([
        likelihoods['distance_to_road_edge'],
        likelihoods['offroad_indicator'],
    ])

    # Compute weighted meta-metric
    meta_metric = sum(
        weights[name] * likelihoods[name]
        for name in weights
    ) / total_weight

    return {
        'realism_meta_metric': meta_metric,
        'kinematic_score': kinematic_score,
        'interactive_score': interactive_score,
        'map_score': map_score,
        **{f'likelihood_{k}': v for k, v in likelihoods.items()}
    }


# Example evaluation
if __name__ == "__main__":
    # Simulated metrics (from previous examples)
    kinematic = {
        'linear_speed': np.random.rayleigh(5, (32, 10, 79)),
        'linear_acceleration': np.random.normal(0, 1, (32, 10, 78)),
        'angular_speed': np.random.normal(0, 0.1, (32, 10, 79)),
        'angular_acceleration': np.random.normal(0, 0.05, (32, 10, 78)),
    }

    interactive = {
        'distance_to_nearest': np.random.exponential(5, (32, 10, 80)),
        'collision_indicator': np.random.random((32, 10)) < 0.05,
        'time_to_collision': np.random.exponential(3, (32, 10, 80)),
    }

    map_based = {
        'distance_to_road_edge': np.random.exponential(2, (32, 10, 80)),
        'offroad_indicator': np.random.random((32, 10)) < 0.02,
    }

    # Ground truth (typical values)
    ground_truth = {
        'linear_speed': 5.5,
        'linear_acceleration': 0.2,
        'angular_speed': 0.05,
        'angular_acceleration': 0.01,
        'distance_to_nearest': 4.0,
        'collision_indicator': 0.03,
        'time_to_collision': 3.5,
        'distance_to_road_edge': 2.5,
        'offroad_indicator': 0.01,
    }

    results = compute_realism_meta_metric(
        kinematic, interactive, map_based, ground_truth
    )

    print("WOSAC Evaluation Results:")
    print(f"  Realism Meta-Metric: {results['realism_meta_metric']:.4f}")
    print(f"  Kinematic Score:     {results['kinematic_score']:.4f}")
    print(f"  Interactive Score:   {results['interactive_score']:.4f}")
    print(f"  Map Score:           {results['map_score']:.4f}")

7. Submission Format

7.1 Proto Message Structure

WOSAC submissions use Protocol Buffers for efficient serialization:

// sim_agents_submission.proto

message SimulatedTrajectory {
  // Object identifier (must match scenario)
  int32 object_id = 1;

  // Future positions (80 timesteps at 10 Hz)
  repeated float center_x = 2;
  repeated float center_y = 3;
  repeated float center_z = 4;
  repeated float heading = 5;
}

message JointScene {
  // One trajectory per simulated agent
  repeated SimulatedTrajectory simulated_trajectories = 1;
}

message ScenarioRollouts {
  // Scenario identifier
  string scenario_id = 1;

  // 32 parallel rollouts
  repeated JointScene joint_scenes = 2;
}

message SimAgentsChallengeSubmission {
  // Multiple scenarios
  repeated ScenarioRollouts scenario_rollouts = 1;

  // Metadata
  SubmissionType submission_type = 2;
  string account_name = 3;
  string unique_method_name = 4;
  repeated string authors = 5;
  string affiliation = 6;
  string description = 7;
  string method_link = 8;

  // Data usage flags
  bool uses_lidar_data = 9;
  bool uses_camera_data = 10;
  bool uses_public_model_pretraining = 11;
  string num_model_parameters = 12;

  // Compliance
  bool acknowledge_complies_with_closed_loop_requirement = 13;
}

7.2 Complete Submission Pipeline

import os
import tarfile
import tensorflow as tf
from waymo_open_dataset.protos import scenario_pb2
from waymo_open_dataset.protos import sim_agents_submission_pb2
from waymo_open_dataset.utils.sim_agents import submission_specs

def create_wosac_submission(
    model,
    data_path: str,
    output_dir: str,
    method_name: str = "my_method",
    authors: list = ["Anonymous"],
    affiliation: str = "Research Lab"
):
    """
    Complete pipeline to create a WOSAC submission.
    """
    os.makedirs(output_dir, exist_ok=True)

    # Get submission configuration
    challenge_type = submission_specs.ChallengeType.SIM_AGENTS
    config = submission_specs.get_submission_config(challenge_type)

    # Process each shard
    filenames = tf.io.gfile.glob(f"{data_path}/*.tfrecord*")
    output_filenames = []

    for shard_idx, shard_filename in enumerate(filenames):
        print(f"Processing shard {shard_idx + 1}/{len(filenames)}")

        # Load scenarios from shard
        dataset = tf.data.TFRecordDataset([shard_filename])
        scenario_rollouts = []

        for scenario_bytes in dataset.as_numpy_iterator():
            scenario = scenario_pb2.Scenario.FromString(scenario_bytes)

            # Get agents to simulate
            sim_agent_ids = submission_specs.get_sim_agent_ids(
                scenario, challenge_type
            )

            # Generate 32 rollouts
            joint_scenes = []
            for rollout_idx in range(config.n_rollouts):  # 32
                # Run model simulation
                trajectories = model.simulate(
                    scenario,
                    sim_agent_ids,
                    n_steps=config.n_simulation_steps,  # 80
                    seed=rollout_idx
                )

                # Create JointScene proto
                simulated_trajectories = []
                for agent_id, traj in zip(sim_agent_ids, trajectories):
                    simulated_trajectories.append(
                        sim_agents_submission_pb2.SimulatedTrajectory(
                            object_id=int(agent_id),
                            center_x=traj[:, 0].tolist(),
                            center_y=traj[:, 1].tolist(),
                            center_z=traj[:, 2].tolist(),
                            heading=traj[:, 3].tolist(),
                        )
                    )

                joint_scenes.append(
                    sim_agents_submission_pb2.JointScene(
                        simulated_trajectories=simulated_trajectories
                    )
                )

            # Create ScenarioRollouts proto
            scenario_rollouts.append(
                sim_agents_submission_pb2.ScenarioRollouts(
                    scenario_id=scenario.scenario_id,
                    joint_scenes=joint_scenes,
                )
            )

        # Create shard submission
        shard_submission = sim_agents_submission_pb2.SimAgentsChallengeSubmission(
            scenario_rollouts=scenario_rollouts,
            submission_type=sim_agents_submission_pb2.SimAgentsChallengeSubmission.SIM_AGENTS_SUBMISSION,
            account_name='your_email@example.com',
            unique_method_name=method_name,
            authors=authors,
            affiliation=affiliation,
            description='Submission description',
            method_link='https://github.com/your/repo',
            uses_lidar_data=False,
            uses_camera_data=False,
            uses_public_model_pretraining=False,
            num_model_parameters='1000000',
            acknowledge_complies_with_closed_loop_requirement=True,
        )

        # Save shard
        shard_suffix = f'-{shard_idx:05d}-of-{len(filenames):05d}'
        output_filename = f'submission.binproto{shard_suffix}'
        output_path = os.path.join(output_dir, output_filename)

        with open(output_path, 'wb') as f:
            f.write(shard_submission.SerializeToString())

        output_filenames.append(output_filename)

    # Create tarball
    tarball_path = os.path.join(output_dir, 'submission.tar.gz')
    with tarfile.open(tarball_path, 'w:gz') as tar:
        for filename in output_filenames:
            tar.add(
                os.path.join(output_dir, filename),
                arcname=filename
            )

    print(f"Submission created: {tarball_path}")
    return tarball_path


# Validation function
def validate_submission(submission_path: str, scenario: scenario_pb2.Scenario):
    """
    Validate submission proto against scenario requirements.
    """
    with open(submission_path, 'rb') as f:
        submission = sim_agents_submission_pb2.SimAgentsChallengeSubmission.FromString(
            f.read()
        )

    errors = []

    for scenario_rollouts in submission.scenario_rollouts:
        # Check number of rollouts
        if len(scenario_rollouts.joint_scenes) != 32:
            errors.append(
                f"Scenario {scenario_rollouts.scenario_id}: "
                f"Expected 32 rollouts, got {len(scenario_rollouts.joint_scenes)}"
            )

        for rollout_idx, joint_scene in enumerate(scenario_rollouts.joint_scenes):
            for traj in joint_scene.simulated_trajectories:
                # Check trajectory length
                if len(traj.center_x) != 80:
                    errors.append(
                        f"Scenario {scenario_rollouts.scenario_id}, "
                        f"Rollout {rollout_idx}, Agent {traj.object_id}: "
                        f"Expected 80 timesteps, got {len(traj.center_x)}"
                    )

    if errors:
        print("Validation FAILED:")
        for error in errors[:10]:  # Show first 10 errors
            print(f"  - {error}")
        return False

    print("Validation PASSED")
    return True

7.3 Submission Checklist

┌─────────────────────────────────────────────────────────────────────────────┐
│                      WOSAC SUBMISSION CHECKLIST                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   BEFORE SUBMISSION:                                                        │
│   □ All 44,920 test scenarios processed                                     │
│   □ Exactly 32 rollouts per scenario                                        │
│   □ Exactly 80 timesteps per agent per rollout                             │
│   □ All required agents included (valid at t=10)                           │
│   □ center_x, center_y, center_z, heading all populated                    │
│   □ object_id matches scenario proto                                        │
│   □ scenario_id matches exactly                                             │
│                                                                             │
│   COMPLIANCE:                                                               │
│   □ Closed-loop simulation (not pre-computed trajectories)                  │
│   □ No access to future ground truth during simulation                      │
│   □ acknowledge_complies_with_closed_loop_requirement = True               │
│                                                                             │
│   METADATA:                                                                 │
│   □ Valid email in account_name                                             │
│   □ Unique method_name (no special characters)                              │
│   □ Authors listed                                                          │
│   □ Description explains approach                                           │
│   □ Data usage flags accurate (lidar, camera, pretraining)                 │
│   □ Parameter count specified                                               │
│                                                                             │
│   FILE FORMAT:                                                              │
│   □ Sharded .binproto files                                                 │
│   □ Packaged as submission.tar.gz                                           │
│   □ Total size < 20 GB                                                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8. Mental Models

8.1 What Does "Realistic" Mean?

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE THREE PILLARS OF REALISM                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                           REALISTIC BEHAVIOR                                │
│                                  ║                                          │
│              ╔═══════════════════╬═══════════════════╗                      │
│              ║                   ║                   ║                      │
│              ▼                   ▼                   ▼                      │
│   ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐              │
│   │   PHYSICALLY    │ │    SOCIALLY     │ │   CONTEXTUALLY  │              │
│   │   PLAUSIBLE     │ │   APPROPRIATE   │ │    GROUNDED     │              │
│   └────────┬────────┘ └────────┬────────┘ └────────┬────────┘              │
│            │                   │                   │                        │
│   • Kinematic limits   • Collision avoidance  • Stay on roads              │
│   • Smooth motion      • Safe following       • Obey signals               │
│   • Vehicle dynamics   • Yielding behavior    • Lane discipline            │
│   • Turn radius        • Merging etiquette    • Speed limits               │
│                                                                             │
│   MEASURED BY:         MEASURED BY:           MEASURED BY:                  │
│   Kinematic metrics    Interactive metrics    Map-based metrics            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8.2 Distribution vs. Point Prediction

┌─────────────────────────────────────────────────────────────────────────────┐
│              WHY 32 ROLLOUTS? THE DISTRIBUTION MATCHING INSIGHT             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   WRONG APPROACH: Single Best Prediction                                    │
│   ─────────────────────────────────────                                     │
│                                                                             │
│   Real futures:     Predicted:          Problem:                           │
│                                                                             │
│      ╱─────         ─────────           Only covers ONE possibility        │
│     ╱                                   Miss rare but valid behaviors      │
│    ●─────────       ●─────────          Fails when real differs           │
│     ╲                                                                       │
│      ╲─────                                                                 │
│                                                                             │
│   CORRECT APPROACH: Distribution Matching                                   │
│   ───────────────────────────────────────                                   │
│                                                                             │
│   Real futures:     32 Samples:         Benefit:                           │
│                                                                             │
│      ╱─────         ╱─────────          COVERS the distribution            │
│     ╱              ╱─────────           Histogram density matches          │
│    ●─────────     ●──────────           Real sample gets HIGH likelihood   │
│     ╲              ╲─────────                                               │
│      ╲─────         ╲─────────                                              │
│                                                                             │
│   KEY INSIGHT: We don't predict THE future, we predict POSSIBLE futures    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8.3 The Goldilocks Zone of Agent Behavior

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AGENT BEHAVIOR SPECTRUM                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   TOO PASSIVE                 JUST RIGHT                   TOO AGGRESSIVE  │
│   (IDM-like)                  (Human-like)                 (Log Replay)    │
│                                                                             │
│   ◄────────────────────────────────┼────────────────────────────►          │
│                                    │                                        │
│   Characteristics:              Characteristics:            Characteristics:│
│   • Always yields               • Context-dependent         • Never yields  │
│   • Large gaps                  • Appropriate gaps          • Tiny gaps     │
│   • Never enters               • Confident merging          • Forces entry  │
│     contested space            • Negotiates right-of-way   • Ignores others│
│                                                                             │
│   WOSAC Score: ~0.45           WOSAC Score: ~0.75+          WOSAC Score:    │
│                                                             ~0.72 (oracle)  │
│   Problem:                     Goal:                        Problem:        │
│   Unrealistic traffic flow     Match real driver behavior  No adaptation   │
│   AV never challenged          Emergent interactions       to AV behavior  │
│                                                                             │
│   Example:                     Example:                     Example:        │
│   Car waits forever            Car yields, then goes       Car goes always │
│   at 4-way stop                when appropriate            regardless      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8.4 Closed-Loop vs. Open-Loop

┌─────────────────────────────────────────────────────────────────────────────┐
│               OPEN-LOOP vs. CLOSED-LOOP SIMULATION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   OPEN-LOOP (Prediction):                                                   │
│   ───────────────────────                                                   │
│                                                                             │
│   t=0      t=1      t=2      t=3      t=4                                  │
│    │        │        │        │        │                                    │
│    ●───────►◯───────►◯───────►◯───────►◯    Predict all at once            │
│         (no feedback from simulation)                                       │
│                                                                             │
│   Problem: Predictions diverge from reality                                │
│            No reaction to other agents' actual behavior                    │
│                                                                             │
│   CLOSED-LOOP (Simulation):                                                │
│   ─────────────────────────                                                 │
│                                                                             │
│   t=0      t=1      t=2      t=3      t=4                                  │
│    │        │        │        │        │                                    │
│    ●───────►●───────►●───────►●───────►●                                   │
│         │        │        │        │                                        │
│         └────────┴────────┴────────┘                                        │
│              (feedback at each step)                                        │
│                                                                             │
│   Benefit: Agents REACT to each other                                      │
│            Emergent interactions                                           │
│            Matches how AV testing actually works                           │
│                                                                             │
│   WOSAC REQUIRES: Closed-loop simulation                                   │
│   Submissions must NOT pre-compute trajectories                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

8.5 Why Metrics Matter for AV Safety

┌─────────────────────────────────────────────────────────────────────────────┐
│              THE SIMULATION→SAFETY PIPELINE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   GOAL: Prove AV is safe before deployment                                 │
│                                                                             │
│   ┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐    │
│   │   Train    │    │   Test in  │    │  Identify  │    │   Deploy   │    │
│   │   AV       │───►│ Simulation │───►│   Fixes    │───►│   Safely   │    │
│   │  Planner   │    │            │    │            │    │            │    │
│   └────────────┘    └────────────┘    └────────────┘    └────────────┘    │
│                            │                                                │
│                            ▼                                                │
│                     BUT: Simulation only useful if                         │
│                          agents behave realistically!                      │
│                                                                             │
│   IF AGENTS TOO PASSIVE:                IF AGENTS TOO AGGRESSIVE:          │
│   ──────────────────────                ─────────────────────────          │
│   • AV never challenged                 • AV trained to be timid           │
│   • Edge cases not found                • False failure modes              │
│   • Overconfident deployment            • Unnecessary conservatism         │
│                                                                             │
│   WOSAC METRICS ANSWER:                                                    │
│   "Are the simulated agents behaving like real humans?"                    │
│                                                                             │
│   High WOSAC score → High-quality simulation → Trustworthy AV testing     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

9. Hands-On Exercises

Exercise 1: Implement Constant Velocity Baseline

Objective: Create the simplest possible WOSAC submission.

"""
Exercise 1: Constant Velocity Baseline

Tasks:
1. Load a scenario from WOMD
2. Extract initial states for all sim agents
3. Propagate with constant velocity for 80 steps
4. Add small noise for diversity across 32 rollouts
5. Package into submission proto
"""

import numpy as np
import tensorflow as tf
from waymo_open_dataset.protos import scenario_pb2
from waymo_open_dataset.protos import sim_agents_submission_pb2

def exercise_1_constant_velocity(scenario_bytes: bytes) -> sim_agents_submission_pb2.ScenarioRollouts:
    """
    YOUR CODE HERE

    Steps:
    1. Parse scenario from bytes
    2. Extract agent states at t=10 (current time)
    3. Compute velocities from t=9 to t=10
    4. Generate 32 rollouts with small noise
    5. Return ScenarioRollouts proto
    """
    # Parse scenario
    scenario = scenario_pb2.Scenario.FromString(scenario_bytes)

    # TODO: Extract initial states
    # Hint: scenario.tracks contains all agent trajectories
    # Hint: Each track has states.center_x, states.center_y, etc.

    # TODO: Get sim agent IDs
    # Hint: Valid agents at timestep 10

    # TODO: Compute velocities
    # Hint: v = (pos[10] - pos[9]) / 0.1

    # TODO: Generate 32 rollouts
    # Hint: Add Gaussian noise with std=0.01 per step

    # TODO: Create proto messages

    pass  # Replace with your implementation


# Test your implementation
def test_exercise_1():
    """Test constant velocity baseline."""
    # Create dummy scenario
    # In practice, load from WOMD tfrecord

    # Verify output format
    # - 32 joint scenes
    # - 80 timesteps per trajectory
    # - All sim agents present
    pass


# Solution hint structure
SOLUTION_SKELETON = """
def exercise_1_solution(scenario_bytes):
    scenario = scenario_pb2.Scenario.FromString(scenario_bytes)

    # Get valid agents at current time (index 10)
    current_time = 10
    sim_agent_ids = []
    initial_states = {}

    for track in scenario.tracks:
        if track.states[current_time].valid:
            sim_agent_ids.append(track.id)
            initial_states[track.id] = {
                'x': track.states[current_time].center_x,
                'y': track.states[current_time].center_y,
                'z': track.states[current_time].center_z,
                'heading': track.states[current_time].heading,
                'vx': (track.states[current_time].center_x -
                       track.states[current_time-1].center_x) / 0.1,
                'vy': (track.states[current_time].center_y -
                       track.states[current_time-1].center_y) / 0.1,
            }

    # Generate 32 rollouts
    joint_scenes = []
    for rollout in range(32):
        trajectories = []
        for agent_id in sim_agent_ids:
            state = initial_states[agent_id]
            traj = {'x': [], 'y': [], 'z': [], 'heading': []}

            x, y, z = state['x'], state['y'], state['z']
            heading = state['heading']

            for t in range(80):
                # Add noise
                noise_x = np.random.normal(0, 0.01)
                noise_y = np.random.normal(0, 0.01)

                x += state['vx'] * 0.1 + noise_x
                y += state['vy'] * 0.1 + noise_y

                traj['x'].append(x)
                traj['y'].append(y)
                traj['z'].append(z)
                traj['heading'].append(heading)

            trajectories.append((agent_id, traj))

        # Create JointScene
        ...

    return scenario_rollouts
"""

Exercise 2: Compute Metrics Locally

Objective: Understand metric computation by implementing it yourself.

"""
Exercise 2: Local Metric Computation

Tasks:
1. Implement linear speed metric
2. Implement collision detection
3. Implement offroad detection
4. Combine into mini realism score
"""

import numpy as np

def exercise_2_metrics(
    trajectories: np.ndarray,  # [32, n_agents, 80, 4]
    ground_truth: np.ndarray,  # [n_agents, 80, 4]
    road_boundary: np.ndarray, # Road polygon vertices
) -> dict:
    """
    YOUR CODE HERE

    Compute metrics for 32 simulated rollouts against ground truth.

    Returns:
        {
            'linear_speed_likelihood': float,
            'collision_rate': float,
            'offroad_rate': float,
            'mini_realism_score': float,
        }
    """

    # TODO: Compute linear speeds for all rollouts
    # Shape: [32, n_agents, 79]

    # TODO: Build histogram of simulated speeds

    # TODO: Compute likelihood of ground truth under histogram

    # TODO: Detect collisions (simplified: center-to-center distance < threshold)

    # TODO: Detect offroad (point-in-polygon test)

    # TODO: Combine into score

    pass


def test_exercise_2():
    """Test metric computation."""
    np.random.seed(42)

    # Generate test data
    n_agents = 5
    trajectories = np.random.randn(32, n_agents, 80, 4)
    trajectories[..., :2] = np.cumsum(trajectories[..., :2] * 0.5, axis=2)

    ground_truth = trajectories[0]  # Use first rollout as GT

    # Simple rectangular road
    road_boundary = np.array([
        [-100, -10],
        [100, -10],
        [100, 10],
        [-100, 10],
    ])

    metrics = exercise_2_metrics(trajectories, ground_truth, road_boundary)

    print("Computed Metrics:")
    for name, value in metrics.items():
        print(f"  {name}: {value:.4f}")

Exercise 3: Improve Baseline with Simple Rules

Objective: Add rule-based improvements to constant velocity.

"""
Exercise 3: Rule-Enhanced Baseline

Add simple rules to improve constant velocity baseline:
1. Collision avoidance: Slow down if about to collide
2. Road following: Steer back if drifting off road
3. Speed limits: Cap maximum speed
"""

def exercise_3_rule_enhanced(
    scenario,
    max_speed: float = 20.0,  # m/s
    safe_distance: float = 5.0,  # meters
    road_margin: float = 2.0,  # meters from edge
) -> sim_agents_submission_pb2.ScenarioRollouts:
    """
    YOUR CODE HERE

    Implement rule-enhanced simulation:

    For each timestep:
    1. Compute default action (constant velocity)
    2. Check collision risk with other agents
       - If distance < safe_distance: reduce speed
    3. Check road boundary
       - If distance < road_margin: steer toward road center
    4. Apply speed limit
    5. Update state
    """

    # TODO: Initialize from scenario

    # TODO: Simulation loop with rules

    # TODO: Package into proto

    pass


# Hints for implementation
EXERCISE_3_HINTS = """
Collision Avoidance Logic:
    for each agent i:
        for each other agent j:
            dist = ||pos_i - pos_j||
            if dist < safe_distance:
                # Reduce speed proportionally
                speed_factor = dist / safe_distance
                velocity *= speed_factor

Road Following Logic:
    for each agent:
        dist_to_edge = compute_road_distance(pos)
        if dist_to_edge < road_margin:
            # Compute steering toward road center
            center_direction = compute_center_direction(pos, road)
            heading += steering_gain * center_direction

Speed Limit Logic:
    speed = ||velocity||
    if speed > max_speed:
        velocity *= max_speed / speed
"""

Exercise 4: Analyze Winning Solutions

Objective: Understand what makes top solutions work.

"""
Exercise 4: Solution Analysis

Tasks:
1. Read MVTA paper and identify 3 key innovations
2. Compare SMART tokenization to continuous prediction
3. Explain why RL fine-tuning helps (SMART-R1)
"""

# Answer template
EXERCISE_4_ANSWERS = """
Question 1: What are 3 key innovations in MVTA (2023 winner)?

YOUR ANSWER:
1. _______________________________________________
2. _______________________________________________
3. _______________________________________________

Question 2: Why does tokenization (SMART) outperform continuous prediction?

YOUR ANSWER:
________________________________________________
________________________________________________
________________________________________________

Question 3: How does RL fine-tuning (SMART-R1) improve over behavior cloning?

YOUR ANSWER:
________________________________________________
________________________________________________
________________________________________________

Question 4: What is the "covariate shift" problem in closed-loop simulation?

YOUR ANSWER:
________________________________________________
________________________________________________
________________________________________________

Question 5: Why are collision and offroad metrics weighted 2x?

YOUR ANSWER:
________________________________________________
________________________________________________
________________________________________________
"""

# Reference answers (reveal after attempting)
REFERENCE_ANSWERS = """
Q1 Key Innovations:
1. Variable-length history sampling - creates more training data
2. Receding horizon prediction - only use first 0.1s of 1s prediction
3. Top-K sampling - balance realism with diversity

Q2 Tokenization Benefits:
- Discretization acts as regularization
- Captures multi-modal distribution naturally
- Avoids mode collapse of continuous GMM
- Enables language model architectures (decoder-only)

Q3 RL Fine-Tuning Benefits:
- Directly optimizes evaluation metric (WOSAC score)
- Addresses distribution mismatch between training and testing
- Behavior cloning only mimics, doesn't optimize for metric
- RL explores beyond training distribution

Q4 Covariate Shift:
- Training: model sees ground truth history
- Testing: model sees its own predictions as history
- Errors compound over time
- Solution: closed-loop training or RL fine-tuning

Q5 Safety Weights:
- Collisions and offroad are safety-critical failures
- A simulator with many collisions is unusable
- Double weighting emphasizes these must be rare
- Reflects real-world importance hierarchy
"""

10. Interview Questions

Conceptual Understanding

Q1: Why is WOSAC framed as distribution matching rather than trajectory prediction?

<details> <summary>Answer</summary>

Distribution matching is fundamentally different from trajectory prediction because:

  1. Multiple Valid Futures: At any moment, there are many plausible ways a driver could behave. A car approaching an intersection might turn left, go straight, or turn right - all valid.

  2. Evaluation Challenge: If we predict "turn left" but the real driver went straight, a point prediction metric would say we're wrong. But "turn left" might still be a perfectly realistic behavior.

  3. 32 Samples Solution: By generating 32 rollouts, we approximate the distribution of possible behaviors. The metric then asks: "Does this distribution assign reasonable probability to what actually happened?"

  4. Practical Impact: This encourages models to capture behavioral diversity rather than just the most likely outcome, which is crucial for finding edge cases in simulation.

</details>

Q2: Explain why log replay achieves only ~0.72 on WOSAC despite being "ground truth".

<details> <summary>Answer</summary>

Log replay's imperfect score reveals key insights:

  1. No Diversity: Replaying the same trajectory 32 times puts all histogram mass in one bin. If real behavior has ANY natural variation, the likelihood is suboptimal.

  2. Single Sample: We only observed ONE outcome from the real distribution. The true distribution might have variance that log replay doesn't capture.

  3. Adaptation: Log replay cannot adapt to different AV behaviors. In real testing, the AV might act differently than in the recording, requiring agents to react.

  4. Metric Design: WOSAC intentionally penalizes this "cheating" approach to encourage learning generalizable behaviors.

</details>

Q3: What is covariate shift and how do modern WOSAC solutions address it?

<details> <summary>Answer</summary>

Covariate Shift occurs when the distribution of inputs at test time differs from training:

  • Training: Model sees ground-truth history from the dataset
  • Testing: Model sees its own previous predictions as history

Consequences:

  • Small errors accumulate over 80 timesteps
  • Model enters states never seen during training
  • Performance degrades significantly over time

Solutions:

  1. Closed-Loop Training (MVTA): Train autoregressively, feeding predictions back as input

  2. Scheduled Teacher Forcing (TrafficBots): Gradually reduce ground-truth exposure during training

  3. Top-K Sampling During Training (SMART): Intentionally use non-optimal tokens to expose model to diverse states

  4. RL Fine-Tuning (SMART-R1): Explicitly train in closed-loop with WOSAC metric as reward

</details>

Technical Deep-Dives

Q4: Walk through the mathematical computation of a kinematic metric likelihood.

<details> <summary>Answer</summary>

Let's compute linear speed likelihood step-by-step:

STEP 1: Extract speeds from 32 rollouts
─────────────────────────────────────────
For rollout k, agent i, timestep t:
    v_{k,i,t} = ||pos_{k,i,t+1} - pos_{k,i,t}|| / 0.1

Result: speeds array of shape [32, N_agents, 79]

STEP 2: Build histogram for one agent at one timestep
─────────────────────────────────────────────────────
speeds_flat = speeds[:, agent_i, timestep_t]  # 32 values

bins = [0, 2, 4, 6, ..., 30]  # 15 bins
hist, edges = histogram(speeds_flat, bins)
# Example: hist = [2, 5, 10, 8, 4, 3, 0, 0, ...]

STEP 3: Apply Laplace smoothing
────────────────────────────────
hist_smoothed = hist + 0.1
# Prevents zero probability bins

STEP 4: Normalize to probability
─────────────────────────────────
prob = hist_smoothed / sum(hist_smoothed)
# Example: prob = [0.065, 0.159, 0.315, ...]

STEP 5: Find bin for ground truth
──────────────────────────────────
gt_speed = 5.5 m/s
bin_idx = digitize(gt_speed, edges) = 2  # [4-6] bin

STEP 6: Compute NLL
────────────────────
nll = -log(prob[bin_idx]) = -log(0.315) = 1.15

STEP 7: Average over time
──────────────────────────
For each timestep t:
    nll_t = compute_nll(t)

average_nll = mean(nll_t for t in valid_timesteps)

STEP 8: Convert to likelihood
──────────────────────────────
likelihood = exp(-average_nll)
# Example: exp(-1.15) = 0.317
</details>

Q5: How does SMART's tokenization work and why is it effective?

<details> <summary>Answer</summary>

SMART Tokenization Process:

1. TRAJECTORY SEGMENTATION
   ──────────────────────────
   Split continuous trajectory into 0.5-second segments
   Each segment = 5 timesteps at 10 Hz

2. K-DISK CLUSTERING
   ───────────────────
   Collect all segments from training data
   Apply k-means clustering (k=2048)
   Each cluster centroid = one token

3. VOCABULARY
   ────────────
   Motion tokens: ~2048 for each agent type
   Map tokens: ~1000 for road segments

4. TOKENIZATION AT RUNTIME
   ─────────────────────────
   For each 0.5s segment:
     Find nearest cluster centroid
     Return token index

Why It's Effective:

  1. Built-in Discretization: Prevents mode collapse common in continuous GMM outputs

  2. Natural Multi-modality: Each token represents a distinct behavior mode

  3. Language Model Architecture: Enables decoder-only transformers (like GPT) which scale efficiently

  4. Regularization: Quantization adds implicit regularization, reducing overfitting

  5. Rolling Matching: During training, tokens matched based on previous prediction (not GT) reduces covariate shift

</details>

Q6: Design a new metric to capture "naturalness" of lane changes.

<details> <summary>Answer</summary>

Proposed: Lane Change Naturalness Metric

DEFINITION:
───────────
Measure how similar simulated lane changes are to real lane changes
in terms of duration, lateral displacement profile, and context.

FEATURE EXTRACTION:
───────────────────
For each detected lane change:
1. Duration: T_lc (time from start to end)
2. Lateral Profile: y(t) displacement curve
3. Context: gap acceptance (front/rear distances)

COMPUTATION:
────────────
1. Detect lane changes in both simulated and real data:
   - Lateral displacement > threshold (e.g., 3m)
   - Sustained lane position after change

2. Extract features for each lane change:
   lc_features = {
       'duration': T_lc,
       'max_lateral_speed': max(dy/dt),
       'smoothness': mean(d²y/dt²),
       'front_gap': distance to lead vehicle,
       'rear_gap': distance to following vehicle,
   }

3. Build distributions from 32 rollouts

4. Compute likelihood of real lane change features

INTEGRATION:
────────────
Can be added to Interactive metrics category
Weight: 1.0 (or 2.0 if safety-critical)

CHALLENGES:
───────────
- Lane change detection is non-trivial
- Requires accurate lane information from map
- Sparse events (not every scenario has lane changes)
</details>

System Design

Q7: How would you scale a WOSAC submission to handle 45,000 scenarios efficiently?

<details> <summary>Answer</summary>

Scaling Strategy:

1. PARALLELIZATION
   ─────────────────
   - Scenarios are independent → embarrassingly parallel
   - Use distributed computing (e.g., Ray, Spark)
   - Target: 100+ scenarios per GPU simultaneously

2. BATCHING
   ─────────
   - Batch multiple scenarios per forward pass
   - Batch 32 rollouts together (same scenario)
   - Use dynamic batching for variable agent counts

3. EFFICIENT SIMULATION
   ─────────────────────
   - Use JAX/Waymax for GPU-accelerated simulation
   - JIT compile the simulation step
   - Avoid Python loops in inner simulation

4. INFRASTRUCTURE
   ───────────────
   Recommended setup:
   - 8× A100 GPUs
   - ~5 hours for full test set
   - Sharded output (150 files)

5. PIPELINE
   ──────────
   ┌─────────┐     ┌──────────┐     ┌─────────┐
   │  Load   │────►│ Simulate │────►│  Save   │
   │ Shards  │     │  (GPU)   │     │ Shards  │
   └─────────┘     └──────────┘     └─────────┘
       │               │                │
       └───────────────┼────────────────┘
                       │
              Parallel across GPUs

6. VALIDATION
   ────────────
   - Validate each shard immediately after creation
   - Check: 32 rollouts, 80 steps, all agents
   - Fail fast on errors
</details>

Q8: You notice your model has high collision rate (15%) vs. baseline (5%). Debug this.

<details> <summary>Answer</summary>

Debugging Framework:

STEP 1: REPRODUCE AND ISOLATE
─────────────────────────────
- Identify scenarios with highest collision rates
- Visualize specific collision events
- Categorize collision types:
  □ Rear-end collisions
  □ Side-swipe during lane change
  □ Intersection conflicts
  □ Pedestrian/cyclist interactions

STEP 2: ANALYZE ROOT CAUSES
───────────────────────────
For each category, investigate:

A) Data Issue:
   - Training data has collisions → model learns them
   - Fix: Filter collision trajectories from training

B) Covariate Shift:
   - Model enters unseen states → erratic behavior
   - Fix: Closed-loop training, scheduled teacher forcing

C) Multi-Agent Coordination:
   - Independent sampling → no collision avoidance
   - Fix: Joint prediction, iterative refinement

D) Speed/Acceleration Bounds:
   - Unrealistic kinematics → impossible avoidance
   - Fix: Add kinematic constraints in output

STEP 3: TARGETED FIXES
──────────────────────
Priority order:
1. Add collision loss during training
2. Implement closed-loop fine-tuning
3. Add inference-time collision mitigation (MTR+++)
4. Ensemble multiple samples, select least-collision set

STEP 4: ABLATION STUDY
──────────────────────
Test each fix independently:
| Fix                    | Collision Rate |
|------------------------|----------------|
| Baseline               | 15%            |
| + Collision loss       | 10%            |
| + Closed-loop training | 7%             |
| + Inference mitigation | 4%             |

STEP 5: VERIFY NO REGRESSION
────────────────────────────
Check other metrics didn't degrade:
- Kinematic metrics (might be too conservative)
- Diversity (might collapse to "safe" behaviors)
</details>

Q9: Compare WOSAC to other autonomous driving benchmarks. What's unique?

<details> <summary>Answer</summary>

Benchmark Comparison:

┌───────────────┬─────────────────┬─────────────────┬─────────────────┐
│   Benchmark   │      Task       │    Evaluation   │   Unique to     │
│               │                 │                 │     WOSAC       │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│    WOSAC      │ Multi-agent     │ Distribution    │ ✓ Closed-loop   │
│               │ simulation      │ matching        │ ✓ 32 rollouts   │
│               │                 │ (likelihood)    │ ✓ Joint scenes  │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│   WOMD-MP     │ Motion          │ minADE, minFDE  │ Single agent    │
│               │ prediction      │ miss rate       │ Open-loop       │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│   nuScenes    │ Detection +     │ mAP, NDS        │ Perception      │
│               │ Prediction      │ ADE, FDE        │ focused         │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│   Argoverse   │ Motion          │ minADE, MR      │ Trajectory      │
│               │ forecasting     │ brier-minFDE    │ diversity       │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│   CARLA       │ Closed-loop     │ Task success    │ Synthetic       │
│               │ driving         │ infraction rate │ environment     │
└───────────────┴─────────────────┴─────────────────┴─────────────────┘

WOSAC's Unique Contributions:

1. DISTRIBUTION OVER POINT PREDICTION
   - Other benchmarks: "Is this trajectory close to GT?"
   - WOSAC: "Does this distribution cover GT?"

2. MULTI-AGENT JOINT SIMULATION
   - Other benchmarks: Predict each agent independently
   - WOSAC: All agents simulated together

3. CLOSED-LOOP REQUIREMENT
   - Other benchmarks: Often open-loop evaluation
   - WOSAC: Must be truly autoregressive

4. INTERPRETABLE METRIC DECOMPOSITION
   - Kinematic + Interactive + Map-based
   - Identifies specific failure modes

5. REAL DATA SCALE
   - 45,000 real scenarios
   - Diverse conditions (urban, highway, etc.)
</details>

Q10: If you were to improve WOSAC metrics, what would you add?

<details> <summary>Answer</summary>

Proposed Improvements:

1. SEMANTIC BEHAVIOR METRICS
   ──────────────────────────
   Current: Low-level kinematics
   Proposed: High-level behavior recognition

   Examples:
   - Yield compliance at intersections
   - Lane change appropriateness
   - Gap acceptance realism
   - Turn signal correlation (if available)

2. LONG-TAIL SCENARIO WEIGHTING
   ─────────────────────────────
   Current: Uniform scenario weighting
   Proposed: Upweight rare/critical scenarios

   Rare scenarios:
   - Near-miss situations
   - Construction zones
   - Emergency vehicles
   - Adverse weather

3. TEMPORAL CONSISTENCY METRICS
   ─────────────────────────────
   Current: Per-timestep evaluation
   Proposed: Trajectory-level patterns

   Examples:
   - Smoothness over entire trajectory
   - Intention consistency (committed actions)
   - Reaction time to events

4. ADVERSARIAL ROBUSTNESS
   ────────────────────────
   Current: Fixed scenarios
   Proposed: Perturbed initial conditions

   Test if model handles:
   - Slightly different starting positions
   - Noisy observations
   - Missing agents

5. COMPUTATIONAL EFFICIENCY METRIC
   ────────────────────────────────
   Current: Only accuracy metrics
   Proposed: Realism per FLOP

   Rationale:
   - Real-time simulation matters
   - Encourages efficient architectures
   - Separate track for resource-constrained solutions

6. CAUSALITY-AWARE METRICS
   ─────────────────────────
   Current: All agents weighted equally
   Proposed: Weight by causal relevance to ego

   Focus on agents that:
   - Are in ego's path
   - Could interact with ego
   - Influence ego's decisions
</details>

Summary

The Waymo Open Sim Agents Challenge represents a critical benchmark for autonomous vehicle simulation. Key takeaways:

  1. Metrics Matter: The realism meta-metric's design (distribution matching, safety weighting) directly shapes what models learn.

  2. Evolution of Approaches: From transformers (MVTA) to tokenization (SMART) to RL fine-tuning (SMART-R1), each generation addresses previous limitations.

  3. Closed-Loop is Essential: The gap between open-loop training and closed-loop evaluation is the central challenge.

  4. Distribution Over Points: Generating 32 diverse rollouts that cover the behavioral distribution is more important than single accurate predictions.

  5. Safety-Critical Emphasis: Double-weighting collisions and offroad events reflects real-world priorities.


References

Primary Sources

Official Resources

Additional Reading


Last Updated: January 2026