WOSAC Challenge Deep Dive: The Waymo Open Sim Agents Challenge
A comprehensive learning guide to understanding, implementing, and excelling at the premier benchmark for realistic traffic simulation in autonomous driving.
Table of Contents
- Executive Summary
- Challenge Overview
- Evaluation Metrics Deep Dive
- Baseline Approaches
- Winning Strategies
- Interactive Code Examples
- Submission Format
- Mental Models
- Hands-On Exercises
- Interview Questions
1. Executive Summary
What is WOSAC?
The Waymo Open Sim Agents Challenge (WOSAC) is the first public benchmark for evaluating multi-agent traffic simulation systems. Introduced in 2023 and running annually, it establishes standardized metrics and protocols for assessing how realistically simulation agents behave compared to real-world driving data.
Why WOSAC Matters
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE SIMULATION BOTTLENECK │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Real World Testing vs. Simulation Testing │
│ ───────────────── ────────────────── │
│ • $1M+ per vehicle • $0.01 per scenario │
│ • 1 scenario/hour • 1M scenarios/hour │
│ • Safety risks • Zero risk │
│ • Weather dependent • Any condition │
│ │
│ BUT: Simulation is only useful if agents behave REALISTICALLY │
│ │
│ WOSAC answers: "How do we MEASURE realism?" │
└─────────────────────────────────────────────────────────────────────────────┘
The Core Problem WOSAC Solves
Autonomous vehicle development requires testing against millions of scenarios. But simulation is only valuable if:
- Simulated agents behave like real humans - not too aggressive, not too passive
- Multi-agent interactions are realistic - agents react to each other
- Edge cases emerge naturally - rare but critical scenarios appear
WOSAC provides the measurement framework to quantify these properties.
Key Statistics
| Metric | Value |
|---|---|
| Challenge Years | 2023, 2024, 2025 |
| Scenarios per Evaluation | 44,097 (test set) |
| Rollouts per Scenario | 32 parallel simulations |
| Simulation Horizon | 8 seconds at 10 Hz |
| History Context | 1.1 seconds |
| Top Realism Score (2025) | 0.7858 (SMART-R1) |
References
2. Challenge Overview
Task Definition
Official Task: "Given the agent tracks for the past 1 second on a corresponding map, and optionally the associated lidar for this time interval, simulate 32 realistic joint futures for all the agents in the scene."
┌─────────────────────────────────────────────────────────────────────────────┐
│ WOSAC TASK VISUALIZATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TIME AXIS │
│ ─────────────────────────────────────────────────────────────────────► │
│ │
│ ◄────── 1.1s History ──────►◄─────────── 8.0s Future ──────────────► │
│ │
│ [t=-10, t=-9, ..., t=0] [t=1, t=2, ..., t=80] │
│ 11 frames 80 frames @ 10Hz │
│ │
│ ┌─────────────────────┐ ┌─────────────────────────────────────┐ │
│ │ INPUTS (Given) │ │ OUTPUTS (Generate) │ │
│ │ • Agent positions │ │ • 32 joint scene rollouts │ │
│ │ • Agent velocities│ │ • All agents' trajectories │ │
│ │ • Agent headings │ │ • x, y, z, heading per timestep │ │
│ │ • HD Map │ │ │ │
│ │ • (Optional) LiDAR│ │ │ │
│ └─────────────────────┘ └─────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Input Format
The challenge uses the Waymo Open Motion Dataset (WOMD), providing:
# Scenario Proto Structure (simplified)
scenario = {
'scenario_id': 'unique_identifier_string',
# Agent Information (up to 128 agents per scene)
'tracks': {
'object_id': [int], # Unique ID per agent
'object_type': [int], # 1=Vehicle, 2=Pedestrian, 3=Cyclist
'states': {
'center_x': [N_agents, 91], # 91 timesteps total
'center_y': [N_agents, 91],
'center_z': [N_agents, 91],
'heading': [N_agents, 91], # Yaw angle in radians
'velocity_x': [N_agents, 91],
'velocity_y': [N_agents, 91],
'length': [N_agents], # Bounding box
'width': [N_agents],
'height': [N_agents],
'valid': [N_agents, 91], # Validity mask
}
},
# Map Information
'map_features': {
'lane': [...], # Lane centerlines
'road_line': [...], # Road boundaries
'road_edge': [...], # Drivable area edges
'stop_sign': [...],
'crosswalk': [...],
'speed_bump': [...],
},
# Traffic Signals
'dynamic_map_states': {
'traffic_light_state': [...], # Per timestep
}
}
Output Format
Each submission must provide 32 parallel rollouts per scenario:
# Output Structure
output = {
'scenario_id': 'matching_input_id',
'joint_scenes': [ # List of 32 rollouts
{
'simulated_trajectories': [ # One per sim agent
{
'object_id': int,
'center_x': [80], # 80 future timesteps
'center_y': [80],
'center_z': [80],
'heading': [80],
}
]
}
] * 32
}
What Makes This Challenging?
┌─────────────────────────────────────────────────────────────────────────────┐
│ MULTI-AGENT SIMULATION CHALLENGES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. DISTRIBUTION MATCHING (not point prediction) │
│ ──────────────────────────────────────────── │
│ • Must capture DIVERSITY of plausible futures │
│ • 32 samples must cover the behavioral distribution │
│ • Single "best guess" is insufficient │
│ │
│ 2. CLOSED-LOOP SIMULATION │
│ ─────────────────────── │
│ • Agents must REACT to each other │
│ • Cannot pre-compute trajectories independently │
│ • Covariate shift from training (open-loop) to eval (closed-loop) │
│ │
│ 3. MULTI-SCALE REALISM │
│ ──────────────────── │
│ • Kinematic feasibility (physics) │
│ • Social compliance (interactions) │
│ • Map adherence (road rules) │
│ │
│ 4. JOINT CONSISTENCY │
│ ───────────────── │
│ • All agents simulated TOGETHER │
│ • Cannot optimize per-agent independently │
│ • Emergent behaviors from interactions │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Dataset Splits
| Split | Scenarios | Purpose |
|---|---|---|
| Training | 486,995 | Model training |
| Validation | 44,097 | Development & tuning |
| Testing | 44,920 | Final evaluation (hidden) |
3. Evaluation Metrics Deep Dive
The Realism Meta-Metric
WOSAC's primary ranking criterion is the Realism Meta-Metric, a weighted combination of 9 component metrics across three categories.
┌─────────────────────────────────────────────────────────────────────────────┐
│ REALISM META-METRIC HIERARCHY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ REALISM META-METRIC │
│ (0-1 scale) │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ KINEMATIC │ │ INTERACTIVE │ │ MAP-BASED │ │
│ │ METRICS │ │ METRICS │ │ METRICS │ │
│ │ (4 items) │ │ (3 items) │ │ (2 items) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ┌───┬───┼───┬───┐ ┌──────┼──────┐ ┌──────┼──────┐ │
│ │ │ │ │ │ │ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ │
│ LS LA AS AA DNO COL* TTC DRE OFF* │
│ │
│ * = Double-weighted (2x) for safety emphasis │
│ │
│ LS = Linear Speed DNO = Distance to Nearest Object │
│ LA = Linear Acceleration COL = Collision Indicator │
│ AS = Angular Speed TTC = Time to Collision │
│ AA = Angular Acceleration DRE = Distance to Road Edge │
│ OFF = Offroad Indicator │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Master Formula
The overall Realism Meta-Metric is computed as:
1 N M
M^K = ───────── · Σ Σ w_j · m_{i,j}^K
N · W i=1 j=1
Where:
N = Number of scenarios
M = 9 component metrics
K = 32 parallel rollouts
w_j = Weight for metric j (1.0 or 2.0)
W = Sum of all weights (11 total: 7×1 + 2×2)
m_{i,j}^K = Likelihood score for metric j on scenario i
3.1 Kinematic Metrics
These metrics ensure agents move in physically plausible ways.
Linear Speed
Definition: Unsigned magnitude of velocity
‖v_t‖ = ‖(x_{t+1} - x_t) / Δt‖_2
Where:
x_t = [x_t, y_t, z_t] (3D position)
Δt = 0.1 seconds (10 Hz)
Intuition: Measures if agents move at realistic speeds. A car suddenly jumping 100 m/s would get low likelihood.
Typical Range: 0-30 m/s for vehicles, 0-2 m/s for pedestrians
Linear Acceleration
Definition: Signed magnitude of velocity change
a_t = (‖v_{t+1}‖ - ‖v_t‖) / Δt
Intuition: Captures smooth vs. jerky motion. Real vehicles have bounded acceleration capabilities.
Typical Range: -8 to +4 m/s² for comfortable driving
Angular Speed
Definition: Rate of heading change
ω_t = d(θ_{t+1}, θ_t) / Δt
Where d(·,·) is the minimal angular difference:
d(θ_a, θ_b) = min(|θ_a - θ_b|, 2π - |θ_a - θ_b|)
Intuition: How fast is the agent turning? Vehicles can't spin arbitrarily fast.
Typical Range: -0.5 to +0.5 rad/s for normal driving
Angular Acceleration
Definition: Rate of angular speed change
α_t = d(ω_{t+1}, ω_t) / Δt
Intuition: Steering smoothness. Jerky steering inputs get penalized.
3.2 Interactive Metrics
These metrics capture multi-agent social behavior.
Distance to Nearest Object
Definition: Signed distance to closest other agent
d_{near}(i, t) = min_{j ≠ i} GJK_distance(box_i(t), box_j(t))
Where:
GJK_distance = Gilbert-Johnson-Keerthi algorithm
box_i(t) = Oriented bounding box of agent i at time t
Negative values indicate overlap (collision)
Intuition: Are agents maintaining realistic following distances? Too close = dangerous, too far = unrealistic.
Collision Indicator (2x Weight)
Definition: Binary indicator of any collision occurrence
collision_i = 1 if ∃t : d_{near}(i, t) < 0
= 0 otherwise
Intuition: Did the agent collide with anything during the rollout? This is CRITICAL for safety.
Why Double-Weighted: Collisions are safety-critical failures. A simulator with many collisions is useless for AV testing.
Time to Collision (TTC)
Definition: Time until collision assuming constant velocities
TTC_t = t_collision - t_current
Where t_collision is computed assuming:
- Both agents maintain current velocity
- Linear extrapolation of trajectories
- First intersection time of bounding boxes
Intuition: Are agents maintaining safe time gaps? Real drivers adjust speed to maintain TTC > 2-3 seconds.
3.3 Map-Based Metrics
These metrics ensure agents follow road structure.
Distance to Road Edge
Definition: Signed distance to nearest drivable boundary
d_{road}(i, t) = min_{edge ∈ road_edges} distance(center_i(t), edge)
Positive = on road
Negative = off road
Intuition: Are agents staying on valid road surfaces?
Offroad Indicator (2x Weight)
Definition: Binary indicator of road departure
offroad_i = 1 if ∃t : d_{road}(i, t) < 0
= 0 otherwise
Why Double-Weighted: Vehicles driving through buildings or fields is obviously unrealistic and safety-critical.
3.4 Likelihood Computation
The key insight of WOSAC is treating simulation as distribution matching. Rather than comparing single trajectories, it compares distributions.
┌─────────────────────────────────────────────────────────────────────────────┐
│ LIKELIHOOD COMPUTATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ STEP 1: Extract feature values from 32 rollouts │
│ ───────────────────────────────────────────────── │
│ │
│ Rollout 1: speed = [5.2, 5.3, 5.1, ...] │
│ Rollout 2: speed = [5.1, 5.4, 5.2, ...] │
│ ... │
│ Rollout 32: speed = [5.3, 5.2, 5.4, ...] │
│ │
│ STEP 2: Build histogram from simulated samples │
│ ────────────────────────────────────────────── │
│ │
│ Histogram bins: [0-2], [2-4], [4-6], [6-8], ... │
│ Counts: 0 2 25 5 ... │
│ Normalize → Categorical distribution P_sim │
│ │
│ STEP 3: Compute likelihood of REAL data under P_sim │
│ ──────────────────────────────────────────────────── │
│ │
│ Real speed = 5.5 m/s → falls in bin [4-6] │
│ P_sim([4-6]) = 25/32 = 0.78 │
│ NLL = -log(0.78) = 0.25 │
│ │
│ STEP 4: Average over time with Laplace smoothing │
│ ───────────────────────────────────────────────── │
│ │
│ m = exp(-[Σ_t 1{v_t} · NLL_t] / Σ_t 1{v_t}) │
│ │
│ Laplace smoothing: Add pseudocount 0.1 to avoid log(0) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Mathematical Formulation:
For each component metric m_j:
1. Extract values: V = {v_1, v_2, ..., v_32} from 32 rollouts
2. Build histogram: H(b) = count of samples in bin b
3. Normalize with Laplace smoothing:
P(b) = (H(b) + 0.1) / (32 + 0.1 × n_bins)
4. Compute NLL of ground truth:
NLL = -log(P(bin(v_real)))
5. Time-average:
m_j = exp(- Σ_t [valid_t × NLL_t] / Σ_t valid_t)
3.5 2025 Metric Updates
The 2025 challenge introduced refinements:
| Change | Description |
|---|---|
| Traffic Light Violation | New metric penalizing red light running |
| TTC Filtering | Improved vehicle selection for TTC calculation |
| Weight Adjustments | Fine-tuned relative importance of components |
Summary Table
| Category | Metric | Weight | Formula | Range |
|---|---|---|---|---|
| Kinematic | Linear Speed | 1x | ||v|| = ||(x_{t+1} - x_t)/Δt|| | 0-30 m/s |
| Kinematic | Linear Accel | 1x | a = (v_{t+1} - v_t)/Δt | -8 to +4 m/s² |
| Kinematic | Angular Speed | 1x | ω = Δθ/Δt | -0.5 to +0.5 rad/s |
| Kinematic | Angular Accel | 1x | α = Δω/Δt | Variable |
| Interactive | Dist to Nearest | 1x | min GJK distance | 0-50 m |
| Interactive | Collision | 2x | Binary indicator | {0, 1} |
| Interactive | Time to Collision | 1x | Constant-velocity extrapolation | 0-10 s |
| Map-Based | Dist to Road Edge | 1x | Signed distance | -5 to +20 m |
| Map-Based | Offroad | 2x | Binary indicator | {0, 1} |
4. Baseline Approaches
Understanding baselines reveals what WOSAC rewards and penalizes.
4.1 Constant Velocity Baseline
Approach: Assume each agent continues with current velocity forever.
def constant_velocity_rollout(initial_state, n_steps=80):
"""
Simplest possible baseline: linear extrapolation.
"""
x, y, heading = initial_state['x'], initial_state['y'], initial_state['heading']
vx, vy = initial_state['vx'], initial_state['vy']
dt = 0.1 # 10 Hz
trajectory = []
for _ in range(n_steps):
x += vx * dt
y += vy * dt
# heading unchanged
trajectory.append({'x': x, 'y': y, 'heading': heading})
return trajectory
Performance: Realism Meta ~0.08 (very poor)
Why It Fails:
- Zero angular motion → fails angular metrics completely
- No interaction awareness → many collisions
- Ignores map → drives off-road
- No deceleration → unrealistic at intersections
CONSTANT VELOCITY FAILURE MODES:
─────────────────────────────────
Before: After 8 seconds:
┌────┐ ┌────┐
│ A │→ │ A │→ (drove through intersection)
└────┘ └────┘
↑ ↑
┌──┴──┐ ┌──┴──┐
│Road │ │Road │
└─────┘ └─────┘
↑ ↑
┌────┐ ┌────┐
│ B │↑ │ B │↑ (collision with A!)
└────┘ └────┘
4.2 Log Replay (Oracle)
Approach: Directly replay recorded trajectories from the dataset.
def log_replay_rollout(scenario, agent_id):
"""
Copy ground truth future trajectories.
"""
return scenario.tracks[agent_id].states[11:91] # Future 80 steps
Performance: Realism Meta ~0.72 (surprisingly not perfect!)
Why It's Not Perfect:
- Replicates exact same trajectory 32 times → no diversity
- Histogram has all mass in one bin → NLL can be high if real trajectory has ANY variation
- Cannot adapt to different AV behaviors in closed-loop testing
LOG REPLAY LIMITATION:
─────────────────────
Real Distribution: Log Replay:
P(speed) P(speed)
│ │
│ ████ │ █
│ ██████ │ █
│ ████████ │ █
└──────────── speed └──────────── speed
↑ ↑
Natural variation Single point = poor coverage
4.3 Intelligent Driver Model (IDM)
Approach: Classical car-following model with simple rules.
def idm_acceleration(v, s, v_lead, params):
"""
Intelligent Driver Model for longitudinal control.
Args:
v: Current velocity
s: Gap to lead vehicle
v_lead: Lead vehicle velocity
params: IDM parameters
"""
v0 = params['desired_velocity'] # e.g., 30 m/s
T = params['time_headway'] # e.g., 1.5 s
a = params['max_acceleration'] # e.g., 3 m/s²
b = params['comfortable_decel'] # e.g., 2 m/s²
s0 = params['min_gap'] # e.g., 2 m
# Desired gap
delta_v = v - v_lead
s_star = s0 + max(0, v * T + v * delta_v / (2 * np.sqrt(a * b)))
# Acceleration
acc = a * (1 - (v / v0)**4 - (s_star / s)**2)
return np.clip(acc, -b, a)
Performance: Realism Meta ~0.45
Strengths:
- Maintains safe following distances
- Avoids rear-end collisions
- Smooth acceleration profiles
Weaknesses:
- Too reactive/accommodating
- Cannot handle complex intersections
- No lane changes or lateral behavior
- Assumes simple single-lane scenarios
4.4 Baseline Comparison
| Method | Realism Meta | Kinematic | Interactive | Map-Based | Collisions |
|---|---|---|---|---|---|
| Constant Velocity | 0.08 | 0.02 | 0.15 | 0.12 | High |
| Log Replay | 0.72 | 0.85 | 0.65 | 0.70 | Low |
| IDM | 0.45 | 0.55 | 0.48 | 0.35 | Medium |
| Random Noise | 0.05 | 0.01 | 0.10 | 0.08 | Very High |
Key Insight: Even "cheating" by replaying ground truth doesn't achieve perfect scores, because WOSAC measures distributional coverage, not just accuracy.
5. Winning Strategies
5.1 WOSAC 2023: Multiverse Transformer (MVTA)
1st Place - Realism Meta: 0.5168
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ MULTIVERSE TRANSFORMER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ENCODER │
│ ─────── │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Agent │ │ Map │ │ ADV │ │
│ │ History │ │ Polylines │ │ State │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Transformer │ │
│ │ Encoder │ │
│ │ (Self-Attn) │ │
│ └────────┬────────┘ │
│ │ │
│ DECODER (Autoregressive) ▼ │
│ ──────────────────────────── │
│ ┌─────────────────┐ │
│ │ Transformer │◄──── Cross-attention │
│ │ Decoder │ to encoder features │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ GMM Head │ │
│ │ (64 modes) │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ Position, Velocity, Heading │
│ for next 0.1s timestep │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Innovations
-
Variable-Length History Sampling
- Training: Randomly split trajectories into history/future
- Creates more training samples from each trajectory
- Improves generalization to different context lengths
-
Receding Horizon Prediction
- Predict 1-second ahead, but only use first 0.1s
- Promotes multi-modal diversity
- Reduces compounding error
-
Top-K Sampling Strategy
- Don't always pick max-likelihood trajectory
- Periodically sample from top-3 modes
- Balances realism with behavioral diversity
-
Loss Function
loss = (
1.0 * gaussian_nll_loss(pred_pos, gt_pos) +
0.5 * l1_loss(pred_vel, gt_vel) +
0.5 * l1_loss(pred_heading, gt_heading)
)
5.2 WOSAC 2023: MTR+++ (2nd Place)
Realism Meta: 0.4697
Collision-Mitigation Postprocessing
The key innovation was a graph-based collision avoidance algorithm:
def collision_mitigation(trajectories_per_agent, n_candidates=6):
"""
Graph-based selection to avoid collisions.
For each agent, we have n_candidates trajectory options.
Goal: Select one trajectory per agent minimizing collisions.
"""
n_agents = len(trajectories_per_agent)
# Build collision adjacency matrix
# adj[i,j,k,l] = 1 if agent i's trajectory k collides with agent j's trajectory l
adj = compute_collision_matrix(trajectories_per_agent)
# Greedy selection: find dense subgraph (density >= 0.95)
selected = {}
for agent_idx in range(n_agents):
best_traj = None
min_collisions = float('inf')
for traj_idx in range(n_candidates):
# Count collisions with already-selected trajectories
collisions = sum(
adj[agent_idx, other, traj_idx, selected[other]]
for other in selected
)
if collisions < min_collisions:
min_collisions = collisions
best_traj = traj_idx
selected[agent_idx] = best_traj
return selected
5.3 WOSAC 2024: SMART
1st Place - Realism Meta: 0.7614
Next-Token Prediction Paradigm
SMART revolutionized the approach by treating trajectory generation as language modeling:
┌─────────────────────────────────────────────────────────────────────────────┐
│ SMART: TRAJECTORY AS LANGUAGE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Approach: │
│ ───────────────────── │
│ Input → Encoder → Decoder → Continuous (x, y) coordinates │
│ │
│ SMART Approach: │
│ ─────────────── │
│ Input → Tokenizer → Decoder-Only → Next Token → Detokenize │
│ │
│ TOKENIZATION: │
│ ───────────── │
│ │
│ Continuous trajectory: [(0,0), (0.5,0.1), (1.1,0.3), ...] │
│ │ │
│ ▼ K-Disk Clustering │
│ │
│ Token sequence: [Token_42, Token_17, Token_89, ...] │
│ │
│ Vocabulary size: ~2048 motion tokens + ~1000 map tokens │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Architecture Details
class SMARTModel:
"""
Decoder-only transformer for trajectory generation.
"""
def __init__(self):
self.motion_vocab_size = 2048
self.map_vocab_size = 1000
# Factorized attention
self.temporal_attention = SelfAttention() # Within agent history
self.map_attention = CrossAttention() # Agent → nearby roads
self.agent_attention = SelfAttention() # Agent → other agents
def forward(self, motion_tokens, map_tokens, agent_mask):
"""
Predict next motion token for each agent.
"""
# Temporal: attend to own history
h = self.temporal_attention(motion_tokens)
# Map: attend to nearby road tokens (50m radius)
h = self.map_attention(h, map_tokens)
# Agent: attend to neighboring agents
h = self.agent_attention(h, agent_mask)
# Predict next token
logits = self.output_head(h) # [batch, agents, vocab_size]
return logits
Performance Breakthrough
| Model Size | Realism Meta | Inference Time |
|---|---|---|
| SMART-7M | 0.7591 | <15 ms/frame |
| SMART-101M | 0.7614 | ~30 ms/frame |
| Zero-shot (NuPlan only) | 0.7210 | - |
5.4 WOSAC 2025: SMART-R1 and TrajTok
Top Entries - Realism Meta: 0.7858
SMART-R1: Reinforcement Learning Fine-Tuning
The key insight: directly optimize the WOSAC metric as a reward signal.
class SMARTR1:
"""
Three-stage training pipeline:
1. Behavior Cloning (BC) - open-loop pretraining
2. Closed-loop SFT - address covariate shift
3. RFT - direct metric optimization
"""
def compute_mpo_loss(self, rollouts, reference_policy):
"""
Metric-oriented Policy Optimization.
Unlike GRPO, exploits predictable reward structure.
"""
rewards = self.compute_wosac_metrics(rollouts)
# Simple advantage: reward minus threshold
alpha = 0.7 # Empirical threshold
advantages = rewards - alpha
# Policy gradient with KL penalty
log_probs = self.policy.log_prob(rollouts)
ref_log_probs = reference_policy.log_prob(rollouts)
kl_penalty = log_probs - ref_log_probs
loss = -(advantages * log_probs - self.beta * kl_penalty).mean()
return loss
TrajTok: Improved Tokenization
TrajTok addresses limitations of k-disk clustering:
┌─────────────────────────────────────────────────────────────────────────────┐
│ TRAJTOK vs K-DISKS TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ K-DISKS (Previous): │
│ ─────────────────── │
│ • Pure data-driven clustering │
│ • Limited to observed trajectories │
│ • Asymmetric (left/right turns have different coverage) │
│ • Vocabulary: 2048 tokens │
│ │
│ TRAJTOK (2025): │
│ ─────────────── │
│ • Hybrid rule + data approach │
│ • Grid-based spatial division │
│ • Explicit symmetry (x-axis flipping) │
│ • Noise filtering for robustness │
│ • Vocabulary: 8040 (vehicle), 3001 (ped), 2798 (cyclist) │
│ │
│ IMPROVEMENTS: │
│ ───────────── │
│ ┌──────────────────┬──────────┬──────────┐ │
│ │ Metric │ K-Disks │ TrajTok │ │
│ ├──────────────────┼──────────┼──────────┤ │
│ │ Realism Meta │ 0.7814 │ 0.7852 │ │
│ │ Map-based │ 0.9153 │ 0.9207 │ │
│ │ Interactive │ 0.8089 │ 0.8116 │ │
│ └──────────────────┴──────────┴──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
5.5 Evolution of Winning Approaches
┌─────────────────────────────────────────────────────────────────────────────┐
│ WOSAC WINNING APPROACH TIMELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 2023 2024 2025 │
│ ──── ──── ──── │
│ │
│ MVTA (0.52) SMART (0.76) SMART-R1 (0.79) │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Transformer Next-Token RL Fine-Tuning │
│ + GMM Head Prediction + WOSAC Reward │
│ + Autoregressive + Discrete Tokens + Improved Tokens │
│ │
│ KEY INSIGHT: KEY INSIGHT: KEY INSIGHT: │
│ Multi-modal output Treat trajectories Directly optimize │
│ with top-k sampling as language the evaluation metric │
│ │
│ SCORE JUMP: +0.37 (from baselines) SCORE JUMP: +0.03 │
│ SCORE JUMP: +0.24 (diminishing returns) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
6. Interactive Code Examples
6.1 Computing Kinematic Metrics
import numpy as np
from typing import Dict, List, Tuple
def compute_kinematic_metrics(
trajectories: np.ndarray, # Shape: [n_rollouts, n_agents, n_steps, 4]
dt: float = 0.1
) -> Dict[str, np.ndarray]:
"""
Compute kinematic metrics for WOSAC evaluation.
Args:
trajectories: Simulated trajectories [x, y, z, heading]
dt: Timestep in seconds (0.1 for 10 Hz)
Returns:
Dictionary of metric values per agent per timestep
"""
n_rollouts, n_agents, n_steps, _ = trajectories.shape
# Extract components
positions = trajectories[..., :3] # [n_rollouts, n_agents, n_steps, 3]
headings = trajectories[..., 3] # [n_rollouts, n_agents, n_steps]
# Linear velocity: ||v|| = ||(x_{t+1} - x_t) / dt||
velocities = np.diff(positions, axis=2) / dt # [n_rollouts, n_agents, n_steps-1, 3]
linear_speed = np.linalg.norm(velocities, axis=-1) # [n_rollouts, n_agents, n_steps-1]
# Linear acceleration: (||v_{t+1}|| - ||v_t||) / dt
linear_accel = np.diff(linear_speed, axis=2) / dt # [n_rollouts, n_agents, n_steps-2]
# Angular velocity: d(theta_{t+1}, theta_t) / dt
heading_diff = np.diff(headings, axis=2) # [n_rollouts, n_agents, n_steps-1]
# Wrap to [-pi, pi]
heading_diff = np.arctan2(np.sin(heading_diff), np.cos(heading_diff))
angular_speed = heading_diff / dt
# Angular acceleration
angular_accel = np.diff(angular_speed, axis=2) / dt
return {
'linear_speed': linear_speed,
'linear_acceleration': linear_accel,
'angular_speed': angular_speed,
'angular_acceleration': angular_accel,
}
def compute_metric_likelihood(
simulated_values: np.ndarray, # [n_rollouts, ...]
ground_truth_value: float,
n_bins: int = 100,
value_range: Tuple[float, float] = None,
laplace_smoothing: float = 0.1
) -> float:
"""
Compute likelihood of ground truth under simulated distribution.
This is the core WOSAC metric computation.
"""
# Flatten simulated values
sim_flat = simulated_values.flatten()
# Determine bin range
if value_range is None:
value_range = (sim_flat.min(), sim_flat.max())
# Build histogram
hist, bin_edges = np.histogram(sim_flat, bins=n_bins, range=value_range)
# Laplace smoothing
hist = hist + laplace_smoothing
# Normalize to probability
prob = hist / hist.sum()
# Find bin for ground truth
bin_idx = np.digitize(ground_truth_value, bin_edges) - 1
bin_idx = np.clip(bin_idx, 0, n_bins - 1)
# Compute NLL
nll = -np.log(prob[bin_idx])
# Convert to likelihood (0-1 scale)
likelihood = np.exp(-nll)
return likelihood
# Example usage
if __name__ == "__main__":
# Simulate some trajectories
np.random.seed(42)
n_rollouts = 32
n_agents = 10
n_steps = 80
# Random walk trajectories
trajectories = np.zeros((n_rollouts, n_agents, n_steps, 4))
trajectories[..., 0] = np.cumsum(np.random.randn(n_rollouts, n_agents, n_steps) * 0.5, axis=2)
trajectories[..., 1] = np.cumsum(np.random.randn(n_rollouts, n_agents, n_steps) * 0.5, axis=2)
trajectories[..., 3] = np.cumsum(np.random.randn(n_rollouts, n_agents, n_steps) * 0.1, axis=2)
# Compute metrics
metrics = compute_kinematic_metrics(trajectories)
print("Kinematic Metrics Shapes:")
for name, values in metrics.items():
print(f" {name}: {values.shape}")
# Compute likelihood for one agent
gt_speed = 5.0 # m/s
speed_likelihood = compute_metric_likelihood(
metrics['linear_speed'][:, 0, :], # First agent
gt_speed,
value_range=(0, 30)
)
print(f"\nSpeed likelihood for GT=5.0 m/s: {speed_likelihood:.4f}")
6.2 Computing Interactive Metrics
import numpy as np
from scipy.spatial import ConvexHull
from typing import List, Tuple
def get_corners(x: float, y: float, heading: float,
length: float, width: float) -> np.ndarray:
"""Get bounding box corners in world coordinates."""
# Half dimensions
l2, w2 = length / 2, width / 2
# Local corners (centered at origin)
corners_local = np.array([
[l2, w2],
[l2, -w2],
[-l2, -w2],
[-l2, w2]
])
# Rotation matrix
cos_h, sin_h = np.cos(heading), np.sin(heading)
R = np.array([[cos_h, -sin_h], [sin_h, cos_h]])
# Transform to world
corners_world = corners_local @ R.T + np.array([x, y])
return corners_world
def polygon_distance(poly1: np.ndarray, poly2: np.ndarray) -> float:
"""
Compute signed distance between two convex polygons.
Positive = separated
Negative = overlapping
Uses simplified GJK-like algorithm.
"""
from scipy.spatial import ConvexHull
from shapely.geometry import Polygon
p1 = Polygon(poly1)
p2 = Polygon(poly2)
if p1.intersects(p2):
# Overlapping - return negative distance
intersection = p1.intersection(p2)
return -np.sqrt(intersection.area)
else:
# Separated - return positive distance
return p1.distance(p2)
def compute_interactive_metrics(
trajectories: np.ndarray, # [n_rollouts, n_agents, n_steps, 4]
agent_sizes: np.ndarray, # [n_agents, 2] (length, width)
dt: float = 0.1
) -> dict:
"""
Compute interactive metrics for WOSAC.
"""
n_rollouts, n_agents, n_steps, _ = trajectories.shape
# Distance to nearest object
dist_to_nearest = np.full((n_rollouts, n_agents, n_steps), np.inf)
# Collision indicator
collision_occurred = np.zeros((n_rollouts, n_agents), dtype=bool)
for r in range(n_rollouts):
for t in range(n_steps):
for i in range(n_agents):
x_i, y_i, _, h_i = trajectories[r, i, t]
corners_i = get_corners(x_i, y_i, h_i,
agent_sizes[i, 0], agent_sizes[i, 1])
min_dist = np.inf
for j in range(n_agents):
if i == j:
continue
x_j, y_j, _, h_j = trajectories[r, j, t]
corners_j = get_corners(x_j, y_j, h_j,
agent_sizes[j, 0], agent_sizes[j, 1])
dist = polygon_distance(corners_i, corners_j)
min_dist = min(min_dist, dist)
if dist < 0:
collision_occurred[r, i] = True
dist_to_nearest[r, i, t] = min_dist
# Time to collision (simplified: assume constant velocity)
ttc = compute_ttc(trajectories, agent_sizes, dt)
return {
'distance_to_nearest': dist_to_nearest,
'collision_indicator': collision_occurred,
'time_to_collision': ttc,
}
def compute_ttc(trajectories: np.ndarray, agent_sizes: np.ndarray,
dt: float) -> np.ndarray:
"""
Compute time-to-collision assuming constant velocities.
"""
n_rollouts, n_agents, n_steps, _ = trajectories.shape
ttc = np.full((n_rollouts, n_agents, n_steps), np.inf)
# Compute velocities
velocities = np.diff(trajectories[..., :2], axis=2) / dt
for r in range(n_rollouts):
for t in range(n_steps - 1):
for i in range(n_agents):
pos_i = trajectories[r, i, t, :2]
vel_i = velocities[r, i, t]
min_ttc = np.inf
for j in range(n_agents):
if i == j:
continue
pos_j = trajectories[r, j, t, :2]
vel_j = velocities[r, j, t]
# Relative position and velocity
rel_pos = pos_j - pos_i
rel_vel = vel_j - vel_i
# Time to closest approach
if np.dot(rel_vel, rel_vel) > 1e-6:
t_cpa = -np.dot(rel_pos, rel_vel) / np.dot(rel_vel, rel_vel)
if t_cpa > 0:
# Check if collision occurs
closest_dist = np.linalg.norm(rel_pos + t_cpa * rel_vel)
collision_threshold = (agent_sizes[i, 0] + agent_sizes[j, 0]) / 2
if closest_dist < collision_threshold:
min_ttc = min(min_ttc, t_cpa)
ttc[r, i, t] = min_ttc
return ttc
# Example usage
if __name__ == "__main__":
# Two vehicles approaching each other
n_rollouts, n_agents, n_steps = 32, 2, 80
trajectories = np.zeros((n_rollouts, n_agents, n_steps, 4))
# Agent 0: moving right
trajectories[:, 0, :, 0] = np.linspace(0, 40, n_steps) # x
trajectories[:, 0, :, 1] = 0 # y
trajectories[:, 0, :, 3] = 0 # heading = 0 (facing right)
# Agent 1: moving left
trajectories[:, 1, :, 0] = np.linspace(50, 10, n_steps) # x
trajectories[:, 1, :, 1] = 0 # y
trajectories[:, 1, :, 3] = np.pi # heading = pi (facing left)
# Vehicle sizes
agent_sizes = np.array([[4.5, 2.0], [4.5, 2.0]])
metrics = compute_interactive_metrics(trajectories, agent_sizes)
print("Interactive Metrics:")
print(f" Collision rate: {metrics['collision_indicator'].mean():.2%}")
print(f" Min distance: {metrics['distance_to_nearest'].min():.2f}m")
print(f" Min TTC: {metrics['time_to_collision'][metrics['time_to_collision'] < np.inf].min():.2f}s")
6.3 Computing Map-Based Metrics
import numpy as np
from shapely.geometry import Point, LineString, Polygon
from typing import List, Dict
def compute_map_metrics(
trajectories: np.ndarray, # [n_rollouts, n_agents, n_steps, 4]
road_polygons: List[Polygon], # List of drivable area polygons
road_edges: List[LineString], # List of road boundary lines
) -> Dict[str, np.ndarray]:
"""
Compute map-based metrics for WOSAC.
"""
n_rollouts, n_agents, n_steps, _ = trajectories.shape
# Distance to road edge
dist_to_edge = np.zeros((n_rollouts, n_agents, n_steps))
# Offroad indicator
offroad_occurred = np.zeros((n_rollouts, n_agents), dtype=bool)
# Combine road polygons into single drivable area
from shapely.ops import unary_union
drivable_area = unary_union(road_polygons)
for r in range(n_rollouts):
for i in range(n_agents):
for t in range(n_steps):
x, y = trajectories[r, i, t, :2]
point = Point(x, y)
# Check if on road
is_on_road = drivable_area.contains(point)
if is_on_road:
# Positive distance to edge
dist = drivable_area.exterior.distance(point)
dist_to_edge[r, i, t] = dist
else:
# Negative distance (off road)
dist = -drivable_area.exterior.distance(point)
dist_to_edge[r, i, t] = dist
offroad_occurred[r, i] = True
return {
'distance_to_road_edge': dist_to_edge,
'offroad_indicator': offroad_occurred,
}
def create_sample_road_geometry():
"""Create sample road geometry for testing."""
from shapely.geometry import box
# Simple straight road
road = box(-10, -5, 100, 5) # x: -10 to 100, y: -5 to 5
# Add intersection
intersection = box(40, -20, 60, 20)
drivable_area = road.union(intersection)
return [drivable_area], [drivable_area.exterior]
# Example
if __name__ == "__main__":
# Create sample trajectories
n_rollouts, n_agents, n_steps = 32, 3, 80
trajectories = np.zeros((n_rollouts, n_agents, n_steps, 4))
# Agent 0: stays on road
trajectories[:, 0, :, 0] = np.linspace(0, 80, n_steps)
trajectories[:, 0, :, 1] = 0
# Agent 1: goes slightly off road
trajectories[:, 1, :, 0] = np.linspace(0, 80, n_steps)
trajectories[:, 1, :, 1] = np.linspace(0, 8, n_steps) # Drifts off
# Agent 2: stays in intersection area
trajectories[:, 2, :, 0] = 50
trajectories[:, 2, :, 1] = np.linspace(-15, 15, n_steps)
road_polygons, road_edges = create_sample_road_geometry()
metrics = compute_map_metrics(trajectories, road_polygons, road_edges)
print("Map-Based Metrics:")
print(f" Offroad rate: {metrics['offroad_indicator'].mean():.2%}")
print(f" Min road distance: {metrics['distance_to_road_edge'].min():.2f}m")
6.4 Full Realism Meta-Metric Computation
import numpy as np
from typing import Dict
def compute_realism_meta_metric(
kinematic_metrics: Dict[str, np.ndarray],
interactive_metrics: Dict[str, np.ndarray],
map_metrics: Dict[str, np.ndarray],
ground_truth: Dict[str, np.ndarray],
) -> Dict[str, float]:
"""
Compute the full WOSAC Realism Meta-Metric.
Returns individual component scores and final meta-metric.
"""
# Metric weights (collision and offroad are 2x)
weights = {
'linear_speed': 1.0,
'linear_acceleration': 1.0,
'angular_speed': 1.0,
'angular_acceleration': 1.0,
'distance_to_nearest': 1.0,
'collision_indicator': 2.0, # Safety-critical
'time_to_collision': 1.0,
'distance_to_road_edge': 1.0,
'offroad_indicator': 2.0, # Safety-critical
}
total_weight = sum(weights.values()) # = 11
# Compute likelihood for each metric
likelihoods = {}
# Kinematic metrics
for name in ['linear_speed', 'linear_acceleration',
'angular_speed', 'angular_acceleration']:
likelihoods[name] = compute_metric_likelihood(
kinematic_metrics[name],
ground_truth[name],
n_bins=100
)
# Interactive metrics
likelihoods['distance_to_nearest'] = compute_metric_likelihood(
interactive_metrics['distance_to_nearest'],
ground_truth['distance_to_nearest']
)
# Collision: special handling for binary metric
collision_rate = interactive_metrics['collision_indicator'].mean()
gt_collision = ground_truth.get('collision_indicator', 0.05) # ~5% baseline
likelihoods['collision_indicator'] = 1.0 - abs(collision_rate - gt_collision)
likelihoods['time_to_collision'] = compute_metric_likelihood(
interactive_metrics['time_to_collision'],
ground_truth['time_to_collision']
)
# Map metrics
likelihoods['distance_to_road_edge'] = compute_metric_likelihood(
map_metrics['distance_to_road_edge'],
ground_truth['distance_to_road_edge']
)
# Offroad: binary metric
offroad_rate = map_metrics['offroad_indicator'].mean()
gt_offroad = ground_truth.get('offroad_indicator', 0.02) # ~2% baseline
likelihoods['offroad_indicator'] = 1.0 - abs(offroad_rate - gt_offroad)
# Compute category scores
kinematic_score = np.mean([
likelihoods['linear_speed'],
likelihoods['linear_acceleration'],
likelihoods['angular_speed'],
likelihoods['angular_acceleration'],
])
interactive_score = np.mean([
likelihoods['distance_to_nearest'],
likelihoods['collision_indicator'],
likelihoods['time_to_collision'],
])
map_score = np.mean([
likelihoods['distance_to_road_edge'],
likelihoods['offroad_indicator'],
])
# Compute weighted meta-metric
meta_metric = sum(
weights[name] * likelihoods[name]
for name in weights
) / total_weight
return {
'realism_meta_metric': meta_metric,
'kinematic_score': kinematic_score,
'interactive_score': interactive_score,
'map_score': map_score,
**{f'likelihood_{k}': v for k, v in likelihoods.items()}
}
# Example evaluation
if __name__ == "__main__":
# Simulated metrics (from previous examples)
kinematic = {
'linear_speed': np.random.rayleigh(5, (32, 10, 79)),
'linear_acceleration': np.random.normal(0, 1, (32, 10, 78)),
'angular_speed': np.random.normal(0, 0.1, (32, 10, 79)),
'angular_acceleration': np.random.normal(0, 0.05, (32, 10, 78)),
}
interactive = {
'distance_to_nearest': np.random.exponential(5, (32, 10, 80)),
'collision_indicator': np.random.random((32, 10)) < 0.05,
'time_to_collision': np.random.exponential(3, (32, 10, 80)),
}
map_based = {
'distance_to_road_edge': np.random.exponential(2, (32, 10, 80)),
'offroad_indicator': np.random.random((32, 10)) < 0.02,
}
# Ground truth (typical values)
ground_truth = {
'linear_speed': 5.5,
'linear_acceleration': 0.2,
'angular_speed': 0.05,
'angular_acceleration': 0.01,
'distance_to_nearest': 4.0,
'collision_indicator': 0.03,
'time_to_collision': 3.5,
'distance_to_road_edge': 2.5,
'offroad_indicator': 0.01,
}
results = compute_realism_meta_metric(
kinematic, interactive, map_based, ground_truth
)
print("WOSAC Evaluation Results:")
print(f" Realism Meta-Metric: {results['realism_meta_metric']:.4f}")
print(f" Kinematic Score: {results['kinematic_score']:.4f}")
print(f" Interactive Score: {results['interactive_score']:.4f}")
print(f" Map Score: {results['map_score']:.4f}")
7. Submission Format
7.1 Proto Message Structure
WOSAC submissions use Protocol Buffers for efficient serialization:
// sim_agents_submission.proto
message SimulatedTrajectory {
// Object identifier (must match scenario)
int32 object_id = 1;
// Future positions (80 timesteps at 10 Hz)
repeated float center_x = 2;
repeated float center_y = 3;
repeated float center_z = 4;
repeated float heading = 5;
}
message JointScene {
// One trajectory per simulated agent
repeated SimulatedTrajectory simulated_trajectories = 1;
}
message ScenarioRollouts {
// Scenario identifier
string scenario_id = 1;
// 32 parallel rollouts
repeated JointScene joint_scenes = 2;
}
message SimAgentsChallengeSubmission {
// Multiple scenarios
repeated ScenarioRollouts scenario_rollouts = 1;
// Metadata
SubmissionType submission_type = 2;
string account_name = 3;
string unique_method_name = 4;
repeated string authors = 5;
string affiliation = 6;
string description = 7;
string method_link = 8;
// Data usage flags
bool uses_lidar_data = 9;
bool uses_camera_data = 10;
bool uses_public_model_pretraining = 11;
string num_model_parameters = 12;
// Compliance
bool acknowledge_complies_with_closed_loop_requirement = 13;
}
7.2 Complete Submission Pipeline
import os
import tarfile
import tensorflow as tf
from waymo_open_dataset.protos import scenario_pb2
from waymo_open_dataset.protos import sim_agents_submission_pb2
from waymo_open_dataset.utils.sim_agents import submission_specs
def create_wosac_submission(
model,
data_path: str,
output_dir: str,
method_name: str = "my_method",
authors: list = ["Anonymous"],
affiliation: str = "Research Lab"
):
"""
Complete pipeline to create a WOSAC submission.
"""
os.makedirs(output_dir, exist_ok=True)
# Get submission configuration
challenge_type = submission_specs.ChallengeType.SIM_AGENTS
config = submission_specs.get_submission_config(challenge_type)
# Process each shard
filenames = tf.io.gfile.glob(f"{data_path}/*.tfrecord*")
output_filenames = []
for shard_idx, shard_filename in enumerate(filenames):
print(f"Processing shard {shard_idx + 1}/{len(filenames)}")
# Load scenarios from shard
dataset = tf.data.TFRecordDataset([shard_filename])
scenario_rollouts = []
for scenario_bytes in dataset.as_numpy_iterator():
scenario = scenario_pb2.Scenario.FromString(scenario_bytes)
# Get agents to simulate
sim_agent_ids = submission_specs.get_sim_agent_ids(
scenario, challenge_type
)
# Generate 32 rollouts
joint_scenes = []
for rollout_idx in range(config.n_rollouts): # 32
# Run model simulation
trajectories = model.simulate(
scenario,
sim_agent_ids,
n_steps=config.n_simulation_steps, # 80
seed=rollout_idx
)
# Create JointScene proto
simulated_trajectories = []
for agent_id, traj in zip(sim_agent_ids, trajectories):
simulated_trajectories.append(
sim_agents_submission_pb2.SimulatedTrajectory(
object_id=int(agent_id),
center_x=traj[:, 0].tolist(),
center_y=traj[:, 1].tolist(),
center_z=traj[:, 2].tolist(),
heading=traj[:, 3].tolist(),
)
)
joint_scenes.append(
sim_agents_submission_pb2.JointScene(
simulated_trajectories=simulated_trajectories
)
)
# Create ScenarioRollouts proto
scenario_rollouts.append(
sim_agents_submission_pb2.ScenarioRollouts(
scenario_id=scenario.scenario_id,
joint_scenes=joint_scenes,
)
)
# Create shard submission
shard_submission = sim_agents_submission_pb2.SimAgentsChallengeSubmission(
scenario_rollouts=scenario_rollouts,
submission_type=sim_agents_submission_pb2.SimAgentsChallengeSubmission.SIM_AGENTS_SUBMISSION,
account_name='your_email@example.com',
unique_method_name=method_name,
authors=authors,
affiliation=affiliation,
description='Submission description',
method_link='https://github.com/your/repo',
uses_lidar_data=False,
uses_camera_data=False,
uses_public_model_pretraining=False,
num_model_parameters='1000000',
acknowledge_complies_with_closed_loop_requirement=True,
)
# Save shard
shard_suffix = f'-{shard_idx:05d}-of-{len(filenames):05d}'
output_filename = f'submission.binproto{shard_suffix}'
output_path = os.path.join(output_dir, output_filename)
with open(output_path, 'wb') as f:
f.write(shard_submission.SerializeToString())
output_filenames.append(output_filename)
# Create tarball
tarball_path = os.path.join(output_dir, 'submission.tar.gz')
with tarfile.open(tarball_path, 'w:gz') as tar:
for filename in output_filenames:
tar.add(
os.path.join(output_dir, filename),
arcname=filename
)
print(f"Submission created: {tarball_path}")
return tarball_path
# Validation function
def validate_submission(submission_path: str, scenario: scenario_pb2.Scenario):
"""
Validate submission proto against scenario requirements.
"""
with open(submission_path, 'rb') as f:
submission = sim_agents_submission_pb2.SimAgentsChallengeSubmission.FromString(
f.read()
)
errors = []
for scenario_rollouts in submission.scenario_rollouts:
# Check number of rollouts
if len(scenario_rollouts.joint_scenes) != 32:
errors.append(
f"Scenario {scenario_rollouts.scenario_id}: "
f"Expected 32 rollouts, got {len(scenario_rollouts.joint_scenes)}"
)
for rollout_idx, joint_scene in enumerate(scenario_rollouts.joint_scenes):
for traj in joint_scene.simulated_trajectories:
# Check trajectory length
if len(traj.center_x) != 80:
errors.append(
f"Scenario {scenario_rollouts.scenario_id}, "
f"Rollout {rollout_idx}, Agent {traj.object_id}: "
f"Expected 80 timesteps, got {len(traj.center_x)}"
)
if errors:
print("Validation FAILED:")
for error in errors[:10]: # Show first 10 errors
print(f" - {error}")
return False
print("Validation PASSED")
return True
7.3 Submission Checklist
┌─────────────────────────────────────────────────────────────────────────────┐
│ WOSAC SUBMISSION CHECKLIST │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BEFORE SUBMISSION: │
│ □ All 44,920 test scenarios processed │
│ □ Exactly 32 rollouts per scenario │
│ □ Exactly 80 timesteps per agent per rollout │
│ □ All required agents included (valid at t=10) │
│ □ center_x, center_y, center_z, heading all populated │
│ □ object_id matches scenario proto │
│ □ scenario_id matches exactly │
│ │
│ COMPLIANCE: │
│ □ Closed-loop simulation (not pre-computed trajectories) │
│ □ No access to future ground truth during simulation │
│ □ acknowledge_complies_with_closed_loop_requirement = True │
│ │
│ METADATA: │
│ □ Valid email in account_name │
│ □ Unique method_name (no special characters) │
│ □ Authors listed │
│ □ Description explains approach │
│ □ Data usage flags accurate (lidar, camera, pretraining) │
│ □ Parameter count specified │
│ │
│ FILE FORMAT: │
│ □ Sharded .binproto files │
│ □ Packaged as submission.tar.gz │
│ □ Total size < 20 GB │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8. Mental Models
8.1 What Does "Realistic" Mean?
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE THREE PILLARS OF REALISM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ REALISTIC BEHAVIOR │
│ ║ │
│ ╔═══════════════════╬═══════════════════╗ │
│ ║ ║ ║ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ PHYSICALLY │ │ SOCIALLY │ │ CONTEXTUALLY │ │
│ │ PLAUSIBLE │ │ APPROPRIATE │ │ GROUNDED │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ • Kinematic limits • Collision avoidance • Stay on roads │
│ • Smooth motion • Safe following • Obey signals │
│ • Vehicle dynamics • Yielding behavior • Lane discipline │
│ • Turn radius • Merging etiquette • Speed limits │
│ │
│ MEASURED BY: MEASURED BY: MEASURED BY: │
│ Kinematic metrics Interactive metrics Map-based metrics │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8.2 Distribution vs. Point Prediction
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY 32 ROLLOUTS? THE DISTRIBUTION MATCHING INSIGHT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ WRONG APPROACH: Single Best Prediction │
│ ───────────────────────────────────── │
│ │
│ Real futures: Predicted: Problem: │
│ │
│ ╱───── ───────── Only covers ONE possibility │
│ ╱ Miss rare but valid behaviors │
│ ●───────── ●───────── Fails when real differs │
│ ╲ │
│ ╲───── │
│ │
│ CORRECT APPROACH: Distribution Matching │
│ ─────────────────────────────────────── │
│ │
│ Real futures: 32 Samples: Benefit: │
│ │
│ ╱───── ╱───────── COVERS the distribution │
│ ╱ ╱───────── Histogram density matches │
│ ●───────── ●────────── Real sample gets HIGH likelihood │
│ ╲ ╲───────── │
│ ╲───── ╲───────── │
│ │
│ KEY INSIGHT: We don't predict THE future, we predict POSSIBLE futures │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8.3 The Goldilocks Zone of Agent Behavior
┌─────────────────────────────────────────────────────────────────────────────┐
│ AGENT BEHAVIOR SPECTRUM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TOO PASSIVE JUST RIGHT TOO AGGRESSIVE │
│ (IDM-like) (Human-like) (Log Replay) │
│ │
│ ◄────────────────────────────────┼────────────────────────────► │
│ │ │
│ Characteristics: Characteristics: Characteristics:│
│ • Always yields • Context-dependent • Never yields │
│ • Large gaps • Appropriate gaps • Tiny gaps │
│ • Never enters • Confident merging • Forces entry │
│ contested space • Negotiates right-of-way • Ignores others│
│ │
│ WOSAC Score: ~0.45 WOSAC Score: ~0.75+ WOSAC Score: │
│ ~0.72 (oracle) │
│ Problem: Goal: Problem: │
│ Unrealistic traffic flow Match real driver behavior No adaptation │
│ AV never challenged Emergent interactions to AV behavior │
│ │
│ Example: Example: Example: │
│ Car waits forever Car yields, then goes Car goes always │
│ at 4-way stop when appropriate regardless │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8.4 Closed-Loop vs. Open-Loop
┌─────────────────────────────────────────────────────────────────────────────┐
│ OPEN-LOOP vs. CLOSED-LOOP SIMULATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ OPEN-LOOP (Prediction): │
│ ─────────────────────── │
│ │
│ t=0 t=1 t=2 t=3 t=4 │
│ │ │ │ │ │ │
│ ●───────►◯───────►◯───────►◯───────►◯ Predict all at once │
│ (no feedback from simulation) │
│ │
│ Problem: Predictions diverge from reality │
│ No reaction to other agents' actual behavior │
│ │
│ CLOSED-LOOP (Simulation): │
│ ───────────────────────── │
│ │
│ t=0 t=1 t=2 t=3 t=4 │
│ │ │ │ │ │ │
│ ●───────►●───────►●───────►●───────►● │
│ │ │ │ │ │
│ └────────┴────────┴────────┘ │
│ (feedback at each step) │
│ │
│ Benefit: Agents REACT to each other │
│ Emergent interactions │
│ Matches how AV testing actually works │
│ │
│ WOSAC REQUIRES: Closed-loop simulation │
│ Submissions must NOT pre-compute trajectories │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8.5 Why Metrics Matter for AV Safety
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE SIMULATION→SAFETY PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ GOAL: Prove AV is safe before deployment │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Train │ │ Test in │ │ Identify │ │ Deploy │ │
│ │ AV │───►│ Simulation │───►│ Fixes │───►│ Safely │ │
│ │ Planner │ │ │ │ │ │ │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │ │
│ ▼ │
│ BUT: Simulation only useful if │
│ agents behave realistically! │
│ │
│ IF AGENTS TOO PASSIVE: IF AGENTS TOO AGGRESSIVE: │
│ ────────────────────── ───────────────────────── │
│ • AV never challenged • AV trained to be timid │
│ • Edge cases not found • False failure modes │
│ • Overconfident deployment • Unnecessary conservatism │
│ │
│ WOSAC METRICS ANSWER: │
│ "Are the simulated agents behaving like real humans?" │
│ │
│ High WOSAC score → High-quality simulation → Trustworthy AV testing │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
9. Hands-On Exercises
Exercise 1: Implement Constant Velocity Baseline
Objective: Create the simplest possible WOSAC submission.
"""
Exercise 1: Constant Velocity Baseline
Tasks:
1. Load a scenario from WOMD
2. Extract initial states for all sim agents
3. Propagate with constant velocity for 80 steps
4. Add small noise for diversity across 32 rollouts
5. Package into submission proto
"""
import numpy as np
import tensorflow as tf
from waymo_open_dataset.protos import scenario_pb2
from waymo_open_dataset.protos import sim_agents_submission_pb2
def exercise_1_constant_velocity(scenario_bytes: bytes) -> sim_agents_submission_pb2.ScenarioRollouts:
"""
YOUR CODE HERE
Steps:
1. Parse scenario from bytes
2. Extract agent states at t=10 (current time)
3. Compute velocities from t=9 to t=10
4. Generate 32 rollouts with small noise
5. Return ScenarioRollouts proto
"""
# Parse scenario
scenario = scenario_pb2.Scenario.FromString(scenario_bytes)
# TODO: Extract initial states
# Hint: scenario.tracks contains all agent trajectories
# Hint: Each track has states.center_x, states.center_y, etc.
# TODO: Get sim agent IDs
# Hint: Valid agents at timestep 10
# TODO: Compute velocities
# Hint: v = (pos[10] - pos[9]) / 0.1
# TODO: Generate 32 rollouts
# Hint: Add Gaussian noise with std=0.01 per step
# TODO: Create proto messages
pass # Replace with your implementation
# Test your implementation
def test_exercise_1():
"""Test constant velocity baseline."""
# Create dummy scenario
# In practice, load from WOMD tfrecord
# Verify output format
# - 32 joint scenes
# - 80 timesteps per trajectory
# - All sim agents present
pass
# Solution hint structure
SOLUTION_SKELETON = """
def exercise_1_solution(scenario_bytes):
scenario = scenario_pb2.Scenario.FromString(scenario_bytes)
# Get valid agents at current time (index 10)
current_time = 10
sim_agent_ids = []
initial_states = {}
for track in scenario.tracks:
if track.states[current_time].valid:
sim_agent_ids.append(track.id)
initial_states[track.id] = {
'x': track.states[current_time].center_x,
'y': track.states[current_time].center_y,
'z': track.states[current_time].center_z,
'heading': track.states[current_time].heading,
'vx': (track.states[current_time].center_x -
track.states[current_time-1].center_x) / 0.1,
'vy': (track.states[current_time].center_y -
track.states[current_time-1].center_y) / 0.1,
}
# Generate 32 rollouts
joint_scenes = []
for rollout in range(32):
trajectories = []
for agent_id in sim_agent_ids:
state = initial_states[agent_id]
traj = {'x': [], 'y': [], 'z': [], 'heading': []}
x, y, z = state['x'], state['y'], state['z']
heading = state['heading']
for t in range(80):
# Add noise
noise_x = np.random.normal(0, 0.01)
noise_y = np.random.normal(0, 0.01)
x += state['vx'] * 0.1 + noise_x
y += state['vy'] * 0.1 + noise_y
traj['x'].append(x)
traj['y'].append(y)
traj['z'].append(z)
traj['heading'].append(heading)
trajectories.append((agent_id, traj))
# Create JointScene
...
return scenario_rollouts
"""
Exercise 2: Compute Metrics Locally
Objective: Understand metric computation by implementing it yourself.
"""
Exercise 2: Local Metric Computation
Tasks:
1. Implement linear speed metric
2. Implement collision detection
3. Implement offroad detection
4. Combine into mini realism score
"""
import numpy as np
def exercise_2_metrics(
trajectories: np.ndarray, # [32, n_agents, 80, 4]
ground_truth: np.ndarray, # [n_agents, 80, 4]
road_boundary: np.ndarray, # Road polygon vertices
) -> dict:
"""
YOUR CODE HERE
Compute metrics for 32 simulated rollouts against ground truth.
Returns:
{
'linear_speed_likelihood': float,
'collision_rate': float,
'offroad_rate': float,
'mini_realism_score': float,
}
"""
# TODO: Compute linear speeds for all rollouts
# Shape: [32, n_agents, 79]
# TODO: Build histogram of simulated speeds
# TODO: Compute likelihood of ground truth under histogram
# TODO: Detect collisions (simplified: center-to-center distance < threshold)
# TODO: Detect offroad (point-in-polygon test)
# TODO: Combine into score
pass
def test_exercise_2():
"""Test metric computation."""
np.random.seed(42)
# Generate test data
n_agents = 5
trajectories = np.random.randn(32, n_agents, 80, 4)
trajectories[..., :2] = np.cumsum(trajectories[..., :2] * 0.5, axis=2)
ground_truth = trajectories[0] # Use first rollout as GT
# Simple rectangular road
road_boundary = np.array([
[-100, -10],
[100, -10],
[100, 10],
[-100, 10],
])
metrics = exercise_2_metrics(trajectories, ground_truth, road_boundary)
print("Computed Metrics:")
for name, value in metrics.items():
print(f" {name}: {value:.4f}")
Exercise 3: Improve Baseline with Simple Rules
Objective: Add rule-based improvements to constant velocity.
"""
Exercise 3: Rule-Enhanced Baseline
Add simple rules to improve constant velocity baseline:
1. Collision avoidance: Slow down if about to collide
2. Road following: Steer back if drifting off road
3. Speed limits: Cap maximum speed
"""
def exercise_3_rule_enhanced(
scenario,
max_speed: float = 20.0, # m/s
safe_distance: float = 5.0, # meters
road_margin: float = 2.0, # meters from edge
) -> sim_agents_submission_pb2.ScenarioRollouts:
"""
YOUR CODE HERE
Implement rule-enhanced simulation:
For each timestep:
1. Compute default action (constant velocity)
2. Check collision risk with other agents
- If distance < safe_distance: reduce speed
3. Check road boundary
- If distance < road_margin: steer toward road center
4. Apply speed limit
5. Update state
"""
# TODO: Initialize from scenario
# TODO: Simulation loop with rules
# TODO: Package into proto
pass
# Hints for implementation
EXERCISE_3_HINTS = """
Collision Avoidance Logic:
for each agent i:
for each other agent j:
dist = ||pos_i - pos_j||
if dist < safe_distance:
# Reduce speed proportionally
speed_factor = dist / safe_distance
velocity *= speed_factor
Road Following Logic:
for each agent:
dist_to_edge = compute_road_distance(pos)
if dist_to_edge < road_margin:
# Compute steering toward road center
center_direction = compute_center_direction(pos, road)
heading += steering_gain * center_direction
Speed Limit Logic:
speed = ||velocity||
if speed > max_speed:
velocity *= max_speed / speed
"""
Exercise 4: Analyze Winning Solutions
Objective: Understand what makes top solutions work.
"""
Exercise 4: Solution Analysis
Tasks:
1. Read MVTA paper and identify 3 key innovations
2. Compare SMART tokenization to continuous prediction
3. Explain why RL fine-tuning helps (SMART-R1)
"""
# Answer template
EXERCISE_4_ANSWERS = """
Question 1: What are 3 key innovations in MVTA (2023 winner)?
YOUR ANSWER:
1. _______________________________________________
2. _______________________________________________
3. _______________________________________________
Question 2: Why does tokenization (SMART) outperform continuous prediction?
YOUR ANSWER:
________________________________________________
________________________________________________
________________________________________________
Question 3: How does RL fine-tuning (SMART-R1) improve over behavior cloning?
YOUR ANSWER:
________________________________________________
________________________________________________
________________________________________________
Question 4: What is the "covariate shift" problem in closed-loop simulation?
YOUR ANSWER:
________________________________________________
________________________________________________
________________________________________________
Question 5: Why are collision and offroad metrics weighted 2x?
YOUR ANSWER:
________________________________________________
________________________________________________
________________________________________________
"""
# Reference answers (reveal after attempting)
REFERENCE_ANSWERS = """
Q1 Key Innovations:
1. Variable-length history sampling - creates more training data
2. Receding horizon prediction - only use first 0.1s of 1s prediction
3. Top-K sampling - balance realism with diversity
Q2 Tokenization Benefits:
- Discretization acts as regularization
- Captures multi-modal distribution naturally
- Avoids mode collapse of continuous GMM
- Enables language model architectures (decoder-only)
Q3 RL Fine-Tuning Benefits:
- Directly optimizes evaluation metric (WOSAC score)
- Addresses distribution mismatch between training and testing
- Behavior cloning only mimics, doesn't optimize for metric
- RL explores beyond training distribution
Q4 Covariate Shift:
- Training: model sees ground truth history
- Testing: model sees its own predictions as history
- Errors compound over time
- Solution: closed-loop training or RL fine-tuning
Q5 Safety Weights:
- Collisions and offroad are safety-critical failures
- A simulator with many collisions is unusable
- Double weighting emphasizes these must be rare
- Reflects real-world importance hierarchy
"""
10. Interview Questions
Conceptual Understanding
Q1: Why is WOSAC framed as distribution matching rather than trajectory prediction?
<details> <summary>Answer</summary>Distribution matching is fundamentally different from trajectory prediction because:
-
Multiple Valid Futures: At any moment, there are many plausible ways a driver could behave. A car approaching an intersection might turn left, go straight, or turn right - all valid.
-
Evaluation Challenge: If we predict "turn left" but the real driver went straight, a point prediction metric would say we're wrong. But "turn left" might still be a perfectly realistic behavior.
-
32 Samples Solution: By generating 32 rollouts, we approximate the distribution of possible behaviors. The metric then asks: "Does this distribution assign reasonable probability to what actually happened?"
-
Practical Impact: This encourages models to capture behavioral diversity rather than just the most likely outcome, which is crucial for finding edge cases in simulation.
Q2: Explain why log replay achieves only ~0.72 on WOSAC despite being "ground truth".
<details> <summary>Answer</summary>Log replay's imperfect score reveals key insights:
-
No Diversity: Replaying the same trajectory 32 times puts all histogram mass in one bin. If real behavior has ANY natural variation, the likelihood is suboptimal.
-
Single Sample: We only observed ONE outcome from the real distribution. The true distribution might have variance that log replay doesn't capture.
-
Adaptation: Log replay cannot adapt to different AV behaviors. In real testing, the AV might act differently than in the recording, requiring agents to react.
-
Metric Design: WOSAC intentionally penalizes this "cheating" approach to encourage learning generalizable behaviors.
Q3: What is covariate shift and how do modern WOSAC solutions address it?
<details> <summary>Answer</summary>Covariate Shift occurs when the distribution of inputs at test time differs from training:
- Training: Model sees ground-truth history from the dataset
- Testing: Model sees its own previous predictions as history
Consequences:
- Small errors accumulate over 80 timesteps
- Model enters states never seen during training
- Performance degrades significantly over time
Solutions:
-
Closed-Loop Training (MVTA): Train autoregressively, feeding predictions back as input
-
Scheduled Teacher Forcing (TrafficBots): Gradually reduce ground-truth exposure during training
-
Top-K Sampling During Training (SMART): Intentionally use non-optimal tokens to expose model to diverse states
-
RL Fine-Tuning (SMART-R1): Explicitly train in closed-loop with WOSAC metric as reward
Technical Deep-Dives
Q4: Walk through the mathematical computation of a kinematic metric likelihood.
<details> <summary>Answer</summary>Let's compute linear speed likelihood step-by-step:
STEP 1: Extract speeds from 32 rollouts
─────────────────────────────────────────
For rollout k, agent i, timestep t:
v_{k,i,t} = ||pos_{k,i,t+1} - pos_{k,i,t}|| / 0.1
Result: speeds array of shape [32, N_agents, 79]
STEP 2: Build histogram for one agent at one timestep
─────────────────────────────────────────────────────
speeds_flat = speeds[:, agent_i, timestep_t] # 32 values
bins = [0, 2, 4, 6, ..., 30] # 15 bins
hist, edges = histogram(speeds_flat, bins)
# Example: hist = [2, 5, 10, 8, 4, 3, 0, 0, ...]
STEP 3: Apply Laplace smoothing
────────────────────────────────
hist_smoothed = hist + 0.1
# Prevents zero probability bins
STEP 4: Normalize to probability
─────────────────────────────────
prob = hist_smoothed / sum(hist_smoothed)
# Example: prob = [0.065, 0.159, 0.315, ...]
STEP 5: Find bin for ground truth
──────────────────────────────────
gt_speed = 5.5 m/s
bin_idx = digitize(gt_speed, edges) = 2 # [4-6] bin
STEP 6: Compute NLL
────────────────────
nll = -log(prob[bin_idx]) = -log(0.315) = 1.15
STEP 7: Average over time
──────────────────────────
For each timestep t:
nll_t = compute_nll(t)
average_nll = mean(nll_t for t in valid_timesteps)
STEP 8: Convert to likelihood
──────────────────────────────
likelihood = exp(-average_nll)
# Example: exp(-1.15) = 0.317
</details>
Q5: How does SMART's tokenization work and why is it effective?
<details> <summary>Answer</summary>SMART Tokenization Process:
1. TRAJECTORY SEGMENTATION
──────────────────────────
Split continuous trajectory into 0.5-second segments
Each segment = 5 timesteps at 10 Hz
2. K-DISK CLUSTERING
───────────────────
Collect all segments from training data
Apply k-means clustering (k=2048)
Each cluster centroid = one token
3. VOCABULARY
────────────
Motion tokens: ~2048 for each agent type
Map tokens: ~1000 for road segments
4. TOKENIZATION AT RUNTIME
─────────────────────────
For each 0.5s segment:
Find nearest cluster centroid
Return token index
Why It's Effective:
-
Built-in Discretization: Prevents mode collapse common in continuous GMM outputs
-
Natural Multi-modality: Each token represents a distinct behavior mode
-
Language Model Architecture: Enables decoder-only transformers (like GPT) which scale efficiently
-
Regularization: Quantization adds implicit regularization, reducing overfitting
-
Rolling Matching: During training, tokens matched based on previous prediction (not GT) reduces covariate shift
Q6: Design a new metric to capture "naturalness" of lane changes.
<details> <summary>Answer</summary>Proposed: Lane Change Naturalness Metric
DEFINITION:
───────────
Measure how similar simulated lane changes are to real lane changes
in terms of duration, lateral displacement profile, and context.
FEATURE EXTRACTION:
───────────────────
For each detected lane change:
1. Duration: T_lc (time from start to end)
2. Lateral Profile: y(t) displacement curve
3. Context: gap acceptance (front/rear distances)
COMPUTATION:
────────────
1. Detect lane changes in both simulated and real data:
- Lateral displacement > threshold (e.g., 3m)
- Sustained lane position after change
2. Extract features for each lane change:
lc_features = {
'duration': T_lc,
'max_lateral_speed': max(dy/dt),
'smoothness': mean(d²y/dt²),
'front_gap': distance to lead vehicle,
'rear_gap': distance to following vehicle,
}
3. Build distributions from 32 rollouts
4. Compute likelihood of real lane change features
INTEGRATION:
────────────
Can be added to Interactive metrics category
Weight: 1.0 (or 2.0 if safety-critical)
CHALLENGES:
───────────
- Lane change detection is non-trivial
- Requires accurate lane information from map
- Sparse events (not every scenario has lane changes)
</details>
System Design
Q7: How would you scale a WOSAC submission to handle 45,000 scenarios efficiently?
<details> <summary>Answer</summary>Scaling Strategy:
1. PARALLELIZATION
─────────────────
- Scenarios are independent → embarrassingly parallel
- Use distributed computing (e.g., Ray, Spark)
- Target: 100+ scenarios per GPU simultaneously
2. BATCHING
─────────
- Batch multiple scenarios per forward pass
- Batch 32 rollouts together (same scenario)
- Use dynamic batching for variable agent counts
3. EFFICIENT SIMULATION
─────────────────────
- Use JAX/Waymax for GPU-accelerated simulation
- JIT compile the simulation step
- Avoid Python loops in inner simulation
4. INFRASTRUCTURE
───────────────
Recommended setup:
- 8× A100 GPUs
- ~5 hours for full test set
- Sharded output (150 files)
5. PIPELINE
──────────
┌─────────┐ ┌──────────┐ ┌─────────┐
│ Load │────►│ Simulate │────►│ Save │
│ Shards │ │ (GPU) │ │ Shards │
└─────────┘ └──────────┘ └─────────┘
│ │ │
└───────────────┼────────────────┘
│
Parallel across GPUs
6. VALIDATION
────────────
- Validate each shard immediately after creation
- Check: 32 rollouts, 80 steps, all agents
- Fail fast on errors
</details>
Q8: You notice your model has high collision rate (15%) vs. baseline (5%). Debug this.
<details> <summary>Answer</summary>Debugging Framework:
STEP 1: REPRODUCE AND ISOLATE
─────────────────────────────
- Identify scenarios with highest collision rates
- Visualize specific collision events
- Categorize collision types:
□ Rear-end collisions
□ Side-swipe during lane change
□ Intersection conflicts
□ Pedestrian/cyclist interactions
STEP 2: ANALYZE ROOT CAUSES
───────────────────────────
For each category, investigate:
A) Data Issue:
- Training data has collisions → model learns them
- Fix: Filter collision trajectories from training
B) Covariate Shift:
- Model enters unseen states → erratic behavior
- Fix: Closed-loop training, scheduled teacher forcing
C) Multi-Agent Coordination:
- Independent sampling → no collision avoidance
- Fix: Joint prediction, iterative refinement
D) Speed/Acceleration Bounds:
- Unrealistic kinematics → impossible avoidance
- Fix: Add kinematic constraints in output
STEP 3: TARGETED FIXES
──────────────────────
Priority order:
1. Add collision loss during training
2. Implement closed-loop fine-tuning
3. Add inference-time collision mitigation (MTR+++)
4. Ensemble multiple samples, select least-collision set
STEP 4: ABLATION STUDY
──────────────────────
Test each fix independently:
| Fix | Collision Rate |
|------------------------|----------------|
| Baseline | 15% |
| + Collision loss | 10% |
| + Closed-loop training | 7% |
| + Inference mitigation | 4% |
STEP 5: VERIFY NO REGRESSION
────────────────────────────
Check other metrics didn't degrade:
- Kinematic metrics (might be too conservative)
- Diversity (might collapse to "safe" behaviors)
</details>
Q9: Compare WOSAC to other autonomous driving benchmarks. What's unique?
<details> <summary>Answer</summary>Benchmark Comparison:
┌───────────────┬─────────────────┬─────────────────┬─────────────────┐
│ Benchmark │ Task │ Evaluation │ Unique to │
│ │ │ │ WOSAC │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│ WOSAC │ Multi-agent │ Distribution │ ✓ Closed-loop │
│ │ simulation │ matching │ ✓ 32 rollouts │
│ │ │ (likelihood) │ ✓ Joint scenes │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│ WOMD-MP │ Motion │ minADE, minFDE │ Single agent │
│ │ prediction │ miss rate │ Open-loop │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│ nuScenes │ Detection + │ mAP, NDS │ Perception │
│ │ Prediction │ ADE, FDE │ focused │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Argoverse │ Motion │ minADE, MR │ Trajectory │
│ │ forecasting │ brier-minFDE │ diversity │
├───────────────┼─────────────────┼─────────────────┼─────────────────┤
│ CARLA │ Closed-loop │ Task success │ Synthetic │
│ │ driving │ infraction rate │ environment │
└───────────────┴─────────────────┴─────────────────┴─────────────────┘
WOSAC's Unique Contributions:
1. DISTRIBUTION OVER POINT PREDICTION
- Other benchmarks: "Is this trajectory close to GT?"
- WOSAC: "Does this distribution cover GT?"
2. MULTI-AGENT JOINT SIMULATION
- Other benchmarks: Predict each agent independently
- WOSAC: All agents simulated together
3. CLOSED-LOOP REQUIREMENT
- Other benchmarks: Often open-loop evaluation
- WOSAC: Must be truly autoregressive
4. INTERPRETABLE METRIC DECOMPOSITION
- Kinematic + Interactive + Map-based
- Identifies specific failure modes
5. REAL DATA SCALE
- 45,000 real scenarios
- Diverse conditions (urban, highway, etc.)
</details>
Q10: If you were to improve WOSAC metrics, what would you add?
<details> <summary>Answer</summary>Proposed Improvements:
1. SEMANTIC BEHAVIOR METRICS
──────────────────────────
Current: Low-level kinematics
Proposed: High-level behavior recognition
Examples:
- Yield compliance at intersections
- Lane change appropriateness
- Gap acceptance realism
- Turn signal correlation (if available)
2. LONG-TAIL SCENARIO WEIGHTING
─────────────────────────────
Current: Uniform scenario weighting
Proposed: Upweight rare/critical scenarios
Rare scenarios:
- Near-miss situations
- Construction zones
- Emergency vehicles
- Adverse weather
3. TEMPORAL CONSISTENCY METRICS
─────────────────────────────
Current: Per-timestep evaluation
Proposed: Trajectory-level patterns
Examples:
- Smoothness over entire trajectory
- Intention consistency (committed actions)
- Reaction time to events
4. ADVERSARIAL ROBUSTNESS
────────────────────────
Current: Fixed scenarios
Proposed: Perturbed initial conditions
Test if model handles:
- Slightly different starting positions
- Noisy observations
- Missing agents
5. COMPUTATIONAL EFFICIENCY METRIC
────────────────────────────────
Current: Only accuracy metrics
Proposed: Realism per FLOP
Rationale:
- Real-time simulation matters
- Encourages efficient architectures
- Separate track for resource-constrained solutions
6. CAUSALITY-AWARE METRICS
─────────────────────────
Current: All agents weighted equally
Proposed: Weight by causal relevance to ego
Focus on agents that:
- Are in ego's path
- Could interact with ego
- Influence ego's decisions
</details>
Summary
The Waymo Open Sim Agents Challenge represents a critical benchmark for autonomous vehicle simulation. Key takeaways:
-
Metrics Matter: The realism meta-metric's design (distribution matching, safety weighting) directly shapes what models learn.
-
Evolution of Approaches: From transformers (MVTA) to tokenization (SMART) to RL fine-tuning (SMART-R1), each generation addresses previous limitations.
-
Closed-Loop is Essential: The gap between open-loop training and closed-loop evaluation is the central challenge.
-
Distribution Over Points: Generating 32 diverse rollouts that cover the behavioral distribution is more important than single accurate predictions.
-
Safety-Critical Emphasis: Double-weighting collisions and offroad events reflects real-world priorities.
References
Primary Sources
- The Waymo Open Sim Agents Challenge (arXiv:2305.12032)
- SMART: Scalable Multi-agent Real-time Motion Generation (arXiv:2405.15677)
- Multiverse Transformer: 1st Place WOSAC 2023 (arXiv:2306.11868)
- MTR+++: 2nd Place WOSAC 2023 (arXiv:2306.15914)
- TrajTok: 2025 WOSAC Technical Report (arXiv:2506.21618)
Official Resources
- Waymo Open Dataset Challenges
- WOSAC Research Page
- Waymax GitHub Repository
- WOSAC Submission Tutorial (Waymo)
- WOSAC Submission via Waymax
Additional Reading
- TrafficBots V1.5 (3rd Place WOSAC 2024)
- Trajeglish: Traffic Modeling as Next-Token Prediction
- WOSAC Topic Overview (EmergentMind)
Last Updated: January 2026