WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Source-first digest for monthly 2026_05 rank 17, rank_id p043.

Routing status: success
Extraction route: full_markdown
PDF extraction: not used

Motivation / Background

The paper argues that interactive video world models need a broader evaluation contract than ordinary text-to-video systems. A useful interactive world model has to render plausible video, initialize the requested world, execute user controls, remember state over turns, and obey physical or causal constraints. The authors frame these as five game-engine-like roles: renderer, director, controller, memory, and engine. Existing benchmarks cover parts of this stack, but not the whole combination of open-domain scenes, first- and third-person perspectives, multiple interaction types, multi-turn state, and physics.

Figure 1 is the paper's compact thesis: a benchmark case should specify both the world setting and a multi-turn interaction sequence, then evaluate not just the final clip quality but whether each requested turn remains controllable and coherent.

**Figure 1. WBench overview.** The benchmark combines world settings, four interaction classes, unified navigation control, and five evaluation dimensions.

Benchmark family	Main coverage	Missing piece relative to WBench
Video-generation suites such as VBench	Visual quality, perceptual realism, text alignment	No action input or multi-turn interaction
World-model suites such as MIND / WorldMark	Navigation or memory-oriented interaction	Limited semantic interactions, perspective coverage, or physics scope
Autonomous-driving or robotics suites	Domain-specific control and dynamics	Not open-domain, not both first- and third-person
WBench	Text, camera, and action inputs; FPP and TPP; navigation, subject action, event editing, perspective switching; quality, setting, interaction, consistency, physics	The paper still limits itself to discrete action sequences rather than continuous control

Table 1. Benchmark-coverage framing. This digest table condenses the source benchmark-comparison table and related-work discussion. The key gap is unified evaluation of multi-turn interactive video, not simply another video-quality leaderboard.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	WBench fills a real benchmark gap by jointly covering open-domain settings, both perspectives, four interaction types, multi-turn evaluation, and five evaluation dimensions.	5	problem framing, dataset design, evaluation suite, benchmark coverage table
C2	The dataset is broad enough to expose different failure modes: 289 cases, 1,058 turns, 2-9 turns per case, multiple scenes, styles, subjects, and interaction subtypes.	5	dataset design, dataset statistics table, dataset distribution figure
C3	The navigation design is a meaningful cross-paradigm bridge because text, 6-DoF pose, and discrete keyboard controls are aligned to the same underlying action semantics.	5	navigation design, navigation action table, navigation definition figure
C4	The evaluation suite is not one scalar metric; it uses 22 sub-metrics across video quality, setting adherence, interaction adherence, consistency, and physics compliance.	5	evaluation suite, metric summary table, NavScore equations
C5	Across 20 models, no model dominates all dimensions; text-driven models lead setting and physics, while native-control world models lead navigation.	5	experiment protocol, key results table, results pattern
C6	Navigation is structurally decoupled from rendering, consistency, and physical compliance, so high video quality does not imply controllable motion.	5	cross-dimension analysis, correlation figure, turn degradation figure
C7	The automatic metrics are reasonably aligned with human preference at model-ranking granularity, with Spearman correlations at least 0.94 across ten aspects.	5	human validation, human alignment figure, human platform figure
C8	The benchmark is strong but not complete: it focuses on discrete actions, still relies partly on LMM-based physical judging, and needs broader domains and real-time evaluation.	4	limitations

Support scores are support-from-paper scores, not independent reproduction scores. A score of 5 means the claim is directly backed by source definitions, figures, tables, or experiments. A score of 4 means the paper gives substantial evidence but the claim still depends on benchmark design choices or automated judges.

Core Technical Idea

WBench treats an interactive world model as a conditional generator:

o_{t+1} \sim f_\theta(o_{t+1} \mid o_{\le t}, a_{\le t}).

The benchmark decomposes each case into a world setting \(\mathcal{W}\), which fixes the initial world state \(o_0\), and an interaction sequence \(\mathcal{I}=(a_0,a_1,\ldots,a_{T-1})\), which specifies what the user does across consecutive turns. This separation is the important technical move: failures can be attributed to wrong initialization, wrong action execution, poor memory, or poor physics rather than collapsed into a generic video score.

Figure 2 shows the dataset distribution, and Table 2 gives the digest-scale summary.

**Figure 2. Dataset distribution.** The source figure summarizes perspective, interaction type, subject, scene, style, event-editing subtype, perspective-switching subtype, and turn-depth distributions.

Dataset axis	Value
Total cases / turns	289 cases / 1,058 turns
Turns per case	2-9 turns, average 3.7
Perspective split	62% first-person, 38% third-person
Interaction split	57% navigation, 20% subject action, 17% event editing, 6% perspective switching
Scene split	Nature 31%, urban 21%, indoor 17%, workspace 13%, fantasy 10%, sports/game 8%
Explicit-subject cases	194 cases
Subject split	Human 64%, animal 9%, robot 9%, vehicle 7%, miscellaneous object 10%
Style split	52% photorealistic, 48% other styles including anime, cartoon, CG, oil painting, ink wash, pencil sketch, flat, and abstract
Longer sequences	12% of cases have 5-9 turns, usually mixing subject action and event editing

Table 2. WBench dataset statistics. The benchmark is designed to probe long-horizon state rather than single-shot prompt following.

The evaluation suite has five dimensions and 22 sub-metrics. All sub-metric scores are linearly rescaled to \([0,100]\), higher is better.

Dimension	What it checks	Representative sub-metrics or mechanisms
Video quality	Perceptual quality independent of control	Aesthetic quality, imaging quality, temporal flicker, dynamic degree, motion smoothness, HPSv3-Norm
Setting adherence	Whether the generated video realizes the world setting	Scene adherence and subject adherence via VLM checks
Interaction adherence	Whether requested turns are executed	NavScore with pose estimation; event editing, subject action, and perspective switching via structured VLM protocols
Consistency	Whether identity, background, geometry, perspective, and continuity persist	Subject/background consistency, gated spatial consistency, segment continuity, perspective consistency, geometric and photometric consistency
Physical compliance	Whether events obey plausible physical and causal rules	Causal fidelity plus visual plausibility from a fine-tuned Qwen3-VL-30B-A3B judge

Table 3. Evaluation suite. WBench is best read as a diagnostic grid. It deliberately keeps interaction, memory, and physics separate because they fail differently.

The most distinctive metric is NavScore, which compares estimated camera trajectories against synthetic ground-truth trajectories derived from the action sequence. The benchmark uses MegaSaM for camera pose estimation, then resamples both predicted and ground-truth trajectories to \(K=20\) arc-length points per turn. The normalized translation and rotation errors are:

\mathrm{nATE}_t = \min\left(\frac{\mathrm{ATE}_t}{\max(L_{\mathrm{pred}}, 0.5)}, 1\right), \qquad \mathrm{nATE}_r = \min\left(\frac{\mathrm{ATE}_r}{\max(\Theta_{\mathrm{pred}}, 10^\circ)}, 1\right).

For repeated or symmetric actions, WBench also computes cross-turn consistency:

\mathrm{cnATE}_t = \frac{1}{P}\sum_{p=1}^{P}\mathrm{nATE}^{(p)}_t, \qquad \mathrm{cnATE}_r = \frac{1}{P}\sum_{p=1}^{P}\mathrm{nATE}^{(p)}_r.

\mathrm{Cons} = 1 - \frac{\mathrm{cnATE}_t + \mathrm{cnATE}_r}{2}.

The final navigation score is:

\mathrm{Acc} = 1 - \frac{\mathrm{nATE}_t + \mathrm{nATE}_r}{2}, \qquad \mathrm{NavScore} = \frac{\mathrm{Acc} + \mathrm{Cons}}{2}.

This design avoids treating scale mismatch as the only error. The adaptive ground truth follows each model's motion magnitude, then penalizes wrong direction or wrong shape, as illustrated in Figure 16.

Method Details

The dataset starts from world settings with four attributes: scene, style, perspective, and subject. Initial frames are generated or collected, then manually checked for quality and prompt-frame consistency. Interactions are setting-aware: annotators create physically executable, semantically coherent sequences, such as reasonable movement through a space or a weather transition in an outdoor scene.

Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10 are the appendix galleries that make the coverage concrete rather than purely statistical.

**Figure 6. All-case gallery.** Thumbnail overview of all WBench cases.

**Figure 7. Scene and style gallery.** The source caption pairs photorealistic and stylized variants across nature, urban, indoor, workspace, fantasy, and sports/game scenes.

**Figure 8. Style gallery.** Representative initial frames span realistic, anime, cartoon, oil painting, ink wash, flat, and pencil sketch styles.

**Figure 9. Perspective gallery.** The perspective axis includes disembodied first-person, embodied first-person, and third-person cases.

**Figure 10. Subject gallery.** Third-person subjects include humans, animals, vehicles, robots, and other controllable objects.

The interaction taxonomy has four top-level classes. Navigation uses translational W/S/A/D and rotational arrow-style controls. Subject action covers manipulation, locomotion, tool use, combat, and gestures. Event editing covers exogenous changes such as weather shifts, object appearances, time-of-day shifts, and object-state transitions. Perspective switching covers first-person to third-person transitions, third-person to first-person transitions, same-subject switches, multi-subject switches, and scope-mode transitions, shown in Figure 11.

**Figure 11. Perspective-switching taxonomy.** The figure shows representative same-subject, multi-subject, and scope-mode transitions.

The navigation design is perspective-dependent. The same key means camera motion in first-person mode but subject or orbital motion in third-person mode; Table 4 and Figure 12 capture that mapping.

Type	Key	First-person semantics	Third-person semantics
Translation	`W`	Camera pushes forward	Subject walks forward
Translation	`S`	Camera pulls backward	Subject steps backward
Translation	`A`	Camera strafes left	Subject moves left
Translation	`D`	Camera strafes right	Subject moves right
Rotation	`left`	View turns left	Camera orbits left
Rotation	`right`	View turns right	Camera orbits right
Rotation	`up`	View tilts up	Camera elevates
Rotation	`down`	View tilts down	Camera descends

Table 4. Navigation action semantics. The mapping is central to fair comparison because it lets text, pose, and action interfaces be evaluated against equivalent spatial intent.

**Figure 12. Navigation action definition.** Visual reference for the same action keys under first-person and third-person perspectives.

**Figure 13. Navigation distribution.** Translational actions account for 62.8% of atomic navigation actions and rotations for 37.0%; trajectory categories include round-trip, progressive, repeat, L-shape, loop, and zigzag.

For web-only systems such as Genie 3 and Happy Oyster, the authors automate the web interface instead of changing the benchmark. Figure 14 shows the pipeline: fill prompt and image, wait for world loading, execute each 5-second turn, and download the recorded video.

**Figure 14. Web-based evaluation pipeline.** The pipeline enables web-only interactive systems to be evaluated at scale without a public API.

Several consistency and physics metrics are worth calling out because they guard against shortcut behavior. Gated spatial consistency penalizes a model that appears consistent only because it barely moves:

\mathrm{score} = s_{\mathrm{ret}} \cdot \min\left(1, \frac{1 - s_{\mathrm{min}}}{\tau}\right), \qquad \tau = 0.15.

Perspective consistency tracks target centroids with SAM2 masks:

s_{\text{centroid}} = \max\left(0, 1 - \frac{\sqrt{\sigma_{c_x}^2 + \sigma_{c_y}^2}}{0.3}\right) \times p.

Reconstruction consistency uses depth and pose to back-project and reproject pixels:

\mathbf{X}_i = d_i(\mathbf{u}_i)K^{-1}\mathbf{u}_i, \qquad \mathbf{X}_j = R_{ji}\mathbf{X}_i + \mathbf{t}_{ji}, \qquad \hat{\mathbf{u}}_j = \pi(K\mathbf{X}_j).

It then measures geometric error and photometric agreement:

e_{\mathrm{rel}}(\mathbf{u}_i)=\frac{\lVert\hat{\mathbf{u}}_j-\mathbf{u}_j\rVert_2}{D}, \qquad s_{\mathrm{geo}}=\frac{1}{1+\bar{e}_{\mathrm{rel}}}, \qquad s_{\mathrm{photo}}=\mathrm{PSNR}(\hat{I}_{i\rightarrow j}, I_j).

For physical compliance, causal fidelity uses a two-stage VLM protocol: global rendering-physics and causal-consistency scoring, then scene-aware scoring over selected sub-dimensions such as fluid/smoke, collision, surface tracks, deformation, wind, reflection, and human motion. Visual plausibility is a separate fine-tuned Qwen3-VL-30B-A3B score. The visual-plausibility head renormalizes probabilities over five rating tokens and takes their expected value:

\tilde{p}_c = \frac{p_c}{\sum_{c'\in\mathcal{C}}p_{c'}}, \qquad \mathcal{C}=\{\texttt{Perfect},\texttt{Good},\texttt{Fair},\texttt{Poor},\texttt{Bad}\}.

\hat{s} = 5\tilde{p}_{\texttt{Perfect}}+ 4\tilde{p}_{\texttt{Good}}+ 3\tilde{p}_{\texttt{Fair}}+ 2\tilde{p}_{\texttt{Poor}}+ \tilde{p}_{\texttt{Bad}}, \qquad \mathcal{L}_{\mathrm{VP}}=(\hat{s}-s)^2.

Experiments And Results

The experiments evaluate 20 models across three paradigms. The text-driven group has 9 models and supports all four interaction types on the full 289-case split. The camera-controlled group has 5 models and the action-conditioned group has 6 models; both are restricted to the shared 158-case navigation subset. Semantic interactions are text-only and therefore only appear for text-driven systems.

Result slice	Best / key value	Digest reading
Video quality	Seedance 1.5 average 82.1; Wan 2.7 average 81.5	Visual quality is close to saturated and no longer the main bottleneck
Setting adherence	Wan 2.7 average 91.4; Kling 3.0 average 91.0	Text-driven systems remain much stronger at broad world-setting following
Navigation, text-driven	YUME 1.5 reaches 72.0	Navigation-specific tuning helps text-driven models but does not close the gap
Navigation, camera-controlled	HY-World 1.5 reaches 87.5; group average 76.0	Native geometric control is strongest for camera movement
Navigation, action-conditioned	Happy Oyster reaches 85.1; Matrix-Game 3.0 reaches 83.5; group average 77.7	Direct action interfaces are competitive with camera-control models
Semantic interactions	Kling 3.0 / Wan 2.7 lead event editing and subject action; perspective switching average is 30.7	Promptable semantic edits are still hard, and perspective switching is the weakest semantic interaction
Consistency	LingBot-World average 89.9	Camera-control and explicit state designs can excel at temporal and spatial consistency
Physical average	Wan 2.7 reaches 71.8; text-driven average 67.0 vs camera 64.2 and action 61.7	Physical correctness appears more tied to broad generative priors than control specialization

Table 5. Key result slices. These values come from the main results table and the per-dimension analysis. The paper's strongest conclusion is that different model families win different capabilities.

The paper's main result is not "model X wins." It is that interactive world modeling decomposes into partially independent skills. Text-driven models have broad semantic and physical priors. Camera- and action-conditioned models are better at motion control. Some models get high consistency partly by producing little motion, which is why gated spatial consistency matters.

Figure 3 shows the cross-dimension correlations and setting-level difficulty deviations.

Figure 3. Cross-dimension correlation and setting difficulty. — **Figure 3. Cross-dimension analysis.** Navigation is nearly independent of video quality, consistency, and physical compliance; physical scores correlate more strongly with video quality and consistency.

The strongest diagnostic points are:

Navigation has near-zero correlation with video quality \((r=-0.12)\), consistency \((r=-0.05)\), and physical compliance \((r=-0.15)\).
Physical compliance correlates with video quality \((r=0.84)\) and consistency \((r=0.72)\), but not navigation.
For text-driven models, semantic interaction aligns more with setting adherence \((r=0.76)\) than navigation \((r=0.29)\).
Camera control does not imply perspective consistency: HY-World 1.5 leads navigation but is only rank 8 among 11 world models on perspective consistency, and Matrix-Game 3.0 is navigation rank 3 but perspective rank 11.
Structured difficulty is visible: first-person navigation is easier \((z=+1.0)\), sports/game scenes and animal subjects are harder \((z=-1.9)\), and workspace scenes and robot subjects are easier \((z=+1.6\) and \(z=+1.0)\).

Figure 4 shows why the multi-turn part matters: navigation loses 33 points from turn 1 to turn 4+, while event editing and subject action fall by 13 and 9 points, and perspective switching stays low because its baseline is already weak.

**Figure 4. Per-turn degradation.** Navigation is the most fragile over turns because pose errors compound as the spatial reference frame drifts.

The human-alignment study recruits 400 crowdsourced annotators for blind pairwise comparisons across ten evaluation aspects. Ties count as 0.5 for each side, and per-model win rates are correlated against WBench scores. Figure 5 reports Spearman correlations of at least 0.94, with event editing, subject action, perspective switching, and spatial consistency reaching 1.00.

**Figure 5. Human preference alignment.** The paper uses this result to argue that the automatic metrics are reliable for model-level ranking, even though some sub-metrics rely on specialist models and VLM judges.

The qualitative figures show how the metrics behave on concrete cases. Figure 15 and Figure 16 explain navigation scoring. Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, and Figure 22 cover semantic interaction, consistency, and physical compliance examples.

**Figure 15. Navigation qualitative comparison.** Happy Oyster and HY-World 1.5 follow the instructed directions, while HY-Video 1.5 reverses rotation direction in two turns.

**Figure 16. Adaptive ground-truth construction.** The ground-truth trajectory adapts to predicted motion magnitude, so direction errors are penalized without unfairly penalizing different motion amplitudes.

**Figure 17. Event-editing qualitative comparison.** The examples contrast models that perform requested event transitions with models that leave the scene unchanged or introduce unrelated artifacts.

**Figure 18. Subject-action qualitative comparison.** The examples probe whether the controlled subject executes the requested action without losing identity or scene coherence.

**Figure 19. Perspective-switching qualitative comparison.** Perspective switching is difficult because a successful turn must visibly transition, land in the requested target perspective, and keep the viewpoint structurally valid.

**Figure 20. Spatial-consistency qualitative comparison.** The gated score penalizes near-static video that looks consistent only because the scene barely changes.

**Figure 21. Physics-compliance qualitative comparison.** The examples focus on physical causality and visible rule compliance rather than instruction adherence alone.

**Figure 22. Physics sub-dimension radar.** Causal fidelity is broken down across the seven Track 2 physics sub-dimensions.

**Figure 23. Human annotation platform.** Annotators perform blind pairwise comparisons using A better, B better, Tie, or Discard.

Practical Takeaways

For benchmark users, WBench is most useful when the goal is diagnosing which part of interactive video world modeling fails: scene setup, control, memory, semantic editing, perspective, or physics.
For model builders, the results argue against optimizing only video quality. Navigation is almost independent of video quality, and physical compliance follows different signals from control.
For evaluation design, WBench's strongest idea is the dual-track navigation setup: one benchmark can compare text-prompted I2V, camera-conditioned generation, and discrete-action world models without forcing all systems into a single interface.
For spatial-intelligence work, the hard cases are especially relevant: third-person control, sports/game scenes, animal subjects, long navigation sequences, and perspective switching expose subject-camera coupling failures.
For deployment claims, the paper gives a caution: a model can look visually polished and still fail basic interaction or multi-turn controllability.
Limitations: the current test set uses discrete action sequences rather than continuous control; physical compliance depends partly on LMM-based judging that may miss subtle effects; and the authors name broader domains plus real-time evaluation as future extensions.

Reference Coverage

All digest anchors are linked here as a final coverage check.

Evidence anchors: problem framing, dataset design, evaluation suite, NavScore, navigation design, experiment protocol, results pattern, cross-dimension analysis, human validation, limitations.
Table anchors: benchmark coverage, dataset statistics, evaluation suite, navigation actions, key results.
Figure anchors: Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23.