arXiv20262026avg 5.92interest 9.3099 HF interactive world modelsvideo evaluationphysics consistency

The paper introduces WBench, a multi-turn benchmark for interactive video world models covering video quality, setting adherence, interaction adherence, consistency, and physics compliance. It includes 289 cases and 1,058 turns with multiple interaction types and finds that no evaluated state-of-the-art model is strong across all dimensions.

Source-first digest for monthly 2026_05 rank 17, rank_id p043.

Motivation / Background

The paper argues that interactive video world models need a broader evaluation contract than ordinary text-to-video systems. A useful interactive world model has to render plausible video, initialize the requested world, execute user controls, remember state over turns, and obey physical or causal constraints. The authors frame these as five game-engine-like roles: renderer, director, controller, memory, and engine. Existing benchmarks cover parts of this stack, but not the whole combination of open-domain scenes, first- and third-person perspectives, multiple interaction types, multi-turn state, and physics.

Figure 1 is the paper's compact thesis: a benchmark case should specify both the world setting and a multi-turn interaction sequence, then evaluate not just the final clip quality but whether each requested turn remains controllable and coherent.

Figure 1. WBench overview.
Figure 1. WBench overview. The benchmark combines world settings, four interaction classes, unified navigation control, and five evaluation dimensions.
Benchmark family Main coverage Missing piece relative to WBench
Video-generation suites such as VBench Visual quality, perceptual realism, text alignment No action input or multi-turn interaction
World-model suites such as MIND / WorldMark Navigation or memory-oriented interaction Limited semantic interactions, perspective coverage, or physics scope
Autonomous-driving or robotics suites Domain-specific control and dynamics Not open-domain, not both first- and third-person
WBench Text, camera, and action inputs; FPP and TPP; navigation, subject action, event editing, perspective switching; quality, setting, interaction, consistency, physics The paper still limits itself to discrete action sequences rather than continuous control

Table 1. Benchmark-coverage framing. This digest table condenses the source benchmark-comparison table and related-work discussion. The key gap is unified evaluation of multi-turn interactive video, not simply another video-quality leaderboard.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 WBench fills a real benchmark gap by jointly covering open-domain settings, both perspectives, four interaction types, multi-turn evaluation, and five evaluation dimensions. 5 problem framing, dataset design, evaluation suite, benchmark coverage table
C2 The dataset is broad enough to expose different failure modes: 289 cases, 1,058 turns, 2-9 turns per case, multiple scenes, styles, subjects, and interaction subtypes. 5 dataset design, dataset statistics table, dataset distribution figure
C3 The navigation design is a meaningful cross-paradigm bridge because text, 6-DoF pose, and discrete keyboard controls are aligned to the same underlying action semantics. 5 navigation design, navigation action table, navigation definition figure
C4 The evaluation suite is not one scalar metric; it uses 22 sub-metrics across video quality, setting adherence, interaction adherence, consistency, and physics compliance. 5 evaluation suite, metric summary table, NavScore equations
C5 Across 20 models, no model dominates all dimensions; text-driven models lead setting and physics, while native-control world models lead navigation. 5 experiment protocol, key results table, results pattern
C6 Navigation is structurally decoupled from rendering, consistency, and physical compliance, so high video quality does not imply controllable motion. 5 cross-dimension analysis, correlation figure, turn degradation figure
C7 The automatic metrics are reasonably aligned with human preference at model-ranking granularity, with Spearman correlations at least 0.94 across ten aspects. 5 human validation, human alignment figure, human platform figure
C8 The benchmark is strong but not complete: it focuses on discrete actions, still relies partly on LMM-based physical judging, and needs broader domains and real-time evaluation. 4 limitations

Support scores are support-from-paper scores, not independent reproduction scores. A score of 5 means the claim is directly backed by source definitions, figures, tables, or experiments. A score of 4 means the paper gives substantial evidence but the claim still depends on benchmark design choices or automated judges.

Core Technical Idea

WBench treats an interactive world model as a conditional generator:

$$ o_{t+1} \sim f_\theta(o_{t+1} \mid o_{\le t}, a_{\le t}). $$

The benchmark decomposes each case into a world setting \(\mathcal{W}\), which fixes the initial world state \(o_0\), and an interaction sequence \(\mathcal{I}=(a_0,a_1,\ldots,a_{T-1})\), which specifies what the user does across consecutive turns. This separation is the important technical move: failures can be attributed to wrong initialization, wrong action execution, poor memory, or poor physics rather than collapsed into a generic video score.

Figure 2 shows the dataset distribution, and Table 2 gives the digest-scale summary.

Figure 2. Dataset distribution.
Figure 2. Dataset distribution. The source figure summarizes perspective, interaction type, subject, scene, style, event-editing subtype, perspective-switching subtype, and turn-depth distributions.
Dataset axis Value
Total cases / turns 289 cases / 1,058 turns
Turns per case 2-9 turns, average 3.7
Perspective split 62% first-person, 38% third-person
Interaction split 57% navigation, 20% subject action, 17% event editing, 6% perspective switching
Scene split Nature 31%, urban 21%, indoor 17%, workspace 13%, fantasy 10%, sports/game 8%
Explicit-subject cases 194 cases
Subject split Human 64%, animal 9%, robot 9%, vehicle 7%, miscellaneous object 10%
Style split 52% photorealistic, 48% other styles including anime, cartoon, CG, oil painting, ink wash, pencil sketch, flat, and abstract
Longer sequences 12% of cases have 5-9 turns, usually mixing subject action and event editing

Table 2. WBench dataset statistics. The benchmark is designed to probe long-horizon state rather than single-shot prompt following.

The evaluation suite has five dimensions and 22 sub-metrics. All sub-metric scores are linearly rescaled to \([0,100]\), higher is better.

Dimension What it checks Representative sub-metrics or mechanisms
Video quality Perceptual quality independent of control Aesthetic quality, imaging quality, temporal flicker, dynamic degree, motion smoothness, HPSv3-Norm
Setting adherence Whether the generated video realizes the world setting Scene adherence and subject adherence via VLM checks
Interaction adherence Whether requested turns are executed NavScore with pose estimation; event editing, subject action, and perspective switching via structured VLM protocols
Consistency Whether identity, background, geometry, perspective, and continuity persist Subject/background consistency, gated spatial consistency, segment continuity, perspective consistency, geometric and photometric consistency
Physical compliance Whether events obey plausible physical and causal rules Causal fidelity plus visual plausibility from a fine-tuned Qwen3-VL-30B-A3B judge

Table 3. Evaluation suite. WBench is best read as a diagnostic grid. It deliberately keeps interaction, memory, and physics separate because they fail differently.

The most distinctive metric is NavScore, which compares estimated camera trajectories against synthetic ground-truth trajectories derived from the action sequence. The benchmark uses MegaSaM for camera pose estimation, then resamples both predicted and ground-truth trajectories to \(K=20\) arc-length points per turn. The normalized translation and rotation errors are:

$$ \mathrm{nATE}_t = \min\left(\frac{\mathrm{ATE}_t}{\max(L_{\mathrm{pred}}, 0.5)}, 1\right), \qquad \mathrm{nATE}_r = \min\left(\frac{\mathrm{ATE}_r}{\max(\Theta_{\mathrm{pred}}, 10^\circ)}, 1\right). $$

For repeated or symmetric actions, WBench also computes cross-turn consistency:

$$ \mathrm{cnATE}_t = \frac{1}{P}\sum_{p=1}^{P}\mathrm{nATE}^{(p)}_t, \qquad \mathrm{cnATE}_r = \frac{1}{P}\sum_{p=1}^{P}\mathrm{nATE}^{(p)}_r. $$
$$ \mathrm{Cons} = 1 - \frac{\mathrm{cnATE}_t + \mathrm{cnATE}_r}{2}. $$

The final navigation score is:

$$ \mathrm{Acc} = 1 - \frac{\mathrm{nATE}_t + \mathrm{nATE}_r}{2}, \qquad \mathrm{NavScore} = \frac{\mathrm{Acc} + \mathrm{Cons}}{2}. $$

This design avoids treating scale mismatch as the only error. The adaptive ground truth follows each model's motion magnitude, then penalizes wrong direction or wrong shape, as illustrated in Figure 16.

Method Details

The dataset starts from world settings with four attributes: scene, style, perspective, and subject. Initial frames are generated or collected, then manually checked for quality and prompt-frame consistency. Interactions are setting-aware: annotators create physically executable, semantically coherent sequences, such as reasonable movement through a space or a weather transition in an outdoor scene.

Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10 are the appendix galleries that make the coverage concrete rather than purely statistical.

Figure 6. All-case gallery.
Figure 6. All-case gallery. Thumbnail overview of all WBench cases.
Figure 7. Scene and style gallery.
Figure 7. Scene and style gallery. The source caption pairs photorealistic and stylized variants across nature, urban, indoor, workspace, fantasy, and sports/game scenes.
Figure 8. Style gallery.
Figure 8. Style gallery. Representative initial frames span realistic, anime, cartoon, oil painting, ink wash, flat, and pencil sketch styles.
Figure 9. Perspective gallery.
Figure 9. Perspective gallery. The perspective axis includes disembodied first-person, embodied first-person, and third-person cases.
Figure 10. Subject gallery.
Figure 10. Subject gallery. Third-person subjects include humans, animals, vehicles, robots, and other controllable objects.

The interaction taxonomy has four top-level classes. Navigation uses translational W/S/A/D and rotational arrow-style controls. Subject action covers manipulation, locomotion, tool use, combat, and gestures. Event editing covers exogenous changes such as weather shifts, object appearances, time-of-day shifts, and object-state transitions. Perspective switching covers first-person to third-person transitions, third-person to first-person transitions, same-subject switches, multi-subject switches, and scope-mode transitions, shown in Figure 11.

Figure 11. Perspective-switching taxonomy.
Figure 11. Perspective-switching taxonomy. The figure shows representative same-subject, multi-subject, and scope-mode transitions.

The navigation design is perspective-dependent. The same key means camera motion in first-person mode but subject or orbital motion in third-person mode; Table 4 and Figure 12 capture that mapping.

Type Key First-person semantics Third-person semantics
Translation W Camera pushes forward Subject walks forward
Translation S Camera pulls backward Subject steps backward
Translation A Camera strafes left Subject moves left
Translation D Camera strafes right Subject moves right
Rotation left View turns left Camera orbits left
Rotation right View turns right Camera orbits right
Rotation up View tilts up Camera elevates
Rotation down View tilts down Camera descends

Table 4. Navigation action semantics. The mapping is central to fair comparison because it lets text, pose, and action interfaces be evaluated against equivalent spatial intent.

Figure 12. Navigation action definition.
Figure 12. Navigation action definition. Visual reference for the same action keys under first-person and third-person perspectives.
Figure 13. Navigation distribution.
Figure 13. Navigation distribution. Translational actions account for 62.8% of atomic navigation actions and rotations for 37.0%; trajectory categories include round-trip, progressive, repeat, L-shape, loop, and zigzag.

For web-only systems such as Genie 3 and Happy Oyster, the authors automate the web interface instead of changing the benchmark. Figure 14 shows the pipeline: fill prompt and image, wait for world loading, execute each 5-second turn, and download the recorded video.

Figure 14. Web-based evaluation pipeline.
Figure 14. Web-based evaluation pipeline. The pipeline enables web-only interactive systems to be evaluated at scale without a public API.

Several consistency and physics metrics are worth calling out because they guard against shortcut behavior. Gated spatial consistency penalizes a model that appears consistent only because it barely moves:

$$ \mathrm{score} = s_{\mathrm{ret}} \cdot \min\left(1, \frac{1 - s_{\mathrm{min}}}{\tau}\right), \qquad \tau = 0.15. $$

Perspective consistency tracks target centroids with SAM2 masks:

$$ s_{\text{centroid}} = \max\left(0, 1 - \frac{\sqrt{\sigma_{c_x}^2 + \sigma_{c_y}^2}}{0.3}\right) \times p. $$

Reconstruction consistency uses depth and pose to back-project and reproject pixels:

$$ \mathbf{X}_i = d_i(\mathbf{u}_i)K^{-1}\mathbf{u}_i, \qquad \mathbf{X}_j = R_{ji}\mathbf{X}_i + \mathbf{t}_{ji}, \qquad \hat{\mathbf{u}}_j = \pi(K\mathbf{X}_j). $$

It then measures geometric error and photometric agreement:

$$ e_{\mathrm{rel}}(\mathbf{u}_i)=\frac{\lVert\hat{\mathbf{u}}_j-\mathbf{u}_j\rVert_2}{D}, \qquad s_{\mathrm{geo}}=\frac{1}{1+\bar{e}_{\mathrm{rel}}}, \qquad s_{\mathrm{photo}}=\mathrm{PSNR}(\hat{I}_{i\rightarrow j}, I_j). $$

For physical compliance, causal fidelity uses a two-stage VLM protocol: global rendering-physics and causal-consistency scoring, then scene-aware scoring over selected sub-dimensions such as fluid/smoke, collision, surface tracks, deformation, wind, reflection, and human motion. Visual plausibility is a separate fine-tuned Qwen3-VL-30B-A3B score. The visual-plausibility head renormalizes probabilities over five rating tokens and takes their expected value:

$$ \tilde{p}_c = \frac{p_c}{\sum_{c'\in\mathcal{C}}p_{c'}}, \qquad \mathcal{C}=\{\texttt{Perfect},\texttt{Good},\texttt{Fair},\texttt{Poor},\texttt{Bad}\}. $$
$$ \hat{s} = 5\tilde{p}_{\texttt{Perfect}}+ 4\tilde{p}_{\texttt{Good}}+ 3\tilde{p}_{\texttt{Fair}}+ 2\tilde{p}_{\texttt{Poor}}+ \tilde{p}_{\texttt{Bad}}, \qquad \mathcal{L}_{\mathrm{VP}}=(\hat{s}-s)^2. $$

Experiments And Results

The experiments evaluate 20 models across three paradigms. The text-driven group has 9 models and supports all four interaction types on the full 289-case split. The camera-controlled group has 5 models and the action-conditioned group has 6 models; both are restricted to the shared 158-case navigation subset. Semantic interactions are text-only and therefore only appear for text-driven systems.

Result slice Best / key value Digest reading
Video quality Seedance 1.5 average 82.1; Wan 2.7 average 81.5 Visual quality is close to saturated and no longer the main bottleneck
Setting adherence Wan 2.7 average 91.4; Kling 3.0 average 91.0 Text-driven systems remain much stronger at broad world-setting following
Navigation, text-driven YUME 1.5 reaches 72.0 Navigation-specific tuning helps text-driven models but does not close the gap
Navigation, camera-controlled HY-World 1.5 reaches 87.5; group average 76.0 Native geometric control is strongest for camera movement
Navigation, action-conditioned Happy Oyster reaches 85.1; Matrix-Game 3.0 reaches 83.5; group average 77.7 Direct action interfaces are competitive with camera-control models
Semantic interactions Kling 3.0 / Wan 2.7 lead event editing and subject action; perspective switching average is 30.7 Promptable semantic edits are still hard, and perspective switching is the weakest semantic interaction
Consistency LingBot-World average 89.9 Camera-control and explicit state designs can excel at temporal and spatial consistency
Physical average Wan 2.7 reaches 71.8; text-driven average 67.0 vs camera 64.2 and action 61.7 Physical correctness appears more tied to broad generative priors than control specialization

Table 5. Key result slices. These values come from the main results table and the per-dimension analysis. The paper's strongest conclusion is that different model families win different capabilities.

The paper's main result is not "model X wins." It is that interactive world modeling decomposes into partially independent skills. Text-driven models have broad semantic and physical priors. Camera- and action-conditioned models are better at motion control. Some models get high consistency partly by producing little motion, which is why gated spatial consistency matters.

Figure 3 shows the cross-dimension correlations and setting-level difficulty deviations.

Figure 3. Cross-dimension correlation and setting difficulty.
Figure 3. Cross-dimension analysis. Navigation is nearly independent of video quality, consistency, and physical compliance; physical scores correlate more strongly with video quality and consistency.

The strongest diagnostic points are:

Figure 4 shows why the multi-turn part matters: navigation loses 33 points from turn 1 to turn 4+, while event editing and subject action fall by 13 and 9 points, and perspective switching stays low because its baseline is already weak.

Figure 4. Per-turn degradation.
Figure 4. Per-turn degradation. Navigation is the most fragile over turns because pose errors compound as the spatial reference frame drifts.

The human-alignment study recruits 400 crowdsourced annotators for blind pairwise comparisons across ten evaluation aspects. Ties count as 0.5 for each side, and per-model win rates are correlated against WBench scores. Figure 5 reports Spearman correlations of at least 0.94, with event editing, subject action, perspective switching, and spatial consistency reaching 1.00.

Figure 5. Human preference alignment.
Figure 5. Human preference alignment. The paper uses this result to argue that the automatic metrics are reliable for model-level ranking, even though some sub-metrics rely on specialist models and VLM judges.

The qualitative figures show how the metrics behave on concrete cases. Figure 15 and Figure 16 explain navigation scoring. Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, and Figure 22 cover semantic interaction, consistency, and physical compliance examples.

Figure 15. Navigation qualitative comparison.
Figure 15. Navigation qualitative comparison. Happy Oyster and HY-World 1.5 follow the instructed directions, while HY-Video 1.5 reverses rotation direction in two turns.
Figure 16. Adaptive ground-truth construction.
Figure 16. Adaptive ground-truth construction. The ground-truth trajectory adapts to predicted motion magnitude, so direction errors are penalized without unfairly penalizing different motion amplitudes.
Figure 17. Event-editing qualitative comparison.
Figure 17. Event-editing qualitative comparison. The examples contrast models that perform requested event transitions with models that leave the scene unchanged or introduce unrelated artifacts.
Figure 18. Subject-action qualitative comparison.
Figure 18. Subject-action qualitative comparison. The examples probe whether the controlled subject executes the requested action without losing identity or scene coherence.
Figure 19. Perspective-switching qualitative comparison.
Figure 19. Perspective-switching qualitative comparison. Perspective switching is difficult because a successful turn must visibly transition, land in the requested target perspective, and keep the viewpoint structurally valid.
Figure 20. Spatial-consistency qualitative comparison.
Figure 20. Spatial-consistency qualitative comparison. The gated score penalizes near-static video that looks consistent only because the scene barely changes.
Figure 21. Physics-compliance qualitative comparison.
Figure 21. Physics-compliance qualitative comparison. The examples focus on physical causality and visible rule compliance rather than instruction adherence alone.
Figure 22. Physics sub-dimension radar.
Figure 22. Physics sub-dimension radar. Causal fidelity is broken down across the seven Track 2 physics sub-dimensions.
Figure 23. Human annotation platform.
Figure 23. Human annotation platform. Annotators perform blind pairwise comparisons using A better, B better, Tie, or Discard.

Practical Takeaways

Reference Coverage

All digest anchors are linked here as a final coverage check.