Benchmarking whether video world models remember off-screen change

Current World Models Lack a Persistent State Core

WRBench makes the model look away and come back: the scene is fixed, the event happens off-screen, and the returned view must show what changed.

If a cat jumps onto a bed while hidden, it should still be on the bed when the camera returns.

WRBench cat-state teaser: the cat jumps onto the bed while the camera looks away, then WRBench checks the returned state

Figure 1. The scene and camera path are fixed. The question is simple: after the camera looks away and returns, is the cat on the bed, or did the video reset, duplicate, or lose it?

A cat jumps onto the bed while the camera looks away. Is it on the bed when the camera returns?

Start in a bedroom with a cat on the floor. Turn the camera away. The prompt says the cat jumps onto the bed. Turn back. A real world keeps unfolding while it is not being watched. Evaluated video generators often do something weaker: the cat may drift, duplicate, vanish, or reset. A world model should preserve what happened, not just draw a plausible returned frame.

Stage 01Visible Did the camera move as asked?

Before testing memory, WRBench checks whether the generated video actually follows the requested look-away-and-return motion.

control before memory
Stage 02Access Is the visible part readable?

The visible frames must be coherent enough to judge the scene and the action before the hidden change is tested.

visible evidence
Stage 03Return Does the returned view remember?

When the target comes back into view, WRBench checks whether its location and state match the event that happened off-screen.

returned state
look away find again remember result the return is visible, but the state is wrong

Cat jumps onto bed: three outcomes

The same teaser scene and camera path are shown across three models. The expected outcome is simple: after the camera returns, the cat should be on the bed.

InSpatio World 14BExpected endpoint

The cat reaches the bed, and the returned view still supports that outcome.

Wan-Fun 2.2-5BDragged target

The cat is carried along with the camera instead of completing a clean jump.

Hunyuan WorldPlayReturn problem

The camera returns, but the expected cat-on-bed state is not established.

Why this case: it is the same teaser and overview scene, so the expected endpoint is unambiguous. The montage anchors the task; the three clips above show whether a model reaches that endpoint, drags the target, or fails to re-establish the returned state.

Cat jumps onto bed teaser montage

Frame montage: the camera looks away during the jump, then returns to check whether the off-screen outcome was kept.

Six things we learned — the core of WRBench.

From 23 models and 9,600 videos, the pattern is consistent: models can make clean videos, follow a camera, and bring objects back into view, yet still lose what changed while hidden.

Six headline findings, stated as mechanisms rather than a single leaderboard rank.

Where WRBench sits among video and world-model benchmarks.

Table 1 from the paper. Prior benchmarks cover important parts of video and world-model evaluation. WRBench adds a specific stress test: change the viewpoint, hide the target, return, and ask whether the hidden event outcome stayed true.

stated benchmark target adjacent / mixed target not a stated target
State robustness means the benchmark separates camera following, re-finding the target, and preserving the returned state. Evolution consistency asks whether the off-screen event outcome remains true when the camera comes back. WRBench is the only row that targets all three paper-level requirements.

23 models — separate preservation, access, and return.

WRBench does not collapse everything into one score. The table separates whether the camera was followed, whether the visible frames are readable, whether the target returns, and whether the returned state is correct. Click any header to sort.

Stage 1
Vis. spatial / state while the target stays on screen

Visible consistency — often strong; it does not guarantee a correct return.

Stage 2
Reobs. support how often the target comes back

Re-observation — whether the return can be judged, not whether it is correct.

Stage 3
Reobs. spatial / state scored only when the return is judgeable

Re-observation consistency — the outcome WRBench is designed to measure (2,073 judgeable clips).

† Reobs. spatial and Reobs. state are conditional means over the judgeable re-observation subset. Avg. is the paper's mean of Cam Align., Integ., Vis. spatial/state, and Reobs. spatial/state, excluding re-observation support. Italic re-observation scores mark low re-observation support (fewer than 10 judgeable clips or support below 10%). Cam Prec. is not applicable to prompt-only API rows.

No model wins every dimension. Lingbot-World Act pairs high visible spatial/state scores (0.874 / 0.719) with the weakest requested-camera precision (0.468) among controllable local models. Strong visible tracking still does not mean the requested camera, or the returned world state, is solved.

Tracking shots expose the re-observation gap.

The diagnosis is not that one model loses. The hard part is specific: a model may show the target again but still fail to keep the off-screen change. In-place changes such as folding, tipping, or sitting are especially fragile.

Question 1 · Finding 1 Does the tracking shot buy a correct return?

Visible quality and re-observation consistency barely move together — score the return, not the frame.

Question 2 · Finding 2 Which events break on return?

In-place state change is the universal hard case; relocation is easier to preserve.

Question 3 · Findings 3–6 What increment actually closes the re-observation gap?

Scale, architecture, and training mostly add re-observation support — not re-observation consistency.

Figure · condition frontiers
Diagnostic frontiers by viewpoint condition type

Figure 4. Best score each viewpoint condition type reaches on every diagnostic dimension; frontiers expand on visible/support axes, not re-observation consistency.

The setup · how to read it

A richer input buys access and clean frames, not re-observation consistency.

How to read this: each spoke is one diagnostic dimension; the colored outline is the best score reached by a viewpoint condition type. A bigger outline = better on that axis.

  • The re-observation support and visible-quality frontiers separate sharply by condition type — more external footage pushes them out.
  • The re-observation consistency frontier barely grows, and where it looks high it rests on sparse re-observation support.
  • Lingbot-World Act pairs high visible spatial/state 0.874 / 0.719 with requested-camera precision 0.468, exposing the visible-vs-camera split.
  • Takeaway: handing a model more of the already-seen scene mostly lets it look back, not get the hidden outcome right.
Figure · metric correlations
Paper metric correlation heatmap for visible and re-observed WRBench dimensions

Figure 5. Model-level correlations among the diagnostic metrics (23 rows), with visible and re-observation consistency forming separate blocks.

Finding 1 · dimensions

Measure re-observation consistency, not frame quality.

How to read this: darker = stronger correlation. If two metrics rose together, image quality would predict a correct return. They don’t.

  • Visible spatial and state lock together at r = 0.97; the returned pair forms its own block at r = 0.94.
  • But the two blocks only loosely track each other — visible→returned reaches just 0.60–0.79.
  • Re-observation support sits apart and even inverts, down to −0.42 with visible spatial.
  • Takeaway: score re-observation consistency — visible quality is already a solved-looking block on its own.
Finding 1 · mechanism

Camera motion decides whether the return test runs — not whether D5/D6 are right.

How to read this: each group of bars is a camera condition. Watch how re-observation support grows while re-observation consistency bars stay flat.

  • Panning the camera swings re-observation support from near zero to ~40% — about two orders of magnitude.
  • Across the two pan directions support triples 13% → 40%, yet returned state shifts under 0.01.
  • Takeaway: camera motion creates the opportunity to check the hidden state; it does not make the answer correct.
Paper camera-condition bar chart comparing re-observation support and re-observed scores

Figure 6. Static hold vs. horizontal camera pan: re-observation support moves, re-observation consistency does not.

Finding 2 · which events break

Changing an object in place is the universal hard case.

How to read this: each event is split into two switches — did the object move, and did its state change. The boxes compare flipping one at a time.

  • Relocating an object helps the later return: returned state +0.038 (p < 0.01).
  • An in-place change hurts: visible position −0.114 and returned state −0.068 (p < 0.001).
  • A move gives a new coordinate to track; an in-place change gives no anchor, so the altered object drifts and smears where it sits.
  • Takeaway: in-place state change is the universal hard case — relocation is easier to preserve on return.
Paper event-factor boxplots under the two-by-two spatial and state design

Figure 7. Metrics under a 2×2 event design; stars mark paired Wilcoxon significance.

Findings 3–6 · does anything fix it?

No tested increment of scale, architecture, or training closes the re-observation gap.

How to read this: the left panel asks which model inputs make the return visible. The Wan-family panels then vary scale, architecture, and training signal to ask whether any of them preserve the hidden outcome.

  • Condition type (F3): a richer viewpoint interface decides which models can pose the return test — it moves the bottleneck from whether the object returns to whether the return is correct, not whether the test is passed.
  • Scale (F4): bigger Wan backbones add re-observation support, but scaling 1.3B→14B even lowers returned state from 0.66→0.62.
  • Architecture (F5): carriers store where to look back, not what changed while hidden.
  • Training (F6): no public loss supervises the unobserved outcome — a long-to-short recipe is proposed to write it back into state.
  • Takeaway: every tested increment mostly moves re-observation support or visible quality — not re-observation consistency.
Paper Wan series scale diagnostics

Figure 8A · Wan scale / version diagnostics

Paper Wan architecture increment diagnostics

Figure 8B · Wan architecture diagnostics

Paper Wan training signal diagnostics

Figure 8C · Wan training-signal diagnostics

More appendix cases.

These clips match appendix montages in the paper. They show three recurring patterns: camera direction controls whether a return can be judged, returning into view does not guarantee a correct state, and in-place changes remain hard.

Frame-level failure cases.

Appendix montages show how a video can look readable, move the camera, and show visible action, yet still return to the wrong location or state.

How WRBench measures it.

WRBench fixes the scene, the event, and the camera path, then keeps enough generation detail to compare very different models fairly. The six checks separate visible consistency, re-finding the target, and preserving the returned state.

Natural-25

25 scene families × 4 event tiers × 3 camera conditions. Events span fold, jump, knock, place, sit, and tip — from spatial relocation to in-place state change.

What each model saw

Every generated clip is paired with the exact input condition the model received, so prompt-only, image-conditioned, and video-conditioned systems are compared on the evidence they were actually given.

Human calibration

2,547 deduplicated annotator verdicts over 1,156 comparison pairs calibrate automatic evaluators dimension by dimension.

Four input routes

Prompt-only, model-inferred camera, geometry cache, and source video routes carry different amounts of already-seen scene evidence.

Each test is more than a clip

Each test starts from the same scene and event, then changes the camera path so the target either stays visible, becomes hidden, or returns. The prompt does not reveal the final answer; the generated video must carry the off-screen outcome.

For every model, WRBench keeps the model input, the requested camera path, the generated video, and the scoring evidence together.

Scene + event Camera path Generated video Six checks
WRBench method overview

WRBench pipeline: camera following, visible quality, visible consistency, re-observation support, and re-observation consistency are measured separately.

Natural-25 scene families and event-view coverage

Natural-25: scene families × spatial/state event tiers × camera conditions.

Six checks, with return scored only when visible.

Human calibration grounds the scores.

2,547 deduplicated annotator verdicts over 1,156 released comparison pairs validate each WRBench check separately — not one collapsed preference score. Agreement uses prevalence-robust AC1; rank ρ is Spearman alignment between the automatic margin and the ordered human label.

Visual integrity uses a separate 190-pair holdout. Rev. counts opposite-direction threshold decisions; thresholds are fixed before reporting. Disagreement appears mainly as ties around small differences, not systematic reversal.

For re-observation dimensions, annotators first decide whether hidden-and-returned evidence is judgeable — human-aligned judgeability makes the conditional denominator meaningful and keeps re-observation support distinct from returned correctness.

Citation

@misc{lu2026currentworldmodels,
  title={Current World Models Lack a Persistent State Core},
  author={Jinpeng Lu and Dexu Zhu and Haoyuan Shi and Linghan Cai and Guo Tang and Yinda Chen and Jie Cao and Duyu Tang and Yi Zhang and Yong Dai and Xiaozhu Ju},
  year={2026},
  eprint={2606.20545},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.20545}
}