Stream-T1: Test-Time Scaling for Streaming Video Generation

Source-first digest for monthly 2026_05 rank 23, rank_id p039.

Routing status: success, route full_markdown
PDF extraction: not used

Motivation / Background

The paper studies test-time scaling for streaming video generation. Its starting point is that video generation can benefit from spending more compute at inference time, but conventional diffusion-video test-time scaling explores a high-dimensional latent space and repeats multi-step denoising for many candidates. The authors argue that this is especially inefficient for long videos because the method has to search over both visual content and temporal history. Streaming generation changes the search geometry: video is produced chunk by chunk, often with only a few denoising steps per chunk, so the paper frames it as a "shallow search tree with wide branches." This problem framing is the basis for the claim matrix in Table 1.

The proposed Stream-T1 framework adds three test-time mechanisms around a LongLive-style autoregressive video model: Stream-Scaled Noise Propagation, Stream-Scaled Reward Pruning, and Stream-Scaled Memory Sinking. The paper's core claim is not simply that it samples more candidates. It claims that Stream-T1 actively steers the next chunk by reusing historically good noise, selecting candidates with short-term and long-term reward signals, and updating the KV-cache sink according to reward-derived memory gates. The high-level flow is shown in Figure 1, and the full branch trajectory for the illustrated case is shown in Figure 2.

Claims And Evidence

Support scores are support-from-paper scores, not independent reproduction scores. A score of 5 means the claim is directly supported by paper equations, tables, figures, or ablations. A score of 4 means the paper reports direct evidence, but the claim still depends on the authors' benchmark, reward model, or evaluation setup. A score of 3 means the paper gives plausible support but leaves a numerical inconsistency, missing ablation, or qualitative gap.

Claim id	Main claim	Support	Evidence anchors
C1	Streaming video generation is a more cost-effective target for video test-time scaling than standard diffusion-video search because chunks use a smaller local search space and fewer denoising steps.	4	problem framing, contribution framing, pipeline
C2	Stream-T1 is a three-part active test-time framework rather than plain candidate selection: it propagates noise, prunes by rewards, and updates memory.	5	method overview, method map, pipeline, search trajectory
C3	Noise propagation correlates the current chunk's initial noise with the previous best trajectory while preserving a standard Gaussian marginal.	5	key equations, noise propagation
C4	Reward pruning combines frame-level spatial quality and sliding-window temporal quality, then prunes a beam-expanded pool from \(K \times M\) candidates back to top \(K\).	5	reward pruning, key equations, method map
C5	Memory sinking uses reward feedback to decide whether evicted KV-cache chunks are discarded, EMA-compressed, or appended as new long-term anchors.	5	memory sinking, key equations, pipeline
C6	Against CausVid, Self-forcing, and LongLive, Stream-T1 reports the strongest overall 5-second and 30-second benchmark profile, with a few metric-specific exceptions.	4	short-video results, long-video results, 5s table, 30s table, qualitative results
C7	Compared with passive test-time scaling baselines, the active Stream-T1 intervention is stronger on most 30-second metrics, but the recovered table shows Best-of-N higher on imaging quality.	3	TTS comparison, TTS table
C8	The ablation supports all three modules as useful for temporal consistency, although the best imaging-quality number comes from removing memory sinking.	4	ablation evidence, ablation table, ablation figure

Core Technical Idea

The core technical idea is to make every generated chunk depend on a rewarded history, not only on the model's default autoregressive state. Built on LongLive, Stream-T1 maintains a beam of candidate histories. Before producing a new chunk, it initializes the chunk's noise from historically successful noise. After generation, it scores the expanded candidate set using both image- and video-level reward models. After pruning, it updates memory according to whether the evicted context is high quality and whether it marks a transition. This flow is summarized in Table 2.

**Figure 1. Stream-T1 chunk-level test-time scaling pipeline.** Each chunk passes through noise propagation, reward pruning, and memory sinking. Provenance: source label `fig:pipeline`, original LaTeX asset `fig/pipeline.png`, copied from the ranking cache to `figs/fig001_01_pipeline.jpg`.

**Figure 2. Complete search path for the pipeline example.** The source caption links this trajectory to the case shown in Figure 1. Provenance: source label `fig:search`, original LaTeX asset `fig/search.png`, copied from the ranking cache to `figs/fig002_01_search.jpg`.

Component	When it acts	Signal	What it changes	Digest reading
Beam expansion	Per chunk	\(K\) histories expanded by \(M\) alternatives	Candidate pool grows to \(K \times M\)	This is the ordinary search substrate.
Stream-Scaled Noise Propagation	Before synthesis	Previous best noise latent plus fresh Gaussian noise	Initial latent noise for the next chunk	Adds temporal correlation without changing the Gaussian marginal.
Stream-Scaled Reward Pruning	After chunk synthesis	Frame reward and sliding-window video reward	Keeps top \(K\) candidate histories	Balances local aesthetics and longer temporal coherence.
Stream-Scaled Memory Sinking	After pruning	Short-score quality gate and long-score transition detector	Routes evicted KV-cache chunks through discard, EMA, or append	Tries to avoid both context pollution and over-smoothed memory.

The autoregressive diffusion chunk is modeled as a few-step denoising composition:

p_\theta(x^i \mid x^{\lt i}, c) = f_{\theta,t_1} \circ f_{\theta,t_2} \circ \dots \circ f_{\theta,t_T}(x^i_{t_T}).

Noise propagation initializes the first chunk from a standard Gaussian and later chunks by spherical interpolation against the previous selected chunk noise:

\begin{aligned} x_T^0 &\sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \\ x_T^n &= \beta x_T^{n-1} + \sqrt{1-\beta^2}\epsilon,\quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \end{aligned}

Reward pruning splits candidate evaluation into a short image score and a long sliding-window video score:

\begin{aligned} S^n_{\text{short}} &= \frac{1}{F}\sum_{f=1}^F \text{ImageReward}(x_0^n[:,f]), \\ S^n_{\text{long}} &= \text{VideoReward}(\text{output}[:,\max(0,n-w+1):n]). \end{aligned}

The final pruning score increases short-score weight as generation advances, but caps it by \(\tau\):

S^n_{\text{final}} = \begin{cases} \frac{n}{N}S^n_{\text{short}} + (1-\frac{n}{N})S^n_{\text{long}}, & \frac{n}{N} \le \tau, \\ \tau S^n_{\text{short}} + (1-\tau)S^n_{\text{long}}, & \frac{n}{N} > \tau. \end{cases}

Memory sinking uses two gates:

\mathcal{C}_{\text{quality}} := S^n_{\text{short}}-\overline{S}_{\text{short}} > \tau_{\text{short}}, \qquad \mathcal{C}_{\text{transition}} := S^{n-1}_{\text{long}}-S^n_{\text{long}} > \tau_{\text{long}}.

The routing choices are: keep the sink unchanged when the chunk fails the quality gate; EMA-compress \(K^n,V^n\) into the latest sink when quality is high and no transition is detected; or append \(K^n,V^n\) as a new anchor when quality is high and the long reward indicates a transition.

Method Details

Stream-Scaled Noise Propagation is motivated by the observation that nearby latent states often share generation quality. Instead of sampling each chunk's initial noise independently, the method conditions \(x_T^n\) on the previously selected \(x_T^{n-1}\) through the interpolation in the key equations. The design gives adjacent chunks a controllable correlation through \(\beta\), while the fresh \(\epsilon\) term preserves the marginal Gaussian distribution. This is why C3 receives a high support score in Table 1: the equation directly backs the stated mechanism.

Stream-Scaled Reward Pruning is the search filter. For each chunk, \(K\) live histories are each expanded into \(M\) alternatives, producing \(K \times M\) candidate continuations. A short score averages frame-level image rewards for spatial fidelity, and a long score uses a video reward model over a sliding window of recent chunks for text alignment, visual quality, and motion coherence. The final score in the key equations increases the short-score contribution later in the video while capping that contribution to reduce repetition or stagnation.

Stream-Scaled Memory Sinking is the long-horizon context mechanism. LongLive starts from a sliding-window cache with attention sinks, but fixed sinks can over-anchor early frames and uniform EMA can blur distinct semantic states. Stream-T1 uses the reward scores from pruning to decide what to do with the KV-cache chunk evicted from the local window: discard low-quality chunks, EMA-compress high-quality non-transition chunks, or append high-quality transition chunks as new anchors. The append/EMA/discard policy is the most direct mechanism behind the paper's long-video consistency claim.

The experiments instantiate Stream-T1 on LongLive built from Wan2.1-T2V-1.3B. The initial KV cache has attention window size 9 and sink size 3. The short reward can use image reward models such as HPSv3, ImageReward, and MHP; the long reward can use video reward models such as VisionReward, VideoAlign, and VideoLLaMA3 over a 10-chunk sliding window. The generated videos are reported at 16 FPS and \(832 \times 480\) resolution. These settings are summarized in Table 3.

Setting	Value reported in source
Base generator	LongLive built on Wan2.1-T2V-1.3B
Test-time algorithm	Beam search with \(K\) histories and \(M\) chunk expansions
Initial KV-cache strategy	Attention window size 9 and sink size 3
Short-sequence reward	Image reward models such as HPSv3, ImageReward, and MHP
Long-sequence reward	VisionReward, VideoAlign, and VideoLLaMA3 over a 10-chunk window
Output setting	16 FPS at \(832 \times 480\)
Short benchmark	946 VBench prompts, 5-second clips
Long benchmark	First 128 MovieGen prompts, 30-second clips
Metrics	VBench/VBench-long plus VideoAlign VQ, MQ, and TA

Experiments And Results

For 5-second clips, the paper compares CausVid, Self-forcing, LongLive, and Stream-T1 on 946 VBench prompts. Table 4 shows that Stream-T1 has the best Subject Consistency, Background Consistency, Motion Smoothness, Aesthetic Quality, VideoAlign MQ, and VideoAlign TA. The exceptions matter: Self-forcing has the highest Imaging Quality, and CausVid has the highest VideoAlign VQ.

Method	Subject	Background	Motion	Imaging	Aesthetic	VQ	MQ	TA
CausVid	96.33	95.56	98.66	69.69	62.90	0.433	0.550	1.020
Self-forcing	95.26	95.67	98.67	71.61	63.97	0.099	0.088	1.193
LongLive	97.00	96.78	99.12	71.28	65.28	0.285	0.350	1.193
Stream-T1 on LongLive	97.25	97.05	99.15	71.42	65.98	0.426	0.629	1.305

For 30-second videos, Table 5 is stronger for Stream-T1. It reports the best VBench-long Subject Consistency, Background Consistency, Motion Smoothness, Imaging Quality, and Aesthetic Quality, plus best VideoAlign VQ and TA. CausVid remains higher on VideoAlign MQ. The qualitative comparison in Figure 3 is consistent with the paper's claim that Stream-T1 preserves long-horizon appearance and temporal coherence better than the baselines.

Method	Subject	Background	Motion	Imaging	Aesthetic	VQ	MQ	TA
CausVid	97.91	96.74	98.15	66.32	59.71	-0.144	0.328	0.501
Self-forcing	97.18	96.37	98.35	68.35	59.19	-0.461	-0.216	0.656
LongLive	97.90	96.82	98.78	68.99	61.56	-0.169	-0.002	1.073
Stream-T1 on LongLive	98.43	97.18	99.03	69.10	62.11	-0.073	0.226	1.170

Figure 3. Qualitative comparisons with CausVid, Self-forcing, and LongLive. — **Figure 3. Qualitative comparisons.** The source caption says Stream-T1 improves temporal consistency and frame-level fidelity relative to CausVid, Self-forcing, and LongLive. Provenance: source label `fig:results`, original LaTeX asset `fig/results.png`, copied from the ranking cache to `figs/fig003_01_results.jpg`.

The paper also compares active Stream-T1 against passive test-time scaling variants on 30-second generation. Table 6 reports that Stream-T1 improves over LongLive, Best-of-N, and Beam Search on most metrics. There is a source-level inconsistency: the paper text says Stream-T1 is state-of-the-art across all evaluative metrics, but the recovered table lists Best-of-N at 69.34 Imaging Quality versus Stream-T1 at 69.10. The digest therefore scores C7 as 3 rather than 5.

Method	Subject	Background	Motion	Imaging	Aesthetic	VQ	MQ	TA
LongLive	97.90	96.82	98.78	68.99	61.56	-0.169	-0.002	1.073
LongLive + Best-of-N	98.13	96.88	98.86	69.34	61.97	-0.083	0.062	1.160
LongLive + Beam Search	98.28	97.03	98.90	69.05	61.85	-0.077	0.165	1.159
LongLive + Stream-T1	98.43	97.18	99.03	69.10	62.11	-0.073	0.226	1.170

The ablation studies isolate the three modules on 30-second generation. The text and Figure 4 say removing memory sinking hurts background stability, removing noise propagation introduces local structural artifacts, and removing reward pruning causes semantic misalignment and lower aesthetic quality. Table 7 supports the broad story, but it also shows tradeoffs: removing memory sinking gives the highest Imaging Quality, while Stream-T1 wins the consistency, motion, aesthetic, VQ, MQ, and TA metrics.

Figure 4. Qualitative ablation of Stream-T1 components. — **Figure 4. Qualitative ablation.** The source caption attributes background instability to removing memory sinking, local structural artifacts to removing noise propagation, and semantic/aesthetic degradation to removing reward pruning. Provenance: source label `fig:ablation`, original LaTeX asset `fig/ablation.png`, copied from the ranking cache to `figs/fig004_01_ablation.jpg`.

Variant	Subject	Background	Motion	Imaging	Aesthetic	VQ	MQ	TA
w/o Memory Sinking	98.30	97.04	98.92	69.51	61.90	-0.083	0.188	1.146
w/o Noise Propagation	98.35	97.14	98.98	69.07	61.99	-0.094	0.176	1.164
w/o Reward Pruning	98.04	96.88	98.87	69.17	61.22	-0.173	0.014	1.035
Full Stream-T1	98.43	97.18	99.03	69.10	62.11	-0.073	0.226	1.170

Practical Takeaways

The useful reading is that Stream-T1 is a test-time controller for streaming video generation: it does not introduce a new backbone, but it uses extra inference-time candidate generation, reward evaluation, and memory routing to make chunk-by-chunk video more stable. It is most compelling when the application can afford more generation-time compute for long-video quality. It is less compelling if the strict requirement is single-pass latency, because the method explicitly expands candidates and evaluates reward models at test time.

Practical point	Why it matters
Best fit	Long or streaming text-to-video generation where temporal drift is more damaging than extra inference compute.
Main mechanism to copy	Couple beam search with active state changes: noise initialization, reward-aware pruning, and reward-aware memory updates.
Most concrete equations	Noise interpolation, short/long reward fusion, and quality/transition memory gates in the key equations.
Strongest evidence	30-second VBench-long and VideoAlign improvements in Table 5, plus qualitative stability in Figure 3.
Main caveat	The reported gains depend on the selected reward models and benchmark prompts; there is no independent reproduction in the source.
Table inconsistency	The passive-TTS comparison claims all-metric superiority, but Table 6 lists a higher imaging-quality score for Best-of-N.

Reference Coverage

Anchor coverage links: claims table, problem framing, contribution framing, method overview, pipeline figure, search figure, method map, key equations, noise evidence, reward evidence, memory evidence, implementation evidence, implementation table, short-video evidence, 5s results table, long-video evidence, 30s results table, qualitative results figure, TTS comparison evidence, TTS comparison table, ablation evidence, ablation figure, ablation table, caveats, and practical takeaways table.