HF Monthly Digest

25 paper digests
2026-05
#1
arXiv2026avg 9.50interest 9.00389 HF

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

This paper addresses multi-agent interactive video generation, where several independently controlled agents must remain temporally and perspectively consistent. It introduces Gamma-World with Simplex Rotary Agent Encoding, Sparse Hub Attention, and causal diffusion distillation, improving fidelity, controllability, and generalization from two to four players.

#2
arXiv2026avg 9.46interest 10.00347 HF

MolmoAct2: Action Reasoning Models for Real-world Deployment

This paper targets practical open Vision-Language-Action robot deployment, where existing systems are closed, hardware-bound, slow, or unreliable. MolmoAct2 combines a spatial embodied-reasoning VLM, new robot datasets, an open action tokenizer, a flow-matching continuous-action expert, and adaptive-depth reasoning to outperform strong baselines across simulation and real-world benchmarks.

#3
arXiv2026avg 7.81interest 8.70269 HF

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

This paper argues that document VQA needs evidence attribution, not just final-answer scoring, because models can answer correctly while citing the wrong source region. CiteVQA evaluates answers with element-level bounding-box citations and shows a large attribution hallucination gap across 20 MLLMs.

#4
arXiv2026avg 7.16interest 9.40191 HF

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

This paper targets the split between multimodal understanding and generation in current VLM systems. SenseNova-U1 uses the NEO-unify architecture to train native unified models that perform strongly across vision-language understanding, any-to-image and interleaved generation, and show preliminary promise for VLA and world-model scenarios.

#5
arXiv2026avg 6.65interest 8.00206 HF

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

This paper treats agent skills as trainable external state for a frozen agent rather than one-shot or loosely revised text. SkillOpt uses a separate optimizer model to make bounded edits accepted only by held-out validation gains, improving performance across benchmarks, models, and execution harnesses without extra inference-time calls.

#6
arXiv2026avg 6.64interest 9.60143 HF

PhysBrain 1.0 Technical Report

This paper studies how large-scale human egocentric video can supply physical commonsense for vision-language-action systems beyond what robot trajectories cover. PhysBrain 1.0 extracts structured scene, action, dynamics, and depth-aware supervision from video, transfers the resulting priors to VLA policies, and achieves strong multimodal QA and embodied-control results.

#7
arXiv2026avg 6.60interest 7.80210 HF

Code as Agent Harness

This survey frames code as the operational harness for agentic AI, covering how code connects agents to reasoning, action, environment modeling, planning, memory, tool use, feedback, and multi-agent coordination. It organizes representative methods and applications while outlining challenges in evaluation, verification, shared state, oversight, and multimodal extension.

#8
arXiv2026avg 6.40interest 9.60125 HF

RLDX-1 Technical Report

This paper addresses the limits of current VLAs on complex real-world dexterous tasks requiring motion awareness, long-term memory, and physical sensing. RLDX-1 uses a Multi-Stream Action Transformer with modality-specific streams, cross-modal attention, data synthesis, specialized learning, and real-time inference optimizations to outperform recent frontier VLAs in simulation and humanoid tasks.

#9
arXiv2026avg 6.28interest 9.30127 HF

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

This paper addresses the speed and geometry limitations of visual grounding systems that serialize bounding boxes into independent coordinate tokens. LocateAnything uses Parallel Box Decoding and a 138 million-sample data engine to improve both localization accuracy and decoding throughput across diverse grounding and detection benchmarks.

#10
ICML2026avg 6.27interest 8.80145 HF

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

This paper tackles the lack of large, diverse training data for GUI agents. Video2GUI automatically filters internet tutorial videos and converts them into grounded interaction trajectories, producing WildGUI with 12 million trajectories and improving GUI grounding and action benchmarks after pretraining.

#11
arXiv2026avg 6.22interest 6.50231 HF

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

This paper studies a collapse mode in very deep Diffusion Transformers where token representations become mean-dominated and lose centered variation. It identifies Mean Mode Screaming as the trigger and proposes Mean-Variance Split Residuals, which stabilize 400-layer training and support a 1000-layer DiT scale-validation run.

#12
arXiv2026avg 6.19interest 6.80217 HF

Heterogeneous Scientific Foundation Model Collaboration

This paper addresses the limits of language-only agentic systems in scientific domains that rely on specialized non-linguistic foundation models. Eywa wraps domain-specific predictive models with language-model reasoning interfaces and improves performance across physical, life, and social science tasks involving structured or specialized data.

#13
arXiv2026avg 6.14interest 9.8096 HF

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

The paper presents Qwen-VLA, a unified vision-language-action model for embodied tasks across manipulation, navigation, trajectory prediction, environments, and robot embodiments. It extends Qwen's vision-language stack with a DiT action decoder, embodiment-aware prompting, and large-scale joint pretraining, achieving strong benchmark and out-of-distribution generalization results.

#14
arXiv2026avg 6.08interest 7.40185 HF

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

This paper addresses autonomous scientific discovery as an iterative process rather than a linear paper-generation pipeline. AutoResearchClaw uses multi-agent debate, self-healing execution, verifiable reporting, human-in-the-loop intervention modes, and cross-run learning to outperform AI Scientist v2 on an experiment-stage benchmark.

#15
arXiv2026avg 6.05interest 9.50101 HF

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

The paper provides OpenSearch-VL, an open recipe for training multimodal deep search agents. It builds shortcut-resistant SFT and RL datasets, a tool environment for text and visual search operations, and a fatal-aware GRPO variant, yielding large benchmark gains and performance comparable to proprietary systems on some tasks.

#16
arXiv2026avg 6.01interest 9.00118 HF

MMSkills: Towards Multimodal Skills for General Visual Agents

The paper argues that reusable skills for visual agents need multimodal procedural knowledge rather than text-only instructions. MMSkills packages procedures with state cards and multi-view keyframes, generated from public trajectories and consulted at inference time to improve GUI and game-based visual-agent performance.

#17
arXiv2026avg 5.92interest 9.3099 HF

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

The paper introduces WBench, a multi-turn benchmark for interactive video world models covering video quality, setting adherence, interaction adherence, consistency, and physics compliance. It includes 289 cases and 1,058 turns with multiple interaction types and finds that no evaluated state-of-the-art model is strong across all dimensions.

#18
arXiv2026avg 5.92interest 8.00149 HF

When Vision Speaks for Sound

This paper shows that video-capable MLLMs often infer or hallucinate audio content from visual cues rather than verifying the audio stream. The Thud framework probes this audio-visual Clever Hans effect with Shift, Mute, and Swap interventions, and a two-stage alignment recipe improves intervention performance while preserving general video and audio-visual QA ability.

#19
arXiv2026avg 5.92interest 6.60204 HF

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

This paper analyzes how sequence-level RLVR rewards translate into token-level probability updates and shows that centroid-based updates can be diluted by shared high-frequency token patterns. DelTA reweights token-gradient directions to emphasize discriminative credit assignment, improving math reasoning and generalizing to code generation, other backbones, and out-of-domain settings.

#20
arXiv2026avg 5.86interest 8.50125 HF

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

This paper improves distillation for autoregressive streaming video diffusion models by questioning uniform supervision from teacher rollouts. Stream-R1 reweights losses using reward-guided inter-rollout reliability and intra-rollout spatiotemporal saliency, improving visual quality, motion quality, and text alignment without architecture changes or extra inference cost.

#21
arXiv2026avg 5.82interest 7.30169 HF

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

This paper examines whether MLLMs genuinely ground personality judgments in behavior or rely on superficial first impressions. It introduces Grounded Personality Reasoning, the MM-OCEAN video dataset, and a multi-tier benchmark showing that many correct Big Five ratings are not supported by retrieved behavioral cues.

#22
arXiv2026avg 5.75interest 6.50195 HF

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

This paper investigates why on-policy self-distillation can fail for math reasoning, tracing the issue to privileged contexts that overboost already implied tokens and suppress deliberative search tokens. Anti-Self-Distillation reverses the divergence direction with an entropy gate, matching GRPO accuracy in fewer steps and improving final accuracy across multiple model sizes.

#23
arXiv2026avg 5.74interest 8.80104 HF

Stream-T1: Test-Time Scaling for Streaming Video Generation

The paper applies test-time scaling to streaming video generation to reduce candidate exploration cost and add temporal guidance. Stream-T1 uses noise propagation from prior chunks, reward-based candidate pruning, and memory sinking for evicted context, improving temporal consistency, motion smoothness, and frame quality on 5s and 30s benchmarks.

#24
arXiv2026avg 5.72interest 9.7068 HF

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

Spatial foundation models are usually evaluated in narrow settings, leaving their robustness across tasks, viewpoints, domains, input densities, and hardware constraints unclear. SpatialBench evaluates 41 models on 19 datasets and 546 scenes across six paradigms, showing current models are not all-rounders and introducing DA-Next-5M plus DA-Next to address a key data gap.

#25
arXiv2026avg 5.70interest 5.80218 HF

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

This paper presents MinT, infrastructure for training, exporting, serving, evaluating, and rolling back very large catalogs of LoRA-adapted LLM policies without materializing full merged checkpoints. It keeps base models resident, moves adapter revisions through the lifecycle, and demonstrates million-scale addressable LoRA catalogs over shared 1T-class base models.