HF Daily Digest

22 paper digests
2026-05-29
#1
arXiv2026avg 9.50interest 10.0073 HF

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-VLA addresses fragmented embodied intelligence models by casting manipulation, navigation, and trajectory prediction as a unified vision-language-action problem. It extends the Qwen stack with a DiT-based action decoder, embodiment-aware prompt conditioning, and large-scale joint pretraining, with experiments reporting strong multi-task and out-of-distribution performance across simulated and real robot settings.

#2
arXiv2026avg 6.28interest 6.5049 HF

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

CollectionLoRA addresses the deployment cost and interference caused by loading many customized image-editing LoRAs with acceleration modules. It uses multi-teacher on-policy distillation, routing, prompt-space isolation, and coarse-to-fine objectives to compress up to 50 visual effects plus few-step generation into a single LoRA while preserving concept fidelity.

#3
arXiv2026avg 6.22interest 8.5032 HF

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

YoCausal evaluates whether video diffusion models understand causality rather than merely fitting temporal correlations. It builds a two-level benchmark from temporally reversed real-world videos and finds across 13 models that arrow-of-time perception does not imply human-level causal cognition.

#4
arXiv2026avg 6.11interest 10.0018 HF

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

This paper probes whether VLM spatial reasoning reflects structured 3D understanding or shortcuts from natural image statistics. Using contrastive representation analysis and the SpatialTunnel benchmark, it finds a persistent vertical-distance entanglement bias and links better-separated spatial axes to more robust spatial reasoning.

#5
CVPR2026avg 5.92interest 9.0023 HF

EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom targets Video-LLM latency by moving visual token compression into the vision encoder rather than only compressing after encoding. The training-free method uses early-stage compression and decoupled spatial token selection, reducing time-to-first-token and FLOPs while maintaining accuracy comparable to full-token baselines.

#6
arXiv2026avg 5.55interest 9.0017 HF

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

LoMo addresses VLM fragility when semantically equivalent content is moved between text and rendered-image carriers. It creates interleaved multimodal supervision by rendering selected text spans as images, improving cross-modal invariance and multimodal reasoning across 13 benchmarks.

#7
arXiv2026avg 5.12interest 9.506 HF

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Ptah addresses the difficulty of producing verifiable multimodal deep-research reports that combine textual synthesis with visual evidence. Its multi-agent harness plans, researches, writes, maintains a Visual Working Memory, and uses a verifier for factual grounding, citation fidelity, and cross-modal consistency, with experiments showing more reliable and usable reports than baselines.

#8
CVPR2026avg 5.06interest 10.001 HF

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

GASP improves VLM 3D spatial reasoning by injecting geometric priors into transformer layers rather than relying on 3D VQA fine-tuning or specialized 3D encoders. It trains a correspondence head with contrastive point-correspondence and depth-consistency objectives, substantially improving internal correspondence behavior and downstream spatial benchmark scores without training on 3D VQA data.

#9
arXiv2026avg 4.62interest 9.002 HF

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

This paper improves semantic correspondence by adding 3D foundation-model priors to 2D foundation features that often confuse symmetric or repeated object parts. It estimates object geometry and pose with SAM3D, renders PartField descriptors, filters matches by geodesic distance, and trains a lightweight adapter that improves correspondence with less manual geometric supervision.

#10
arXiv2026avg 4.55interest 8.009 HF

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

AsyncTool evaluates LLM agents' asynchronous tool calling in multi-task settings where tool responses are delayed. The benchmark introduces heterogeneous concurrent tasks and efficiency metrics, and finds that latency substantially degrades current agents unless they coordinate task switching, dependency tracking, and state maintenance.

#11
CVPR2026avg 4.18interest 8.003 HF

NeuROK: Generative 4D Neural Object Kinematics

NeuROK addresses generative 4D object dynamics by learning a data-driven latent kinematic state parameterization and decoder for plausible object deformations. A transformer trained on large-scale 4D data enables simulation in a low-dimensional latent space and shows advantages across diverse dynamic object types.

#12
arXiv2026avg 4.00interest 8.000 HF

PhoneWorld: Scaling Phone-Use Agent Environments

PhoneWorld scales phone-use agent environments by converting real GUI trajectories and screenshots into controllable mock Android apps, executable tasks, automatic verifiers, and training rollouts. Its 34-app instantiation improves multiple mobile-agent benchmarks under a fixed training budget and shows further gains from more PhoneWorld supervision and broader app coverage.

#13
arXiv2026avg 3.93interest 7.007 HF

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

LiteCoder-Terminal-Gen synthesizes executable, verifiable terminal environments from domain specifications to train long-horizon language agents without relying on scraped repositories. The resulting supervised and RL resources improve Qwen-family terminal agents, and Direct Multi-turn Preference Optimization provides additional gains.

#14
ICML2026avg 3.68interest 6.507 HF

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

This work analyzes hybrid multi-agent systems that combine cloud LLMs with on-device smaller language models, focusing on tradeoffs among accuracy, monetary cost, and edge energy use. By adapting representative architectures, it finds that smaller models can benefit from LLM assistance, but the optimal hybrid design is task-dependent and more frontier compute does not reliably improve performance.

#15
arXiv2026avg 3.67interest 5.5015 HF

Xetrieval: Mechanistically Explaining Dense Retrieval

Xetrieval explains dense retrieval decisions at the embedding level by internalizing reasoning into sentence embeddings and decomposing them into sparse, human-interpretable features. Aggregating feature overlaps across document-side views yields retrieval explanations, with experiments showing coherent features, stronger intervention effects, and task-level feature steering.

#16
arXiv2026avg 3.50interest 6.504 HF

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

This paper tackles reward design for factual question answering by introducing CorVer, a lightweight corpus-grounded process reward based on Wikipedia co-occurrence statistics instead of neural verifiers. Across six instruction-tuned models and five QA benchmarks, CorVer improves all tested model-benchmark cells, beats several neural-verifier baselines in most feasible settings, and trains faster.

#17
arXiv2026avg 3.49interest 4.5020 HF

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

This paper studies the quantitative limits of exact parametric memory in LoRA fine-tuning and proposes a Parametric Memory Law relating loss reduction to effective parameters and sequence length. It also introduces MemFT, a threshold-guided training strategy that reallocates effort toward sub-threshold tokens to improve memory fidelity and efficiency.

#18
arXiv2026avg 3.43interest 5.5011 HF

Is Position Bias in Dense Retrievers Built In-or Learned from Data?

This paper examines whether dense retrievers' preference for early-position evidence is architectural or learned from training data. Using synthetic position-targeted training sets across eight pretrained models, it finds that skewed evidence positions steer retrieval bias and that balanced training substantially reduces positional sensitivity while maintaining competitive retrieval performance.

#19
arXiv2026avg 3.25interest 6.500 HF

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

This paper addresses weak performance of large language and multimodal models on time-series anomaly detection by building VisAnomBench, a benchmark with curated anomaly explanations from public datasets. Fine-tuning yields VisAnomReasoner, a parameter-efficient VLM that improves anomaly localization and outperforms baselines on VisAnomBench and a cross-benchmark TSB-AD-U evaluation.

#20
EMNLP2026avg 2.94interest 5.503 HF

Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

This paper argues that constrained decoding can harm LLM reasoning when applied too early, while free-form generation lacks reliable structure. It proposes In-Writing, a single-call method that lets the model reason freely before a trigger token activates structured decoding, reducing premature triggering and improving accuracy across classification and reasoning tasks.

#21
arXiv2026avg 2.69interest 5.003 HF

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

This paper investigates why larger models learn tasks that smaller models miss, arguing that scaling helps models retain rare or complex tasks under data-induced competition for representational resources. Synthetic task-mixture experiments and OLMo pretraining runs suggest larger models suffer less gradient interference, preserve more rare-task features, and better learn infrequent complex tasks.

#22
arXiv2025avg 2.56interest 5.001 HF

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

This paper introduces Local Linear Attention, an attention mechanism derived from nonparametric statistics and test-time regression to interpolate between linear and softmax attention. It provides theoretical bias-variance advantages for associative memory, proposes memory-efficient and hardware-efficient implementations, and shows empirical gains on regression, recall, and state-tracking tasks.