arXiv20262026avg 6.14interest 9.8096 HF vision-language-actionembodied intelligencerobot generalization

The paper presents Qwen-VLA, a unified vision-language-action model for embodied tasks across manipulation, navigation, trajectory prediction, environments, and robot embodiments. It extends Qwen's vision-language stack with a DiT action decoder, embodiment-aware prompting, and large-scale joint pretraining, achieving strong benchmark and out-of-distribution generalization results.

Source-first digest for checked paper rank 1, rank_id p002.

Motivation / Background

Embodied AI is still split across narrow model families: robot manipulation policies, navigation agents, egocentric human-motion predictors, and vision-language systems are often trained with different architectures, output conventions, and evaluation protocols. Qwen-VLA argues that these tasks share the same core conditional prediction problem: given visual context, language, and an embodiment/control description, predict a future action or trajectory sequence.

The paper's main move is to extend a strong Qwen3.5 vision-language backbone with a DiT-style flow-matching action expert, then train it over a heterogeneous mixture of robot, human, simulation, navigation, and vision-language data. The intended payoff is a single model that can switch between task families and robot embodiments through text prompts rather than task-specific heads; the full architecture is summarized in Figure 1.

Figure 1. Overview of Qwen-VLA.
Figure 1. Overview of Qwen-VLA. The original caption describes a unified embodied model trained on mixed manipulation, navigation, and vision-language understanding data to generate both robot actions and textual responses. I place it here because it is the clearest high-level evidence for the paper's unification claim.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 Manipulation, navigation, egocentric action modeling, and trajectory prediction can be handled as one action-and-trajectory prediction problem with a shared VLM + action decoder. 4 overview, action interface, simulation manipulation, navigation
C2 Embodiment-aware prompting plus a padded/masked action tensor lets one architecture cover different robot platforms, action dimensions, control rates, and horizons. 4 action interface, robot embodiments, projection ablation, state conditioning
C3 The staged recipe, especially text-to-action DiT pretraining before visual grounding, improves action learning and downstream performance. 5 training recipe, T2A ablations, RL stages
C4 Large-scale joint pretraining transfers to real ALOHA manipulation and improves out-of-distribution robustness. 4 data mixture, real ALOHA results, static OOD manipulation, real OOD qualitative evidence, limitations
C5 A single generalist can match or exceed many specialist manipulation policies across several simulation benchmarks. 5 simulation manipulation
C6 The same model family can compete in VLN-CE navigation and zero-shot dynamic manipulation. 4 navigation, DOMINO
C7 Vision-language co-training helps difficult embodied tasks without obvious degradation on simpler action benchmarks. 4 VL co-training
C8 Proprioceptive state is not essential for the reported RoboTwin setting; visual observations plus embodiment prompts are almost enough. 3 state conditioning

Scores are support-from-paper scores, not independent reproduction scores. Broad deployment claims are capped below 5 when the evidence is strong but limited by benchmark scope, short-horizon evaluation, or single-platform real-world validation.

Core Technical Idea

Qwen-VLA has two modules:

At time step \(t\), all tasks are written as:

$$ p_\theta(y_{t:t+H-1} \mid o_t, x, e, z). $$

Here \(o_t\) is visual context, \(x\) is the task instruction, \(e\) is the embodiment/control prompt, and \(z\) is an optional task-family identifier. The target \(y\) is a future action or trajectory sequence: robot end-effector deltas, joint positions, navigation waypoints, human hand trajectories, or related structured motion.

The model does not claim that all robots share a physical action semantics. Instead, it standardizes the tensor interface. Each dataset keeps its native control convention; a sample uses the leading \(c\) channels of a fixed \(K\)-channel action tensor and masks the unused padded channels. The flow-matching action loss first averages error per active channel and then averages across active channels:

$$ \ell_k = \frac{\sum_{h=1}^{H} M_{h,k} \left\| \left(v_\theta(\mathbf{Y}_\tau,\tau \mid o_{1:t},x,e,z) - (\mathbf{Y}_1-\mathbf{Y}_0)\right)_{h,k} \right\|_2^2} {\sum_{h=1}^{H} M_{h,k}}, $$
$$ \mathcal{L}_{act} = \mathbb{E}_{\tau,\mathbf{Y}_0,\mathbf{Y}_1} \left[\frac{1}{c}\sum_{k=0}^{c-1}\ell_k\right]. $$

This is the core engineering pattern: keep the VLM/action decoder shared, put embodiment semantics in text, and make the continuous action side robust to different dimensions through padding, masks, and per-dataset normalization.

Method Details

Embodiment Prompting

Each training example is prepended with a prompt of this form:

The robot is {robot_tag} with {single arm / dual arms}[, waist][, and mobile base]. The control frequency is {FPS} Hz. Please predict the next {chunk_size} control actions to execute the following task: {ori_instruction}.

The prompt supplies platform, arm configuration, control convention, control frequency, and prediction horizon. Table 1 shows the range of robot and human embodiments represented by this interface. At deployment, the paper says the prompt can be swapped to describe a new physical robot while leaving the model architecture unchanged.

Robot Arms Action type
WidowX Single \(\Delta\)EEF + G
Google Robot Single \(\Delta\)EEF + G
Franka Panda Single / Dual \(\Delta\)EEF + G; Abs Joint + G
ARX5 Dual \(\Delta\)EEF + G
Fourier GR-1 Dual \(\Delta\)EEF + G
Mobile ALOHA Dual \(\Delta\)EEF + G; Abs Joint + G
AgiBot A2-D Dual Abs Joint + G; Abs Joint + DH
Galaxea R1 Dual Abs Joint + G
AIRBOT MMK2 Dual Abs Joint + DH
TienKung Dual Abs Joint + G; Abs Joint + DH
Real Human Dual \(\Delta\)EEF from MANO

Table. Representative robot embodiments. EEF means end-effector pose, G means gripper, DH means dexterous hand. This table supports the claim that the training corpus spans single-arm, dual-arm, dexterous, and human-motion formats.

Data Mixture

Table 2 shows that the pretraining mixture is dominated by robot manipulation but deliberately keeps navigation, synthetic simulation, and vision-language supervision in the same recipe.

Data source Proportion
Robot manipulation trajectories 74.2%
Human egocentric trajectories 6.0%
Navigation trajectories 7.5%
Synthetic simulation trajectories 3.7%
General vision-language data 3.4%
Spatial grounding, 2D 2.5%
Autonomous driving VQA 2.4%
Fine-grained embodied action caption 0.2%
Total 100.0%

The corpus is mostly action-bearing robot manipulation data, but it deliberately keeps non-action vision-language supervision in the mix. The paper's reason is that action data teaches control while VQA, captioning, spatial grounding, and navigation preserve perception, object vocabulary, spatial semantics, and instruction following. Table 3 illustrates the fine-grained action-caption supervision used to make language more operational.

The robot manipulation portion includes public real-robot datasets, in-house trajectories, simulation trajectories, and more than 10,000 hours of heterogeneous robot interaction. The egocentric data contributes future wrist and hand-articulation targets, including PCA-reduced hand-pose coefficients. The synthetic pipeline contributes both vision-conditioned and text-only action data. The navigation data contributes waypoint-style continuous decisions.

Coarse label Fine-grained action caption
Pick up, rotate, and place the ceramic bowl. Step 1: Pick up the ceramic bowl from the right far edge. Step 2: Rotate the bowl clockwise for two full circles. Step 3: Place the bowl at the center of the table.

Table. Fine-grained embodied action caption example. This table is included because it shows how the paper tries to bridge coarse task labels and richer action-language supervision.

Figure 2. Training recipe of Qwen-VLA.
Figure 2. Training recipe. Stage I trains the DiT decoder text-to-action without visual input. Stage II grounds the decoder in vision during continued pretraining. Stage III branches into multi-task and real-robot SFT. Stage IV applies RL for closed-loop success.

The staged recipe in Figure 2 is intentionally staged because the VLM is already pretrained but the action decoder starts random. The paper frames T2A as a "decompression" stage: a short language instruction and embodiment prompt must expand into a high-dimensional trajectory. Once the DiT has a language-indexed action prior, visual grounding can happen in CPT without making every early update pay the full cost of image conditioning.

Synthetic Simulation

Figure 3 shows how the synthetic data covers both short atomic manipulation tasks and longer compositional task sequences.

Figure 3. Synthetic simulation examples.
Figure 3. Synthetic simulation examples. The original caption contrasts short-horizon placement with a longer compositional grouping task. I include it with the data section because it shows why the synthetic data is meant to cover both atomic manipulation and multi-step instructions.

The synthetic vision-conditioned data uses an internal early version of RoboInF with IsaacLab and cuRobo. The paper reports 20 tabletop scenes, 10 object-pose configurations per scene, 450 manipulation tasks, 300 successful trajectories per task, and visual/control randomization. It also segments motion-planned trajectories into subtask trajectories, so the model sees both whole-task and intermediate-stage supervision.

Objectives And Post-Training

The joint pretraining objective combines continuous action flow matching with next-token vision-language loss:

$$ \mathcal{L}_{vl} = -\sum_i \log p_\theta(w_i \mid w_{< i}, o_{1:t}), $$
$$ \mathcal{L} = \lambda_{act}\mathcal{L}_{act} + \lambda_{vl}\mathcal{L}_{vl}. $$

After CPT, the paper defines Qwen-VLA-Base. SFT jointly fine-tunes on vision-language samples, navigation episodes, and manipulation demonstrations. The SFT weights are reported as 0.1 for vision-language next-token prediction and 1.0 for manipulation/navigation action prediction.

RL uses PPO with sparse binary success rewards in simulation. For the actor:

$$ \mathcal{L}^{actor}(\theta) = -\mathbb{E}_t\left[ \min\left( r_t(\theta)\hat{A}_t, \operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t \right) \right], $$

and the total loss is:

$$ \mathcal{L}(\theta)=\mathcal{L}^{actor}(\theta)+c_v\mathcal{L}^{value}(\theta). $$

The paper calls out a nontrivial PPO detail: flow-matching policies do not expose autoregressive token probabilities. Qwen-VLA estimates log-probabilities by treating denoising transitions as Gaussian steps in a stochastic process, storing denoising states during rollout, and recomputing the velocity field under the current policy during PPO updates.

Experiments And Results

Simulation Manipulation

Table 4 compares Qwen-VLA against specialist manipulation policies across simulated manipulation benchmarks.

Method Type LIBERO RoboCasa-GR1 Simpler-WidowX RoboTwin-Easy RoboTwin-Hard
\(\pi_0\) Specialist 94.4 - - 65.9 58.4
StarVLA-OFT Specialist 96.6 48.8 64.6 50.4 -
GR00T N1.6 Specialist 97.2 49.9 63.2 47.6 -
\(\pi_{0.5}\) Specialist 97.6 37.0 46.9 82.7 76.8
ABot-M0 Specialist 98.6 58.3 - 86.0 85.0
Being-H0.5 Specialist 97.6 53.3 - - -
Qwen-VLA-Base Generalist 90.8 40.4 64.3 64.3 66.4
Qwen-VLA-Instruct Generalist 97.9 56.7 73.7 86.1 87.2

Table. Robot manipulation results across benchmarks. Qwen-VLA-Instruct is best on Simpler-WidowX and both RoboTwin splits, second on LIBERO and RoboCasa-GR1. This directly supports C5.

The main nuance is that the paper compares a single generalist against specialists that are usually adapted per benchmark. The generalist claim is strongest where Qwen-VLA-Instruct wins or nearly matches the best specialist. It is weaker on RoboCasa-GR1, where ABot-M0 remains higher at 58.3 versus 56.7.

Real ALOHA Manipulation

Figure 4 shows the physical ALOHA task families, with in-domain and OOD results summarized in Table 5 and Table 6.

Figure 4. Real-world ALOHA evaluation tasks.
Figure 4. ALOHA task overview. This figure is included because it defines the real-world short-horizon, long-horizon, and fine-grained manipulation tasks used for transfer evaluation.
Model Pick and Place Table Cleaning Bowl Stacking Bowl Pick and Place Towel Folding Fine-grained Manipulation Avg.
GR00T N1.6 30.8 38.5 53.8 19.2 19.2 10.3 28.6
\(\pi_{0.5}\) 73.1 84.6 88.5 69.2 80.8 33.3 71.6
Qwen-VLA-aloha, w/o pretrain 30.8 53.8 61.5 64.1 50.0 30.8 48.5
Qwen-VLA-aloha, w/ pretrain 96.2 92.3 98.7 87.2 65.4 61.5 83.6

Table. Real-world in-domain performance. Fine-tuning from Qwen-VLA-Base raises the ALOHA average from 48.5 to 83.6, which is the paper's clearest real-world transfer evidence.

Model Color Instance Position Background Instruction Avg.
GR00T N1.6 46.2 38.5 3.8 19.2 19.2 25.4
\(\pi_{0.5}\) 57.7 61.5 19.2 26.9 42.3 41.5
Qwen-VLA-aloha, w/o pretrain 42.3 30.8 34.6 30.8 42.3 36.2
Qwen-VLA-aloha, w/ pretrain 88.5 76.9 53.8 80.8 84.6 76.9

Table. Real-world OOD performance. The pretrained ALOHA variant is best in all five reported generalization categories. This supports the transfer claim, though the evidence is still one physical platform.

Table 7 tests whether the same model family can operate in VLN-CE navigation rather than only robot manipulation.

Method R2R NE down R2R OS up R2R SR up R2R SPL up RxR NE down RxR SR up RxR SPL up RxR nDTW up
NaVid 5.7 49.2 41.9 36.5 5.7 45.7 38.2 -
Uni-NaVid 5.6 53.3 47.0 42.7 6.2 48.7 40.9 -
NaVILA 5.2 62.5 54.0 49.0 6.8 49.3 44.0 58.8
StreamVLN 5.0 64.2 56.9 51.9 6.2 52.9 46.0 61.9
Qwen-VLA-Base 5.2 61.7 53.8 49.4 6.4 55.1 45.8 56.2
Qwen-VLA-Instruct 5.1 69.0 57.5 51.2 5.8 59.6 47.8 57.1

Table. VLN-CE results. Qwen-VLA-Instruct leads R2R Oracle Success and Success Rate, and leads RxR SR/SPL. StreamVLN remains best on R2R NE/SPL and RxR nDTW, so the navigation claim is competitive rather than dominant.

Static OOD Manipulation

The SimplerEnv-OOD benchmark fine-tunes only on simple Bridge pick-and-place data, then evaluates unseen spatial relations, reversed color-object bindings, and novel manipulation primitives.

Table 8 isolates these static out-of-distribution manipulation shifts.

Method MoveAway MoveRight PlaceNear PlaceRight PutFront StackYellow Avg.
\(\pi_{0.5}\) 26.1 0.0 0.0 32.1 13.0 4.2 12.6
Qwen-VLA-Base 31.3 31.6 16.7 47.1 6.3 18.8 25.3
Qwen-VLA-Instruct 43.8 33.3 39.6 47.9 4.2 22.9 32.0

Table. SimplerEnv-OOD. Qwen-VLA-Instruct improves the average from 12.6 for \(\pi_{0.5}\) to 32.0. The one exception is PutFront, where \(\pi_{0.5}\) is higher.

Dynamic OOD Manipulation

Table 9 evaluates zero-shot dynamic manipulation on DOMINO against both dynamic-data fine-tuned and zero-shot baselines.

Method Training category SR % up MS up
OpenVLA Fine-tuned on dynamic data 1.5 6.1
RDT-1B Fine-tuned on dynamic data 5.3 17.7
\(\pi_0\) Fine-tuned on dynamic data 8.2 24.0
\(\pi_{0.5}\) Fine-tuned on dynamic data 9.6 26.2
InternVLA-M1 Fine-tuned on dynamic data 5.4 27.6
VLA-Adapter Fine-tuned on dynamic data 4.4 24.3
\(\pi_0\)-FAST Fine-tuned on dynamic data 3.5 20.9
OpenVLA-OFT Fine-tuned on dynamic data 9.1 24.1
StarVLA-OFT Fine-tuned on dynamic data 10.9 30.5
PUMA Fine-tuned on dynamic data 17.2 35.0
OpenVLA-OFT Zero-shot to dynamic manipulation 6.7 20.0
\(\pi_{0.5}\) Zero-shot to dynamic manipulation 7.5 20.4
LingBot-VLA w/ depth Zero-shot to dynamic manipulation 11.8 26.7
LingBot-VA Zero-shot to dynamic manipulation 24.1 36.1
Qwen-VLA-Base Zero-shot to dynamic manipulation 21.1 37.4
Qwen-VLA-Instruct Zero-shot to dynamic manipulation 26.6 39.5

Table. DOMINO. Qwen-VLA-Instruct is best overall in success rate and manipulation score. This is notable because the paper says it uses only current-frame observations and no dynamic-manipulation fine-tuning.

The qualitative ALOHA examples in Figure 5 show what the real-world OOD categories look like, but they should be read alongside the quantitative OOD table.

Figure 5. Qualitative out-of-distribution generalization.
Figure 5. Qualitative real-world OOD. The figure shows color-conditioned grasping, novel object grasping, compositional clean-up, unseen object interaction, and background robustness on ALOHA. This is qualitative evidence rather than a substitute for the quantitative OOD table.

T2A And Training Ablations

Figure 6 is the main evidence for the staged text-to-action initialization recipe.

Figure 6. T2A pretraining ablations.
Figure 6. T2A ablations. The paper reports the best SFT success rate at about 20% synthetic + 80% real T2A data, full-sequence prediction instead of chunk prediction, no image tokens during T2A, Sigmoid-Normal timestep sampling at T2A with Beta at SFT, and a 2,000-step T2A duration.

Key numbers from the text:

Figure 7 shows the effect of mixing vision-language data with VLA action training.

Figure 7. Vision-language co-training ablations.
Figure 7. Vision-language co-training. VL+VLA improves harder object/compositional benchmarks while staying comparable on simpler ones. The caption reports +4.9 pp on RoboCasa-GR1 and +4.6 pp on RoboTwin-2.0, and faster/higher convergence with a pretrained DiT.

The projection design ablation in Table 10 motivates the paper's lightweight zero-padding interface.

Benchmark Bridge Only Robocasa Only Multi-MLP Concat. Zero-Pad
Bridge 62.8 - 63.3 63.0 63.0
Robocasa - 53.4 52.1 52.8 53.2

Table. Projection design ablation. Co-training is close to single-embodiment training. Zero-Padding is adopted because it is architecturally light: \(2h d_{max}\) projection parameters instead of \(2h\sum_i d_i\).

Table 11 separates the cumulative contribution of CPT, supervised fine-tuning, and RL.

Stage Simpler RoboCasa RoboTwin-E RoboTwin-H LIBERO SimplerOOD DOMINO SR DOMINO MS
CPT 64.3 40.4 64.3 66.4 90.8 25.3 21.1 37.4
+ SFT 70.8 56.0 86.3 87.1 97.8 31.6 25.7 39.1
+ RL 73.7 56.7 86.1 87.2 97.9 32.0 26.6 39.5

Table. Cumulative post-training effect. SFT contributes the large jump; RL adds smaller gains, especially on Simpler, and does not erase held-out performance in the reported table.

Table 12 is the narrow basis for the claim that explicit proprioceptive state is only marginally useful in this RoboTwin setup.

Conditioning RoboTwin-Easy RoboTwin-Hard
No State 88.7 87.4
State in VLM Prompt 89.3 88.7
State in DiT 89.4 88.3

Table. State conditioning ablation. Explicit proprioceptive state gives only marginal improvements in this setting. The evidence supports a narrow claim: on RoboTwin-2.0, the paper's visual setup plus relative action prediction makes extra state less important.

Practical Takeaways

The limitations section explicitly says embodied action data is still much smaller than vision-language pretraining data, joint VL/navigation/action optimization creates trade-offs, and evaluations remain short-horizon and benchmark-driven. Those caveats matter when interpreting the "unified embodied foundation model" framing.