Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA addresses fragmented embodied intelligence models by casting manipulation, navigation, and trajectory prediction as a unified vision-language-action problem. It extends the Qwen stack with a DiT-based action decoder, embodiment-aware prompt conditioning, and large-scale joint pretraining, with experiments reporting strong multi-task and out-of-distribution performance across simulated and real robot settings.