ICML20262026avg 6.27interest 8.80145 HF GUI agentsmultimodal datainteraction trajectories

This paper tackles the lack of large, diverse training data for GUI agents. Video2GUI automatically filters internet tutorial videos and converts them into grounded interaction trajectories, producing WildGUI with 12 million trajectories and improving GUI grounding and action benchmarks after pretraining.

Source-first digest for monthly 2026_05 rank 10, rank_id p019.

Motivation / Background

The paper addresses a specific bottleneck for generalized GUI agents: strong models need trajectories that connect instructions, visual observations, actions, coordinates, and multi-step state changes, but existing datasets are usually manually labeled, simulator-bound, platform-specific, or too small. The source comparison in Table 1 frames WildGUI as the scale jump: it covers website, mobile, and desktop environments with 12.7M instructions, 124.5M images, and 9.7 average turns.

The core bet is that internet tutorial videos already contain broad, real user behavior. The missing piece is not more manual labeling; it is an automated Video2GUI pipeline that can filter videos, extract task trajectories, and spatially ground actions. The overview in Figure 1 shows the three stages: coarse-to-fine video filtering, trajectory extraction, and action spatial grounding.

Figure 1. Video2GUI pipeline overview.
Figure 1. Video2GUI pipeline overview. The pipeline starts with large-scale video filtering, converts useful tutorial segments into instruction-trajectory records, and grounds low-level GUI actions to precise screen targets.

The authors start from over 500M YouTube metadata entries, reduce the pool to about 20M videos through metadata filtering, then retain 4.16M high-quality videos, about 300K hours, after content scoring. From that source pool, they construct WildGUI, a GUI pretraining dataset with 12.7M trajectories spanning more than 1,500 applications and websites.

Dataset Website Mobile Desktop Environments Instructions Images Avg turns Instruction level
MiniWoB++ Y Y N 114 100 17,971 3.6 Low-level
MIND2WEB Y N N 137 2,350 2,350 7.3 High-level
AITW N Y N 357 30,378 715,142 6.5 High and low
AndroidControl N Y N 833 14,538 15,283 4.8 High and low
GUI-Net Y Y Y 280 1M 1M 4.7 High-level
WildGUI Y Y Y 1,500+ 12.7M 124.5M 9.7 High and low

Table 1. Dataset scale comparison. The table is a compact transcription of the source dataset-comparison table, keeping representative baselines and the WildGUI row.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 GUI-agent progress is data-limited because existing trajectory datasets are costly, narrow, or hard to scale. 5 problem framing, dataset comparison
C2 Video2GUI is a plausible automated source-first pipeline: metadata filtering narrows the web-scale pool, video scoring keeps high-quality tutorials, trajectory extraction creates instruction-action records, and spatial grounding attaches coordinates. 4 video filtering, trajectory and grounding, pipeline figure
C3 WildGUI is substantially larger and broader than prior GUI-agent datasets, with 12.7M trajectories, 124.5M screenshots, and coverage across 1,500+ apps and websites. 5 dataset scale, dataset table, dataset statistics figures
C4 Continual pretraining on WildGUI improves GUI grounding, offline agent, and online agent benchmarks for both Qwen2.5-VL and Mimo-VL. 5 training objective, grounding results, offline results, online results
C5 Scaling WildGUI pretraining data up to 200B tokens continues to improve ScreenSpot-Pro and OSWorld-G, without clear saturation in the reported range. 4 scaling analysis, scaling figure
C6 The three pretraining losses and the two-stage training recipe matter, especially trajectory modeling for online AndroidWorld and Stage 2 alignment for instruction following. 4 ablation evidence, ablation table
C7 The pipeline is useful but not free: its highest-quality setting depends on Gemini-3-Pro annotation/grounding, expert-validated filters, and a one-time construction cost. 4 data quality evidence, video scoring table, cost caveats

Support scores are support-from-paper scores, not independent reproduction scores. A score of 5 means the claim is directly backed by the paper's definitions, tables, or experiments; a score of 4 means the paper presents substantial evidence but still relies on internal model annotation, filtering thresholds, or non-public release assumptions.

Core Technical Idea

The coarse-to-fine filtering stage is built to avoid downloading and processing internet video blindly. First, a metadata classifier is trained from 10K DeepSeek-V3-labeled samples and applied to more than 500M YouTube metadata records, reducing the candidate pool to about 20M. Second, a Qwen2.5-Omni video quality scorer is trained on about 200 hours of Gemini-3-Pro-labeled video, scoring the first minute of each candidate for topic relevance, instruction clarity, and recording quality. The paper then keeps videos scoring at least 4.2 on all three dimensions, giving 4.16M videos and about 300K hours. Table 2 summarizes the filter.

Stage Model / signal Training or annotation source Output scale
Metadata filter Qwen2.5-7B classifier head 10K video metadata records labeled by DeepSeek-V3, with upsampling for positives 500M+ metadata entries to about 20M candidates
Video quality scorer Qwen2.5-Omni with three regression heads About 200 hours labeled by Gemini-3-Pro across topic relevance, instruction clarity, recording quality 20M candidates to 4.16M videos
Quality threshold Minimum 4.2 on all three dimensions Manual verification and scorer test set About 300K hours of high-quality GUI tutorial content

Table 2. Coarse-to-fine video filtering. The filter trades cheap metadata pruning against more expensive visual/audio scoring only after the candidate pool is smaller.

The trajectory extraction stage asks Gemini-3-Pro to turn each tutorial video into task-level instruction-trajectory pairs. For a video \(V\), the extracted data are:

$$ \mathcal{D}(V) = \{(u^{(k)}, e^{(k)})\}_{k=1}^{N} $$

Each \(u^{(k)}\) is a natural-language task instruction and each \(e^{(k)}\) is an interaction trajectory with timestamped steps. Long videos are split into at most 4-minute segments, and later segments receive previous extraction results as text context so the annotator can preserve task continuity across segment boundaries.

The action spatial grounding stage compensates for the low visual resolution of long-context video inputs. For an action at timestamp \(t\), the system extracts frames at \(t-0.5s\), \(t\), and \(t+0.5s\), then predicts the action target from the low-level instruction and action type:

$$ b_t = g_\phi(o_{t-0.5s}, o_t, o_{t+0.5s}, \tau_t). $$

The first frame with a valid grounding result is kept. Manual verification of 200 sampled actions reports more than 95% correct parameterization. The action vocabulary is summarized in Table 3.

Environment Action groups Parameter examples
Desktop click, doubleClick, tripleClick, rightClick, middleClick, press, input, hotkey, scroll, drag, moveTo, wait, finished x, y; [keys]; text; direction, x, y, distance; start, end
Mobile click, longpress, scroll, pinch, input, drag, press, open, multi_touch, finished x, y; x, y, duration; app_name; pointers; status

Table 3. Desktop and mobile action space. The digest table condenses the source action-space table while preserving the action coverage used by trajectory annotation.

The model training recipe has two stages. Stage 1 continually pretrains Qwen2.5-VL and Mimo-VL for one epoch on WildGUI, about 200B tokens, with three complementary tasks: GUI grounding, GUI action prediction, and GUI trajectory modeling. The mixed objective is:

$$ \mathcal{L}_{\text{pretrain}} = \mathcal{L}_{\text{ground}} + \mathcal{L}_{\text{action}} + \mathcal{L}_{\text{traj}}. $$

Stage 2 fine-tunes on curated open-source GUI datasets for three epochs, about 15B tokens, to align the broad pretraining behavior with cleaner supervised instruction-following data.

The appendix also defines the training losses for the filter models. The metadata classifier uses binary cross-entropy:

$$ \mathcal{L}_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N} \left[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]. $$

The video scorer uses mean squared error over the three quality dimensions:

$$ \mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{3}(y_{ij}-\hat{y}_{ij})^2. $$

Method Details

WildGUI is positioned as a pretraining-scale dataset rather than a benchmark. It is built from 500M+ metadata entries, 20M filtered candidates, 4.16M retained videos, and about 300K hours of source tutorial content. The final source dataset contains 12.7M GUI operation trajectories, 124.5M screenshots, over 1,500 environments, and a 9.7-turn average. The distribution plots in Figure 6 and Figure 7 show why this matters: the dataset is intended to cover platforms, software categories, web categories, video durations, trajectory lengths, and action types rather than a single app family.

Figure 6. WildGUI platform and category distributions.
Figure 6. WildGUI platform and category distributions. The source caption describes distributions over platforms, software categories, and website categories.
Figure 7. WildGUI duration, step, and action-type statistics.
Figure 7. WildGUI duration, step, and action-type statistics. The source caption covers video duration, steps per trajectory, and desktop/mobile action types.

The paper's extraction prompts ask for more than raw action labels. Each segment annotation includes user task instruction, dense caption, task plan, platform, application type, website name, and action trajectory. Each action stores timestamp, action type, low-level grounding instruction, action rationale, action parameters, and the expected interface change. That design is meant to train not only point-and-click behavior but also a rudimentary GUI world model.

The examples in Figure 8, Figure 9, Figure 10, and Figure 11 show source WildGUI trajectory records rather than downstream benchmark outputs.

Figure 8. WildGUI example 1.
Figure 8. WildGUI example 1. Example trajectory visualization copied from the ranking figure cache.
Figure 9. WildGUI example 2.
Figure 9. WildGUI example 2. Example trajectory visualization copied from the ranking figure cache.
Figure 10. WildGUI example 3.
Figure 10. WildGUI example 3. Example trajectory visualization copied from the ranking figure cache.
Figure 11. WildGUI example 4.
Figure 11. WildGUI example 4. Example trajectory visualization copied from the ranking figure cache.

Experiments And Results

The authors evaluate three capabilities: GUI grounding, offline mobile/CN GUI-agent execution, and online interaction in OSWorld and AndroidWorld. The grounding results in Table 4 are the cleanest evidence that WildGUI helps both base architectures. Qwen2.5-VL-7B improves from 26.8 to 41.9 on ScreenSpot-Pro average and from 27.3 to 53.7 on OSWorld-G average. Mimo-VL-7B improves from 41.2 to 56.9 on ScreenSpot-Pro average and from 54.7 to 67.6 on OSWorld-G average, exceeding Qwen3-VL-32B on OSWorld-G average and matching or surpassing most open-source grounding baselines.

Model ScreenSpot-Pro Avg OSWorld-G Avg Notes
Seed1.5-VL 60.9 62.9 Proprietary baseline
Qwen3-VL-32B 54.9 60.6 Open-source baseline evaluated by authors
Qwen2.5-VL-7B 26.8 27.3 Base model
Qwen2.5-VL-7B + WildGUI 41.9 53.7 +15.1 ScreenSpot-Pro, +26.4 OSWorld-G
Mimo-VL-7B 41.2 54.7 Base model
Mimo-VL-7B + WildGUI 56.9 67.6 +15.7 ScreenSpot-Pro, +12.9 OSWorld-G

Table 4. GUI grounding results. Compact transcription of the source ScreenSpot-Pro and OSWorld-G table, retaining the headline averages and gains.

The offline agent results in Table 5 show smaller but consistent gains. Mimo-VL-7B + WildGUI reaches 91.8 step success on AndroidControl-Low, 71.4 on AndroidControl-High, and 71.0 on CAGUI, all above the corresponding base model. Qwen2.5-VL also improves, especially on CAGUI.

Model AndroidControl-Low Step SR AndroidControl-High Step SR CAGUI Type Acc. CAGUI Step SR
Qwen2.5-VL-7B 85.0 62.9 74.2 55.2
Qwen2.5-VL-7B + WildGUI 90.3 64.5 88.3 65.4
Mimo-VL-7B 87.9 65.6 82.2 63.4
Mimo-VL-7B + WildGUI 91.8 71.4 90.3 71.0

Table 5. Offline GUI-agent results. The table keeps the reported step-success and CAGUI metrics most relevant to agent execution.

For online evaluation, Figure 2 supports the key generalization claim. Mimo-VL-7B with Stage 1 WildGUI pretraining plus Stage 2 post-training reaches 31.9% on AndroidWorld, compared with 16.4% for the base model and 23.3% for Stage 2 only. On OSWorld it reaches 12.3%, compared with 10.4% for Stage 2 only. These absolute numbers are still modest, but the relative improvement suggests offline video-derived data can transfer to live dynamic environments.

Figure 2. Online OSWorld and AndroidWorld results.
Figure 2. Online OSWorld and AndroidWorld results. The chart compares base, Stage 2 only, and Stage 1 + Stage 2 training on interactive environments.

The scaling study in Figure 3 varies pretraining from 0 to 200B tokens and compares against a Stage 2 only baseline. ScreenSpot-Pro rises from about 41% to 56.9%, and OSWorld-G rises from about 55% to 67.6%. The authors report no evident saturation in this range and note that OSWorld-G surpasses Stage 2 only around 50B tokens.

Figure 3. Scaling law for WildGUI pretraining.
Figure 3. Scaling law for WildGUI pretraining. Reported accuracy increases as WildGUI pretraining tokens scale toward 200B.

The ablation in Table 6 clarifies what the model is using. Removing \(\mathcal{L}_{\text{ground}}\) hurts ScreenSpot-Pro most, dropping 56.9 to 49.8. Removing \(\mathcal{L}_{\text{traj}}\) preserves static metrics better but drops AndroidWorld from 31.9 to 24.1, consistent with the claim that trajectory modeling matters for long-horizon online execution. Removing Stage 2 is catastrophic, dropping AndroidWorld to 6.0, so WildGUI pretraining is not a replacement for alignment-style supervised post-training.

Setting ScreenSpot-Pro CAGUI AndroidWorld
Ours 56.9 71.0 31.9
w/o \(\mathcal{L}_{\text{ground}}\) 49.8 69.8 28.4
w/o \(\mathcal{L}_{\text{action}}\) 50.5 65.3 27.6
w/o \(\mathcal{L}_{\text{traj}}\) 54.6 70.2 24.1
w/o Stage 1 49.3 64.2 23.3
w/o Stage 2 28.2 45.7 6.0

Table 6. Training-objective and stage ablation. The table is copied from the source ablation and normalized to Markdown.

The data-quality check in Figure 4 is the paper's main guardrail against "large but noisy" data. Five expert participants rated 300 sampled data points. The video-quality pipeline improved average video scores from 1.22 to 2.12 after metadata filtering and to 4.45 after video scoring. For trajectory quality, WildGUI achieved 4.62, above TongUI at 3.35 and VideoAgentTrek at 4.05; the reported Krippendorff's \(\alpha\) is 0.84.

Figure 4. Human evaluation of video and trajectory quality.
Figure 4. Human evaluation. Expert ratings support the claim that filtering improves video quality and that extracted WildGUI trajectories compare favorably with prior video-derived GUI datasets.

The scorer itself is evaluated in Figure 5, which reports alignment with ground-truth annotations over topic relevance, instruction clarity, and screen recording quality.

Figure 5. Video scoring model evaluation.
Figure 5. Video scoring model evaluation. The figure reports the content-quality scorer's performance across the three filtering dimensions.

Practical Takeaways

The most important practical takeaway is that GUI-agent data may be hidden in ordinary tutorial videos, but the conversion pipeline has to solve two hard alignment problems: intent-to-step extraction and step-to-coordinate grounding. Video2GUI uses strong VLMs for both, then makes the result cheaper to scale with learned filters and scorers.

For model builders, WildGUI-like data is most compelling as Stage 1 pretraining, not as the only source of supervision. The ablation shows that removing Stage 2 hurts every reported metric, so clean supervised post-training remains necessary. The reported recipe also assumes large infrastructure: 24,000 Stage 1 steps, 2,000 Stage 2 steps, maximum 4,096 visual tokens, 32,768 sequence length, and a cluster with 256 NVIDIA GPUs.

The cost section is unusually concrete. With Gemini-3-Pro as the default annotator, trajectory extraction costs about $0.0653 per sample and action spatial grounding adds about $0.011, for about $0.0763 per sample end to end. The authors also evaluate Qwen3.5-397B-A17B as an open-source alternative but report roughly 15-20% lower annotation quality, mainly in temporal alignment and spatial grounding. That makes the release of the resulting dataset and pipeline more valuable than re-running the full pipeline for every downstream user.

The main caveat is that the reported evidence validates the pipeline internally rather than through an independent reproduction. The benchmark gains are broad and numerically strong, but the upstream labels depend on model annotation, quality thresholds, and expert sampling. Users should treat WildGUI as a high-value pretraining corpus whose quality should still be audited for target domains, languages, applications, and safety-sensitive workflows.

Reference Coverage

Anchor coverage check: evidence anchors are linked from the claims table and again here for problem framing, video filtering, trajectory and grounding, agent training, dataset scale, main results, scaling, ablation, data quality, and cost caveats. Figure anchors are linked here for Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, and Figure 11. Table anchors are linked here for Table 1, Table 2, Table 3, Table 4, Table 5, and Table 6.