arXiv20262026avg 6.01interest 9.00118 HF visual agentsmultimodal skillsagent memory

The paper argues that reusable skills for visual agents need multimodal procedural knowledge rather than text-only instructions. MMSkills packages procedures with state cards and multi-view keyframes, generated from public trajectories and consulted at inference time to improve GUI and game-based visual-agent performance.

Source-first digest for monthly checked paper rank 16, rank_id p030.

Motivation / Background

Visual agents often reuse skills as text prompts, code snippets, execution graphs, or learned routines. MMSkills argues that this representation is incomplete for agents that operate from screenshots or game frames. A reusable GUI or game procedure is not only an instruction sequence; it also needs evidence for when the procedure is applicable, when it should be skipped, what visual cue confirms progress, and what visible state signals completion.

The paper's introductory MMSkills example shows the central failure mode: text-only guidance can describe a chart-creation procedure, but miss whether the live spreadsheet is in the state where that procedure should be applied.

Figure 1. Concrete MMSkills example.
Figure 1. Concrete MMSkills example. The original caption says a multimodal skill package combines a textual procedure, runtime state cards, and multi-view visual evidence; branch-loaded MMSkills align the skill evidence with the live screen and return state-aware guidance.

The paper formalizes this as multimodal procedural knowledge. Its answer is a three-part framework: a multimodal skill package, an agentic trajectory-to-skill Generator, and a branch-loaded runtime agent. The system overview is the best high-level map of those pieces.

Figure 2. MMSkills framework overview.
Figure 2. MMSkills framework overview. MMSkills stores a textual procedure, runtime state cards, and multi-view keyframes; generates such packages from public non-test trajectories; and uses branch loading so the main agent receives compact structured guidance instead of a full bundle of reference images.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 Visual-agent skills need multimodal procedural knowledge, not only text or code-level action recipes. 4 problem framing, package equations, Figure 1, Figure 2
C2 MMSkills defines a compact reusable package that binds a descriptor, procedure, runtime state cards, and aligned keyframe bundles. 5 package equations, method overview
C3 The Generator can turn public non-evaluation trajectories into reusable multimodal skill libraries through clustering, procedure induction, grounding, and audit. 4 generator pipeline, source data coverage, source/package table
C4 Branch loading reduces direct-context pressure by selecting needed state cards and views in a temporary branch, then returning structured guidance to the main trajectory. 4 runtime equations, branch runtime, prompt surfaces, prompt table
C5 MMSkills improves reported performance across OSWorld and auxiliary GUI/game benchmarks for multiple model families. 5 OSWorld results, OSWorld table, auxiliary results, auxiliary table
C6 State cards, keyframes, branch loading, and view selection each matter; removing them weakens the system. 4 ablation evidence, ablation figure
C7 MMSkills changes behavior, not only final success rate: agents invoke skills more often, take fewer steps, and show less repetitive low-level action. 4 skill usage, usage table, behavior shift, behavior figures
C8 The approach remains limited by source-trajectory coverage, generated-skill errors, visual-grounding mistakes, and extra branch-loading cost. 3 limitations, case studies

Support scores are support-from-paper scores, not reproduction scores. A score of 5 means the claim is directly backed by equations, tables, or controlled experiments inside the paper; lower scores mark claims with more qualitative or implementation-dependent evidence.

Core Technical Idea

MMSkills changes the stored skill artifact. A text-only skill says what to do. An MMSkill says what to do, when it applies, when it does not apply, what visual cue matters, and what evidence verifies progress or completion.

The core package is:

$$ M=(D,P,S,K), \label{eq:mmskill} $$

where \(D\) is a compact descriptor, \(P\) is the reusable textual procedure, \(S=\{S_j\}_{j=1}^{m}\) is a set of runtime state cards, and \(K=\{K_j\}_{j=1}^{m}\) is a set of aligned keyframe bundles.

A runtime state card is:

$$ \begin{split} S_j = (& \text{when\_to\_use}_j, \text{when\_not\_to\_use}_j, \text{visible\_cues}_j,\\ & \text{verification\_cue}_j, \mathcal{V}_j), \qquad \mathcal{V}_j=\text{available\_views}_j . \end{split} \label{eq:state-card} $$

The available view vocabulary is:

$$ \mathcal{V}=\{\text{full\_frame},\text{focus\_crop},\text{before},\text{after}\}. $$

Each procedural state has a matched image bundle:

$$ K_j=\{K_j^{v}:v\in\mathcal{V}_j,\ v\in\mathcal{V}\}. \label{eq:keyframe-bundle} $$

This representation is the paper's main conceptual move. State cards are not captions; they are decision nodes. Keyframes are not demonstration replay; they are compact evidence for recognizing or verifying runtime state.

The Generator maps public trajectories into a domain-specific skill library:

$$ \mathcal{G}_{\mathcal{F}}:\mathcal{T}_d\mapsto\mathcal{M}_d, \label{eq:meta-generator} $$

with the staged pipeline:

$$ \begin{aligned} \mathcal{T}_d &\xrightarrow{\text{Phase 0: embed+cluster}} \mathcal{C}_d \xrightarrow{\text{Phase 1: cluster plan}} \mathcal{A}_d \xrightarrow{\text{Phase 2: merge}} \mathcal{R}_d \\ &\xrightarrow{\text{Phase 3: text draft}} \widehat{\mathcal{M}}_d \xrightarrow{\text{Phase 4: image ground+audit}} \mathcal{M}_d . \end{aligned} \label{eq:generator-pipeline} $$

The paper emphasizes that source trajectories are disjoint from evaluation tasks. The goal is to convert public interaction experience into reusable procedures and diagnostic visual states, not to store raw demonstrations or replay test episodes.

Method Details

Branch-Loaded Runtime

At runtime, the main agent either acts directly or consults a skill branch:

$$ \begin{aligned} \text{direct}: \quad & A_t = \pi_{\text{main}}(O_t,H_t,\mathcal{C}_I),\\ \text{branch}: \quad & G_t = \text{Branch}(O_t,H_t,M_t),\quad A_t = \pi_{\text{main}}(O_t,H_t,\mathcal{C}_I,G_t). \end{aligned} \label{eq:runtime-modes} $$

The branch returns structured guidance:

$$ G_t = (\text{applicable}_t,\text{subgoal}_t,\text{plan}_t,\text{do\_not\_do}_t,\text{verify}_t). \label{eq:branch-summary} $$

The branch first selects which state cards and views to inspect:

$$ (J_t,R_t)=\text{SelectViews}(O_t,H_{t-1},P_t,S_t), \qquad V_t=\{K_j^v:j\in J_t,\ v\in R_{t,j}\}. \label{eq:branch-stage1} $$

Then it plans from the live observation, selected cards, and selected views:

$$ G_t=\text{PlanBranch}(O_t,H_{t-1},P_t,\{S_j:j\in J_t\},V_t). \label{eq:branch-stage2} $$

The key engineering claim is that reference images should not compete with the live screenshot in the main action context. Branch loading treats skill evidence like progressive disclosure: inspect just enough evidence in a temporary branch, distill the guidance, and let the main agent still ground actions in the current observation.

Prompt Interface

The prompt-surface figure exposes the concrete implementation interface behind the equations: one main skill-calling prompt, one gated state-view selection prompt, and one planner-JSON prompt.

Figure 5. Prompt surfaces for branch-loaded MMSkills.
Figure 5. Prompt surfaces for branch-loaded MMSkills. Stage 1 decides whether visual reference images are needed and which views to load. Stage 2 returns a compact planner object rather than a GUI action.

The prompt interface table condenses the appendix prompt templates.

Surface Reads Emits Guardrail
Main agent Current screenshot, previous steps, candidate skills, active planner memo One GUI action, DONE, WAIT, FAIL, or LOAD_SKILL(...) Use skill hints as references only; verify before DONE; avoid repeated unproductive loops.
Stage 1 branch Live screenshot, skill text, state-card manifests, previous steps LOAD_STATE_VIEWS(...) JSON Load no images when text is enough; keep evidence minimal with at most a bounded number of states/views.
Stage 2 branch Live screenshot, Stage 1 decision, selected state cards/views JSON guidance with applicability, subgoal, plan, do-not-do, fallback, expected state, completion scope Do not emit GUI actions; reference views are never coordinate templates.

Source Data And Skill Library Scale

The paper's source construction is important because it separates skill generation from evaluation. GUI MMSkills come from OpenCUA Ubuntu/macOS source trajectories, VAB-Minecraft skills use official training episodes, and Super Mario skills come from separate source cases. The source coverage table summarizes the library scale most relevant for interpreting the results.

Source/evaluation area Evaluation tasks Unique skills State cards Views Notes
OSWorld 360 247 879 1,898 Average 1.21 skills/task; 879 full-frame and 876 focus views; 143 transition cards.
macOSWorld 143 248 522 1,097 Average 1.73 skills/task; 522 full-frame and 522 focus views; 53 transition cards.
VAB-Minecraft official test set 24 79 185 Official training set used as source; average 3.29 cards/skill.
Super Mario Bros held-out cases 10 34 48 Skills extracted from separate source runs over four source cases.

Experiments And Results

The paper evaluates four questions: overall performance, ablations, skill-usage dynamics, and low-level behavior shift. The strongest quantitative evidence is in the OSWorld and auxiliary benchmark tables.

The OSWorld table condenses the paper's application-level OSWorld results to the overall success-rate column. MMSkills improve overall success for every listed base model.

Base model No skill Text-only MMSkills MMSkills gain vs no skill
Gemini 3.1 Pro 44.08 40.76 50.11 +6.03
Gemini 3 Flash 36.65 40.27 47.97 +11.32
Qwen3-VL-235B 21.34 28.57 39.17 +17.83
GLM-5V 28.71 36.61 38.51 +9.80
Kimi-K2.6 34.98 39.66 46.59 +11.61
Qwen3-VL-8B-Instruct 10.78 14.93 25.40 +14.62

The Qwen3 rows are the cleanest evidence for the "smaller/weaker model" claim. Qwen3-VL-235B gains 17.83 points on OSWorld, while Qwen3-VL-8B-Instruct more than doubles from 10.78 to 25.40.

The auxiliary results table shows the same pattern beyond Ubuntu desktop tasks. macOSWorld gains are strongest for Gemini 3 Flash and Qwen3-VL-235B; VAB-Minecraft and Super Mario show consistent improvements in the completed model runs.

Base model macOSWorld overall: no/text/MMSkills VAB success: no/text/MMSkills Super Mario total perf.: no/text/MMSkills
Gemini 3 Flash 55.94 / 53.85 / 65.73 67.24 / 68.96 / 73.28 411.00 / 548.00 / 624.00
Qwen3-VL-235B 47.55 / 43.36 / 51.75 52.59 / 55.17 / 62.07 454.50 / 610.50 / 788.00
GLM-5V 34.97 / 51.75 / 51.75 56.03 / 61.20 / 68.10 612.75 / 794.50 / 950.50
Qwen3-VL-8B-Instruct 6.29 / 4.90 / 6.29 23.28 / 29.31 / 38.79 415.25 / 596.50 / 764.00

Ablations

The ablation figure supports two linked claims. First, complete MMSkills outperform text-only skills, MMSkills without state cards, and MMSkills without images. Second, branch loading and view selection are distinct: direct full loading can hurt by injecting unfiltered reference evidence, while the full two-stage branch-loaded design performs best.

Figure 3. Ablation results.
Figure 3. Ablation results. Panel A removes runtime state cards or visual keyframes from the package. Panel B compares direct loading, branch loading, and view selection. The paper reports bars as percentage-point gains over the no-skill baseline.

Skill Usage And Interaction Dynamics

The skill usage table shows that MMSkills are not just extra context. They are called more often than text-only skills, yet reduce average trajectory length in every listed MMSkills condition.

Benchmark Model Text-only invoked / steps MMSkills invoked / steps MMSkills selected views
OSWorld Gemini 3 Flash 41.11% / 15.64 62.50% / 11.86 79 full / 241 focus / 8 before / 24 after
OSWorld Qwen3-VL-235B 37.50% / 13.34 65.28% / 9.87 40 full / 27 focus / 17 before / 13 after
VAB-Minecraft Gemini 3 Flash 68.97% / 17.30 81.90% / 13.75 105 full / 205 focus / 15 before / 12 after
VAB-Minecraft Qwen3-VL-235B 54.31% / 31.36 64.66% / 27.07 98 full / 196 focus / 13 before / 10 after

This is a useful diagnostic result. Text-only skills can add overhead; MMSkills increase skill invocation but reduce steps, suggesting that state-conditioned visual evidence helps the model recognize relevance and avoid exploratory loops.

Behavioral Shift

The main behavior figure argues that MMSkills change action patterns. The paper highlights Qwen3-VL-235B: click share drops from 75.8% to 63.7%, exact repeated actions fall from 21.8% to 6.2%, keyboard and DONE actions increase, and the longest same-mode run decreases.

Figure 4. Behavioral shift on OSWorld.
Figure 4. Behavioral shift on OSWorld. Panels report action primitive distribution, low-level primitives per task, and repetitive behavior statistics for OSWorld.

The additional behavior figure extends the same analysis to GLM-5V and Kimi-K2.6.

Figure 6. Additional behavioral shift for GLM-5V and Kimi-K2.6.
Figure 6. Additional behavioral shift for GLM-5V and Kimi-K2.6. The appendix uses the same metrics as Figure 4: primitive mix, low-level primitives per task, and repetition statistics.

Interaction Case Studies

The appendix case studies are qualitative, but useful for understanding the runtime mechanics. The Calc case shows multiple spreadsheet skills being consulted at different stages of table construction. The terminal case shows branch guidance helping the agent recover from a brittle command path and verify the final archive structure.

Figure 7. LibreOffice Calc interaction case.
Figure 7. LibreOffice Calc interaction case. Colored turn labels distinguish direct GUI actions, skill loading, branch guidance, evidence-gated reasoning, and final completion.
Figure 8. Terminal file-organization interaction case.
Figure 8. Terminal file-organization interaction case. The case illustrates a branch-loaded skill helping with command repair and final archive verification.

Practical Takeaways

1. The most transferable idea is the state card. A useful visual-agent skill should explicitly encode when to use it, when not to use it, visible cues, and verification cues. The state card is what turns an instruction into reusable procedural knowledge.

2. The strongest systems result is that multimodal skill evidence helps weaker models most. The Qwen3-VL-8B and Qwen3-VL-235B improvements suggest that external visual procedural knowledge can compensate for weaker internal priors.

3. The branch is not just an implementation detail. Directly loading many reference images into the main action context can distract or over-anchor the model. The branch-loaded design lets the system inspect selected reference evidence, then return compact guidance.

4. The paper's evidence is strongest for benchmark performance and ablations, moderate for behavioral interpretation, and weaker for broad deployment claims. The interaction cases are useful illustrations, but not proof of robustness outside the evaluated tasks.

5. A production version would need stronger skill-library governance: source filtering, privacy controls for screenshots, generated-skill audits, online repair when a skill is misleading, and cost controls for branch calls.

Limitations

The paper's own limitations are practical and important. MMSkills depend on source-trajectory coverage, so the system is only as broad as the public non-test experience used to build it. Generated state cards and grounded keyframes can be wrong. Branch loading adds inference cost and wall-clock latency. The method also stores visual evidence, so real deployments need privacy filtering, access controls, and audit mechanisms before skills are shared or reused.

Reference Coverage

All explicit anchors introduced in this digest are linked here for validation coverage: evidence anchors problem, method overview, package equations, generator pipeline, runtime equations, branch runtime, source data, OSWorld results, auxiliary results, ablation, skill usage, behavior shift, case studies, and limitations; figure anchors intro, method, ablation, behavior, prompt surfaces, extra behavior, Calc case, and terminal case; table anchors prompt surfaces, source coverage, OSWorld overall, auxiliary results, and skill usage.