arXiv20262026avg 4.00interest 8.000 HF mobile agentsGUI environmentsverifiable tasks

PhoneWorld scales phone-use agent environments by converting real GUI trajectories and screenshots into controllable mock Android apps, executable tasks, automatic verifiers, and training rollouts. Its 34-app instantiation improves multiple mobile-agent benchmarks under a fixed training budget and shows further gains from more PhoneWorld supervision and broader app coverage.

Source-first digest for checked paper rank 29, rank_id p058.

Motivation / Background

Phone-use agents operate through screenshots, touch actions, mobile navigation, and app state rather than through stable APIs. The paper argues that this makes environment supply a bottleneck: real mobile apps are dynamic, difficult to reset, and expensive to turn into reproducible evaluation and training setups. Existing systems such as AndroidWorld, MobileWorld, and A3 make mobile-agent evaluation more reliable, but they do not solve the upstream problem of repeatedly constructing new controllable phone-use environments.

PhoneWorld addresses that upstream bottleneck. It converts real GUI trajectories and representative screenshots into runnable mock Android apps, executable tasks, rule-based verifiers, and verifier-confirmed training rollouts. The key shift is that trajectories are not treated only as demonstrations; they are used to decide which screens matter, how navigation flows between screens, what state changes must persist, and which user goals can be checked automatically. Figure 1 shows the full source-to-environment pipeline, and Table 1 summarizes the scale of the current suite.

Figure 1. PhoneWorld pipeline.
Figure 1. PhoneWorld turns real GUI traces into runnable phone-use environments. The original caption describes five stages: real-app traces, structure recovery, build specification, app construction, and tasks/verification/rollouts. I place it here because it is the clearest evidence for the paper's claim that PhoneWorld is infrastructure for both evaluation and training, not just a fixed benchmark.
Component Summary
Apps 34 mock Android apps across 16 domains
Reusable modules 18 shared modules reused across app families
Benchmark 120 audited tasks: 102 single-app + 18 cross-app
Verification Answer-based checks and SQLite-based state checks
Task pool 7,936 generated tasks used for large-scale rollout generation
Training rollouts 3,354 successful episodes / 36,193 interaction steps

Table 1. Summary of the PhoneWorld suite. This table grounds the scale claims: PhoneWorld is not only an app-construction recipe, but a concrete 34-app suite with held-out evaluation tasks and successful rollout data.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 Real trajectories plus screenshots can be converted into controllable, resettable phone-use environments with tasks, verifiers, and rollouts. 4 pipeline, worked example, worked-example figure, prompt verifier, suite
C2 The current PhoneWorld instantiation has enough breadth to test environment scaling: 34 apps, 16 domains, 18 reusable modules, and held-out audited tasks. 5 suite, evaluation benchmarks, evaluation setup
C3 Under a matched total training budget, replacing 10K auxiliary AndroidWorld steps with broad PhoneWorld supervision improves all four reported benchmarks. 5 partial replacement
C4 PhoneWorld supervision is strong but complementary to AndroidWorld supervision rather than a wholesale replacement. 4 full replacement, add-only control, add-only table
C5 Under a fixed 10K PhoneWorld budget, broader app coverage is the strongest reported scaling signal. 4 scaling figure
C6 The construction process is scalable but not fully automatic; human audit and selective abstraction remain part of the system. 3 construction loop, limitations

Scores are support-from-paper scores, not independent reproduction scores. I cap the broad scaling and construction-process claims below 5 where the paper gives strong experimental evidence but does not fully expose artifacts, app-level QA statistics, or public benchmark access.

Core Technical Idea

PhoneWorld is an environment factory for phone-use agents. For each target app, it takes two source signals:

The pipeline recovers a page taxonomy, classifies screenshots into that taxonomy, estimates page visitation frequency, and builds a weighted page-transition graph. High-frequency pages become P0 build targets; moderate pages become P1; long-tail pages become P2 unless they are required by tasks. This matters because the system does not aim to clone every app feature. It preserves the screens, transitions, visible content, and mutable actions that matter for mobile agents.

Figure 2. Worked example of constructing a QQ-like PhoneWorld environment.
Figure 2. Worked example of constructing a QQ-like PhoneWorld environment. The paper uses this example to show the path from high-frequency real-app pages to page priorities, transition graph, generated mock pages, and a SQLite-checkable task. It is the most concrete source-side evidence for the environment-construction workflow.

The build specification layer converts recovered structure into per-page PRDs, reusable UI components, and a data architecture. The data layer separates read-only app content from mutable state in a resettable SQLite database. Read-only content lets agents browse and search realistic data offline; mutable tables record state changes such as favorites, cart edits, comments, messages, or profile updates. The same database is later used by verifiers.

The construction loop is deliberately pragmatic: a coding agent writes Kotlin/Jetpack Compose code, compiles the APK, runs a self-review checklist, and fixes issues such as schema mismatches, dead buttons, and missing routes. Across apps, recurring UI and logic patterns are promoted into reusable modules. The paper reports 18 modules, including search, feeds, comments, cart/checkout, address management, messaging, settings, and media player patterns.

Method Details

Structure Recovery And Build Specification

PhoneWorld first asks a coding/vision-language process to identify recurring page types from screenshots. A lightweight vision-language model then classifies the screenshot corpus in parallel. The classified trajectories yield a page-frequency distribution and a page-transition graph. Those two artifacts guide what the mock app must preserve: high-use screens, common navigation paths, and stateful interactions that tasks will later require.

For each prioritized page, a vision-language model writes a structured PRD covering layout, interactive elements, transitions, and visual attributes. A coding agent then implements the app with shared modules and a local data backend. The paper's central engineering choice is to make the environment resettable and inspectable: mutable app state lives in SQLite, read-only content is initialized deterministically, and local BM25 search gives reproducible retrieval behavior.

Task Synthesis And Verification

PhoneWorld generates tasks from the artifacts already produced during app construction: app content, database schema, and UI specification. This grounding is meant to ensure that generated goals refer to visible entities, ask for feasible operations, and have deterministic verification rules. The paper uses two verifier types: answer-based checks for information-seeking tasks, and SQLite-based checks for state-changing tasks. Figure 3 shows the source prompt-template example as a local SVG rendering because the LaTeX figure is a text box rather than an external graphic.

Figure 3. Synthesized task and verification rule.
Figure 3. Example of a synthesized verifiable task. The task asks the agent to search QQ for "Project Group" and tap Favorite; verification queries user_collections in the app's SQLite database. This illustrates why the environment construction step and verifier construction step must share the same content and schema.

Training And Evaluation Setup

The training experiments use the same Qwen3.5-9B backbone, LlamaFactory training for two epochs, normalized coordinates, and the original phone screenshot resolution \(1080 \times 2400\). The matched-budget runs keep a shared AndroidWorld base corpus of 36,193 steps fixed and vary the other 36,193 steps, so the total budget is \(36{,}193 + 36{,}193 = 72{,}386\) steps.

Benchmark Environment Metric Role in the paper
HYMobileBench Offline Step SR Real-device mobile performance proxy
AndroidControl Offline Step SR Android control transfer
AndroidWorld Online real app Task SR Out-of-domain real-app transfer
PhoneWorld Online mock apps Task SR In-domain PhoneWorld evaluation

Table 2. Evaluation benchmarks used in this paper. The table matters because the central result is not merely in-domain PhoneWorld improvement; the partial-replacement setting also improves the real-app AndroidWorld benchmark and two offline control benchmarks.

Experiments And Results

Matched-Budget Partial Replacement

The main matched-budget experiment asks whether a small amount of broad PhoneWorld data improves a strong AndroidWorld-based baseline. Baseline uses 36,193 shared AndroidWorld base steps plus 36,193 auxiliary AndroidWorld steps. The 10K PhoneWorld replacement keeps the same 72,386-step total, retains 26,193 auxiliary AndroidWorld steps, and replaces 10,000 auxiliary steps with verifier-confirmed PhoneWorld rollouts from all 34 apps. Table 3 is the key evidence.

Benchmark Baseline 10K PhoneWorld replacement Absolute change
HYMobileBench 15.5 33.2 +17.7
AndroidControl 53.7 59.7 +6.0
AndroidWorld 56.9 71.6 +14.7
PhoneWorld 12.5 65.0 +52.5

Table 3. Matched-budget partial replacement. Replacing only 10K auxiliary AndroidWorld steps with broad PhoneWorld supervision improves all four reported benchmarks. The strongest gain is in-domain PhoneWorld, but AndroidWorld and the offline benchmarks also improve, which supports the claim that PhoneWorld is not just overfitting its own mock apps.

Full Replacement Control

Table 4 tests the endpoint where the full auxiliary AndroidWorld corpus is replaced by 36,193 PhoneWorld steps. This is useful because it separates "PhoneWorld data is useful" from "PhoneWorld data can replace every AndroidWorld signal."

Benchmark Baseline Full PhoneWorld replacement Absolute change
HYMobileBench 15.5 33.2 +17.7
AndroidControl 53.7 59.3 +5.6
AndroidWorld 56.9 46.6 -10.3
PhoneWorld 12.5 73.3 +60.8

Table 4. Matched-budget full-replacement control. Full replacement strongly improves PhoneWorld and keeps gains on the two offline benchmarks, but it hurts AndroidWorld by 10.3 points. The paper interprets this as complementarity: AndroidWorld still contributes a distinct real-app transfer signal.

Scaling Amount And App Coverage

The second scaling study adds PhoneWorld supervision on top of the shared AndroidWorld base, using 0, 10K, 20K, or 36,193 PhoneWorld steps. PhoneWorld task success rises from 14.2 to 64.2, 70.0, and 73.3. The first 10K steps give the largest jump, and later additions still help but with smaller returns.

The third scaling study fixes the PhoneWorld budget at 10K steps and varies how many apps those steps cover. Figure 4 summarizes both curves. The app-coverage result is the stronger practical point: with the same 10K PhoneWorld budget, expanding coverage from 5 to 34 apps raises PhoneWorld from 46.7 to 65.0, HYMobileBench from 14.9 to 33.2, and AndroidWorld from 61.2 to 71.6, while AndroidControl stays roughly stable.

Figure 4. Main scaling results for PhoneWorld.
Figure 4. Main scaling results for PhoneWorld. Left: more PhoneWorld supervision improves PhoneWorld task success. Right: under a fixed 10K PhoneWorld budget, broader app coverage yields broader cross-benchmark gains. This is the main evidence for the paper's claim that environment diversity matters, not just trajectory count.
Added PhoneWorld steps AndroidWorld PhoneWorld
0K 46.6 14.2
10K 45.7 64.2
20K 45.2 70.0
36K 46.6 73.3

Table 5. Supplementary add-only analysis. Adding PhoneWorld supervision to the shared AndroidWorld base substantially improves PhoneWorld while leaving AndroidWorld roughly unchanged. This supports the paper's interpretation that the AndroidWorld drop in full replacement comes from removing auxiliary AndroidWorld data, not from adding PhoneWorld data itself.

Practical Takeaways

The limitations are material. PhoneWorld apps are selective abstractions rather than full replicas, the benchmark is intentionally compact and manually audited, HYMobileBench is internal, and the PhoneWorld APKs/tasks are not publicly released at submission time. The system is AI-driven, but the paper does not claim full automation; human audit remains part of the construction process.