PhoneWorld: Scaling Phone-Use Agent Environments

Source-first digest for checked paper rank 29, rank_id p058.

Routing status: success
PDF extraction: not used
Table recovery: latex_flattened/main.flattened.tex was consulted only to recover the missing numeric bodies of tab:main_result and tab:replacement.

Motivation / Background

Phone-use agents operate through screenshots, touch actions, mobile navigation, and app state rather than through stable APIs. The paper argues that this makes environment supply a bottleneck: real mobile apps are dynamic, difficult to reset, and expensive to turn into reproducible evaluation and training setups. Existing systems such as AndroidWorld, MobileWorld, and A3 make mobile-agent evaluation more reliable, but they do not solve the upstream problem of repeatedly constructing new controllable phone-use environments.

PhoneWorld addresses that upstream bottleneck. It converts real GUI trajectories and representative screenshots into runnable mock Android apps, executable tasks, rule-based verifiers, and verifier-confirmed training rollouts. The key shift is that trajectories are not treated only as demonstrations; they are used to decide which screens matter, how navigation flows between screens, what state changes must persist, and which user goals can be checked automatically. Figure 1 shows the full source-to-environment pipeline, and Table 1 summarizes the scale of the current suite.

Figure 1. PhoneWorld pipeline. — **Figure 1. PhoneWorld turns real GUI traces into runnable phone-use environments.** The original caption describes five stages: real-app traces, structure recovery, build specification, app construction, and tasks/verification/rollouts. I place it here because it is the clearest evidence for the paper's claim that PhoneWorld is infrastructure for both evaluation and training, not just a fixed benchmark.

Component	Summary
Apps	34 mock Android apps across 16 domains
Reusable modules	18 shared modules reused across app families
Benchmark	120 audited tasks: 102 single-app + 18 cross-app
Verification	Answer-based checks and SQLite-based state checks
Task pool	7,936 generated tasks used for large-scale rollout generation
Training rollouts	3,354 successful episodes / 36,193 interaction steps

Table 1. Summary of the PhoneWorld suite. This table grounds the scale claims: PhoneWorld is not only an app-construction recipe, but a concrete 34-app suite with held-out evaluation tasks and successful rollout data.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	Real trajectories plus screenshots can be converted into controllable, resettable phone-use environments with tasks, verifiers, and rollouts.	4	pipeline, worked example, worked-example figure, prompt verifier, suite
C2	The current PhoneWorld instantiation has enough breadth to test environment scaling: 34 apps, 16 domains, 18 reusable modules, and held-out audited tasks.	5	suite, evaluation benchmarks, evaluation setup
C3	Under a matched total training budget, replacing 10K auxiliary AndroidWorld steps with broad PhoneWorld supervision improves all four reported benchmarks.	5	partial replacement
C4	PhoneWorld supervision is strong but complementary to AndroidWorld supervision rather than a wholesale replacement.	4	full replacement, add-only control, add-only table
C5	Under a fixed 10K PhoneWorld budget, broader app coverage is the strongest reported scaling signal.	4	scaling figure
C6	The construction process is scalable but not fully automatic; human audit and selective abstraction remain part of the system.	3	construction loop, limitations

Scores are support-from-paper scores, not independent reproduction scores. I cap the broad scaling and construction-process claims below 5 where the paper gives strong experimental evidence but does not fully expose artifacts, app-level QA statistics, or public benchmark access.

Core Technical Idea

PhoneWorld is an environment factory for phone-use agents. For each target app, it takes two source signals:

representative screenshots, which define visual layout and visible content;
real exploratory-use trajectories, which define page importance, navigation flow, state-changing actions, and recurrent user goals.

The pipeline recovers a page taxonomy, classifies screenshots into that taxonomy, estimates page visitation frequency, and builds a weighted page-transition graph. High-frequency pages become P0 build targets; moderate pages become P1; long-tail pages become P2 unless they are required by tasks. This matters because the system does not aim to clone every app feature. It preserves the screens, transitions, visible content, and mutable actions that matter for mobile agents.

**Figure 2. Worked example of constructing a QQ-like PhoneWorld environment.** The paper uses this example to show the path from high-frequency real-app pages to page priorities, transition graph, generated mock pages, and a SQLite-checkable task. It is the most concrete source-side evidence for the environment-construction workflow.

The build specification layer converts recovered structure into per-page PRDs, reusable UI components, and a data architecture. The data layer separates read-only app content from mutable state in a resettable SQLite database. Read-only content lets agents browse and search realistic data offline; mutable tables record state changes such as favorites, cart edits, comments, messages, or profile updates. The same database is later used by verifiers.

The construction loop is deliberately pragmatic: a coding agent writes Kotlin/Jetpack Compose code, compiles the APK, runs a self-review checklist, and fixes issues such as schema mismatches, dead buttons, and missing routes. Across apps, recurring UI and logic patterns are promoted into reusable modules. The paper reports 18 modules, including search, feeds, comments, cart/checkout, address management, messaging, settings, and media player patterns.

Method Details

Structure Recovery And Build Specification

PhoneWorld first asks a coding/vision-language process to identify recurring page types from screenshots. A lightweight vision-language model then classifies the screenshot corpus in parallel. The classified trajectories yield a page-frequency distribution and a page-transition graph. Those two artifacts guide what the mock app must preserve: high-use screens, common navigation paths, and stateful interactions that tasks will later require.

For each prioritized page, a vision-language model writes a structured PRD covering layout, interactive elements, transitions, and visual attributes. A coding agent then implements the app with shared modules and a local data backend. The paper's central engineering choice is to make the environment resettable and inspectable: mutable app state lives in SQLite, read-only content is initialized deterministically, and local BM25 search gives reproducible retrieval behavior.

Task Synthesis And Verification

PhoneWorld generates tasks from the artifacts already produced during app construction: app content, database schema, and UI specification. This grounding is meant to ensure that generated goals refer to visible entities, ask for feasible operations, and have deterministic verification rules. The paper uses two verifier types: answer-based checks for information-seeking tasks, and SQLite-based checks for state-changing tasks. Figure 3 shows the source prompt-template example as a local SVG rendering because the LaTeX figure is a text box rather than an external graphic.

Figure 3. Synthesized task and verification rule. — **Figure 3. Example of a synthesized verifiable task.** The task asks the agent to search QQ for "Project Group" and tap Favorite; verification queries `user_collections` in the app's SQLite database. This illustrates why the environment construction step and verifier construction step must share the same content and schema.

Training And Evaluation Setup

The training experiments use the same Qwen3.5-9B backbone, LlamaFactory training for two epochs, normalized coordinates, and the original phone screenshot resolution \(1080 \times 2400\). The matched-budget runs keep a shared AndroidWorld base corpus of 36,193 steps fixed and vary the other 36,193 steps, so the total budget is \(36{,}193 + 36{,}193 = 72{,}386\) steps.

Benchmark	Environment	Metric	Role in the paper
HYMobileBench	Offline	Step SR	Real-device mobile performance proxy
AndroidControl	Offline	Step SR	Android control transfer
AndroidWorld	Online real app	Task SR	Out-of-domain real-app transfer
PhoneWorld	Online mock apps	Task SR	In-domain PhoneWorld evaluation

Table 2. Evaluation benchmarks used in this paper. The table matters because the central result is not merely in-domain PhoneWorld improvement; the partial-replacement setting also improves the real-app AndroidWorld benchmark and two offline control benchmarks.

Experiments And Results

Matched-Budget Partial Replacement

The main matched-budget experiment asks whether a small amount of broad PhoneWorld data improves a strong AndroidWorld-based baseline. Baseline uses 36,193 shared AndroidWorld base steps plus 36,193 auxiliary AndroidWorld steps. The 10K PhoneWorld replacement keeps the same 72,386-step total, retains 26,193 auxiliary AndroidWorld steps, and replaces 10,000 auxiliary steps with verifier-confirmed PhoneWorld rollouts from all 34 apps. Table 3 is the key evidence.

Benchmark	Baseline	10K PhoneWorld replacement	Absolute change
HYMobileBench	15.5	33.2	+17.7
AndroidControl	53.7	59.7	+6.0
AndroidWorld	56.9	71.6	+14.7
PhoneWorld	12.5	65.0	+52.5

Table 3. Matched-budget partial replacement. Replacing only 10K auxiliary AndroidWorld steps with broad PhoneWorld supervision improves all four reported benchmarks. The strongest gain is in-domain PhoneWorld, but AndroidWorld and the offline benchmarks also improve, which supports the claim that PhoneWorld is not just overfitting its own mock apps.

Full Replacement Control

Table 4 tests the endpoint where the full auxiliary AndroidWorld corpus is replaced by 36,193 PhoneWorld steps. This is useful because it separates "PhoneWorld data is useful" from "PhoneWorld data can replace every AndroidWorld signal."

Benchmark	Baseline	Full PhoneWorld replacement	Absolute change
HYMobileBench	15.5	33.2	+17.7
AndroidControl	53.7	59.3	+5.6
AndroidWorld	56.9	46.6	-10.3
PhoneWorld	12.5	73.3	+60.8

Table 4. Matched-budget full-replacement control. Full replacement strongly improves PhoneWorld and keeps gains on the two offline benchmarks, but it hurts AndroidWorld by 10.3 points. The paper interprets this as complementarity: AndroidWorld still contributes a distinct real-app transfer signal.

Scaling Amount And App Coverage

The second scaling study adds PhoneWorld supervision on top of the shared AndroidWorld base, using 0, 10K, 20K, or 36,193 PhoneWorld steps. PhoneWorld task success rises from 14.2 to 64.2, 70.0, and 73.3. The first 10K steps give the largest jump, and later additions still help but with smaller returns.

The third scaling study fixes the PhoneWorld budget at 10K steps and varies how many apps those steps cover. Figure 4 summarizes both curves. The app-coverage result is the stronger practical point: with the same 10K PhoneWorld budget, expanding coverage from 5 to 34 apps raises PhoneWorld from 46.7 to 65.0, HYMobileBench from 14.9 to 33.2, and AndroidWorld from 61.2 to 71.6, while AndroidControl stays roughly stable.

**Figure 4. Main scaling results for PhoneWorld.** Left: more PhoneWorld supervision improves PhoneWorld task success. Right: under a fixed 10K PhoneWorld budget, broader app coverage yields broader cross-benchmark gains. This is the main evidence for the paper's claim that environment diversity matters, not just trajectory count.

Added PhoneWorld steps	AndroidWorld	PhoneWorld
0K	46.6	14.2
10K	45.7	64.2
20K	45.2	70.0
36K	46.6	73.3

Table 5. Supplementary add-only analysis. Adding PhoneWorld supervision to the shared AndroidWorld base substantially improves PhoneWorld while leaving AndroidWorld roughly unchanged. This supports the paper's interpretation that the AndroidWorld drop in full replacement comes from removing auxiliary AndroidWorld data, not from adding PhoneWorld data itself.

Practical Takeaways

The most reusable idea is the environment-construction loop, not any single benchmark number: use real traces to identify the functional app skeleton, build resettable state, and generate verifier-grounded tasks from the same schema.
PhoneWorld's strongest result is the 10K partial replacement setting. It improves all four reported benchmarks under the same training budget, so the gain cannot be explained by simply adding more steps.
Environment breadth appears more valuable than narrow data volume under a fixed PhoneWorld budget. The 5-to-34 app comparison is the cleanest evidence for scaling coverage.
Full replacement is a warning against treating controllable mock apps as a substitute for real-app data. PhoneWorld contributes broad consumer-behavior supervision; AndroidWorld contributes a different real-app transfer signal.
SQLite-backed mutable state is the important verifier design pattern. It makes tasks resettable, permits deterministic success checks, and creates a direct path from evaluation to successful-rollout harvesting.

The limitations are material. PhoneWorld apps are selective abstractions rather than full replicas, the benchmark is intentionally compact and manually audited, HYMobileBench is internal, and the PhoneWorld APKs/tasks are not publicly released at submission time. The system is AI-driven, but the paper does not claim full automation; human audit remains part of the construction process.