ICML20262026avg 3.68interest 6.507 HF multi-agent systemsefficient inference

This work analyzes hybrid multi-agent systems that combine cloud LLMs with on-device smaller language models, focusing on tradeoffs among accuracy, monetary cost, and edge energy use. By adapting representative architectures, it finds that smaller models can benefit from LLM assistance, but the optimal hybrid design is task-dependent and more frontier compute does not reliably improve performance.

Source-first digest for checked paper rank 38, rank_id p022.

Motivation / Background

The paper starts from a practical tension in agentic AI. Frontier LLMs are strong enough to run long-horizon tool workflows, but cloud API costs can grow quickly because agents repeatedly plan, act, observe, and recover. Smaller language models can run on phones or laptops and offer privacy, availability, and cost advantages, but they are weaker and have tighter context and KV-cache limits.

The paper studies whether hybrid systems can do more than route a request to either a small or large model. Its main setup gives different roles to the models: an on-device Executor performs the token-heavy ReAct loop, while a cloud Supervisor plans, verifies, replans, or advises only at scheduled intervals. The architecture map in Figure 1 is the entry point for the whole paper.

Figure 1. PEVR and EVA hybrid MAS topology.
Figure 1. System diagrams for PEVR and EVA. Original caption: "System diagrams for the PEVR (Top, a) and EVA (Top, b) Hybrid Multi-Agent Systems, with pseudocode (Bottom) for each architecture." The copied source-side graphic shows the topology panels: Supervisor S, Executor E, environment \mathcal{E}, observations o^t, actions a^t, and supervisor interventions. The paper source also contains the pseudocode text; I summarize its operational differences in Table 1.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 A cloud-supervised, device-executed MAS can improve over monolithic edge execution while using much less cloud spend than a monolithic cloud agent. 4 architecture overview, experimental setup, main trade-off curves, reverse assignment, KV efficiency
C2 There is no single best hybrid MAS architecture: PEVR is better suited to stateful UI assistance, while EVA is better suited to deep-search QA. 5 main trade-off curves, domain mechanism analysis, verifier ablation, verifier table
C3 More cloud supervision is not automatically better; too-frequent intervention can degrade performance, especially in deep search. 5 main trade-off curves, intervention distributions, false-positive rates
C4 The architecture's restart and verification policy is a mechanism, not just a cost knob: PEVR replans are useful when plans remain actionable, while EVA advice avoids damaging restarts in search. 4 architecture comparison, domain mechanism analysis, summarization ablation
C5 The best hybrid direction in their study is device Executor plus cloud Supervisor; reversing the roles is less accurate and more expensive than a cloud-only agent. 4 reverse assignment
C6 Hybrid MASs are not equivalent to simple model routing: the best MAS solves some tasks that neither monolithic edge nor monolithic cloud solves. 3 task overlap
C7 Context resets help make long-horizon agents more feasible on memory-constrained devices by reducing KV-cache growth while improving AppWorld task success. 4 KV-cache equation, KV efficiency table
C8 The reported cost and edge-energy comparison is useful for relative design analysis, but the edge-energy side is an estimate rather than direct device measurement. 3 energy model, limitations

Scores are support-from-paper scores, not independent reproduction scores. Claims about broad deployment are capped when evidence comes from a fixed model family, estimated energy rather than direct device measurement, or a limited set of benchmarks.

Core Technical Idea

The paper adapts two common multi-agent patterns to a cloud-device split. In both, the on-device Executor does the long ReAct trajectory and the cloud Supervisor only checks progress every T_v steps. A supervisor intervention always resets the Executor context, replacing a growing trajectory with a compact handoff.

Architecture Initial cloud action Executor loop Verification target Intervention payload Why it matters
PEVR: Plan-Execute-Verify-Replan Supervisor writes an initial natural-language plan. Executor follows the plan through ReAct tool use. Supervisor checks alignment with the plan. A revised plan for the remaining task. Strong when a detailed plan stays valid and early mistakes are costly.
EVA: Execute-Verify-Advise No initial plan; the Executor receives the user query directly. Executor acts directly through ReAct tool use. Supervisor checks progress relative to the query. A summary of prior work plus advice. Strong when repeated replanning can disrupt search and the Executor needs lighter correction.

Table 1. PEVR versus EVA. This table is placed here because the paper's main result is not a new model architecture; it is the interaction between role assignment, verification target, and restart prompt.

The design desiderata are explicit: cover very different model capabilities, keep multi-turn execution on device where possible, expose cloud involvement through a tunable verification interval, and keep edge contexts bounded. That design makes the verification interval a single knob that changes cost, intervention frequency, and context-reset frequency at the same time.

Method Details

Benchmarks And Models

The experimental setup in Table 2 spans short retrieval, long information aggregation, and stateful UI execution. The cloud model is GPT-4o through Azure. Edge Executors are Qwen3 4B, 8B, 14B, and 32B served with vLLM; the 32B setting uses fp8 quantization of KV-cache and weights so it fits on one A100 for the experiments.

Benchmark Task character Metric reported Main-turn budget Verification intervals
HotpotQA Short-horizon multi-hop Wikipedia QA ROUGE-1 F1 10 1, 2, 3, 5 steps
FanOutQA Longer fan-out information aggregation ROUGE-1 F1 20 1, 2, 3, 5, 10 steps
AppWorld Stateful application/API task execution Test pass ratio and task success 40 1, 2, 4, 8, 16 steps

Table 2. Experimental setup. This table is reconstructed from the paper text and anchors the later results: AppWorld rewards structured, stateful execution, while HotpotQA and FanOutQA are deep-search settings where restarts can disrupt reasoning.

Cost, Energy, And KV-Cache Accounting

The paper evaluates cloud cost by summing cloud inference costs over a multi-turn trajectory. For Azure GPT-4o, it uses per-token subscription prices from the source text: 2.5 USD per 1M prefill tokens, 1.25 USD per 1M cached tokens, and 10 USD per 1M generated tokens.

For edge energy, the paper uses a back-of-the-envelope operation model rather than direct device measurement. A single inference round is decomposed as:

$$ E = E_{\mathrm{prefill}} + E_{\mathrm{decode}}. $$

For dense Transformers, it approximates operations per token as:

$$ \mathrm{Ops/token} \approx 2\mathcal{N}, $$

so the total operation count is:

$$ \mathrm{Ops}_{\mathrm{total}} \approx 2\mathcal{N}(n_p + n_d). $$

Given hardware efficiency \eta in operations per joule, the estimated energy is:

$$ E \approx \frac{2\mathcal{N}(n_p + n_d)}{\eta}. $$

The paper's example for a 4B model with 1000 prompt tokens, 200 generated tokens, and \eta = 1.5 \times 10^{12} Ops/J yields about 6.4 J. This supports relative comparison, but the paper itself notes the model omits DRAM access, CPU/runtime overhead, display and thermal costs, and sustained-efficiency loss.

For memory, the paper tracks maximum context length over each trajectory and estimates KV-cache size as:

$$ M_{\mathrm{KV}} = 2 \cdot L \cdot H_{\mathrm{KV}} \cdot d_h \cdot b_{\mathrm{act}} \cdot C, $$

where L is layer count, H_{\mathrm{KV}} is the number of key-value heads, d_h is the per-head dimension, b_{\mathrm{act}} is bytes per cached activation, and C is context length. This equation is important because PEVR and EVA do not merely shift compute between models; they periodically shorten the active Executor context.

Experiments And Results

Main Trade-Off Curves

The central empirical result is in Figure 2: across AppWorld, FanOutQA, and HotpotQA, multi-agent configurations create new accuracy-cost operating points between monolithic edge and monolithic cloud systems.

Figure 2. Main hybrid MAS trade-off curves.
Figure 2. Experimental results comparing monolithic systems against PEVR and EVA. Original caption: "Experimental results comparing Monolithic systems (edge-only and cloud-only) against Multi-Agent systems (PEVR and EVA). We present one row per benchmark, compare performance against both API cost in USD (on the two leftmost columns), and energy cost in Joules (on the two rightmost columns). For MASs, each line corresponds to a sweep on the verification interval." This figure is the primary evidence for C1, C2, and C3.

The key observations are:

Why PEVR And EVA Split By Domain

Figure 3 makes the mechanism visible. On AppWorld, PEVR and EVA have comparable intervention-count distributions, but PEVR achieves better pass ratios at fixed intervention counts. The paper attributes this to the actionable initial plans and replans in a stateful environment where early mistakes are costly. On FanOutQA, restarts correlate with worse performance, and PEVR intervenes more repeatedly; EVA's more conservative, advice-based behavior is less disruptive.

Figure 3. Supervisor intervention counts and performance.
Figure 3. PEVR and EVA performance by number of Supervisor interventions. Original caption: "Comparison of PEVR and EVA performance as a function of the number of Supervisor interventions (top), and the distribution of interventions (bottom). Results obtained using Qwen 3 14B as Edge model on AppWorld and FanOutQA, respectively with verification interval of 8 and 3, and max number of turns 40 and 20." I include it because it supports the paper's claim that the restart/intervention policy is a causal-looking mechanism behind the domain split.
Benchmark Verifier False negative False positive
AppWorld Plan-based (PEVR) 5.3% 6.2%
AppWorld Query-based (EVA) 6.0% 1.9%
FanOutQA Plan-based (PEVR) 8.4% 7.7%
FanOutQA Query-based (EVA) 14.8% 6.1%

Table 3. Verifier ablation. The paper labels FanOutQA tasks as successful if ROUGE-1 F1 is greater than 0.5. The table supports C3 and C4: PEVR has fewer false negatives but more false positives, and false positives matter when an unnecessary intervention triggers a harmful restart.

Role Assignment: Why Cloud Supervisor Plus Device Executor Wins

Table 4 compares the intended hybrid direction against the symmetric alternative. The intended direction keeps the token-heavy Executor on device and uses GPT-4o as intermittent Supervisor. The reverse direction puts GPT-4o in the Executor role and uses Qwen as Supervisor. In these experiments, the reverse direction is both less accurate and more expensive than cloud-only execution.

Executor Supervisor AppWorld task success AppWorld cost USD FanOutQA ROUGE-1 F1 FanOutQA cost USD
GPT-4o none 0.25 0.37 0.14 0.19
Qwen 32B GPT-4o 0.21 0.09 0.23 0.11
Qwen 14B GPT-4o 0.19 0.08 0.12 0.04
Qwen 8B GPT-4o 0.16 0.08 0.09 0.04
Qwen 4B GPT-4o 0.11 0.13 0.06 0.04
GPT-4o Qwen 32B 0.25 0.67 0.14 0.17
GPT-4o Qwen 14B 0.19 0.79 0.10 0.17
GPT-4o Qwen 8B 0.21 0.58 0.13 0.17
GPT-4o Qwen 4B 0.22 0.61 0.13 0.17
Qwen 32B none 0.07 0.00 0.15 0.00

Table 4. Reverse role assignment. This table was recovered from latex_flattened/main.flattened.tex because paper.md leaves the converted table body empty. It is strong evidence for C5: if the expensive model performs the whole execution trajectory and the cheap model only supervises, the system loses the cost advantage and does not recover better accuracy.

Hybrid MASs Are Not Just Model Routing

The overlap analysis in Figure 4 compares the best monolithic cloud agent, the best monolithic edge agent, and the best MAS configuration. The result is not that one system dominates. Each solves some tasks the others miss, and the MAS solves nontrivial unique subsets: 20 AppWorld tasks, 49 FanOutQA tasks, and 26 HotpotQA tasks are shown in the MAS-only regions.

Figure 4. Unique task completions by edge, cloud, and MAS systems.
Figure 4. Venn diagram of unique completed tasks. Original caption: "Venn diagram showing how many unique test tasks were completed by monolithic Edge, Cloud and MAS architectures. The Edge model is Qwen 3 14B with verification interval of 3 for AppWorld and 8 for FanOutQA. All systems show unique capabilities at solving different tasks." This supports C6, but I score it as 3 because it is an overlap analysis rather than a mechanistic proof that a routing policy could not learn the same choices.

Context Efficiency And KV-Cache Growth

The KV-cache results in Table 5 are the strongest memory-side evidence. On AppWorld, PEVR has higher task success and lower maximum KV-cache footprint than monolithic execution as the turn budget grows. The effect is largest for Qwen3 32B at 80 turns: PEVR reports 0.19 task success and 7.90 GB KV-cache, versus monolithic 0.09 task success and 13.12 GB.

Max turns Architecture Qwen3 8B task success Qwen3 8B KV GB Qwen3 32B task success Qwen3 32B KV GB
20 Monolithic 0.00 3.52 0.05 6.59
20 PEVR 0.07 3.34 0.18 6.53
40 Monolithic 0.02 4.82 0.07 11.34
40 PEVR 0.09 3.65 0.16 6.98
80 Monolithic 0.00 5.17 0.09 13.12
80 PEVR 0.11 3.82 0.19 7.90

Table 5. Context efficiency on AppWorld. This table was recovered from latex_flattened/main.flattened.tex because the numeric body was missing from paper.md. It directly supports C7 by tying context resets to lower KV-cache growth and higher task success.

Summarization Is Not The Whole EVA Story

Figure 5 tests whether EVA's advantage on FanOutQA comes mainly from summarizing prior work after a restart. Removing summarization does not substantially change EVA, while PEVR still degrades as cloud cost/interventions rise. The paper concludes that the PEVR/EVA gap is more likely caused by verifier criteria and restart prompt type than by summarization alone.

Figure 5. EVA summarization ablation.
Figure 5. EVA with and without summarization. Original caption: "Comparison of PEVR, EVA, and EVA without summarization. Results obtained using Qwen3-14B as Executor on FanOutQA. Removing the summarization feature does not significantly affect the performance of EVA." This figure supports C4 by separating summarization from the broader advice-versus-replan restart policy.

Limitations

The paper is careful about several limits. It evaluates three domains, but does not include robotics or coding agents. It fixes the cloud model to GPT-4o and the edge family to Qwen3 variants. The cost and compute burden led the authors to prioritize breadth over multi-seed repetitions. The edge energy model is explicitly a comparative estimate, not direct measurement on physical mobile or laptop hardware.

Practical Takeaways