arXiv20262026avg 4.55interest 8.009 HF tool-use agentsagent benchmarkstemporal coordination

AsyncTool evaluates LLM agents' asynchronous tool calling in multi-task settings where tool responses are delayed. The benchmark introduces heterogeneous concurrent tasks and efficiency metrics, and finds that latency substantially degrades current agents unless they coordinate task switching, dependency tracking, and state maintenance.

Source-first digest for checked paper rank 22, rank_id p021.

Motivation / Background

Modern tool-using agents are usually evaluated as if one task is active at a time and tool calls return immediately. AsyncTool argues that this misses a central real-world constraint: external function calls can take time, and a useful agent should make progress on other independent tasks while waiting for delayed results.

The benchmark reframes the agent as a concurrent scheduler. Each task has an ordered tool-call trajectory with intra-task dependencies, while different tasks can be interleaved. The agent must call the right tool with valid arguments, preserve task state across pending calls, resume a task only when its dependency has returned, and avoid confusing tools or state across tasks. Figure 1 shows how the benchmark is built, and Figure 2 illustrates the asynchronous executor interaction.

Figure 1. Dataset construction pipeline.
Figure 1. Dataset construction pipeline. The original caption says the pipeline collects raw data from tool-use benchmarks, categorizes it by scenario, reinforces task-step dependencies, reconstructs execution trajectories with Gemini 2.5 Pro, manually verifies accuracy and determinism, and uses hybrid evolution plus filtering to produce the final multitasking dataset. I place it here because it is the paper's central evidence for the benchmark construction claim.
Figure 2. Asynchronous executor example.
Figure 2. Asynchronous executor example. The paper's example gives the agent two tasks at once. When a result for one task is pending, the agent can switch to an independent task; when the delayed result becomes available, it can resume the dependent trajectory. This figure supports the claim that AsyncTool evaluates temporal coordination rather than plain sequential tool calling.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 AsyncTool fills a real benchmark gap by jointly testing function calls, multi-task execution, multi-step trajectories, cross-scenario composition, and an asynchronous executor. 4 motivation, benchmark comparison, data construction
C2 The dataset is deliberately constructed from validated single-task trajectories and then composed into diverse two-task and three-task asynchronous settings. 4 data pipeline, task inventory, task-type split, composition table
C3 The evaluation protocol captures correctness at step, sub-task, and task levels, not only final answer success. 5 metric definitions, trajectory equations, main results
C4 Current agents struggle under asynchronous delayed feedback; even strong models achieve modest end-to-end overall scores. 5 main results, latency ablation, task-count ablation
C5 Good asynchronous performance requires the right kind of switching: frequent switching alone is insufficient without state maintenance and dependency tracking. 4 scheduling trade-off, error modes, score figure, score visualization
C6 Existing function-calling strength, as measured by BFCL, does not fully predict AsyncTool performance. 3 BFCL comparison, BFCL score plot
C7 Few-shot trajectory examples help some smaller open models but do not solve the benchmark. 4 few-shot table

Scores are support-from-paper scores, not independent reproduction scores. Dataset novelty is capped below 5 because it is substantiated by feature comparison and construction details rather than an external benchmark audit; core metric and result claims receive higher support because the paper gives explicit definitions and quantitative tables.

Core Technical Idea

AsyncTool is not a new agent architecture. It is an evaluation environment for agents that must juggle multiple delayed tool-use tasks. The important object is a sub-task tuple:

$$ S_i = (I_i, Q_i, T_i, E_i), $$

where \(I_i\) is the task identifier, \(Q_i\) is the task query, \(T_i\) is the available API list, and \(E_i\) is the hidden environment state. The model response must explicitly include \(I_i\), so the executor can tell which pending sub-task the agent is trying to advance.

For each sub-task, the source trajectory is an ordered sequence of tool calls:

$$ \mathcal{T}_i = \langle a_1, a_2, \dots, a_k \rangle, $$

where each action \(a_j\) is a pair of tool name and arguments. The appendix defines predicted and ground-truth trajectories as:

$$ \mathcal{T}^{\text{pred}} = \langle a_1^{\text{pred}}, a_2^{\text{pred}}, \dots, a_{n}^{\text{pred}}\rangle, $$
$$ \mathcal{T}^{\text{gt}} = \langle a_1^{\text{gt}}, a_2^{\text{gt}}, \dots, a_{n}^{\text{gt}}\rangle. $$

The asynchronous twist is that after an agent issues a tool call, the environment may report that the result is not yet available. The agent can wait, continue the same task incorrectly, or switch to another available task. AsyncTool scores whether it eventually produces the correct tool trajectories and environment states while managing those delays.

Method Details

Benchmark Construction

The construction pipeline starts from NESTFUL and BFCLv3, because those benchmarks already provide tool APIs, executors, task descriptions, and execution paths. The authors extract 12 tools and 358 tasks into an original single-task dataset, reconstruct task descriptions and strictly ordered trajectories with Gemini 2.5 Pro, and then manually correct errors in arguments, dependency order, initial task interpretation, and task-trajectory alignment.

The human validation instructions require annotators to inspect task goals, validate ordered function calls, check tool names and argument formats, verify dependency relations, execute or inspect trajectory results, correct unambiguous errors, rewrite ambiguous descriptions, and remove unreliable instances. This matters because an asynchronous benchmark is only meaningful if the single-task dependency chains are deterministic and executable before they are composed.

Table 1 summarizes the 12 single-task categories used before composition. It is referenced here because it shows that the benchmark is not limited to one tool domain.

Category Abbrev. Tasks Avg. trajectory length
Data Management DM 59 2.14
Filesystem FS 53 3.58
Data generation DG 29 2.14
MessageAPI MA 23 3.48
Number operations NO 24 2.12
SocialConnect SC 18 4.11
String Manipulation SM 34 2.09
TicketPurchase TP 21 2.81
TradingBot TB 15 3.27
TravelPlanning TP* 20 3.10
DataFormat DF 44 2.09
Machine Operation MO 18 3.00

Table 2 gives the final 712 multitasking instances. The benchmark uses both similar-task and cross-task mixtures, with two-task and three-task settings.

Composition Number of tasks Samples Share
SIMILAR 2 120 16.85%
CROSS 2 132 18.54%
SIMILAR 3 240 33.71%
CROSS 3 220 30.90%

What AsyncTool Adds

Table 3 is the paper's compact argument for novelty. It claims that prior benchmarks cover some pieces of the problem, but not the full combination of async executor, function calling, multi-tasking, multi-step execution, and cross-scenario composition.

Benchmark Async executor Function call Multi-task Multi-step Cross-scenario
tau-bench No Yes No No No
BFCL v3 No Yes No Yes No
NESTFUL No Yes No Yes No
TimeArena Yes No Yes Yes No
C3-Bench No Yes Yes Yes No
Robotouille Yes No No Yes No
AsyncTool Yes Yes Yes Yes Yes

Metrics

AsyncTool scores agents at three levels:

This creates a sharp distinction between local tool-call fluency and full asynchronous success. A model can have high step-level scores while still failing overall because it mixes task states, violates dependencies, or produces an inconsistent final environment.

Experiments And Results

Main Results

Table 4 distills the main result table. The full source table reports eight metrics; this digest keeps the most diagnostic columns: step-level function and parameter F1, sub-task accuracy, task-level environment matching, and overall score.

Model Group Func. Param. Sub-task Acc. Task Env. Overall
Qwen-Max closed 86.22 73.62 52.44 50.14 25.56
Kimi-K2 closed 96.14 80.46 56.79 51.69 24.44
Gemini 2.5 Pro closed 89.08 78.27 62.05 54.35 32.44
GPT-5 closed 92.21 80.11 60.67 58.43 31.32
GPT-4o closed 93.92 82.26 61.41 60.53 31.74
GPT-4.1 closed 96.22 84.08 67.14 64.89 38.06
LLaMA-3.1-8B-Instruct open <20B 78.29 43.69 12.47 14.61 1.26
Qwen2.5-7B-Instruct open <20B 82.40 65.01 26.38 25.84 6.04
Qwen2.5-14B-Instruct open <20B 81.32 70.22 46.28 38.20 18.82
Qwen3-8B open <20B 63.05 53.61 29.30 28.65 10.67
Qwen3-14B open <20B 85.02 72.67 47.19 44.66 18.82
LLaMA-3.1-70B-Instruct open >20B 89.60 47.10 17.83 16.43 2.81
LLaMA-3.3-70B-Instruct open >20B 73.00 40.32 20.54 18.26 5.34
GLM-4-32B open >20B 60.59 51.41 33.97 29.78 15.17
Qwen3-32B open >20B 79.95 70.37 46.71 41.43 19.10
Qwen2.5-32B-Instruct open >20B 94.24 81.73 56.48 49.72 24.86
Qwen3-30B-A3B-Instruct open >20B 94.29 80.03 53.03 47.47 21.49
DeepSeek-V3.1-Terminus open >20B 86.10 75.32 56.21 49.30 28.93

GPT-4.1 is the best reported model, but its overall score is only 38.06. That is the most important result: even when function and parameter F1 are high, end-to-end asynchronous task completion remains difficult.

Figure 3. Overall and sub-task scores.
Figure 3. Overall and sub-task scores. The original caption says TC is overall score and SC is sub-task-level score. The figure is relevant because it visualizes the gap between partial sub-task completion and full task-level asynchronous success.

Scheduling And Efficiency

The paper introduces Same-task Streak as an efficiency-oriented behavior metric: the longest consecutive sequence of turns spent on the same task, averaged over samples. Lower values indicate stronger interleaving. Figure 4 shows why this cannot be optimized blindly.

Figure 4. Accuracy and scheduling trade-off.
Figure 4. Accuracy and scheduling trade-off. The lower-right region is ideal: high overall score and low same-task streak. The paper argues that GPT-4.1, Gemini 2.5 Pro, and GPT-4o combine stronger accuracy with compact switching, while some weaker models switch often without completing tasks. This supports the claim that timing and state tracking matter more than switching frequency alone.
Figure 5. BFCL accuracy versus AsyncTool scores.
Figure 5. BFCL comparison. This plot compares BFCL overall accuracy with AsyncTool overall scores. It is included as evidence that conventional function-calling strength does not directly transfer to delayed multi-task execution.

Task Type, Task Count, And Latency

The remaining ablations separate category difficulty from scaling difficulty. Figure 6 shows the task-type heatmap before the task-count and latency tables, making clear that asynchronous difficulty is uneven across tool domains.

Figure 6. Performance by task category.
Figure 6. Task-type heatmap. The original caption says red indicates poorer performance and green indicates better performance. I include it because AsyncTool's difficulty is not uniform across tool categories or model families.

Table 5 uses the appendix task-count experiment to isolate one difficulty factor. Moving from two to three tasks reduces overall scores by about one-third to more than one-half for the four reported open models; moving from three to four tasks causes another large drop.

Model #2 overall #3 overall Drop #2 to #3 #4 overall Drop #3 to #4
Qwen2.5-7B-Instruct 17.06 10.87 36.28% 6.33 41.77%
GLM-4-9B-chat 21.43 12.83 40.13% 8.17 36.32%
Qwen2.5-72B-Instruct 50.79 33.70 33.65% 22.00 34.72%
LLaMA3.3-70B-Instruct 20.63 8.70 57.83% 3.67 57.81%

Table 6 summarizes the reported latency ablations for selected closed-source models. Scores are not monotonic across every model and delay regime, but the broader result supports the claim that response timing changes behavior and difficulty.

Model Main overall Fixed delay 2 Random delay 0-1 Random delay 1-2
Qwen-Max 25.56 30.76 34.41 30.20
Gemini 2.5 Pro 32.44 25.70 35.25 29.35
GPT-4o 31.74 26.54 35.53 28.79
GPT-4.1 38.06 26.40 34.55 29.49

The source text's main qualitative conclusion is stronger than the raw ablation table alone: when tool responses are delayed, lower-performing models often continue dependent calls prematurely or hallucinate parameters based on assumed results. Higher-performing models are better at moving to independent tasks and resuming when dependencies are actually available.

Few-shot And Error Modes

Table 7 reports few-shot prompting with one successful trajectory as a reference. It helps several smaller open models, especially LLaMA-3.1-8B and Qwen2.5-14B, but the resulting overall scores remain low in absolute terms.

Model Base overall Few-shot overall Change
Qwen2.5-7B-Instruct 6.04 8.29 +2.25
LLaMA-3.1-8B-Instruct 1.26 6.74 +5.48
Qwen3-8B 10.67 11.24 +0.57
Qwen2.5-14B-Instruct 18.32 21.91 +3.59
Qwen2.5-72B-Instruct 31.04 34.55 +3.51

The paper's error analysis identifies three high-level asynchronous failures:

These failure modes explain why high function-call fluency does not guarantee high overall score. The hard part is maintaining separate task state and dependency information through delayed, interleaved interactions.

Practical Takeaways