AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Source-first digest for checked paper rank 22, rank_id p021.

Routing status: pandoc_failed
PDF extraction: not used

Motivation / Background

Modern tool-using agents are usually evaluated as if one task is active at a time and tool calls return immediately. AsyncTool argues that this misses a central real-world constraint: external function calls can take time, and a useful agent should make progress on other independent tasks while waiting for delayed results.

The benchmark reframes the agent as a concurrent scheduler. Each task has an ordered tool-call trajectory with intra-task dependencies, while different tasks can be interleaved. The agent must call the right tool with valid arguments, preserve task state across pending calls, resume a task only when its dependency has returned, and avoid confusing tools or state across tasks. Figure 1 shows how the benchmark is built, and Figure 2 illustrates the asynchronous executor interaction.

**Figure 1. Dataset construction pipeline.** The original caption says the pipeline collects raw data from tool-use benchmarks, categorizes it by scenario, reinforces task-step dependencies, reconstructs execution trajectories with Gemini 2.5 Pro, manually verifies accuracy and determinism, and uses hybrid evolution plus filtering to produce the final multitasking dataset. I place it here because it is the paper's central evidence for the benchmark construction claim.

**Figure 2. Asynchronous executor example.** The paper's example gives the agent two tasks at once. When a result for one task is pending, the agent can switch to an independent task; when the delayed result becomes available, it can resume the dependent trajectory. This figure supports the claim that AsyncTool evaluates temporal coordination rather than plain sequential tool calling.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	AsyncTool fills a real benchmark gap by jointly testing function calls, multi-task execution, multi-step trajectories, cross-scenario composition, and an asynchronous executor.	4	motivation, benchmark comparison, data construction
C2	The dataset is deliberately constructed from validated single-task trajectories and then composed into diverse two-task and three-task asynchronous settings.	4	data pipeline, task inventory, task-type split, composition table
C3	The evaluation protocol captures correctness at step, sub-task, and task levels, not only final answer success.	5	metric definitions, trajectory equations, main results
C4	Current agents struggle under asynchronous delayed feedback; even strong models achieve modest end-to-end overall scores.	5	main results, latency ablation, task-count ablation
C5	Good asynchronous performance requires the right kind of switching: frequent switching alone is insufficient without state maintenance and dependency tracking.	4	scheduling trade-off, error modes, score figure, score visualization
C6	Existing function-calling strength, as measured by BFCL, does not fully predict AsyncTool performance.	3	BFCL comparison, BFCL score plot
C7	Few-shot trajectory examples help some smaller open models but do not solve the benchmark.	4	few-shot table

Scores are support-from-paper scores, not independent reproduction scores. Dataset novelty is capped below 5 because it is substantiated by feature comparison and construction details rather than an external benchmark audit; core metric and result claims receive higher support because the paper gives explicit definitions and quantitative tables.

Core Technical Idea

AsyncTool is not a new agent architecture. It is an evaluation environment for agents that must juggle multiple delayed tool-use tasks. The important object is a sub-task tuple:

$$ S_i = (I_i, Q_i, T_i, E_i), $$

where \(I_i\) is the task identifier, \(Q_i\) is the task query, \(T_i\) is the available API list, and \(E_i\) is the hidden environment state. The model response must explicitly include \(I_i\), so the executor can tell which pending sub-task the agent is trying to advance.

For each sub-task, the source trajectory is an ordered sequence of tool calls:

\mathcal{T}_i = \langle a_1, a_2, \dots, a_k \rangle,

where each action \(a_j\) is a pair of tool name and arguments. The appendix defines predicted and ground-truth trajectories as:

\mathcal{T}^{\text{pred}} = \langle a_1^{\text{pred}}, a_2^{\text{pred}}, \dots, a_{n}^{\text{pred}}\rangle,

\mathcal{T}^{\text{gt}} = \langle a_1^{\text{gt}}, a_2^{\text{gt}}, \dots, a_{n}^{\text{gt}}\rangle.

The asynchronous twist is that after an agent issues a tool call, the environment may report that the result is not yet available. The agent can wait, continue the same task incorrectly, or switch to another available task. AsyncTool scores whether it eventually produces the correct tool trajectories and environment states while managing those delays.

Method Details

Benchmark Construction

The construction pipeline starts from NESTFUL and BFCLv3, because those benchmarks already provide tool APIs, executors, task descriptions, and execution paths. The authors extract 12 tools and 358 tasks into an original single-task dataset, reconstruct task descriptions and strictly ordered trajectories with Gemini 2.5 Pro, and then manually correct errors in arguments, dependency order, initial task interpretation, and task-trajectory alignment.

The human validation instructions require annotators to inspect task goals, validate ordered function calls, check tool names and argument formats, verify dependency relations, execute or inspect trajectory results, correct unambiguous errors, rewrite ambiguous descriptions, and remove unreliable instances. This matters because an asynchronous benchmark is only meaningful if the single-task dependency chains are deterministic and executable before they are composed.

Table 1 summarizes the 12 single-task categories used before composition. It is referenced here because it shows that the benchmark is not limited to one tool domain.

Category	Abbrev.	Tasks	Avg. trajectory length
Data Management	DM	59	2.14
Filesystem	FS	53	3.58
Data generation	DG	29	2.14
MessageAPI	MA	23	3.48
Number operations	NO	24	2.12
SocialConnect	SC	18	4.11
String Manipulation	SM	34	2.09
TicketPurchase	TP	21	2.81
TradingBot	TB	15	3.27
TravelPlanning	TP*	20	3.10
DataFormat	DF	44	2.09
Machine Operation	MO	18	3.00

Table 2 gives the final 712 multitasking instances. The benchmark uses both similar-task and cross-task mixtures, with two-task and three-task settings.

Composition	Number of tasks	Samples	Share
SIMILAR	2	120	16.85%
CROSS	2	132	18.54%
SIMILAR	3	240	33.71%
CROSS	3	220	30.90%

What AsyncTool Adds

Table 3 is the paper's compact argument for novelty. It claims that prior benchmarks cover some pieces of the problem, but not the full combination of async executor, function calling, multi-tasking, multi-step execution, and cross-scenario composition.

Benchmark	Async executor	Function call	Multi-task	Multi-step	Cross-scenario
tau-bench	No	Yes	No	No	No
BFCL v3	No	Yes	No	Yes	No
NESTFUL	No	Yes	No	Yes	No
TimeArena	Yes	No	Yes	Yes	No
C3-Bench	No	Yes	Yes	Yes	No
Robotouille	Yes	No	No	Yes	No
AsyncTool	Yes	Yes	Yes	Yes	Yes

Metrics

AsyncTool scores agents at three levels:

Step level: F1 for function-name correctness and parameter correctness.
Sub-task level: trajectory completion, environment matching, and their combined sub-task accuracy.
Task level: trajectory completion and environment consistency count only when all sub-task metrics within the task are satisfied.

This creates a sharp distinction between local tool-call fluency and full asynchronous success. A model can have high step-level scores while still failing overall because it mixes task states, violates dependencies, or produces an inconsistent final environment.

Experiments And Results

Main Results

Table 4 distills the main result table. The full source table reports eight metrics; this digest keeps the most diagnostic columns: step-level function and parameter F1, sub-task accuracy, task-level environment matching, and overall score.

Model	Group	Func.	Param.	Sub-task Acc.	Task Env.	Overall
Qwen-Max	closed	86.22	73.62	52.44	50.14	25.56
Kimi-K2	closed	96.14	80.46	56.79	51.69	24.44
Gemini 2.5 Pro	closed	89.08	78.27	62.05	54.35	32.44
GPT-5	closed	92.21	80.11	60.67	58.43	31.32
GPT-4o	closed	93.92	82.26	61.41	60.53	31.74
GPT-4.1	closed	96.22	84.08	67.14	64.89	38.06
LLaMA-3.1-8B-Instruct	open <20B	78.29	43.69	12.47	14.61	1.26
Qwen2.5-7B-Instruct	open <20B	82.40	65.01	26.38	25.84	6.04
Qwen2.5-14B-Instruct	open <20B	81.32	70.22	46.28	38.20	18.82
Qwen3-8B	open <20B	63.05	53.61	29.30	28.65	10.67
Qwen3-14B	open <20B	85.02	72.67	47.19	44.66	18.82
LLaMA-3.1-70B-Instruct	open >20B	89.60	47.10	17.83	16.43	2.81
LLaMA-3.3-70B-Instruct	open >20B	73.00	40.32	20.54	18.26	5.34
GLM-4-32B	open >20B	60.59	51.41	33.97	29.78	15.17
Qwen3-32B	open >20B	79.95	70.37	46.71	41.43	19.10
Qwen2.5-32B-Instruct	open >20B	94.24	81.73	56.48	49.72	24.86
Qwen3-30B-A3B-Instruct	open >20B	94.29	80.03	53.03	47.47	21.49
DeepSeek-V3.1-Terminus	open >20B	86.10	75.32	56.21	49.30	28.93

GPT-4.1 is the best reported model, but its overall score is only 38.06. That is the most important result: even when function and parameter F1 are high, end-to-end asynchronous task completion remains difficult.

**Figure 3. Overall and sub-task scores.** The original caption says TC is overall score and SC is sub-task-level score. The figure is relevant because it visualizes the gap between partial sub-task completion and full task-level asynchronous success.

Scheduling And Efficiency

The paper introduces Same-task Streak as an efficiency-oriented behavior metric: the longest consecutive sequence of turns spent on the same task, averaged over samples. Lower values indicate stronger interleaving. Figure 4 shows why this cannot be optimized blindly.

**Figure 4. Accuracy and scheduling trade-off.** The lower-right region is ideal: high overall score and low same-task streak. The paper argues that GPT-4.1, Gemini 2.5 Pro, and GPT-4o combine stronger accuracy with compact switching, while some weaker models switch often without completing tasks. This supports the claim that timing and state tracking matter more than switching frequency alone.

Figure 5. BFCL accuracy versus AsyncTool scores. — **Figure 5. BFCL comparison.** This plot compares BFCL overall accuracy with AsyncTool overall scores. It is included as evidence that conventional function-calling strength does not directly transfer to delayed multi-task execution.

Task Type, Task Count, And Latency

The remaining ablations separate category difficulty from scaling difficulty. Figure 6 shows the task-type heatmap before the task-count and latency tables, making clear that asynchronous difficulty is uneven across tool domains.

Figure 6. Performance by task category. — **Figure 6. Task-type heatmap.** The original caption says red indicates poorer performance and green indicates better performance. I include it because AsyncTool's difficulty is not uniform across tool categories or model families.

Table 5 uses the appendix task-count experiment to isolate one difficulty factor. Moving from two to three tasks reduces overall scores by about one-third to more than one-half for the four reported open models; moving from three to four tasks causes another large drop.

Model	#2 overall	#3 overall	Drop #2 to #3	#4 overall	Drop #3 to #4
Qwen2.5-7B-Instruct	17.06	10.87	36.28%	6.33	41.77%
GLM-4-9B-chat	21.43	12.83	40.13%	8.17	36.32%
Qwen2.5-72B-Instruct	50.79	33.70	33.65%	22.00	34.72%
LLaMA3.3-70B-Instruct	20.63	8.70	57.83%	3.67	57.81%

Table 6 summarizes the reported latency ablations for selected closed-source models. Scores are not monotonic across every model and delay regime, but the broader result supports the claim that response timing changes behavior and difficulty.

Model	Main overall	Fixed delay 2	Random delay 0-1	Random delay 1-2
Qwen-Max	25.56	30.76	34.41	30.20
Gemini 2.5 Pro	32.44	25.70	35.25	29.35
GPT-4o	31.74	26.54	35.53	28.79
GPT-4.1	38.06	26.40	34.55	29.49

The source text's main qualitative conclusion is stronger than the raw ablation table alone: when tool responses are delayed, lower-performing models often continue dependent calls prematurely or hallucinate parameters based on assumed results. Higher-performing models are better at moving to independent tasks and resuming when dependencies are actually available.

Few-shot And Error Modes

Table 7 reports few-shot prompting with one successful trajectory as a reference. It helps several smaller open models, especially LLaMA-3.1-8B and Qwen2.5-14B, but the resulting overall scores remain low in absolute terms.

Model	Base overall	Few-shot overall	Change
Qwen2.5-7B-Instruct	6.04	8.29	+2.25
LLaMA-3.1-8B-Instruct	1.26	6.74	+5.48
Qwen3-8B	10.67	11.24	+0.57
Qwen2.5-14B-Instruct	18.32	21.91	+3.59
Qwen2.5-72B-Instruct	31.04	34.55	+3.51

The paper's error analysis identifies three high-level asynchronous failures:

Premature continuation: the model executes a later dependent call before the previous tool result has returned.
Task neglect: the model over-focuses on the most recently presented task and fails to complete earlier tasks, especially in tri-task settings.
Tool confusion: the model switches across tasks but uses a tool from the wrong scenario, causing cascading failures.

These failure modes explain why high function-call fluency does not guarantee high overall score. The hard part is maintaining separate task state and dependency information through delayed, interleaved interactions.

Practical Takeaways

The reusable idea is the evaluation framing: tool-use agents should be tested as schedulers under delayed feedback, not only as single-threaded function callers.
The benchmark is useful for agent harness design because its failures map to concrete engineering needs: per-task state tracking, dependency guards, pending-call management, and explicit scheduling policies.
The main result is sobering: GPT-4.1 leads with 38.06 overall, while many open models have respectable step-level scores but very low task-level success.
Same-task Streak is a useful diagnostic but not a standalone objective. Frequent switching can be harmful if the agent does not preserve state and dependency order.
The evidence for benchmark construction is credible from source-side details, but an external audit of dataset determinism and leakage would be needed before treating AsyncTool as fully settled.
The paper is benchmark-focused; it diagnoses coordination failures but does not provide a new agent architecture that fixes them.