arXiv20262026avg 6.60interest 7.80210 HF agentic harnessestool useverification

This survey frames code as the operational harness for agentic AI, covering how code connects agents to reasoning, action, environment modeling, planning, memory, tool use, feedback, and multi-agent coordination. It organizes representative methods and applications while outlining challenges in evaluation, verification, shared state, oversight, and multimodal extension.

Source-first digest for monthly 2026_05 rank 7, rank_id p007.

Motivation / Background

The paper argues that code should no longer be viewed only as something language models generate. In modern agent systems, code is also the runtime substrate that lets agents reason through execution, act through tools and policies, model environments through inspectable state, and revise behavior through feedback. This reframing matters because long-running agents fail less from one isolated bad completion than from weak surrounding systems: missing state, weak verifiers, unsafe tools, stale context, brittle memory, and poor coordination.

The source introduction separates three parts of an agent system: model-internal capabilities, system-provided harness infrastructure, and agent-initiated code artifacts. The paper focuses on the third part: code artifacts that agents create, execute, observe, revise, persist, and share inside the task loop. Figure 1 is the high-level taxonomy, and Table 2 compresses that taxonomy into the digest's layer map.

Figure 1. Taxonomy of code as agent harness.
Figure 1. Taxonomy of code as agent harness. The paper organizes the field into harness interfaces, harness mechanisms, and multi-agent scaling, with applications and open problems downstream of those layers.

The paper's central contribution is therefore conceptual and synthetic: it gives a vocabulary for code-centered agent systems in which executable artifacts are the medium for reasoning, action, state, feedback, verification, and coordination. This is a useful lens for reading coding agents, GUI agents, embodied agents, scientific agents, personalization agents, DevOps workflows, and enterprise automation as variations of the same code-harness problem.

Claims And Evidence

The main claims are listed in Table 1. Support scores are support-from-paper scores, not independent reproduction scores. A score of 5 means the claim is directly backed by the source taxonomy, section structure, figures, or tables; a score of 4 means the paper makes a strong literature-synthesis argument but does not independently benchmark it.

Claim id Main claim Support Evidence anchors
C1 Code is becoming an agent harness: an executable, inspectable, stateful medium for reasoning, acting, feedback, and verification, not just a generated artifact. 5 problem framing, taxonomy, Figure 1
C2 The survey's three-layer taxonomy, interface, mechanisms, and scaling, gives a coherent map of code-centered agent systems. 5 taxonomy, layer map, interface roadmap, mechanisms roadmap, MAS roadmap
C3 At the interface layer, code plays three recurring roles: executable reasoning substrate, action interface, and environment/world representation. 5 interface evidence, Figure 2, Figure 3
C4 Reliability depends on harness mechanisms around the model: planning, memory, tool use, Plan-Execute-Verify control, and adaptive harness optimization. 5 mechanisms evidence, mechanism map, PEV evidence, AHE evidence
C5 Multi-agent coding systems need shared code-centric substrates because role specialization increases the cost of stale or inconsistent shared state. 4 MAS evidence, shared-state evidence, MAS state table
C6 A major trend is that execution feedback bridges linguistic reasoning and formal verification, but only when the harness has adequate oracles and state representations. 4 patterns evidence, PEV evidence, open problem table
C7 The code-as-harness view generalizes beyond coding assistants to GUI/OS agents, embodied agents, scientific discovery agents, and personalization systems. 4 applications evidence, Figure 12
C8 The open agenda is harness engineering: evaluation beyond final success, semantic verification, regression-free self-evolution, transactional shared state, governance, and multimodal grounding. 5 open problems, open problem table

Core Technical Idea

The core idea is to make agent-initiated code artifacts first-class state in the agent loop. A code artifact is not merely an output string; it is something the harness can execute, inspect, store, test, diff, share, and roll back. This makes code a natural interface between model judgment and environment dynamics. Figure 2 shows the interface view, while Figure 3 places representative work into reasoning, acting, and environment-modeling roles.

Figure 2. Code as the harness interface.
Figure 2. Code as the harness interface. Code connects agents to reasoning, action, environment modeling, tool calls, state tracking, and feedback traces.
Figure 3. Roadmap of the harness interface.
Figure 3. Roadmap of the harness interface. The interface layer is organized around code for reasoning, code for acting, and code for environment modeling.

Table 2 is the digest's compact view of the paper's full taxonomy.

Layer What code does Representative artifacts Main failure pressure
Harness interface Converts model outputs into executable reasoning, actions, and world-state representations. Programs, symbolic specs, tool calls, policies, simulators, tests, traces. Natural-language reasoning is hard to verify and transient.
Harness mechanisms Keeps long-horizon agent loops bounded, observable, and revisable. Plans, memory stores, tool schemas, sandboxes, validators, telemetry, permission tiers. Agents need state, feedback, and governance beyond one generation step.
Scaling harness Makes code a shared substrate for multiple specialized agents. Repositories, blackboards, execution logs, role-specific patches, tests, workflow states. Agents can diverge in belief, context, assumptions, and accepted state.

Table 2. Three-layer code-as-harness map. This digest table synthesizes the source taxonomy and section organization.

The extraction equation index contains no substantive technical equation; it only captured a malformed author-affiliation display. The paper is nevertheless formal in several places through compact notations. The most useful notations are conceptual rather than algorithmic:

$$ \text{agent loop} \approx \text{model judgment} + \text{harness infrastructure} + \text{agent-initiated code artifacts} $$

For GUI and OS agents, the paper models the environment as a partially observable program world:

$$ \langle \mathcal{S}, \mathcal{A}, \mathcal{O}, T, R\rangle, \quad s \in \mathcal{S}, \quad o \in \mathcal{O}, \quad \pi(a \mid h) $$

For multi-agent synchronization, the important formal object is the gap between true shared program state and an agent's belief, summarized in the paper's discussion of belief divergence as \( |B_k - S_k| \). For score-based convergence, the survey cites the form \(s(r) = 1 - m(r) / tc(r)\), where a candidate program's mismatch count is normalized by test count. These notations are not the paper's main contribution; they support the larger claim that code-agent harnesses need explicit state, objective feedback, and convergence criteria.

Method Details

The paper's survey method is a taxonomy over code-centered agent systems. It does not introduce a new model or benchmark. Instead, it groups existing systems by the role code plays in the harness and then tracks what reliability problem each role solves.

At the interface layer, code has three jobs. First, code supports reasoning by externalizing intermediate computation into executable programs, symbolic solvers, proof scripts, traces, or process-reward units. Second, code supports acting by compiling high-level intent into robot policies, GUI actions, software commands, behavior trees, or reusable skills. Third, code supports environment modeling by representing state, dynamics, and evaluation through repositories, simulations, tests, logs, and executable world models.

At the mechanisms layer, the paper turns from what code is to how a harness manages code over time. Figure 4 gives the overview, and Table 3 summarizes the five mechanism families.

Figure 4. Roadmap of harness mechanisms.
Figure 4. Roadmap of harness mechanisms. Planning, memory, tool use, verification/control, and harness engineering are treated as interacting control surfaces.
Mechanism Harness role What the paper emphasizes Digest reading
Planning Decides what the agent should do next and when to revise. Linear decomposition, structure-grounded planning, search-based planning, orchestration-based planning. Planning is a contract over future state transitions, not just hidden reasoning.
Memory/context Maintains task state across long interactions. Working, semantic, experiential, long-term, multi-agent memory; compaction and state offloading. Memory quality depends on provenance, write gates, retrieval keys, and evaluation reliability.
Tool use Expands actions through governed executable interfaces. Function tools, environment-interaction tools, verification tools, workflow orchestration tools. Tools need schemas, permissions, lifecycle hooks, result sanitization, and traces.
PEV control Regulates state transitions through plan, execute, verify. Sandboxes, permission tiers, deterministic sensors, runtime feedback, human-review gates. Debugging is better read as cybernetic control over executable state.
Harness engineering Optimizes the harness itself using telemetry. Deep telemetry, Evolution Agents, governed harness mutation. The wrapper around the model is becoming a measurable, revisable software system.

Table 3. Harness mechanism map. This digest table condenses the mechanisms section and its roadmap figure.

Figure 5, Figure 6, Figure 7, Figure 8, and Figure 9 are the paper's mechanism-specific roadmaps.

Figure 5. Planning mechanisms for agent harnesses.
Figure 5. Planning mechanisms. The planning section treats plans as explicit scaffolds, graph- or repository-grounded constraints, search trajectories, and workflow orchestration structures.
Figure 6. Memory and context engineering mechanisms.
Figure 6. Memory and context engineering. The paper frames memory as a state-management layer, not merely a longer prompt or vector store.
Figure 7. Tool-using mechanisms.
Figure 7. Tool use. Tools are grouped by function-oriented grounding, environment interaction, verification-driven feedback, and workflow orchestration.
Figure 8. Plan-Execute-Verify harness control.
Figure 8. Plan-Execute-Verify loop. The PEV framing treats planning as contract formation, execution as a sandboxed state transition, and verification as deterministic sensing plus repair or escalation.
Figure 9. Adaptive harness engineering.
Figure 9. Harness engineering. The survey's AHE section treats telemetry, evolution agents, and governed harness mutation as ways to improve the runtime around a model.

At the scaling layer, the paper argues that multi-agent systems make the shared-code problem unavoidable. A planner, coder, tester, reviewer, executor, and security checker may all be individually useful, but the system can still fail if they read different versions of the repository, optimize against different validators, or carry incompatible assumptions. Figure 10 and Figure 11 summarize this MAS layer.

Figure 10. Multi-agent harness overview.
Figure 10. Multi-agent harness overview. Role-specialized agents coordinate through shared code artifacts, feedback, and workflow topology.
Figure 11. Multi-agent roadmap.
Figure 11. Multi-agent roadmap. The MAS section organizes systems by workflow collaboration, shared repository state, execution verification, and adaptive coordination.

The survey's most actionable MAS distinction is the kind of shared state a system exposes. Table 4 summarizes the four representations discussed in the paper.

Shared-state representation What agents share Strength Limitation
Implicit / file-only Latest code files or conversational context. Simple and cheap. State divergence is mostly invisible.
Repository-based File tree, dependencies, call graphs, summaries, version history. Good for localization and structural reasoning. Can miss runtime behavior.
Execution-based Test results, compiler diagnostics, traces, fuzzing, profiling, simulation snapshots. Supplies objective oracle signals. Only as strong as the attached verifier.
Blackboard / shared-state Persistent global store or structured state graph. Queryable, durable coordination substrate. Needs authority, consistency, and conflict policies.

Table 4. Shared code-harness representations. The paper's position is that most systems still lack a formal shared substrate despite code's unique ability to execute and produce objective signals.

Experiments And Results

This paper is a survey, so its "results" are structured synthesis results rather than new experiments. The most important empirical-style output is the pattern analysis in the multi-agent section and the cross-domain agenda in the final section.

The survey identifies seven recurring patterns:

The applications section uses five domains to show that the framework is not limited to software engineering. Figure 12 is the paper's application map. Code assistants operate over repositories and tests; GUI/OS agents act in program worlds built from DOM trees, accessibility APIs, screenshots, and event loops; embodied agents use code as policy and skill memory; scientific agents turn hypotheses, experiments, and analyses into executable pipelines; personalization agents use code to represent preference state, feedback processing, constraints, and policy adaptation.

Figure 12. Applications of code as agent harness.
Figure 12. Applications. The survey maps code-as-harness concepts onto coding assistants, GUI/OS agents, scientific discovery, personalization, and embodied agents.

Table 5 summarizes the open agenda. The strongest through-line is that final task success is too weak a metric for agentic systems: a harness can pass a test while hiding weak oracle coverage, unsafe permissions, stale memory, or state conflicts.

Open problem Why it matters Practical version
Harness-level evaluation and oracle adequacy Final accuracy mixes model ability, tool quality, environment difficulty, and verifier strength. Track trajectory cost, recovery, replayability, safety compliance, and verifier coverage.
Semantic verification beyond executable feedback Green tests can be an incomplete proxy for the real specification. Compose tests, static analysis, fuzzing, formal specs, critiques, coverage, and human review with explicit scope.
Self-evolving harnesses without regression Harness changes affect future agent behavior and can overfit or weaken safety. Treat harness mutation like safety-critical code: change contracts, regression suites, canaries, rollback.
Transactional shared program state Multi-agent and human-agent workflows can conflict at the level of assumptions, not only file diffs. Require read sets, write sets, version dependencies, verifier obligations, and conflict policies.
HITL safety and accountability as harness state High-risk actions need enforceable gates, not prompt-only caution. Store approvals, denials, risk tiers, audit evidence, and policy updates as durable state.
Multimodal code-harness systems GUI, robotics, and scientific settings need visual/physical evidence, not just text logs. Preserve raw observations, structured annotations, grounding contracts, and calibrated visual feedback.
Science of harness engineering The field needs principles for the whole closed loop, not just better base models. Build auditable telemetry, component metrics, and benchmarks that expose long-horizon harness failures.

Table 5. Open harness-engineering agenda. The paper's future work section treats the harness as the main object of measurement, design, and governance.

Practical Takeaways

For builders of agent systems, the practical message is direct: make code artifacts and execution evidence first-class, durable, and auditable. A strong harness should not only call tools; it should record plans, state assumptions, permissions, tool arguments, command outputs, tests, failures, approvals, and rollback points.

The second takeaway is that memory and tool use should be designed together. A memory item is useful only if it preserves enough provenance to be checked against source state, and a tool result is useful only if the harness decides whether it should update memory, trigger verification, or remain transient.

The third takeaway is that multi-agent systems should start with shared-state design, not role taxonomy. Adding a planner, reviewer, or tester is cheap; keeping their beliefs synchronized with repository state, execution traces, and human constraints is the real systems problem.

The fourth takeaway is that self-improving harnesses need stronger governance than ordinary code edits. A new retrieval policy, tool schema, permission rule, or verifier can change future agent behavior. The paper's AHE framing suggests treating those changes as evidence-carrying, replayable, regression-tested software changes.

Limitations are also clear. The paper is a broad survey and position piece, not a controlled experimental comparison. Its support is strongest for taxonomy, synthesis, and design agenda claims; it is weaker for causal claims about which harness design is best under a fixed benchmark. The LaTeX extraction produced all twelve figures and a successful source-first route, but no substantive display-equation index, so this digest reports conceptual and inline formal notations rather than algorithm-specific equations.

Reference Coverage

Anchor coverage links: evidence-problem-framing, fig-taxonomy, evidence-taxonomy, table-claims, evidence-interface, fig-interface-overview, fig-interface-roadmap, table-layer-map, evidence-equation-note, evidence-mechanisms, fig-mechanisms-roadmap, table-mechanism-map, fig-planning, fig-memory, fig-tool, evidence-pev, fig-verification, evidence-ahe, fig-harness-engineering, evidence-mas, fig-mas-overview, fig-mas-roadmap, evidence-shared-state, table-mas-state, evidence-patterns, evidence-applications, fig-applications, evidence-open-problems, and table-open-problems.