arXiv20262026avg 5.12interest 9.506 HF multimodal agentsvisual evidenceresearch automation

Ptah addresses the difficulty of producing verifiable multimodal deep-research reports that combine textual synthesis with visual evidence. Its multi-agent harness plans, researches, writes, maintains a Visual Working Memory, and uses a verifier for factual grounding, citation fidelity, and cross-modal consistency, with experiments showing more reliable and usable reports than baselines.

Source-first digest for checked paper rank 11, rank_id p026.

Motivation / Background

The paper starts from a gap between deep search and deep research. Deep search usually aims at short, checkable answers, while deep research asks an agent to synthesize scattered evidence into a long report. That second setting is harder to verify because there is no deterministic ground truth, and it is harder to present well because useful reports need text interleaved with charts, diagrams, screenshots, and other visual evidence.

The authors argue that existing deep-research systems fail in two coupled ways: mistakes from early search and planning stages can propagate into final reports, and images are often inserted as post-hoc decoration instead of being tied to claims, citations, and section intent. Figure 1 makes this framing concrete: images can reduce cognitive load and support claims, but the same visual channel can harm credibility through invalid references, uninformative images, or poor placement.

Figure 1. Interleaved image-text research reports: benefits and challenges.
Figure 1. Interleaved image-text research reports. The original caption says the figure illustrates how images enhance report quality and the challenges of generating high-quality interleaved image-text reports. I place it here because it motivates the paper's core claim that multimodal report generation needs evidence-aware visual handling, not only better prose generation.

Claims And Evidence

The digest's evidence map is Table A. Scores are support-from-paper scores, not independent reproduction scores.

Claim id Main claim Support Evidence anchors
C1 Ptah turns multimodal deep research into a staged, inspectable workflow rather than a single monolithic generation pass. 4 motivation, harness overview, planning/research/writing details
C2 Visual evidence is treated as structured working state through visual requirements, webpage image extraction, VLM selection, and Visual Working Memory. 4 harness overview, visual working memory, example reports
C3 A verifier agent improves stability and factual grounding by checking stage outputs, citations, protocol compliance, and cross-modal consistency. 4 method stages, FACT results, verifier ablation, verifier latency
C4 PtahEval fills a measurement gap by adding image-level and rendered-page presentation scores to existing deep-research benchmarks. 5 PtahEval protocol, PtahEval scores
C5 Ptah improves multimodal report quality and credibility over the evaluated baselines. 4 content results, PtahEval scores, FACT results, human evaluation
C6 Test-time scaling contributes to content quality, image quality, and rendered HTML quality. 4 TTS method, TTS ablation
C7 The multi-agent decomposition is not only a quality device; parallel section-level research reduces wall-clock research latency. 4 stage latency, parallel latency
C8 The system remains bounded by current open-source model reliability and by manually defined stage boundaries. 5 limitations

Core Technical Idea

Ptah is a harness around agents, tools, intermediate state, and verification. The paper does not propose a new foundation model. Its contribution is an execution pattern for long-form multimodal report generation: make the report plan visual-aware, let researchers collect claim-grounded text and source-aligned image candidates, then let a writer compose an interleaved report through declarative image operations.

The task formulation represents a report as ordered blocks:

$$ r = (b_1, b_2, \dots, b_M), $$

where each block is either text \(t_i\) or a visual element \(v_i\). During execution, the harness maintains a state \(s_t = (q, \mathcal{M}_t, \tau_{<t})\), where \(\mathcal{M}_t\) contains plans, evidence, citations, numerical data, and visual candidates. That state is the key object: the system does not just produce prose; it accumulates inspectable intermediate artifacts before final rendering.

The full Planning-Research-Writing lifecycle is summarized in Figure 2.

Figure 2. Overview of Ptah.
Figure 2. Overview of Ptah. The original caption describes Ptah as a multi-agent harness for verifiable multimodal deep research. I include it as the central method figure because it shows the three stages, verifier accept/reject loops, Visual Working Memory, image operations, and final HTML refinement.

The verifier is placed between lifecycle stages rather than only after final generation. This matters because the paper's problem is error accumulation: bad plans, unsupported research packages, invalid references, or misaligned images become harder to fix once they are already woven into a final report.

Method Details

Planning Stage

The planner uses text search to explore the user's query and emits a structured plan. The plan contains an overview, section-level research goals, expected evidence types, and explicit visual specifications. Those visual specifications say what kind of image should support a section, where it should appear, and what communicative role it should serve.

The verifier checks the plan in two ways. Rule-based checks validate interaction protocol, tool-use constraints, and JSON format. LLM-based rubric checks assess query coverage, section coherence, and whether visual requirements actually match the intended argument. Failed plans are revised before research starts.

Research Stage

For each planned section, a researcher performs an independent investigation. The output is a structured research package containing key findings, evidence, numerical data, tables, references, and writing instructions. This decomposition is meant to support broad coverage without losing traceability.

In parallel, researchers extract candidate images from visited webpages. Low-resolution, duplicated, irrelevant, or non-informative images are filtered out, and a VLM selector keeps images that satisfy the planner's visual requirements. Retained images are stored in Visual Working Memory with source URL, surrounding context, section association, and intended role. This is the paper's main design answer to post-hoc image insertion.

Writing Stage

The writer receives the global plan, verified research packages, and Visual Working Memory. It generates text and image directives jointly, then the harness arbitrates among three image operations:

After the first report draft, Ptah applies verifier-guided test-time scaling. The six lifecycle refinement hooks are Section Refine, Image Refine, Overall Refine, HTML Generate, HTML Refine, and Render. The important detail is that refinement is not only prose polishing: image placement, image editing/deletion, global layout, and browser-rendered readability are also refined.

PtahEval

Figure 3 shows how PtahEval extends existing deep-research benchmarks with multimodal evaluation.

Figure 3. PtahEval evaluation protocol.
Figure 3. PtahEval protocol. The original caption calls this an illustration of the PtahEval evaluation. I place it in the method section because it defines the two evaluation axes used later in the experiments.

PtahEval keeps the base benchmark score, then adds two VLM-judged dimensions:

For MPQ, the generated report is rendered as a web page and a \(1000 \times 2000\) pixel first-screen screenshot is judged. This choice is practical: the authors evaluate what a reader sees, not just a text file with image tags.

Implementation And Tools

The experiments use Qwen3-32B as Planner, Researcher, and Verifier, with Qwen3-VL-32B-Instruct as Writer. Qwen3-32B is also used for LLM-based verification, while Qwen3-VL-32B-Instruct performs image selection in the research stage. The benchmark evaluator uses Qwen3-VL-235B-A22B-Instruct.

The tool stack includes text search, image search, image generation, image editing, and code execution. In the reported setup, text/image search use Serper, webpage parsing uses Jina Reader, and image generation/editing/evaluation components are accessed through SiliconFlow APIs. The paper says these APIs are replaceable interfaces rather than fixed parts of the core design.

Experiments And Results

The main benchmark setup uses DeepResearch Bench and DeepConsult. Baselines include direct Qwen3-32B and QwQ-32B report generation, text-only search agents ReAct, Search-o1, and WebThinker, plus LLM-I as a multimodal-generation baseline.

Table 1 collects the main content-quality values that are visible in the extracted Markdown. Some cells in the source table were blank after conversion, so this digest only reports the numeric cells available in paper.md.

Method DeepResearch Bench visible values DeepConsult visible values
WebThinker Comprehensiveness 44.63; Insight/Depth 43.26; Instruction-Following 46.86; Readability 46.61; Overall 45.00 Instruction-Following 2.94; Comprehensiveness 17.64; Completeness 2.94; Average 7.35
Ptah Comprehensiveness 42.97; Insight/Depth 44.32; Readability 47.95; Overall 45.16 Instruction-Following 13.73; Comprehensiveness 18.63; Completeness 17.64; Writing Quality 14.71; Average 16.18

Table 1. Main content-quality results. The strongest numeric claim here is on DeepConsult, where Ptah reaches an average of 16.18 versus WebThinker's 7.35. On DeepResearch Bench, Ptah's overall score is only slightly above WebThinker, but the paper emphasizes stronger Insight/Depth and Readability.

Table 2 gives the PtahEval values visible in the paper's extracted Markdown.

Method VC CMA IC ES ICQ Avg. DLB IS VED VE MPQ Avg.
LLM-I 2.10 2.28 1.96 1.52 1.97 - - 3.25 - -
Ptah 4.42 4.79 4.35 4.01 4.39 3.72 3.78 3.61 3.74 3.71

Table 2. PtahEval results on DeepResearch Bench. Ptah has high scores on all ICQ dimensions and complete MPQ results, supporting the claim that visual evidence is clearer, better aligned, and presented with better rendered-page ergonomics than the multimodal baseline.

The paper's credibility claim is easiest to inspect in Table 3.

Method Citation Accuracy Effective Citations Avg. Search Calls
ReAct 37.28 0.23 4.17
Search-o1 40.91 0.31 2.78
WebThinker 60.74 2.32 5.91
Ptah w/o Verifier 30.29 4.75 5.13
Ptah 87.53 9.64 12.82

Table 3. FACT evaluation on DeepResearch Bench. Ptah reaches 87.53 citation accuracy and 9.64 effective citations per task, substantially above WebThinker's 60.74 and 2.32. The w/o Verifier row also shows that removing the verifier harms citation accuracy sharply.

Figure 4 is the paper's human-evaluation evidence for PtahEval and rendered multimodal quality.

Figure 4. Human evaluation of Ptah against LLM-I and WebThinker.
Figure 4. Human evaluation. The original caption describes human evaluation of Ptah against LLM-I and WebThinker on DeepResearch Bench via PtahEval. The bars show frequent Ptah wins: 88-96% on Image Content Quality dimensions versus LLM-I, 92-100% on MPQ dimensions versus LLM-I, and 80-92% on MPQ dimensions versus WebThinker.

The paper also reports a user-centric human evaluation on 20 DeepResearch Bench reports. Table 4 summarizes the win-or-tie rates over WebThinker.

Evaluator Readability Usability Information Acquisition Overall
Expert E1 85% 90% 95% 95%
Expert E2 85% 80% 95% 90%
General U1 90% 95% 100% 100%
General U2 95% 90% 95% 95%
Average 88.75% 88.75% 96.25% 95.00%

Table 4. User-centric human evaluation. This supports the usability claim, but the sample is still small and centered on report preference rather than independent factual reproduction.

Table 5 isolates the role of test-time scaling.

Method DRB ICQ MPQ Avg. Images / Failures
LLM-I 36.36 1.97 3.00 0.74 / 0.14
Ptah w/o TTS 42.13 2.77 3.49 5.06 / 0.38
Ptah 45.16 4.39 3.71 3.76 / 0.12

Table 5. Test-time scaling ablation. Removing TTS drops DRB by 3.03 points, ICQ from 4.39 to 2.77, and MPQ from 3.71 to 3.49. The full system also has fewer image failures than the no-TTS variant.

The cost side is reported in Table 6. The full pipeline takes 1015 seconds on average, with research as the dominant stage.

Stage Avg. Time (s)
Planning Stage 192
Research Stage 459
Writing Stage 121
TTS 243
Total 1015

Table 6. Stage-wise latency. Research is the largest component because it performs open-ended evidence collection, webpage inspection, and image-pool construction.

Table 7 shows the efficiency benefit of parallel researchers.

Research Execution Avg. Time (s) Relative Change
Parallel 459 1.00x
Sequential 1328 2.89x slower

Table 7. Parallel research latency. Parallel section-level research reduces research-stage wall-clock time by 65.4% relative to sequential execution.

Table 8 gives the verifier-strength trade-off.

Setting Time (s)
Current Verifier - Planning 192
Current Verifier - Research 459
DeepSeek-R1 Verifier - Planning 853
DeepSeek-R1 Verifier - Research 1408

Table 8. Verifier latency. A stronger verifier can trigger more expensive verification and additional revision rounds, so the paper frames verifier choice as a quality-efficiency trade-off.

Figure 5 shows first-screen examples of Ptah reports.

Figure 5. Example generated multimodal reports.
Figure 5. Example cases. The original caption says these are first-screen views of multimodal analytical reports generated by Ptah. I include this as qualitative evidence for the rendered-report style, not as a substitute for quantitative scores.

Table 9 is an additional same-framework ablation on removing images from Ptah outputs.

Method DRB Overall MPQ Avg.
WebThinker 45.00 3.11
Ptah w/ images 45.16 3.71
Ptah w/o images 45.10 3.29

Table 9. Visual-elements ablation. Removing images barely changes the text-oriented DRB score but lowers MPQ from 3.71 to 3.29, supporting the claim that visuals mainly improve multimodal presentation rather than only text quality.

Practical Takeaways

The limitations section explicitly says stable autonomous long-horizon multimodal search and generation remains challenging with current open-source models. The modular stage design is therefore a reliability choice, not proof that the decomposition is the only possible architecture.