arXiv20262026avg 7.81interest 8.70269 HF multimodal groundingdocument intelligenceevidence attribution

This paper argues that document VQA needs evidence attribution, not just final-answer scoring, because models can answer correctly while citing the wrong source region. CiteVQA evaluates answers with element-level bounding-box citations and shows a large attribution hallucination gap across 20 MLLMs.

Source-first digest for monthly 2026_05 rank 3, rank_id p003.

Motivation / Background

The paper targets a gap in document visual question answering evaluation. Existing Doc-VQA benchmarks mostly score whether the final text answer is correct, but they do not verify whether the answer is grounded in the exact source region that supports it. The authors argue that this is unsafe in legal, finance, medical, and other audit-heavy document settings, because a model can produce the right answer while citing the wrong table, paragraph, page, or visual element.

CiteVQA changes the task definition: a model must output both an answer and element-level bounding-box citations. A response is useful only if a human can visually verify the answer against the cited source regions. Figure 1 shows the task framing, dataset position, and the gap between answer accuracy and Strict Attributed Accuracy.

Figure 1. CiteVQA overview.
Figure 1. CiteVQA overview. The benchmark requires correct answers and precise evidence citations. The source caption highlights that models can retain high answer accuracy while losing Strict Attributed Accuracy because of attribution hallucination.

The benchmark is also positioned against prior document VQA resources in Table 1. The main difference is not raw document count; it is the combination of long documents, element-level evidence, and joint answer-plus-citation scoring.

Benchmark Documents Average pages Evidence granularity Joint answer/evidence metric
DocVQA 12,767 1.0 Page No
InfoVQA 5,485 1.0 Page No
MP-DocVQA 6,000 8.3 Page No
MMLongBench-Doc 135 47.5 Page No
SlideVQA 2,619 20.0 Block No
ViDoRe V3 190 137.0 Block No
CiteVQA 711 40.6 Element Yes

Table 1. Benchmark comparison. This digest table is copied from the source benchmark-comparison table and normalized to plain text labels. CiteVQA is smaller than single-page VQA datasets but tests longer, more realistic documents and element-level traceability.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 Answer-only Doc-VQA metrics miss an important failure mode: correct text paired with wrong evidence. 5 problem framing, attribution hallucination results, case studies
C2 CiteVQA provides a realistic evidence-attribution benchmark: 1,897 questions over 711 PDFs, seven domains, two languages, and long multi-page documents. 5 dataset statistics, question distribution figure
C3 The annotation pipeline is scalable and structured enough to generate element-level citations, using document linking, evidence package extraction, QA templating, verification, and evidence ablation. 4 pipeline overview, pipeline details, expert audit
C4 Strict Attributed Accuracy (SAA) is a strong audit metric because it only credits a sample when the answer is correct and the supporting citation is either relevant or localized. 5 metric equations, main results
C5 Current MLLMs show severe attribution hallucination: top systems can answer well, but their SAA and evidence recall remain much lower. 5 main results table, coarse localization analysis, case studies
C6 Better localization appears to help answer correctness, but the paper's evidence is correlational and ablation-based rather than a causal training intervention. 4 attribution and accuracy analysis, search-space ablation
C7 CiteVQA is costly to reproduce and evaluate because it relies on strong MLLMs, coordinate-level checking, and high-resolution document inputs. 4 experiment setup, resolution sensitivity, limitations

Support scores are support-from-paper scores, not independent reproduction scores. A score of 5 means the claim is directly backed by the paper's definitions, tables, or experiments; a score of 4 means the paper presents substantial evidence but still depends on its automated pipeline, judge setup, or internal ablations.

Core Technical Idea

The core idea is to treat trustworthy document VQA as a joint answer and citation problem. Each sample is represented as \((D, Q, A_{\text{gt}}, \mathcal{B}_{\text{gt}})\), where \(\mathcal{B}_{\text{gt}}\) is the set of ground-truth evidence boxes. The crucial subset \(\mathcal{B}_{\text{crucial}}\) is identified by masking ablation: if masking a box prevents a strong model from answering correctly, the box is labeled crucial.

The main localization metric is crucial-evidence recall at IoU@0.5:

$$ \text{Rec.} = \frac{1}{|\mathcal{B}_{\text{crucial}}|} \sum_{b_{\text{gt}} \in \mathcal{B}_{\text{crucial}}} \mathbf{1}_{\left( \max_{b_{\text{pred}} \in \mathcal{B}_{\text{pred}}} \text{IoU}(b_{\text{pred}}, b_{\text{gt}}) \ge 0.5 \right)} $$

Relevance and answer correctness are judged on 0-5 scales:

$$ \text{Rel.} = \frac{1}{n}\sum_{i=1}^{n} \mathcal{J}_{\text{rel}}(A_i,b_i) \in [0,5] $$
$$ \text{Ans.} = \mathcal{J}_{\text{ans}}(\{A_1,\dots,A_n\}, A_{\text{gt}}) \in [0,5] $$

The headline metric, Strict Attributed Accuracy, is binary at the sample level:

$$ \text{SAA} = \mathbf{1}_{(\text{Ans.} \ge 4 \land (\text{Rel.} \ge 4 \lor \text{Rec.} \ge 0.6))} $$

This definition matters because it gives models two routes to attribution credit. They can cite regions that overlap crucial evidence, or they can cite regions judged semantically relevant, but they still need a correct answer. The supplementary metrics in Table 2 separate page navigation, box precision, and localization F1.

Metric What it checks Formula or decision rule
Page recall Whether the model reached the right evidence page at all \(\text{Page.} = \frac{ \{p \in \mathcal{P}_{\text{crucial}} \mid \exists \hat{p} \in \mathcal{P}_{\text{pred}}, \hat{p}=p\} }{ \mathcal{P}_{\text{crucial}} }\)
Precision Whether predicted boxes avoid irrelevant evidence \(\text{Prec.} = \frac{1}{ \mathcal{B}_{\text{pred}} } \sum_{b_{\text{pred}}} \mathbf{1}_{(\max_{b_{\text{gt}}} \text{IoU}(b_{\text{pred}}, b_{\text{gt}}) \ge 0.5)}\)
F1 Localization balance between recall and precision \(F_1 = 2 \cdot \frac{\text{Prec.}\cdot\text{Rec.}}{\text{Prec.}+\text{Rec.}}\)

Table 2. Supplementary localization metrics. These metrics explain whether a model fails because it cannot find the page, cannot draw tight boxes, or produces many irrelevant citations.

Method Details

The benchmark construction pipeline has five major stages. Figure 2 is the paper's overview: documents are linked, parsed into evidence packages, converted into QA tasks through templates, verified, and reduced to crucial evidence through ablation.

Figure 2. CiteVQA construction pipeline.
Figure 2. CiteVQA construction pipeline. The workflow aggregates semantically related PDFs, asks agents to navigate MinerU parsing outputs, synthesizes realistic QA pairs from templates, verifies answerability, and uses evidence ablation to identify crucial evidence.

Table 3 captures the benchmark scale. The most important facts for a reader are that the average document is 40.6 pages, 48% of questions are multi-document, and nearly 30% of cited evidence is non-textual.

Statistic Value
Documents 711
Macro / micro document types 7 / 30
Average / median pages 40.6 / 30.0
Language split, EN / ZH 451 / 260
Total questions 1,897
Single-doc questions 987 (52.0%)
Multi-doc, one gold document 487 (25.7%)
Multi-doc, multiple gold documents 423 (22.3%)
Complex synthesis questions 839 (44.23%)
Factual retrieval questions 499 (26.30%)
Multimodal parsing questions 352 (18.56%)
Quantitative reasoning questions 207 (10.91%)
Evidence source: text / table / image / equation 70.12% / 21.99% / 7.04% / 0.84%
Average / max evidences per question 2.57 / 10

Table 3. CiteVQA dataset statistics. This table is copied from the source statistics table and reformatted for the digest.

Figure 3 is the visual counterpart to Table 3. It matters because it shows the benchmark is not just a set of text-span retrieval tasks: it includes domain-specific question mixes, evidence locality variation, and cross-page evidence spans.

Figure 3. Question type and evidence distribution.
Figure 3. Question types and evidence distribution. The figure analyzes domain-specific question types, where evidence appears within documents, and how many pages evidence spans.

The pipeline starts from more than 100 million raw PDFs, preselects roughly 250k candidates, then uses a two-stage MLLM annotation process for coarse domain/language classification and fine-grained subcategory classification. The final benchmark keeps 711 documents across seven domains and 30 subcategories.

For multi-document tasks, each document receives a semantic profile containing high-level metadata such as document type, core thesis, and section units. Dense retrieval selects the top \(K_{\text{doc}}=5\) candidate documents for an anchor. An LLM then performs section-level matching and returns up to five association groups, each with one to three related segments. When matched pages are assembled into a synthetic multi-document workspace, the paper keeps a bijective mapping \(f_{\text{map}}\) back to the original PDF coordinates so synthesized evidence remains traceable.

Evidence extraction uses MinerU2.5 to produce document IDs, page numbers, bounding boxes, and OCR content. High-performance MLLM agents then navigate this parsed space to concatenate supporting facts into an Evidence Package. QA construction uses templates distilled from open-source datasets across academic technology, medical and health, business finance, industrial and construction, and government/legal domains.

Quality control has three layers. First, candidate QA pairs are retained only when a powerful MLLM can answer from the evidence screenshots alone. Second, a zero-document self-test using Qwen3-VL-235B-A22B discards questions answerable without document context. Third, evidence ablation masks each BBox element one at a time; a box becomes crucial evidence if masking it breaks answerability.

The paper adds two validation checks. A PhD-level expert audit of 200 sampled outputs reports human averages of 2.97 for difficulty, 4.43 for answer quality, and 4.91 for crucial-evidence quality on a 5-point scale. A separate auxiliary training validation uses 3k CiteVQA-generated samples from ViDoRe V3 PDFs and reports that the synthetic pipeline nearly reaches human-annotated training data performance in the AgenticOCR-style setup.

Experiments And Results

The evaluation covers 20 MLLMs across proprietary and open-source families. Gemini models use the native File API; long-context models use 150 DPI page screenshots; standard-context models downscale pages to at most \(1024 \times 1024\); GLM-5V-Turbo uses \(768 \times 768\). All models use a unified prompt with sampling temperature 1.0. Qwen3-VL-235B-A22B is the primary automated judge, and a 200-sample human study reports no statistically significant difference from automated judges for Rel. and Ans. scores.

Table 4 extracts the key overall rows from the paper's main results table. Rel. and Ans. are normalized to 100 by multiplying the original 0-5 scores by 20, matching the paper's table convention.

Model Overall Rec. Overall Rel. Overall Ans. Overall SAA Digest reading
Gemini-3.1-Pro-Preview 66.0 83.6 86.1 76.0 Best overall attribution and SAA
Gemini-3-Flash-Preview 45.4 75.7 84.5 65.4 Strong SAA but weaker localization than Gemini-3.1-Pro
GPT-5.4 31.0 67.5 87.1 59.0 Best answer score, but lower attribution
Gemini-2.5-Pro 27.4 59.8 82.2 47.0 Solid answer accuracy, weaker citation grounding
Qwen3-VL-235B-A22B 11.3 35.3 72.3 22.5 Strongest open-source MLLM row in the paper
Qwen3-VL-32B 6.6 30.5 72.3 17.3 High answer score relative to low localization
Qwen3-VL-8B 1.0 14.7 61.2 7.5 Severe evidence-attribution failure

Table 4. Key overall model results. The headline pattern is the gap between answer quality and evidence attribution. GPT-5.4 reaches 87.1 answer accuracy but only 59.0 SAA; Qwen3-VL-235B-A22B reaches 72.3 answer accuracy but only 22.5 SAA.

The authors call this gap Attribution Hallucination. It is not just a fine-grained bounding-box issue: the appendix reports that page-level recall is also low for many models, which means models often fail before precise box localization even begins.

Multi-document complexity worsens the problem. For Gemini-3.1-Pro-Preview, crucial-evidence recall falls from 68.9 in single-document tasks to 55.3 in multi-document, multi-gold tasks. The paper reports similar drops in page-level recall and F1 for advanced systems in multi-document scenarios, showing that cross-document navigation remains a hard frontier.

Figure 4. Ability-specific SAA radar.
Figure 4. Fine-grained results by question type. The paper reports that quantitative reasoning is comparatively easier for top models, with Gemini-3.1-Pro-Preview reaching 82.6, while multimodal parsing is a bottleneck because it requires locating visual elements from descriptive cues before parsing their content.

The paper also argues that evidence attribution is not merely a post-hoc explanation. After models leave the 0-30 evidence-quality region, answer accuracy tends to rise with \(\max(\text{Rel.}, \text{Rec.})\). The search-space ablations in Table 5 support that interpretation: when the model is given ground-truth pages or the gold document, answer scores improve.

Model Scenario Base setting GT page or gold-document setting Gain
Qwen3.5-27B Single-doc, GT pages 79.3 84.6 +5.3
Qwen3-VL-32B Single-doc, GT pages 75.3 79.9 +4.6
Qwen3.5-9B Single-doc, GT pages 73.2 75.2 +2.0
Qwen3-VL-8B Single-doc, GT pages 67.0 71.1 +4.1
Qwen3.5-27B Multi-doc, gold document 73.1 81.6 +8.5
Qwen3-VL-32B Multi-doc, gold document 67.6 72.6 +5.0
Qwen3.5-9B Multi-doc, gold document 58.4 68.1 +9.7
Qwen3-VL-8B Multi-doc, gold document 53.3 66.7 +13.4

Table 5. Search-space ablation. Narrowing the search space improves answer performance, especially for weaker models in multi-document settings. This is evidence that localization is a bottleneck, though not a full causal proof that better autonomous citation always improves reasoning.

Figure 5 gives the compact visual case study used in the main paper. It is the clearest intuition for why SAA is different from answer accuracy.

Figure 5. Simple attribution case study.
Figure 5. Simple case study. Qwen3-VL-235B-A22B answers correctly but cites blank or incomplete crops, so SAA is 0. Gemini-3.1-Pro-Preview grounds the answer in near-correct crops and receives SAA 1.
Figure 6. Domain-specific SAA radar.
Figure 6. Domain-specific SAA scores. The appendix reports that academic-technology documents are easiest, with Gemini-3.1-Pro-Preview reaching 85.0 SAA, while publishing and media documents are hardest, with the highest SAA only 63.3 because of non-linear layout and image-text interleaving.
Resolution strategy for Qwen3-VL-235B-A22B Total pixels Rec. Rel. Ans. SAA
Full resolution \(1024^2\) 11.3 35.3 72.3 22.5
Half-pixel scaling \(1024^2 / 2 \approx 724^2\) 4.2 23.6 66.8 11.8
Quarter-pixel scaling \(1024^2 / 4 = 512^2\) 1.6 17.2 53.5 5.3

Table 6. Resolution sensitivity. Lowering image resolution moderately hurts answer accuracy but sharply collapses evidence attribution, so precise citation depends on visual fidelity.

The appendix includes two larger case studies. Figure 7 shows a model with the correct text answer but wrong pricing-table citation; Figure 8 shows both semantic and visual failure in a multi-step numerical question.

Figure 7. Case study 1.
Figure 7. Case study 1. Gemini-3.1-Pro-Preview correctly cites the "Contract Optional Years" table and gets SAA 1. GPT-5.4 gives the correct answer but cites an incorrect pricing table, yielding Rec. 0 and SAA 0.
Figure 8. Case study 2.
Figure 8. Case study 2. Gemini-2.5-Pro correctly calculates \(0.40 - 0.14 = 0.26\) eV and cites the corresponding evidence. Qwen3-VL-8B extracts wrong values, cites irrelevant regions, and receives Ans. 1 and SAA 0.

Practical Takeaways

The paper's stated limitations are important: domain-specific definitions of authoritative evidence may vary, large-scale replication requires substantial compute, and coordinate-level evaluation is more expensive than standard VQA scoring. There is also a risk that models overfit to CiteVQA's document distributions and metrics instead of learning general evidence attribution.

Reference Coverage