CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Source-first digest for monthly 2026_05 rank 3, rank_id p003.

Routing status: success
PDF extraction: not used

Motivation / Background

The paper targets a gap in document visual question answering evaluation. Existing Doc-VQA benchmarks mostly score whether the final text answer is correct, but they do not verify whether the answer is grounded in the exact source region that supports it. The authors argue that this is unsafe in legal, finance, medical, and other audit-heavy document settings, because a model can produce the right answer while citing the wrong table, paragraph, page, or visual element.

CiteVQA changes the task definition: a model must output both an answer and element-level bounding-box citations. A response is useful only if a human can visually verify the answer against the cited source regions. Figure 1 shows the task framing, dataset position, and the gap between answer accuracy and Strict Attributed Accuracy.

**Figure 1. CiteVQA overview.** The benchmark requires correct answers and precise evidence citations. The source caption highlights that models can retain high answer accuracy while losing Strict Attributed Accuracy because of attribution hallucination.

The benchmark is also positioned against prior document VQA resources in Table 1. The main difference is not raw document count; it is the combination of long documents, element-level evidence, and joint answer-plus-citation scoring.

Benchmark	Documents	Average pages	Evidence granularity	Joint answer/evidence metric
DocVQA	12,767	1.0	Page	No
InfoVQA	5,485	1.0	Page	No
MP-DocVQA	6,000	8.3	Page	No
MMLongBench-Doc	135	47.5	Page	No
SlideVQA	2,619	20.0	Block	No
ViDoRe V3	190	137.0	Block	No
CiteVQA	711	40.6	Element	Yes

Table 1. Benchmark comparison. This digest table is copied from the source benchmark-comparison table and normalized to plain text labels. CiteVQA is smaller than single-page VQA datasets but tests longer, more realistic documents and element-level traceability.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	Answer-only Doc-VQA metrics miss an important failure mode: correct text paired with wrong evidence.	5	problem framing, attribution hallucination results, case studies
C2	CiteVQA provides a realistic evidence-attribution benchmark: 1,897 questions over 711 PDFs, seven domains, two languages, and long multi-page documents.	5	dataset statistics, question distribution figure
C3	The annotation pipeline is scalable and structured enough to generate element-level citations, using document linking, evidence package extraction, QA templating, verification, and evidence ablation.	4	pipeline overview, pipeline details, expert audit
C4	Strict Attributed Accuracy (SAA) is a strong audit metric because it only credits a sample when the answer is correct and the supporting citation is either relevant or localized.	5	metric equations, main results
C5	Current MLLMs show severe attribution hallucination: top systems can answer well, but their SAA and evidence recall remain much lower.	5	main results table, coarse localization analysis, case studies
C6	Better localization appears to help answer correctness, but the paper's evidence is correlational and ablation-based rather than a causal training intervention.	4	attribution and accuracy analysis, search-space ablation
C7	CiteVQA is costly to reproduce and evaluate because it relies on strong MLLMs, coordinate-level checking, and high-resolution document inputs.	4	experiment setup, resolution sensitivity, limitations

Support scores are support-from-paper scores, not independent reproduction scores. A score of 5 means the claim is directly backed by the paper's definitions, tables, or experiments; a score of 4 means the paper presents substantial evidence but still depends on its automated pipeline, judge setup, or internal ablations.

Core Technical Idea

The core idea is to treat trustworthy document VQA as a joint answer and citation problem. Each sample is represented as \((D, Q, A_{\text{gt}}, \mathcal{B}_{\text{gt}})\), where \(\mathcal{B}_{\text{gt}}\) is the set of ground-truth evidence boxes. The crucial subset \(\mathcal{B}_{\text{crucial}}\) is identified by masking ablation: if masking a box prevents a strong model from answering correctly, the box is labeled crucial.

The main localization metric is crucial-evidence recall at IoU@0.5:

\text{Rec.} = \frac{1}{|\mathcal{B}_{\text{crucial}}|} \sum_{b_{\text{gt}} \in \mathcal{B}_{\text{crucial}}} \mathbf{1}_{\left( \max_{b_{\text{pred}} \in \mathcal{B}_{\text{pred}}} \text{IoU}(b_{\text{pred}}, b_{\text{gt}}) \ge 0.5 \right)}

Relevance and answer correctness are judged on 0-5 scales:

\text{Rel.} = \frac{1}{n}\sum_{i=1}^{n} \mathcal{J}_{\text{rel}}(A_i,b_i) \in [0,5]

\text{Ans.} = \mathcal{J}_{\text{ans}}(\{A_1,\dots,A_n\}, A_{\text{gt}}) \in [0,5]

The headline metric, Strict Attributed Accuracy, is binary at the sample level:

\text{SAA} = \mathbf{1}_{(\text{Ans.} \ge 4 \land (\text{Rel.} \ge 4 \lor \text{Rec.} \ge 0.6))}

This definition matters because it gives models two routes to attribution credit. They can cite regions that overlap crucial evidence, or they can cite regions judged semantically relevant, but they still need a correct answer. The supplementary metrics in Table 2 separate page navigation, box precision, and localization F1.

Metric	What it checks	Formula or decision rule
Page recall	Whether the model reached the right evidence page at all	\(\text{Page.} = \frac{	\{p \in \mathcal{P}_{\text{crucial}} \mid \exists \hat{p} \in \mathcal{P}_{\text{pred}}, \hat{p}=p\}	}{	\mathcal{P}_{\text{crucial}}	}\)
Precision	Whether predicted boxes avoid irrelevant evidence	\(\text{Prec.} = \frac{1}{	\mathcal{B}_{\text{pred}}	} \sum_{b_{\text{pred}}} \mathbf{1}_{(\max_{b_{\text{gt}}} \text{IoU}(b_{\text{pred}}, b_{\text{gt}}) \ge 0.5)}\)
F1	Localization balance between recall and precision	\(F_1 = 2 \cdot \frac{\text{Prec.}\cdot\text{Rec.}}{\text{Prec.}+\text{Rec.}}\)

Table 2. Supplementary localization metrics. These metrics explain whether a model fails because it cannot find the page, cannot draw tight boxes, or produces many irrelevant citations.

Method Details

The benchmark construction pipeline has five major stages. Figure 2 is the paper's overview: documents are linked, parsed into evidence packages, converted into QA tasks through templates, verified, and reduced to crucial evidence through ablation.

**Figure 2. CiteVQA construction pipeline.** The workflow aggregates semantically related PDFs, asks agents to navigate MinerU parsing outputs, synthesizes realistic QA pairs from templates, verifies answerability, and uses evidence ablation to identify crucial evidence.

Table 3 captures the benchmark scale. The most important facts for a reader are that the average document is 40.6 pages, 48% of questions are multi-document, and nearly 30% of cited evidence is non-textual.

Statistic	Value
Documents	711
Macro / micro document types	7 / 30
Average / median pages	40.6 / 30.0
Language split, EN / ZH	451 / 260
Total questions	1,897
Single-doc questions	987 (52.0%)
Multi-doc, one gold document	487 (25.7%)
Multi-doc, multiple gold documents	423 (22.3%)
Complex synthesis questions	839 (44.23%)
Factual retrieval questions	499 (26.30%)
Multimodal parsing questions	352 (18.56%)
Quantitative reasoning questions	207 (10.91%)
Evidence source: text / table / image / equation	70.12% / 21.99% / 7.04% / 0.84%
Average / max evidences per question	2.57 / 10

Table 3. CiteVQA dataset statistics. This table is copied from the source statistics table and reformatted for the digest.

Figure 3 is the visual counterpart to Table 3. It matters because it shows the benchmark is not just a set of text-span retrieval tasks: it includes domain-specific question mixes, evidence locality variation, and cross-page evidence spans.

Figure 3. Question type and evidence distribution. — **Figure 3. Question types and evidence distribution.** The figure analyzes domain-specific question types, where evidence appears within documents, and how many pages evidence spans.

The pipeline starts from more than 100 million raw PDFs, preselects roughly 250k candidates, then uses a two-stage MLLM annotation process for coarse domain/language classification and fine-grained subcategory classification. The final benchmark keeps 711 documents across seven domains and 30 subcategories.

For multi-document tasks, each document receives a semantic profile containing high-level metadata such as document type, core thesis, and section units. Dense retrieval selects the top \(K_{\text{doc}}=5\) candidate documents for an anchor. An LLM then performs section-level matching and returns up to five association groups, each with one to three related segments. When matched pages are assembled into a synthetic multi-document workspace, the paper keeps a bijective mapping \(f_{\text{map}}\) back to the original PDF coordinates so synthesized evidence remains traceable.

Evidence extraction uses MinerU2.5 to produce document IDs, page numbers, bounding boxes, and OCR content. High-performance MLLM agents then navigate this parsed space to concatenate supporting facts into an Evidence Package. QA construction uses templates distilled from open-source datasets across academic technology, medical and health, business finance, industrial and construction, and government/legal domains.

Quality control has three layers. First, candidate QA pairs are retained only when a powerful MLLM can answer from the evidence screenshots alone. Second, a zero-document self-test using Qwen3-VL-235B-A22B discards questions answerable without document context. Third, evidence ablation masks each BBox element one at a time; a box becomes crucial evidence if masking it breaks answerability.

The paper adds two validation checks. A PhD-level expert audit of 200 sampled outputs reports human averages of 2.97 for difficulty, 4.43 for answer quality, and 4.91 for crucial-evidence quality on a 5-point scale. A separate auxiliary training validation uses 3k CiteVQA-generated samples from ViDoRe V3 PDFs and reports that the synthetic pipeline nearly reaches human-annotated training data performance in the AgenticOCR-style setup.

Experiments And Results

The evaluation covers 20 MLLMs across proprietary and open-source families. Gemini models use the native File API; long-context models use 150 DPI page screenshots; standard-context models downscale pages to at most \(1024 \times 1024\); GLM-5V-Turbo uses \(768 \times 768\). All models use a unified prompt with sampling temperature 1.0. Qwen3-VL-235B-A22B is the primary automated judge, and a 200-sample human study reports no statistically significant difference from automated judges for Rel. and Ans. scores.

Table 4 extracts the key overall rows from the paper's main results table. Rel. and Ans. are normalized to 100 by multiplying the original 0-5 scores by 20, matching the paper's table convention.

Model	Overall Rec.	Overall Rel.	Overall Ans.	Overall SAA	Digest reading
Gemini-3.1-Pro-Preview	66.0	83.6	86.1	76.0	Best overall attribution and SAA
Gemini-3-Flash-Preview	45.4	75.7	84.5	65.4	Strong SAA but weaker localization than Gemini-3.1-Pro
GPT-5.4	31.0	67.5	87.1	59.0	Best answer score, but lower attribution
Gemini-2.5-Pro	27.4	59.8	82.2	47.0	Solid answer accuracy, weaker citation grounding
Qwen3-VL-235B-A22B	11.3	35.3	72.3	22.5	Strongest open-source MLLM row in the paper
Qwen3-VL-32B	6.6	30.5	72.3	17.3	High answer score relative to low localization
Qwen3-VL-8B	1.0	14.7	61.2	7.5	Severe evidence-attribution failure

Table 4. Key overall model results. The headline pattern is the gap between answer quality and evidence attribution. GPT-5.4 reaches 87.1 answer accuracy but only 59.0 SAA; Qwen3-VL-235B-A22B reaches 72.3 answer accuracy but only 22.5 SAA.

The authors call this gap Attribution Hallucination. It is not just a fine-grained bounding-box issue: the appendix reports that page-level recall is also low for many models, which means models often fail before precise box localization even begins.

Multi-document complexity worsens the problem. For Gemini-3.1-Pro-Preview, crucial-evidence recall falls from 68.9 in single-document tasks to 55.3 in multi-document, multi-gold tasks. The paper reports similar drops in page-level recall and F1 for advanced systems in multi-document scenarios, showing that cross-document navigation remains a hard frontier.

Figure 4. Ability-specific SAA radar. — **Figure 4. Fine-grained results by question type.** The paper reports that quantitative reasoning is comparatively easier for top models, with Gemini-3.1-Pro-Preview reaching 82.6, while multimodal parsing is a bottleneck because it requires locating visual elements from descriptive cues before parsing their content.

The paper also argues that evidence attribution is not merely a post-hoc explanation. After models leave the 0-30 evidence-quality region, answer accuracy tends to rise with \(\max(\text{Rel.}, \text{Rec.})\). The search-space ablations in Table 5 support that interpretation: when the model is given ground-truth pages or the gold document, answer scores improve.

Model	Scenario	Base setting	GT page or gold-document setting	Gain
Qwen3.5-27B	Single-doc, GT pages	79.3	84.6	+5.3
Qwen3-VL-32B	Single-doc, GT pages	75.3	79.9	+4.6
Qwen3.5-9B	Single-doc, GT pages	73.2	75.2	+2.0
Qwen3-VL-8B	Single-doc, GT pages	67.0	71.1	+4.1
Qwen3.5-27B	Multi-doc, gold document	73.1	81.6	+8.5
Qwen3-VL-32B	Multi-doc, gold document	67.6	72.6	+5.0
Qwen3.5-9B	Multi-doc, gold document	58.4	68.1	+9.7
Qwen3-VL-8B	Multi-doc, gold document	53.3	66.7	+13.4

Table 5. Search-space ablation. Narrowing the search space improves answer performance, especially for weaker models in multi-document settings. This is evidence that localization is a bottleneck, though not a full causal proof that better autonomous citation always improves reasoning.

Figure 5 gives the compact visual case study used in the main paper. It is the clearest intuition for why SAA is different from answer accuracy.

Figure 5. Simple attribution case study. — **Figure 5. Simple case study.** Qwen3-VL-235B-A22B answers correctly but cites blank or incomplete crops, so SAA is 0. Gemini-3.1-Pro-Preview grounds the answer in near-correct crops and receives SAA 1.

Figure 6. Domain-specific SAA radar. — **Figure 6. Domain-specific SAA scores.** The appendix reports that academic-technology documents are easiest, with Gemini-3.1-Pro-Preview reaching 85.0 SAA, while publishing and media documents are hardest, with the highest SAA only 63.3 because of non-linear layout and image-text interleaving.

Resolution strategy for Qwen3-VL-235B-A22B	Total pixels	Rec.	Rel.	Ans.	SAA
Full resolution	\(1024^2\)	11.3	35.3	72.3	22.5
Half-pixel scaling	\(1024^2 / 2 \approx 724^2\)	4.2	23.6	66.8	11.8
Quarter-pixel scaling	\(1024^2 / 4 = 512^2\)	1.6	17.2	53.5	5.3

Table 6. Resolution sensitivity. Lowering image resolution moderately hurts answer accuracy but sharply collapses evidence attribution, so precise citation depends on visual fidelity.

The appendix includes two larger case studies. Figure 7 shows a model with the correct text answer but wrong pricing-table citation; Figure 8 shows both semantic and visual failure in a multi-step numerical question.

**Figure 7. Case study 1.** Gemini-3.1-Pro-Preview correctly cites the "Contract Optional Years" table and gets SAA 1. GPT-5.4 gives the correct answer but cites an incorrect pricing table, yielding Rec. 0 and SAA 0.

**Figure 8. Case study 2.** Gemini-2.5-Pro correctly calculates \(0.40 - 0.14 = 0.26\) eV and cites the corresponding evidence. Qwen3-VL-8B extracts wrong values, cites irrelevant regions, and receives Ans. 1 and SAA 0.

Practical Takeaways

For model evaluation, CiteVQA is most useful when the deployment needs auditable document answers, not just final-answer text. The benchmark directly checks whether cited visual evidence supports the output.
For model builders, the main gap is localization. The strongest answerer in the table is not the strongest attributed answerer, and open-source models show especially low evidence recall.
For retrieval and agentic document systems, page navigation is a first-class problem. The paper's supplementary analysis shows that models often fail to identify the right page before bounding-box precision becomes relevant.
For dataset users, the automated pipeline is strong but not free of assumptions. It depends on high-end MLLMs, MinerU parsing quality, automated judges, and ablation-based definitions of crucial evidence.
For deployment, image resolution matters. Reducing the pixel budget sharply reduces Rec. and SAA, so evidence attribution should be tested under the exact rendering and token-budget constraints of the production system.
For follow-up research, the most actionable direction is training and evaluating models that emit evidence as a structured part of reasoning, not as a cosmetic citation after generating an answer.

The paper's stated limitations are important: domain-specific definitions of authoritative evidence may vary, large-scale replication requires substantial compute, and coordinate-level evaluation is more expensive than standard VQA scoring. There is also a risk that models overfit to CiteVQA's document distributions and metrics instead of learning general evidence attribution.

Reference Coverage

Additional local references for validation and navigation: fig-ability-radar, fig-domain-radar.