arXiv20262026avg 3.50interest 6.504 HF verifiable rewardsfactual groundingprocess supervision

This paper tackles reward design for factual question answering by introducing CorVer, a lightweight corpus-grounded process reward based on Wikipedia co-occurrence statistics instead of neural verifiers. Across six instruction-tuned models and five QA benchmarks, CorVer improves all tested model-benchmark cells, beats several neural-verifier baselines in most feasible settings, and trains faster.

Source-first digest for checked paper rank 42, rank_id p029.

Motivation / Background

Knowledge-intensive QA is an awkward target for RL. Math and code can often use deterministic reward functions, but factual QA responses can mix a correct answer with unsupported claims in the same reasoning trace. Outcome-level rewards see only the final answer, while sentence-level factuality rewards usually require NLI models, LLM judges, retrieval, or knowledge-verification services for every generated sentence.

CorVer targets that gap. It turns Wikipedia subject-object co-occurrence into a process reward: extract one factual triplet from each generated sentence, query an Infini-gram Wikipedia index for the subject and object content words, convert the count into a small sentence reward, and align that sentence reward back to the generated tokens. Figure 1 frames the paper's motivation: factual QA lacks the cheap verification loop that has made RL easier for math and code.

Figure 1. Verifiable rewards beyond math and code.
Figure 1. Verifiable rewards beyond math and code. Original caption: math and code tasks enjoy programmatic, deterministic reward signals; prior sentence-level factuality methods rely on neural verification pipelines that become costly at RL scale; CorVer fills the gap with a corpus-indexed co-occurrence statistic that requires no neural verifier in the reward loop. I place it here because it states the paper's core motivation.

The key caveat is already visible in the motivation: CorVer is not a fact checker. A co-occurrence count can be a useful directional signal, but it cannot prove the predicate is correct. The rest of the paper argues that this weak but cheap signal is useful enough when it is calibrated, bounded, and combined with a response-level answer reward.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 A corpus-indexed co-occurrence signal can provide lightweight sentence-level process supervision for factual QA without putting neural verifiers in the reward loop. 5 problem setup, pipeline, reward map, cost
C2 Co-occurrence count is a calibrated directional proxy for sentence factuality, especially at the zero-count and high-count extremes. 4 calibration figure, calibration table, case study, limitations
C3 CorVer improves factual QA accuracy over the raw model across all 30 tested model-benchmark cells from 3B to 14B. 5 main results, cross-model scaling, full scaling summary
C4 Under feasible training configurations, CorVer beats the four neural-verifier factuality-RL baselines in most cells while training faster. 4 main results, training-time comparison, cost table
C5 The gain is not just from adding another scalar reward; per-token sentence alignment and the QuCo signal both matter. 4 ablation table, aggregation variants, penalty sweep
C6 The method's gains track corpus coverage more than rare-entity rescue, which is both evidence for how the signal works and a limitation. 4 PopQA quartiles, limitations
C7 On Qwen3-8B, part of the gain comes from reducing refusals rather than indiscriminate guessing, but that diagnostic is model-specific. 3 refusal decomposition, checkpoint curve

Scores are support-from-paper scores, not independent reproduction scores. Claims with single-model diagnostics or asymmetric baseline configurations are capped below 5 even when the reported results are strong.

Core Technical Idea

CorVer has three reward channels:

Figure 2 is the compact view: generated text is split into sentences, a 0.5B QuCo extractor pulls a subject-object relation triplet, the subject and object content words are counted in a Wikipedia Infini-gram index, and the resulting sentence reward is combined with GRPO.

Figure 2. CorVer pipeline.
Figure 2. CorVer pipeline. Original caption: each sentence is scored for Wikipedia co-occurrence via an Infini-gram index; the per-sentence reward is mapped to token-level returns through \(\sigma\) and combined with response-level signals in a policy-gradient update, instantiated with GRPO. This figure is the main evidence for the process-supervision mechanism.

For a factual sentence \(s_i\), CorVer extracts a first valid triplet, discards the relation for the main query, reduces the head and tail to content words, and counts bounded-window co-occurrences:

$$ c_i = \mathrm{count}_{\mathcal{W}}\left(w_1 \wedge \cdots \wedge w_k\right). $$

The count becomes a bounded auxiliary reward:

$$ r_i^{\mathrm{c}} = \begin{cases} 0.0 & \text{if no valid triplet}, \\ -0.3 & \text{if } c_i = 0, \\ -0.1 & \text{if } 0 < c_i < 5, \\ 0.0 & \text{if } 5 \leq c_i < 20, \\ +0.1 & \text{if } c_i \geq 20. \end{cases} $$
Co-occurrence condition Sentence reward Interpretation
No valid triplet 0.0 Do not shape the sentence.
\(c_i = 0\) -0.3 Strong negative signal for an unsupported subject-object pair.
\(0 < c_i < 5\) -0.1 Weak negative signal.
\(5 \leq c_i < 20\) 0.0 Neutral middle bucket.
\(c_i \geq 20\) +0.1 Weak positive signal for a well-covered pair.

Table 1. Four-tier co-occurrence reward map. The concrete bins come from the appendix implementation section and the human calibration audit. Table 1 is important because it shows how deliberately small the co-occurrence term is.

The per-token return adds the sentence reward only to tokens that align to that sentence:

$$ R^{\mathrm{r}}(x,y) = \lambda_{\mathrm{j}} R^{\mathrm{j}}(x,y) + \lambda_{\mathrm{f}} R^{\mathrm{f}}(y), $$
$$ R_t(x,y) = R^{\mathrm{r}}(x,y) + \mathbf{1}[\sigma(t)>0] \lambda_{\mathrm{c}} r_{\sigma(t)}^{\mathrm{c}}. $$

That is the main design decision. A response can get a positive final-answer reward while unsupported sentences receive a local negative shaping term, and two sentences in one completion can receive different local advantages.

Method Details

The paper uses a local English Wikipedia 20231101 Infini-gram index with about 6.4M articles and 5.5B tokens. Counts are bounded to a 1,000-token inter-clause window, so \(c_i\) measures passage-like position-level co-occurrence, not document-level co-occurrence. The extractor is QuCo-extractor-0.5B, a Qwen2.5-0.5B-Instruct triplet model. For each sentence, the implementation keeps only the first valid head-relation-tail triplet whose head and tail are non-empty and non-pronominal.

The reward scale is intentionally conservative. The response judge maps GOOD, BAD, and NA to \(+2.0, -1.0, -1.0\), the format reward is \(\pm 1.0\), and \(\lambda_{\mathrm{f}}=\lambda_{\mathrm{j}}=\lambda_{\mathrm{c}}=1.0\). The source text notes that the maximum co-occurrence contribution for a typical completion is about \(0.3\), an order of magnitude smaller than the judge reward swing of \(3.0\). This is why CorVer is best read as a local shaping signal, not a replacement correctness label.

Training uses LoRA with rank 128 and alpha 256, \(G=16\) generations per prompt, prompt batch 24, max completion length 1024, and 100 GRPO steps for the canonical runs. The method trains directly from the raw instruction-tuned model rather than from an SFT cold start. The paper reports that small 3B/4B models needed mastered anchor questions mixed into the self-filtered learning-zone pool, while models at least 8B did not.

The evaluation setup is summarized in Table 2. Training prompts come only from NQ-Open train and WebQuestions, so TriviaQA, PopQA, SimpleQA, and TruthfulQA are out-of-distribution for the RL prompt source.

Item Paper setting
Benchmarks TriviaQA, NQ-Open, PopQA, SimpleQA, TruthfulQA
Evaluation sizes 17,944; 3,610; 14,267; 4,326; 817 questions
Main models Llama-3.2-3B, Qwen3-4B, Llama-3.1-8B, Qwen3-8B
Scaling models Adds OLMo-2-13B and Qwen3-14B
Baselines Raw, FoRAG, RLFH, FSPO, KnowRL
Metric Factual QA accuracy with substring plus alias matching and lenient regex parsing
Diagnostics NA/refusal rate, format success, answer length

Table 2. Experimental setup. Table 2 is included to make the scope of the reported claims explicit.

Calibration And Reward Robustness

The main validation question is whether \(c_i\) is directionally meaningful. Figure 5 and Table 3 report the human audit: correctness rises monotonically from 24.0% for \(c_i=0\) to 81.0% for \(c_i \geq 20\).

Figure 5. Co-occurrence calibration curve.
Figure 5. Reward signal calibration. Original caption: empirical \(P(\mathrm{correct} \mid c_i)\) across five co-occurrence buckets, \(N=700\) manually annotated sentences, with Wilson 95% confidence intervals. I place it here because it is the strongest evidence that the reward bins are not arbitrary.
Bucket n Correct P(correct) 95% CI
\(c_i = 0\) 200 48 24.0% [18.6, 30.4]
\(1 \leq c_i \leq 4\) 100 53 53.0% [43.3, 62.5]
\(5 \leq c_i \leq 9\) 100 70 70.0% [60.4, 78.1]
\(10 \leq c_i \leq 19\) 100 73 73.0% [63.6, 80.7]
\(c_i \geq 20\) 200 162 81.0% [75.0, 85.8]

Table 3. Five-bucket calibration audit. The two thresholds used by the reward map sit at visible precision transitions: \(c_i=5\) and \(c_i=20\).

Figure 6 checks whether the zero-count penalty is fragile. The full recipe is retrained with zero-count penalties from -0.1 to -1.0 on Llama-3.2-3B-Instruct; both weaker and stronger penalties underperform the canonical -0.3 setting.

Figure 6. Zero-count penalty sensitivity.
Figure 6. Zero-count penalty sensitivity. Original caption: sensitivity of CorVer accuracy on Llama-3.2-3B-Instruct to the zero-count penalty \(r_i^{\mathrm{c}}(c_i=0)\), evaluated on TriviaQA validation with \(N=17,944\). This supports the chosen -0.3 penalty as an empirical sweet spot in the tested setting.

The qualitative case study in Table 9 shows the same idea at the sentence level: zero co-occurrence can catch plausible but wrong entities that a GPT-4o-mini offline judge labels correct.

Experiments And Results

Main Accuracy Results

Table 4 compresses the main table into model-level averages and the two exceptions noted in the source. CorVer improves every Raw cell in the four-model main comparison; compared with the four prior factuality-RL baselines, it wins 18 of 20 model-benchmark cells under the paper's feasible baseline configurations.

Model Raw avg Best non-CorVer avg CorVer avg CorVer vs Raw avg Notes
Llama-3.2-3B 22.52 25.55 27.89 +5.37 CorVer best on all five benchmarks.
Llama-3.1-8B 30.64 31.21 35.27 +4.63 Largest main-table average improvement.
Qwen3-4B 20.85 22.12 22.69 +1.84 CorVer narrowly ahead of FSPO average.
Qwen3-8B 24.37 24.65 26.11 +1.74 FSPO beats CorVer on PopQA by 0.26 pp; RLFH beats CorVer on SimpleQA by 0.58 pp.

Table 4. Main result summary. The average deltas in Table 4 are computed from the source table. The paper's own wording emphasizes that CorVer improves every Raw cell and beats prior baselines in 18 of 20 cells, with the two baseline wins both on Qwen3-8B and within a point.

Scaling Across Six Models

Figure 3 and Table 5 address generality. The scaling heatmap reports positive CorVer-minus-Raw gains in all 30 cells across six instruction-tuned models and five benchmarks.

Figure 3. Cross-model scaling heatmap.
Figure 3. Cross-model scaling of CorVer gains over Raw. Original caption: gains across six models, 3B to 14B, and five benchmarks; every cell is positive. This is the strongest evidence for C3.
Scale Model TriviaQA Raw -> CorVer NQ-Open Raw -> CorVer PopQA Raw -> CorVer SimpleQA Raw -> CorVer TruthfulQA Raw -> CorVer
3B Llama-3.2-3B-Instruct 55.39 -> 62.24 34.13 -> 43.41 15.92 -> 23.75 1.55 -> 2.57 5.63 -> 7.47
4B Qwen3-4B 51.14 -> 53.77 24.65 -> 26.59 17.51 -> 19.33 2.52 -> 3.12 8.45 -> 10.65
8B Llama-3.1-8B-Instruct 71.86 -> 76.52 40.66 -> 48.34 28.85 -> 35.30 5.20 -> 5.92 6.61 -> 10.28
8B Qwen3-8B 62.84 -> 63.99 29.61 -> 32.80 20.34 -> 21.83 2.57 -> 2.73 6.49 -> 9.18
13B OLMo-2-13B-Instruct 67.48 -> 73.19 32.80 -> 37.87 25.56 -> 31.04 2.17 -> 3.44 6.00 -> 9.30
14B Qwen3-14B 67.51 -> 71.28 31.39 -> 37.76 19.44 -> 25.37 1.85 -> 3.91 7.71 -> 9.79

Table 5. Full scaling summary. Table 5 preserves the Raw and CorVer values behind the 30 positive cells.

Cost And Feasibility

At the canonical rollout density, the reward loop issues about:

$$ N_{\mathrm{steps}} B G \bar{m}_{\mathrm{sent}} \approx 100 \cdot 24 \cdot 16 \cdot 3 \approx 1.2 \times 10^5 $$

sentence-level reward calls per training run. Figure 4 and Table 6 show why the paper cares so much about per-call cost.

Figure 4. Reward computation cost.
Figure 4. Reward computation cost. Original caption: average training time per method across four base models, with bars annotating each baseline's slowdown relative to CorVer. I place it here because it is the main evidence for the "lightweight" claim.
Method Llama-3.2-3B Qwen3-4B Llama-3.1-8B Qwen3-8B Avg slowdown
CorVer 2.0 h 4.3 h 2.5 h 4.1 h 1x
FoRAG 21.0 h 26.9 h 11.8 h 20.1 h 6.6x
RLFH 15.5 h 23.1 h 9.4 h 10.1 h 4.8x
FSPO 11.7 h 27.6 h 12.8 h 65.8 h 8.4x
KnowRL 21.1 h 21.9 h 17.1 h 36.4 h 7.8x

Table 6. Training wall-clock comparison. Table 6 is not an equal-budget comparison; it reflects the paper's argument that neural-verifier baselines cannot be run at the same rollout configuration without prohibitive cost.

What Drives The Gain

Table 7 holds the base model fixed at Llama-3.1-8B-Instruct and removes components. The full configuration beats every ablation on every benchmark. The most important comparison is A3 versus full: A3 keeps the same QuCo scalar magnitude but removes per-token sentence alignment, and it recovers only part of the gain.

Variant TriviaQA NQ-Open PopQA SimpleQA TruthfulQA
Raw 71.86 40.66 28.85 5.20 6.61
CorVer full 76.52 48.34 35.30 5.92 10.28
A1: no QuCo 71.3 45.9 34.6 5.4 6.8
A2: no judge 76.1 42.6 31.7 4.8 7.1
A3: no per-token alignment 72.9 46.3 34.9 5.8 6.6
A4: no self-filter 75.0 46.0 33.5 5.1 8.0

Table 7. Component ablation. Table 7 supports the claim that CorVer needs both the response-level judge and the token-aligned sentence reward, especially outside TriviaQA.

Table 8 compares alternative triplet aggregation rules. The canonical "first valid triplet" rule is not the most exhaustive factual checker, but it is the best training signal in the reported experiment: Min catches more within-sentence issues but collapses length and accuracy, while RelCheck adds relation tokens and cost but underperforms.

Variant Correct (%) NA (%) Mean length Time
First, canonical 62.24 5.04 150 1.0x
Min 57.27 0.33 40 1.2x
RelCheck 61.37 6.58 150 1.7x

Table 8. Triplet aggregation variants. Table 8 explains a practical choice: a weaker, first-triplet proxy can train better than a harsher multi-triplet rule because the policy can exploit the latter by shortening outputs.

The paper's qualitative Figure 7 has no external graphics asset, so I reconstruct its content as Table 9. Each row is a single sentence where the co-occurrence reward assigns \(r_i^{\mathrm{c}}=-0.3\) and the offline GPT-4o-mini judge incorrectly says the sentence is correct.

Case Prompt / gold answer Generated error Extracted pair Co-occurrence result Offline LLM judge
A Ronkonkoma, NY zip code / 11779 "11777" Ronkonkoma NY - 11777 \(c_i=0\), reward -0.30 Correct
B Kevin James's wife in Grown Ups / Maria Bello Jennifer Coolidge appears in Grown Ups Jennifer Coolidge - Grown Ups \(c_i=0\), reward -0.30 Correct
C Ella Fitzgerald's parent / William Fitzgerald Temperance Mary Tempie Height Fitzgerald Ella Fitzgerald - Height \(c_i=0\), reward -0.30 Correct

Table 9. Reconstructed case-study figure. The source caption says these are three single-sentence examples from the zero-frequency bucket; all were independently verified as factually incorrect by a human annotator.

Coverage, Refusals, And Checkpoints

Table 10 is the paper's most important limitation analysis. If CorVer mainly rescued rare entities, gains should be largest in Q1. Instead, gains are larger for better-covered PopQA entities, especially Q3 and Q4.

Model PopQA quartile Raw CorVer Delta
Llama-3.1-8B Q1 rare 18.94 24.41 +5.47
Llama-3.1-8B Q2 16.51 21.64 +5.13
Llama-3.1-8B Q3 24.50 32.89 +8.39
Llama-3.1-8B Q4 popular 57.06 64.55 +7.50
OLMo-2-13B Q1 rare 15.63 19.31 +3.68
OLMo-2-13B Q2 13.27 17.61 +4.33
OLMo-2-13B Q3 23.74 29.25 +5.51
OLMo-2-13B Q4 popular 50.60 59.63 +9.03

Table 10. PopQA popularity-quartile gains. Table 10 shows that the reward works best where the corpus has enough coverage to distinguish correct from incorrect co-occurrences.

For Qwen3-8B, Table 11 shows a large reduction in refusal rates. The paper checks the subset of questions that Raw refused but CorVer attempted and reports nontrivial correctness on that subset, which supports the interpretation that the model is recovering some factual recall rather than just guessing everywhere.

Dataset Raw NA% CorVer NA% Correct on Raw-refused subset
TriviaQA 8.32 4.37 24.9
NQ-Open 13.27 4.13 17.5
PopQA 18.30 2.58 7.0
SimpleQA 27.99 2.96 1.3
TruthfulQA 16.65 3.30 8.1

Table 11. Qwen3-8B refusal decomposition. This diagnostic is useful but narrow: the source text says the Llama family shows the opposite NA-rate shift, so this is not a universal mechanism.

Figure 8 supports the uniform step-100 checkpoint choice. On a longer Llama-3.2-3B CorVer run, the largest jump comes in the first 100 steps; step 200 is higher but costs twice as much, and later checkpoints drift slightly down.

Figure 8. Checkpoint curve.
Figure 8. Checkpoint selection. Original caption: TriviaQA accuracy of Llama-3.2-3B-Instruct CorVer as a function of GRPO step, evaluated every 50 steps to 300; step 100 captures the largest single jump over Raw, while step 200 is the empirical peak. This is why the paper uses step 100 uniformly.

Practical Takeaways

Limitations And Residual Risk

The reward is a corpus-grounded proxy, not a claim-level verifier. It ignores predicate semantics in the main query, keeps only one triplet per sentence, depends on English Wikipedia coverage, and uses a lenient substring-plus-alias evaluator for absolute accuracies. The relative comparisons use the same grader, so the tables are still meaningful for within-paper comparisons, but absolute correctness should be read cautiously.

I did not use raw PDF extraction. I did not consult latex_flattened/main.flattened.tex because the numeric tables needed for this digest were recoverable from paper.md; the HTML-rowspan tables were awkward, but not malformed enough to require fallback recovery.