Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Source-first digest for checked paper rank 42, rank_id p029.

Routing status: success
PDF extraction: not used

Motivation / Background

Knowledge-intensive QA is an awkward target for RL. Math and code can often use deterministic reward functions, but factual QA responses can mix a correct answer with unsupported claims in the same reasoning trace. Outcome-level rewards see only the final answer, while sentence-level factuality rewards usually require NLI models, LLM judges, retrieval, or knowledge-verification services for every generated sentence.

CorVer targets that gap. It turns Wikipedia subject-object co-occurrence into a process reward: extract one factual triplet from each generated sentence, query an Infini-gram Wikipedia index for the subject and object content words, convert the count into a small sentence reward, and align that sentence reward back to the generated tokens. Figure 1 frames the paper's motivation: factual QA lacks the cheap verification loop that has made RL easier for math and code.

**Figure 1. Verifiable rewards beyond math and code.** Original caption: math and code tasks enjoy programmatic, deterministic reward signals; prior sentence-level factuality methods rely on neural verification pipelines that become costly at RL scale; CorVer fills the gap with a corpus-indexed co-occurrence statistic that requires no neural verifier in the reward loop. I place it here because it states the paper's core motivation.

The key caveat is already visible in the motivation: CorVer is not a fact checker. A co-occurrence count can be a useful directional signal, but it cannot prove the predicate is correct. The rest of the paper argues that this weak but cheap signal is useful enough when it is calibrated, bounded, and combined with a response-level answer reward.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	A corpus-indexed co-occurrence signal can provide lightweight sentence-level process supervision for factual QA without putting neural verifiers in the reward loop.	5	problem setup, pipeline, reward map, cost
C2	Co-occurrence count is a calibrated directional proxy for sentence factuality, especially at the zero-count and high-count extremes.	4	calibration figure, calibration table, case study, limitations
C3	CorVer improves factual QA accuracy over the raw model across all 30 tested model-benchmark cells from 3B to 14B.	5	main results, cross-model scaling, full scaling summary
C4	Under feasible training configurations, CorVer beats the four neural-verifier factuality-RL baselines in most cells while training faster.	4	main results, training-time comparison, cost table
C5	The gain is not just from adding another scalar reward; per-token sentence alignment and the QuCo signal both matter.	4	ablation table, aggregation variants, penalty sweep
C6	The method's gains track corpus coverage more than rare-entity rescue, which is both evidence for how the signal works and a limitation.	4	PopQA quartiles, limitations
C7	On Qwen3-8B, part of the gain comes from reducing refusals rather than indiscriminate guessing, but that diagnostic is model-specific.	3	refusal decomposition, checkpoint curve

Scores are support-from-paper scores, not independent reproduction scores. Claims with single-model diagnostics or asymmetric baseline configurations are capped below 5 even when the reported results are strong.

Core Technical Idea

CorVer has three reward channels:

A response-level string-match judge reward for whether the answer is correct.
A response-level format reward for the required think / answer template.
A sentence-level co-occurrence reward, aligned back to the tokens in that sentence.

Figure 2 is the compact view: generated text is split into sentences, a 0.5B QuCo extractor pulls a subject-object relation triplet, the subject and object content words are counted in a Wikipedia Infini-gram index, and the resulting sentence reward is combined with GRPO.

**Figure 2. CorVer pipeline.** Original caption: each sentence is scored for Wikipedia co-occurrence via an Infini-gram index; the per-sentence reward is mapped to token-level returns through \(\sigma\) and combined with response-level signals in a policy-gradient update, instantiated with GRPO. This figure is the main evidence for the process-supervision mechanism.

For a factual sentence \(s_i\), CorVer extracts a first valid triplet, discards the relation for the main query, reduces the head and tail to content words, and counts bounded-window co-occurrences:

c_i = \mathrm{count}_{\mathcal{W}}\left(w_1 \wedge \cdots \wedge w_k\right).

The count becomes a bounded auxiliary reward:

r_i^{\mathrm{c}} = \begin{cases} 0.0 & \text{if no valid triplet}, \\ -0.3 & \text{if } c_i = 0, \\ -0.1 & \text{if } 0 < c_i < 5, \\ 0.0 & \text{if } 5 \leq c_i < 20, \\ +0.1 & \text{if } c_i \geq 20. \end{cases}

Co-occurrence condition	Sentence reward	Interpretation
No valid triplet	0.0	Do not shape the sentence.
\(c_i = 0\)	-0.3	Strong negative signal for an unsupported subject-object pair.
\(0 < c_i < 5\)	-0.1	Weak negative signal.
\(5 \leq c_i < 20\)	0.0	Neutral middle bucket.
\(c_i \geq 20\)	+0.1	Weak positive signal for a well-covered pair.

Table 1. Four-tier co-occurrence reward map. The concrete bins come from the appendix implementation section and the human calibration audit. Table 1 is important because it shows how deliberately small the co-occurrence term is.

The per-token return adds the sentence reward only to tokens that align to that sentence:

R^{\mathrm{r}}(x,y) = \lambda_{\mathrm{j}} R^{\mathrm{j}}(x,y) + \lambda_{\mathrm{f}} R^{\mathrm{f}}(y),

R_t(x,y) = R^{\mathrm{r}}(x,y) + \mathbf{1}[\sigma(t)>0] \lambda_{\mathrm{c}} r_{\sigma(t)}^{\mathrm{c}}.

That is the main design decision. A response can get a positive final-answer reward while unsupported sentences receive a local negative shaping term, and two sentences in one completion can receive different local advantages.

Method Details

The paper uses a local English Wikipedia 20231101 Infini-gram index with about 6.4M articles and 5.5B tokens. Counts are bounded to a 1,000-token inter-clause window, so \(c_i\) measures passage-like position-level co-occurrence, not document-level co-occurrence. The extractor is QuCo-extractor-0.5B, a Qwen2.5-0.5B-Instruct triplet model. For each sentence, the implementation keeps only the first valid head-relation-tail triplet whose head and tail are non-empty and non-pronominal.

The reward scale is intentionally conservative. The response judge maps GOOD, BAD, and NA to \(+2.0, -1.0, -1.0\), the format reward is \(\pm 1.0\), and \(\lambda_{\mathrm{f}}=\lambda_{\mathrm{j}}=\lambda_{\mathrm{c}}=1.0\). The source text notes that the maximum co-occurrence contribution for a typical completion is about \(0.3\), an order of magnitude smaller than the judge reward swing of \(3.0\). This is why CorVer is best read as a local shaping signal, not a replacement correctness label.

Training uses LoRA with rank 128 and alpha 256, \(G=16\) generations per prompt, prompt batch 24, max completion length 1024, and 100 GRPO steps for the canonical runs. The method trains directly from the raw instruction-tuned model rather than from an SFT cold start. The paper reports that small 3B/4B models needed mastered anchor questions mixed into the self-filtered learning-zone pool, while models at least 8B did not.

The evaluation setup is summarized in Table 2. Training prompts come only from NQ-Open train and WebQuestions, so TriviaQA, PopQA, SimpleQA, and TruthfulQA are out-of-distribution for the RL prompt source.

Item	Paper setting
Benchmarks	TriviaQA, NQ-Open, PopQA, SimpleQA, TruthfulQA
Evaluation sizes	17,944; 3,610; 14,267; 4,326; 817 questions
Main models	Llama-3.2-3B, Qwen3-4B, Llama-3.1-8B, Qwen3-8B
Scaling models	Adds OLMo-2-13B and Qwen3-14B
Baselines	Raw, FoRAG, RLFH, FSPO, KnowRL
Metric	Factual QA accuracy with substring plus alias matching and lenient regex parsing
Diagnostics	NA/refusal rate, format success, answer length

Table 2. Experimental setup. Table 2 is included to make the scope of the reported claims explicit.

Calibration And Reward Robustness

The main validation question is whether \(c_i\) is directionally meaningful. Figure 5 and Table 3 report the human audit: correctness rises monotonically from 24.0% for \(c_i=0\) to 81.0% for \(c_i \geq 20\).

Figure 5. Co-occurrence calibration curve. — **Figure 5. Reward signal calibration.** Original caption: empirical \(P(\mathrm{correct} \mid c_i)\) across five co-occurrence buckets, \(N=700\) manually annotated sentences, with Wilson 95% confidence intervals. I place it here because it is the strongest evidence that the reward bins are not arbitrary.

Bucket	n	Correct	P(correct)	95% CI
\(c_i = 0\)	200	48	24.0%	[18.6, 30.4]
\(1 \leq c_i \leq 4\)	100	53	53.0%	[43.3, 62.5]
\(5 \leq c_i \leq 9\)	100	70	70.0%	[60.4, 78.1]
\(10 \leq c_i \leq 19\)	100	73	73.0%	[63.6, 80.7]
\(c_i \geq 20\)	200	162	81.0%	[75.0, 85.8]

Table 3. Five-bucket calibration audit. The two thresholds used by the reward map sit at visible precision transitions: \(c_i=5\) and \(c_i=20\).

Figure 6 checks whether the zero-count penalty is fragile. The full recipe is retrained with zero-count penalties from -0.1 to -1.0 on Llama-3.2-3B-Instruct; both weaker and stronger penalties underperform the canonical -0.3 setting.

**Figure 6. Zero-count penalty sensitivity.** Original caption: sensitivity of CorVer accuracy on Llama-3.2-3B-Instruct to the zero-count penalty \(r_i^{\mathrm{c}}(c_i=0)\), evaluated on TriviaQA validation with \(N=17,944\). This supports the chosen -0.3 penalty as an empirical sweet spot in the tested setting.

The qualitative case study in Table 9 shows the same idea at the sentence level: zero co-occurrence can catch plausible but wrong entities that a GPT-4o-mini offline judge labels correct.

Experiments And Results

Main Accuracy Results

Table 4 compresses the main table into model-level averages and the two exceptions noted in the source. CorVer improves every Raw cell in the four-model main comparison; compared with the four prior factuality-RL baselines, it wins 18 of 20 model-benchmark cells under the paper's feasible baseline configurations.

Model	Raw avg	Best non-CorVer avg	CorVer avg	CorVer vs Raw avg	Notes
Llama-3.2-3B	22.52	25.55	27.89	+5.37	CorVer best on all five benchmarks.
Llama-3.1-8B	30.64	31.21	35.27	+4.63	Largest main-table average improvement.
Qwen3-4B	20.85	22.12	22.69	+1.84	CorVer narrowly ahead of FSPO average.
Qwen3-8B	24.37	24.65	26.11	+1.74	FSPO beats CorVer on PopQA by 0.26 pp; RLFH beats CorVer on SimpleQA by 0.58 pp.

Table 4. Main result summary. The average deltas in Table 4 are computed from the source table. The paper's own wording emphasizes that CorVer improves every Raw cell and beats prior baselines in 18 of 20 cells, with the two baseline wins both on Qwen3-8B and within a point.

Scaling Across Six Models

Figure 3 and Table 5 address generality. The scaling heatmap reports positive CorVer-minus-Raw gains in all 30 cells across six instruction-tuned models and five benchmarks.

Figure 3. Cross-model scaling heatmap. — **Figure 3. Cross-model scaling of CorVer gains over Raw.** Original caption: gains across six models, 3B to 14B, and five benchmarks; every cell is positive. This is the strongest evidence for C3.

Scale	Model	TriviaQA Raw -> CorVer	NQ-Open Raw -> CorVer	PopQA Raw -> CorVer	SimpleQA Raw -> CorVer	TruthfulQA Raw -> CorVer
3B	Llama-3.2-3B-Instruct	55.39 -> 62.24	34.13 -> 43.41	15.92 -> 23.75	1.55 -> 2.57	5.63 -> 7.47
4B	Qwen3-4B	51.14 -> 53.77	24.65 -> 26.59	17.51 -> 19.33	2.52 -> 3.12	8.45 -> 10.65
8B	Llama-3.1-8B-Instruct	71.86 -> 76.52	40.66 -> 48.34	28.85 -> 35.30	5.20 -> 5.92	6.61 -> 10.28
8B	Qwen3-8B	62.84 -> 63.99	29.61 -> 32.80	20.34 -> 21.83	2.57 -> 2.73	6.49 -> 9.18
13B	OLMo-2-13B-Instruct	67.48 -> 73.19	32.80 -> 37.87	25.56 -> 31.04	2.17 -> 3.44	6.00 -> 9.30
14B	Qwen3-14B	67.51 -> 71.28	31.39 -> 37.76	19.44 -> 25.37	1.85 -> 3.91	7.71 -> 9.79

Table 5. Full scaling summary. Table 5 preserves the Raw and CorVer values behind the 30 positive cells.

Cost And Feasibility

At the canonical rollout density, the reward loop issues about:

N_{\mathrm{steps}} B G \bar{m}_{\mathrm{sent}} \approx 100 \cdot 24 \cdot 16 \cdot 3 \approx 1.2 \times 10^5

sentence-level reward calls per training run. Figure 4 and Table 6 show why the paper cares so much about per-call cost.

**Figure 4. Reward computation cost.** Original caption: average training time per method across four base models, with bars annotating each baseline's slowdown relative to CorVer. I place it here because it is the main evidence for the "lightweight" claim.

Method	Llama-3.2-3B	Qwen3-4B	Llama-3.1-8B	Qwen3-8B	Avg slowdown
CorVer	2.0 h	4.3 h	2.5 h	4.1 h	1x
FoRAG	21.0 h	26.9 h	11.8 h	20.1 h	6.6x
RLFH	15.5 h	23.1 h	9.4 h	10.1 h	4.8x
FSPO	11.7 h	27.6 h	12.8 h	65.8 h	8.4x
KnowRL	21.1 h	21.9 h	17.1 h	36.4 h	7.8x

Table 6. Training wall-clock comparison. Table 6 is not an equal-budget comparison; it reflects the paper's argument that neural-verifier baselines cannot be run at the same rollout configuration without prohibitive cost.

What Drives The Gain

Table 7 holds the base model fixed at Llama-3.1-8B-Instruct and removes components. The full configuration beats every ablation on every benchmark. The most important comparison is A3 versus full: A3 keeps the same QuCo scalar magnitude but removes per-token sentence alignment, and it recovers only part of the gain.

Variant	TriviaQA	NQ-Open	PopQA	SimpleQA	TruthfulQA
Raw	71.86	40.66	28.85	5.20	6.61
CorVer full	76.52	48.34	35.30	5.92	10.28
A1: no QuCo	71.3	45.9	34.6	5.4	6.8
A2: no judge	76.1	42.6	31.7	4.8	7.1
A3: no per-token alignment	72.9	46.3	34.9	5.8	6.6
A4: no self-filter	75.0	46.0	33.5	5.1	8.0

Table 7. Component ablation. Table 7 supports the claim that CorVer needs both the response-level judge and the token-aligned sentence reward, especially outside TriviaQA.

Table 8 compares alternative triplet aggregation rules. The canonical "first valid triplet" rule is not the most exhaustive factual checker, but it is the best training signal in the reported experiment: Min catches more within-sentence issues but collapses length and accuracy, while RelCheck adds relation tokens and cost but underperforms.

Variant	Correct (%)	NA (%)	Mean length	Time
First, canonical	62.24	5.04	150	1.0x
Min	57.27	0.33	40	1.2x
RelCheck	61.37	6.58	150	1.7x

Table 8. Triplet aggregation variants. Table 8 explains a practical choice: a weaker, first-triplet proxy can train better than a harsher multi-triplet rule because the policy can exploit the latter by shortening outputs.

The paper's qualitative Figure 7 has no external graphics asset, so I reconstruct its content as Table 9. Each row is a single sentence where the co-occurrence reward assigns \(r_i^{\mathrm{c}}=-0.3\) and the offline GPT-4o-mini judge incorrectly says the sentence is correct.

Case	Prompt / gold answer	Generated error	Extracted pair	Co-occurrence result	Offline LLM judge
A	Ronkonkoma, NY zip code / 11779	"11777"	Ronkonkoma NY - 11777	\(c_i=0\), reward -0.30	Correct
B	Kevin James's wife in Grown Ups / Maria Bello	Jennifer Coolidge appears in Grown Ups	Jennifer Coolidge - Grown Ups	\(c_i=0\), reward -0.30	Correct
C	Ella Fitzgerald's parent / William Fitzgerald	Temperance Mary Tempie Height Fitzgerald	Ella Fitzgerald - Height	\(c_i=0\), reward -0.30	Correct

Table 9. Reconstructed case-study figure. The source caption says these are three single-sentence examples from the zero-frequency bucket; all were independently verified as factually incorrect by a human annotator.

Coverage, Refusals, And Checkpoints

Table 10 is the paper's most important limitation analysis. If CorVer mainly rescued rare entities, gains should be largest in Q1. Instead, gains are larger for better-covered PopQA entities, especially Q3 and Q4.

Model	PopQA quartile	Raw	CorVer	Delta
Llama-3.1-8B	Q1 rare	18.94	24.41	+5.47
Llama-3.1-8B	Q2	16.51	21.64	+5.13
Llama-3.1-8B	Q3	24.50	32.89	+8.39
Llama-3.1-8B	Q4 popular	57.06	64.55	+7.50
OLMo-2-13B	Q1 rare	15.63	19.31	+3.68
OLMo-2-13B	Q2	13.27	17.61	+4.33
OLMo-2-13B	Q3	23.74	29.25	+5.51
OLMo-2-13B	Q4 popular	50.60	59.63	+9.03

Table 10. PopQA popularity-quartile gains. Table 10 shows that the reward works best where the corpus has enough coverage to distinguish correct from incorrect co-occurrences.

For Qwen3-8B, Table 11 shows a large reduction in refusal rates. The paper checks the subset of questions that Raw refused but CorVer attempted and reports nontrivial correctness on that subset, which supports the interpretation that the model is recovering some factual recall rather than just guessing everywhere.

Dataset	Raw NA%	CorVer NA%	Correct on Raw-refused subset
TriviaQA	8.32	4.37	24.9
NQ-Open	13.27	4.13	17.5
PopQA	18.30	2.58	7.0
SimpleQA	27.99	2.96	1.3
TruthfulQA	16.65	3.30	8.1

Table 11. Qwen3-8B refusal decomposition. This diagnostic is useful but narrow: the source text says the Llama family shows the opposite NA-rate shift, so this is not a universal mechanism.

Figure 8 supports the uniform step-100 checkpoint choice. On a longer Llama-3.2-3B CorVer run, the largest jump comes in the first 100 steps; step 200 is higher but costs twice as much, and later checkpoints drift slightly down.

Figure 8. Checkpoint curve. — **Figure 8. Checkpoint selection.** Original caption: TriviaQA accuracy of Llama-3.2-3B-Instruct CorVer as a function of GRPO step, evaluated every 50 steps to 300; step 100 captures the largest single jump over Raw, while step 200 is the empirical peak. This is why the paper uses step 100 uniformly.

Practical Takeaways

The reusable idea is not "co-occurrence equals truth"; it is "a cheap, bounded, calibrated corpus statistic can be useful as a dense auxiliary reward when paired with an ordinary answer reward."
The strongest empirical support is breadth: all 30 Raw-vs-CorVer cells improve across six models and five benchmarks.
The strongest engineering support is cost: at roughly 120k sentence-level reward calls per run, avoiding neural verifier calls changes what configurations are feasible.
The most important limitation is predicate blindness. A subject and object can co-occur in Wikipedia while the generated relation is wrong.
Corpus coverage is a real constraint. The PopQA analysis shows larger gains on better-covered entities, so this is not a magic fix for the rarest facts.
The "first triplet" rule is pragmatic. It misses some multi-fact sentence errors, but harsher aggregation changed model behavior in worse ways.
The paper validates CorVer only under GRPO. PPO, REINFORCE, DPO-style preference construction, larger web indexes, and combinations with retrieval or NLI rewards are future-work claims, not demonstrated results.

Limitations And Residual Risk

The reward is a corpus-grounded proxy, not a claim-level verifier. It ignores predicate semantics in the main query, keeps only one triplet per sentence, depends on English Wikipedia coverage, and uses a lenient substring-plus-alias evaluator for absolute accuracies. The relative comparisons use the same grader, so the tables are still meaningful for within-paper comparisons, but absolute correctness should be read cautiously.

I did not use raw PDF extraction. I did not consult latex_flattened/main.flattened.tex because the numeric tables needed for this digest were recoverable from paper.md; the HTML-rowspan tables were awkward, but not malformed enough to require fallback recovery.