DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Source-first digest for monthly 2026_05 rank 19, rank_id p009.

Routing status: success; route full_markdown
Source repair: flattened TeX was used only to recover table cells that were dropped by Markdown conversion
PDF extraction: not used

Motivation / Background

DelTA addresses a granularity mismatch in reinforcement learning from verifiable rewards (RLVR): the reward is a response-level scalar, but the policy update is applied through token-level probability terms. The paper starts from the observation that RLVR can induce sparse token-level distributional shifts even though every token in a response shares the same sequence-level advantage. This raises the central question: which token probabilities are increased or decreased by an RLVR update, and what determines those changes?

The paper's answer is that a sequence-level RLVR update implicitly behaves like a linear discriminator in token-gradient space. Positive-advantage responses and negative-advantage responses each contribute an aggregate token-gradient direction; a candidate token is locally encouraged when its own token-gradient vector aligns more with the positive aggregate than the negative aggregate. The main claims are summarized in Table 1.

The failure mode is subtle. Standard group-relative objectives build each side's aggregate as a centroid over all tokens on that side. But high-reward and low-reward reasoning traces share many high-frequency tokens, such as formatting markers, repeated problem entities, and boilerplate reasoning scaffolds. Those shared directions can pull both centroids toward common background structure, making the induced discriminator less sensitive to sparse directions that actually distinguish good from bad responses.

Claims And Evidence

Support scores in Table 1 are source-support scores, not independent reproduction scores. A score of 5 means the claim is directly backed by source text, equations, tables, or figures. A score of 4 means the source evidence is strong but still depends on assumptions such as benchmark representativeness, proxy validity, or single-paper experimental scope.

Claim id	Main claim	Support	Evidence anchors
C1	Sequence-level RLVR induces a local discriminator over token-gradient vectors, determining which candidate-token probabilities move up or down.	5	motivation, local discriminator, key equations
C2	Standard side-wise centroids can be diluted by shared high-frequency token directions, so within-side summarization is not necessarily good between-side discrimination.	4	motivation, local discriminator, diagnostics
C3	DelTA estimates discriminative token coefficients and reweights a self-normalized RLVR surrogate to make positive and negative effective centroids more contrastive.	5	method, implementation, Figure 1, key equations
C4	On seven math benchmarks, DelTA beats the strongest same-scale baseline by 3.26 average points on Qwen3-8B-Base and 2.62 average points on Qwen3-14B-Base.	5	main results, Table 3, Figure 2
C5	The learned coefficients carry useful token-level signal: top-lambda token selection outperforms full-token DAPO while bottom-lambda token selection collapses.	4	diagnostics, Figure 3, Figure 4
C6	The benefit is not confined to the main Qwen3 math setup: the paper reports gains on Olmo3-7B, code generation, and OOD GPQA-D/MMLU-Pro.	4	generalization, Table 5
C7	The method is lightweight relative to RLVR training but still has caveats: proxy gradients, extra forward passes, mostly math-centered evaluation, and no independent reproduction in this digest.	4	implementation, limitations, Table 5

Core Technical Idea

The core move is to reinterpret the RLVR update as both a parameter update and a token-gradient classifier. For a candidate token \(x\) under context \(c\), a local first-order step gives:

\Delta \log \pi(x \mid c) \approx \left( \nabla_\theta \log \pi_\theta(x \mid c) \big|_{\theta=\theta_{\mathrm{old}}} \right)^\top \Delta\theta .

For a DAPO-style group-relative RLVR objective, the local update direction decomposes into positive- and negative-advantage token-gradient aggregates:

\Delta\theta_{\mathrm{RLVR}} \propto \sum_{i:\hat A_i>0}\sum_{t=1}^{|o_i|} \hat A_i v_{i,t} - \sum_{i:\hat A_i<0}\sum_{t=1}^{|o_i|} |\hat A_i| v_{i,t}, \quad v_{i,t}:= \nabla_\theta \log \pi_\theta(o_{i,t}\mid q,o_{i,{<}t}) \big|_{\theta=\theta_{\mathrm{old}}}.

After normalizing each side, the update can be written as a contrast between side-wise centroids:

\Delta\theta_{\mathrm{RLVR}} \propto M_+\bar\mu_+ - M_-\bar\mu_- .

Substituting this into the log-probability change makes the discriminator view explicit:

\Delta \log \pi(x \mid c) \propto M_+ \left(\nabla_\theta \log \pi_\theta(x \mid c)\right)^\top \bar\mu_+ - M_- \left(\nabla_\theta \log \pi_\theta(x \mid c)\right)^\top \bar\mu_- .

The candidate token is encouraged when the positive-side score exceeds the negative-side score. This explains why a response-level reward can still create sparse token-level probability movement: selection is induced by geometry in token-gradient space, not by explicit token rewards.

Digest object	Source label	Role
Group-normalized advantage and importance ratio	unlabelled prelim equation	Defines the response-level advantage shared by every token in one sampled response.
DAPO clipped surrogate	unlabelled prelim equation	Provides the representative critic-free RLVR objective used for the local analysis.
Local token log-probability change	`eq:local-logprob-change`	Connects a parameter update to candidate-token probability movement.
Local RLVR update	`eq:local-rlvr-update`	Splits sampled-token gradients into positive and negative advantage sides.
Centroid decomposition	`eq:centroid-decomposition`	Shows the update as a mass-weighted contrast between side-wise centroids.
Local discriminator score	`eq:local-discriminator-score`	Makes the positive-vs-negative token-gradient scoring rule explicit.
DelTA assignment score	`eq:DelTA-alpha`	Computes side-specific soft assignment scores from opposite-side distance margins.
DelTA weighted objective	`eq:DelTA-weighted-objective`	Replaces uniform token averaging with bounded, self-normalized token coefficients.

Table 2. Key equations. The source equations come from equations.json and paper.md; Table 2 is the digest's compact map of the mathematical flow.

Method Details

DelTA implements the discriminator view by assigning a coefficient to each rollout token. The method starts from the original positive and negative centroids, estimates how side-specific each token-gradient vector is, refines the centroids with those scores, maps the final scores to bounded coefficients, and uses the coefficients inside the RLVR surrogate. Figure 1 is the paper's overview of this pipeline.

Figure 1. Overview of DelTA. — **Figure 1. DelTA overview.** DelTA estimates token coefficients from the contrast between positive- and negative-advantage token-gradient aggregates, then reweights the sequence-level RLVR objective. Provenance: source `fig_0001`, label `fig:main`, original LaTeX asset `figs/main_4fin.pdf`, copied from ranking cache `local source artifact2026_05/assets/figures/p009/fig001_01_main_4fin.jpg`.

For a positive-advantage token, DelTA gives a larger raw score when its token-gradient vector is closer to the positive centroid than to the negative centroid. The source writes the entropy-regularized assignment in objective form and gives the closed-form score:

\alpha_{i,t}^{(k)} = \begin{cases} \sigma\!\left( \dfrac{ \|v_{i,t}-\mu_-^{(k)}\|_2^2 - \|v_{i,t}-\mu_+^{(k)}\|_2^2 }{ \gamma_+^{(k)} } \right), & \hat A_i>0,\\ \sigma\!\left( \dfrac{ \|v_{i,t}-\mu_+^{(k)}\|_2^2 - \|v_{i,t}-\mu_-^{(k)}\|_2^2 }{ \gamma_-^{(k)} } \right), & \hat A_i<0 . \end{cases}

The score is high for token-gradient vectors that are more characteristic of their own side than of the opposite side. DelTA then updates each centroid as a score-weighted within-side average, recomputes final scores \(\alpha_{i,t}^{\star}\), maps them into \(\lambda_{i,t}=\lambda_{\min}+(\lambda_{\max}-\lambda_{\min})\alpha_{i,t}^{\star}\), and optimizes:

J_{\mathrm{DelTA}}(\theta) = \mathbb{E}\!\left[ \frac{1}{\sum_{i=1}^{G}\sum_{t=1}^{|o_i|}\lambda_{i,t}} \sum_{i=1}^{G}\sum_{t=1}^{|o_i|} \lambda_{i,t} \min\!\Big( r_{i,t}(\theta)\hat A_i,\; \mathrm{clip}\!\big( r_{i,t}(\theta), 1-\epsilon_{\mathrm{low}}, 1+\epsilon_{\mathrm{high}} \big)\hat A_i \Big) \right].

Implementation details matter because exact full-parameter token gradients would be too expensive at RLVR scale. DelTA uses a layer-restricted token-gradient proxy based on the LM-head output row:

\nabla_{W_{y_t}}\log p_t(y_t) = \bigl(1-p_t(y_t)\bigr)h_t .

This proxy is only used to compute stop-gradient coefficients; the weighted RLVR objective still updates the full model. In the main experiments, DelTA sets \([\lambda_{\min},\lambda_{\max}]=[0.8,1.2]\), uses one refinement iteration \(K=1\), computes coefficients once per rollout batch, holds them fixed across optimization epochs, and recomputes them for newly sampled trajectories. The implementation can keep the standard DAPO token-count normalizer by using \(\bar\lambda_{i,t}=\lambda_{i,t}N/Z\), which is equivalent to the self-normalized objective.

The paper reports that with \(K=1\), DelTA needs \(K+2=3\) extra actor forward passes for coefficient estimation when hidden states are not cached. On 8 NVIDIA B200 GPUs, the first DelTA training step takes 37 seconds longer than DAPO, reported as about 10.2 percent of the total first-step time of DelTA. That overhead is nonzero, but the paper argues it is modest relative to long-response RLVR rollout generation.

Experiments And Results

The main setup trains Qwen3-8B-Base and Qwen3-14B-Base on DeepMath-103K with VeRL, disables dynamic sampling for all methods, and compares DelTA against DAPO, DAPO with Forking Tokens, SAPO, and FIPO under the same training hyperparameters. Evaluation covers AIME24, AIME25, AIME26, HMMT25-Feb, HMMT25-Nov, HMMT26-Feb, and Brumo25, with 16 sampled responses per problem and 30,000-token maximum generation length. Table 3 condenses the source main-results table.

Backbone	Method	AIME24	AIME25	AIME26	HMMT25 Feb	HMMT25 Nov	HMMT26 Feb	Brumo25	Avg.
Qwen3-8B-Base	DAPO	34.79	23.33	24.17	13.54	12.08	16.86	36.46	22.95
Qwen3-8B-Base	DAPO w/ FT	36.67	23.96	26.46	15.62	15.42	17.05	39.17	24.80
Qwen3-8B-Base	SAPO	38.75	24.37	26.25	14.58	16.04	17.42	39.37	25.14
Qwen3-8B-Base	FIPO	37.50	23.13	23.96	14.58	12.92	17.99	37.71	23.89
Qwen3-8B-Base	DelTA	43.13	26.46	28.12	18.33	18.54	20.27	44.79	28.40
Qwen3-14B-Base	DAPO	51.25	32.29	39.79	19.79	30.00	25.38	48.13	35.09
Qwen3-14B-Base	DAPO w/ FT	54.37	33.75	41.46	20.42	31.67	24.81	52.08	36.77
Qwen3-14B-Base	SAPO	53.96	34.17	41.46	20.62	28.33	24.05	50.21	35.94
Qwen3-14B-Base	FIPO	54.58	35.00	42.50	21.46	32.29	24.43	52.08	37.29
Qwen3-14B-Base	DelTA	56.87	37.92	45.21	26.04	32.92	26.89	54.79	39.91

Table 3. Main math results. DelTA has the best average at both scales and beats the strongest same-scale baseline average by 3.26 points on Qwen3-8B-Base and 2.62 points on Qwen3-14B-Base.

Figure 2 supports the claim that DelTA's gains are not just a shorter-answer effect: the paper reports that DAPO plateaus and shifts toward shorter responses with rising entropy, while DelTA continues improving reward, maintains longer responses, and has lower entropy.

Figure 2c. Entropy curve. — **Figure 2. Training dynamics.** Reward, response length, and entropy curves compare DelTA with DAPO. Provenance: source `fig_0002`, label `fig:training_dynamics`, original LaTeX assets `figs/dyna/reward.pdf`, `figs/dyna/leng.pdf`, and `figs/dyna/entropy.pdf`, copied from ranking cache files `fig002_01_reward.jpg`, `fig002_02_leng.jpg`, and `fig002_03_entropy.jpg`.

The analysis section asks whether the opposite-side comparison and the coefficient design are actually necessary. Table 4 collects the quantitative diagnostic rows, while Figure 3 and Figure 4 show source visual evidence for token selection and token-weight interpretation.

Diagnostic	Setting	AIME25	AIME26	HMMT25	HMMT26	Avg.	Interpretation
Within-side comparison	DelTA	26.46	28.12	18.54	20.27	23.27	Reference.
Within-side comparison	DAPO	23.33	24.17	12.08	16.86	19.05	Baseline.
Within-side comparison	Within-side only	21.67	22.08	11.04	17.05	17.94	Own-side centrality alone is worse than DAPO.
Component ablation	Full DelTA	26.46	28.12	18.54	20.27	23.27	Reference.
Component ablation	w/o adaptive \(\gamma\)	25.00	26.04	16.04	17.99	21.19	Scale adaptation helps.
Component ablation	w/o \(h(\alpha)\)	24.37	26.87	15.42	17.42	20.93	Soft assignment helps.
Component ablation	w/o \(\lambda\)-norm	24.37	26.25	15.83	19.32	21.39	Coefficient-mass normalization helps.
Component ablation	w/o range map	24.79	25.83	15.83	17.05	20.78	Bounded coefficients are more stable than raw scores.
Component ablation	w/o refinement	23.13	25.42	15.42	16.29	19.97	One-shot initial centroids are insufficient.
Proxy ablation	Base DelTA	26.46	28.12	18.54	20.27	23.27	Default output-row proxy.
Proxy ablation	Top-\(K\) hidden-gradient proxy	27.08	27.71	20.83	21.78	24.29	Another proxy can work even better.
Proxy ablation	Random \(\lambda\)	22.50	22.50	11.87	16.67	18.34	Random reweighting is not enough.

Table 4. Diagnostics. The results support the paper's claim that opposite-side contrast, bounded soft coefficients, refinement, normalization, and non-random coefficient signal each matter.

Figure 3a. Reward under token-selection strategies.

Figure 3b. Accuracy under token-selection strategies. — **Figure 3. Token-selection evidence.** The paper uses DelTA coefficients only for hard token selection: top-\(\lambda\) tokens improve over full-token DAPO, random 50 percent selection stays close to DAPO, and bottom-\(\lambda\) tokens collapse. Provenance: source `fig_0003`, labels `fig:reward_mask` and `fig:acc_mask`, original LaTeX assets `figs/mask/reward_mask.pdf` and `figs/mask/acc_mask.pdf`, copied from ranking cache files `fig003_01_reward_mask.jpg` and `fig003_02_acc_mask.jpg`.

The paper also tests transfer beyond the main Qwen3 math setting. Table 5 summarizes the main supplementary results and the reported overhead.

Evidence family	Baseline	DelTA	Reported gain
Olmo3-7B-Base seven-math average	DAPO 19.01	DelTA 22.80	+3.79 average points
Code generation weighted average	DAPO 47.7	DelTA 49.5	+1.8 average points
Qwen3-8B OOD weighted average	DAPO 58.87	DelTA 62.38	+3.51 average points
Qwen3-14B OOD weighted average	DAPO 66.77	DelTA 68.40	+1.63 average points
First-step wall-clock overhead	DAPO reference	DelTA +37 seconds	About 10.2 percent of DelTA first-step time

Table 5. Transfer and cost. The supplementary evidence supports generality across architecture, code generation, and OOD reasoning, while also showing the extra compute cost of coefficient estimation.

Figure 4 is qualitative but useful for interpreting what the coefficients emphasize. The paper reports that high-weight tokens include reasoning- or transformation-associated tokens such as scaffold, prime, =y, forward, and backward, while low-weight tokens include more background-like or entity-specific tokens such as Seat, domain, players, Vander, and Hamilton.

Figure 4b. Low-weight token cloud. — **Figure 4. Token weight analysis.** Token clouds visualize token types with high and low average DelTA coefficients across roughly \(10^8\) generated tokens. Provenance: source `fig_0004`, label `fig:token_cloud`, original LaTeX assets `figs/token_c/high_w.pdf` and `figs/token_c/low_w.pdf`, copied from ranking cache files `fig004_01_high_w.jpg` and `fig004_02_low_w.jpg`.

The limitations are important for interpreting the result. First, DelTA's coefficients are estimated with a layer-restricted proxy rather than exact full-parameter token gradients. The source argues this is a practical necessity and supports robustness with proxy ablations, but it remains an approximation. Second, the evaluation is centered on math reasoning, with supplementary code, architecture, and OOD tests rather than broad RLVR coverage over multi-turn tool use or diverse verifiable domains. Third, coefficient estimation adds compute. The paper reports a modest overhead in its setting, but users with very long responses or constrained memory may face different tradeoffs. Finally, this digest does not independently reproduce the experiments.

Practical Takeaways

For RLVR work, the paper's most useful contribution is the discriminator lens: a response-level advantage does not explain token-level learning alone; the geometry of the aggregated token gradients determines which token probabilities move. That framing gives a concrete way to reason about credit assignment without adding dense process rewards or a learned value function.

DelTA is attractive when an existing DAPO/GRPO-style trainer already computes old log-probabilities and hidden states, because its coefficients are stop-gradient quantities that can be inserted as per-token multipliers. The method keeps the training objective close to the baseline, but changes which token-gradient directions receive emphasis.

The strongest empirical signal is the combination of Table 3, Table 4, and Figure 3: DelTA beats strong same-scale baselines, the within-side-only and random-weight controls fail, and the top-\(\lambda\)/bottom-\(\lambda\) split suggests the coefficients identify useful and harmful token-gradient directions rather than merely sparsifying the loss.

The main caution is that DelTA's mechanism is approximate and source-validated rather than independently reproduced here. Treat the method as a promising RLVR objective modification with strong internal evidence, not as a universally validated credit-assignment solution.

Reference Coverage

This digest links all explicit anchors used above: evidence anchors motivation, local discriminator, method, implementation, main results, diagnostics, generalization, and limitations; table anchors claims, key equations, main results, diagnostics, and transfer/cost; figure anchors DelTA overview, training dynamics, token selection, and token clouds.