arXiv20262026avg 3.43interest 5.5011 HF retrieval biasevidence placement

This paper examines whether dense retrievers' preference for early-position evidence is architectural or learned from training data. Using synthetic position-targeted training sets across eight pretrained models, it finds that skewed evidence positions steer retrieval bias and that balanced training substantially reduces positional sensitivity while maintaining competitive retrieval performance.

Source-first digest for checked paper rank 45, rank_id p019.

Motivation / Background

Dense retrievers are often used as the first stage in open-domain QA and retrieval-augmented generation, so a systematic preference for early document evidence can become a downstream evidence-missing failure. The paper starts from an unresolved causal question: is this position bias mostly baked into transformer architecture and pooling, or can retrieval fine-tuning data redirect the bias?

The authors argue that prior evidence is mostly observational. MS MARCO and many natural corpora are early-skewed, and earlier dense-retrieval studies found primacy bias across architectures, but those results do not isolate the training-position distribution. This paper builds a controlled intervention: generate position-targeted query-document training examples, verify that the query is answerable from the intended document segment, and fine-tune the same model families under begin-, middle-, end-, and uniformly distributed evidence positions.

The central design is visible in the controlled sampling setup: the retained-pool table shows the verified candidate pool, while the model overview table shows why the test is not tied to one architecture family.

Length bin Begin Middle End
256-512 105,652 13,934 21,405
512-1024 86,495 16,660 21,427
1024-2048 60,357 13,594 16,691
2048-4096 43,946 10,527 13,363
4096-8192 39,200 8,189 9,796
Total 335,650 62,904 82,682

Table. Retained candidate examples. Candidates are retained after the multi-reranker verification rule with margin threshold \(\delta=0.3\), before downsampling. I include this table because it explains why controlled downsampling is needed: the high-confidence pool is large but strongly begin-skewed.

Model Type Positional encoding Pooling Params Max length
BERT-base Encoder APE CLS 110M 512
Longformer-base Encoder APE Mean 149M 4k
ModernBERT-base Encoder RoPE CLS 149M 8k
ModernBERT-large Encoder RoPE CLS 395M 8k
GPT-2-medium Decoder APE Last token 355M 1k
BLOOM-560M Decoder ALiBi Last token 560M 2k
TinyLlama-NoPE Decoder NoPE Last token 1.1B 2k
Qwen3-0.6B Decoder RoPE Last token 0.6B 32k

Table. Model families. The experiments span encoder and decoder retrievers, multiple positional encodings, and different pooling strategies. This matters because a consistent direction shift across this table would weaken a purely architectural explanation.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 Retrieval-level position-bias direction is strongly shaped by the evidence-position distribution used during fine-tuning. 5 controlled setup, main position-aware results, mirror reversal, representation shift
C2 Position-balanced fine-tuning reduces positional sensitivity while preserving competitive mean retrieval performance. 5 main results table, PosIR results, PSI definition
C3 Architecture and pretraining are not sufficient explanations for the observed retrieval-level bias direction. 4 model overview, main results, pooling ablation, limitations
C4 Standard BEIR scores can partly reward early-position priors when benchmark evidence is early-skewed. 4 BEIR evidence-position figure, BEIR table, BEIR interpretation
C5 The generated position-targeted training data is reasonably high-confidence under model-based checks, but still not human-labeled ground truth. 4 data pool, verification rule, retained pool, LLM audit, limitations
C6 Fine-tuning can reshape query-document and document-only representations toward the emphasized evidence position. 4 evidence-moving table, document-segment heatmap, all-model heatmap

Scores are support-from-paper scores, not independent reproduction scores. Claims about causality are strong within the controlled synthetic setting, but the paper itself narrows them because generated queries, segment content, discourse role, and difficulty can remain entangled with physical position.

Core Technical Idea

The paper turns position bias into a controlled data intervention. For each English Wikipedia document, it divides the text into beginning, middle, and end segments, asks GPT-4o-mini to generate a query answerable from a target segment, filters the generated query with multiple cross-encoder rerankers, and then constructs matched fine-tuning sets:

The important control is that the four configurations are matched in training scale and document-length distribution. Each concentrated configuration samples 8,189 examples from the target position in each length bin, giving 40,945 examples. The uniform configuration samples 2,729 examples from each position within each length bin, giving 40,935 examples. This makes the main comparison about evidence position rather than dataset size or document length.

Method Details

Position-Targeted Data Construction

The query generator is not trusted by itself. The paper uses three cross-encoder rerankers as position verifiers: bge-reranker-v2-m3, gte-multilingual-reranker-base, and jina-reranker-v2-base-multilingual. For a query \(q\), a segment \(s_i\), and a reranker \(R\), the reranker score is:

$$ r_{R,i} = R(q, s_i), \quad i \in \mathcal{P}. $$

The candidate is retained only when every reranker scores the intended target segment \(t\) above the best non-target segment by at least \(\delta\):

$$ r_{R,t} - \max_{u \neq t} r_{R,u} \geq \delta, \quad \forall R \in \mathcal{R}. $$

For the main experiments, \(\delta=0.3\). The LLM audit table is referenced from the main text because it is the paper's sanity check that larger reranker margins correlate with segment-exclusive answerability.

Reranker condition Target Yes Distractor No Exclusive
Failed top-rank check 87.7% 73.8% 51.4%
\(0 \le m_{\mathrm{cons}} < 0.1\) 93.4% 83.5% 67.0%
\(0.1 \le m_{\mathrm{cons}} < 0.2\) 93.7% 90.2% 77.7%
\(0.2 \le m_{\mathrm{cons}} < 0.3\) 94.7% 94.4% 85.3%
\(0.3 \le m_{\mathrm{cons}}\) 95.4% 97.0% 90.4%

Table. Segment-wise LLM audit. The highest-margin stratum, which corresponds to the retained pool used for training, reaches 95.4% Target Yes, 97.0% Distractor No, and 90.4% Exclusive in the model-based audit.

Training And Evaluation

All eight models are fine-tuned as bi-encoder retrievers with InfoNCE loss and chunk-aware negatives. The paper deliberately avoids hard negative mining because it could introduce position-dependent confounds. The shared settings are AdamW, batch size 256, 3 epochs, warmup ratio 0.1, similarity scale 20.0, seed 42, and model-specific learning rates.

Evaluation uses nDCG@10 separately on beginning, middle, and end evidence subsets. The summary statistic is Position Sensitivity Index:

$$ \mathrm{PSI} = 1 - \frac{\min(s)}{\max(s)}, \quad \mathrm{where}\ \max(s) > 0. $$

Here \(s=\{s_{\mathrm{begin}},s_{\mathrm{mid}},s_{\mathrm{end}}\}\). A lower PSI means less sensitivity to where the relevant evidence appears. This metric is always interpreted with mean nDCG@10 so that a uniformly bad retriever is not mistaken for a robust one.

Experiments And Results

Position-Aware Retrieval Benchmarks

Figure 1 and Table 1 are the main evidence for C1 and C2. The figure shows the position-wise curves; the table recovers the mean nDCG@10 and PSI values from the LaTeX source because the Markdown table header is malformed.

Figure 1. Position-wise nDCG@10 across training configurations.
Figure 1. Main position-wise results. The original caption reports SQuAD-PosQ on the top row and FineWeb-PosQ on the bottom row. Columns are \(\mathcal{M}_B\), \(\mathcal{M}_M\), \(\mathcal{M}_E\), and \(\mathcal{M}_U\). I place it here because it visually shows the training-position direction effect.
Benchmark / model \(\mathcal{M}_B\) nDCG \(\mathcal{M}_M\) nDCG \(\mathcal{M}_E\) nDCG \(\mathcal{M}_U\) nDCG \(\mathcal{M}_B\) PSI \(\mathcal{M}_M\) PSI \(\mathcal{M}_E\) PSI \(\mathcal{M}_U\) PSI
SQuAD-PosQ / BERT-base 0.438 0.432 0.408 0.487 0.281 0.371 0.533 0.136
SQuAD-PosQ / Longformer-base 0.481 0.539 0.489 0.543 0.304 0.236 0.331 0.143
SQuAD-PosQ / ModernBERT-base 0.466 0.520 0.463 0.516 0.433 0.286 0.341 0.088
SQuAD-PosQ / ModernBERT-large 0.490 0.564 0.499 0.557 0.348 0.166 0.335 0.082
SQuAD-PosQ / GPT-2-medium 0.231 0.287 0.244 0.283 0.592 0.233 0.431 0.080
SQuAD-PosQ / BLOOM-560M 0.359 0.447 0.361 0.465 0.536 0.290 0.431 0.080
SQuAD-PosQ / TinyLlama-NoPE 0.362 0.455 0.418 0.462 0.561 0.271 0.389 0.132
SQuAD-PosQ / Qwen3-0.6B 0.546 0.626 0.556 0.655 0.395 0.283 0.409 0.068
FineWeb-PosQ / ModernBERT-base 0.554 0.570 0.571 0.640 0.476 0.422 0.343 0.108
FineWeb-PosQ / ModernBERT-large 0.592 0.647 0.595 0.675 0.426 0.305 0.361 0.116
FineWeb-PosQ / Qwen3-0.6B 0.578 0.533 0.535 0.604 0.359 0.245 0.272 0.116

Table 1. Mean nDCG@10 and PSI. Uniform training has the lowest PSI for every model on SQuAD-PosQ and for every evaluated long-context model on FineWeb-PosQ. On SQuAD-PosQ, the paper reports that \(\mathcal{M}_U\) reduces PSI by 57-87% relative to the worst skewed configuration.

The paper gives concrete directional examples. On SQuAD-PosQ, Qwen3-0.6B scores 0.672 in the 0-100 bucket under begin training but 0.415 under end training; in the 500-3120 bucket, the end-trained model scores 0.702 versus 0.407 for begin training. On FineWeb-PosQ, ModernBERT-large similarly favors the position it was trained on: beginning evidence scores 0.778 under \(\mathcal{M}_B\) versus 0.475 under \(\mathcal{M}_E\), while end evidence scores 0.743 under \(\mathcal{M}_E\) versus 0.447 under \(\mathcal{M}_B\). This is the clearest evidence that the bias direction can be redirected by training data.

Standard BEIR Evaluation

The BEIR result is more nuanced. Figure 2 shows that several evaluated BEIR subsets place evidence near the beginning. Table 2 then shows that begin-trained retrievers obtain the highest average nDCG@10 across these subsets. This supports the paper's warning that standard benchmark gains can reflect evidence-location skew rather than position robustness.

Figure 2. BEIR evidence start-position distributions.
Figure 2. BEIR evidence start-position distributions. The original caption says evidence start positions are normalized by relevant-document length, with dashed red means and dotted begin/middle/end boundaries. I include it because it explains why BEIR can reward an early-position prior.
BEIR subset \(\mathcal{M}_B\) \(\mathcal{M}_M\) \(\mathcal{M}_E\) \(\mathcal{M}_U\)
SciFact 0.351 0.368 0.340 0.393
HotpotQA 0.338 0.192 0.165 0.284
FEVER 0.491 0.164 0.156 0.357
Climate-FEVER 0.153 0.125 0.109 0.154
Average 0.333 0.212 0.193 0.297

Table 2. BEIR nDCG@10 averaged over all eight models. Begin training wins on average, especially on FEVER and HotpotQA where evidence is early-concentrated. Uniform training is best on SciFact and effectively tied on Climate-FEVER, where the early skew is weaker.

PosIR And Mirror Reversal

Figure 3 and Table 3 show that the same broad pattern extends to PosIR for the long-context models. Uniform training has the best mean nDCG@10 and lowest PSI for ModernBERT-base, ModernBERT-large, and Qwen3-0.6B.

Figure 3. Position-wise nDCG@10 on selected PosIR domains.
Figure 3. PosIR position-wise results. The original caption reports four selected domains: Subject Education, News Media, Law Judiciary, and Finance Economics. I place it here because it tests whether the controlled finding survives a separate position-aware benchmark.
Model \(\mathcal{M}_B\) nDCG \(\mathcal{M}_M\) nDCG \(\mathcal{M}_E\) nDCG \(\mathcal{M}_U\) nDCG \(\mathcal{M}_B\) PSI \(\mathcal{M}_M\) PSI \(\mathcal{M}_E\) PSI \(\mathcal{M}_U\) PSI
ModernBERT-base 0.341 0.359 0.358 0.411 0.547 0.596 0.570 0.158
ModernBERT-large 0.377 0.419 0.362 0.423 0.472 0.383 0.562 0.138
Qwen3-0.6B 0.409 0.401 0.386 0.450 0.341 0.562 0.590 0.261

Table 3. PosIR mean nDCG@10 and PSI. The paper reports PSI reductions relative to the worst skewed configuration of 73.5% for ModernBERT-base, 75.4% for ModernBERT-large, and 55.8% for Qwen3-0.6B.

The mirror-reversal diagnostic further tests physical position rather than document origin. Documents are split into five segments and reversed, so front-origin evidence moves to the back and back-origin evidence moves to the front. The reversal front-back gap \(\Delta_{\mathrm{rev}}=\mathrm{B{\to}F}-\mathrm{F{\to}B}\) is positive when the model favors currently front-placed evidence. For Qwen3-0.6B, \(\mathcal{M}_B\) has \(+0.236\), \(\mathcal{M}_E\) has \(-0.230\), and \(\mathcal{M}_U\) drops to \(+0.039\). This mirrors the main conclusion: the concentrated training configurations prefer their trained physical positions, while uniform training is less position-sensitive.

Representation-Level Analyses

The paper then asks whether the ranking-level pattern appears inside embeddings. The evidence-moving experiment relocates query-relevant evidence to ten uniformly spaced positions inside the same document and measures query-document cosine similarity. Table 4 shows that the highest-similarity insertion position tracks the training configuration for ModernBERT-base and Qwen3-0.6B.

Model Config Peak position Lowest position \(\Delta \times 10^3\)
ModernBERT-base \(\mathcal{M}_B\) 1 9 21.2
ModernBERT-base \(\mathcal{M}_M\) 4 10 9.4
ModernBERT-base \(\mathcal{M}_E\) 9 1 20.6
ModernBERT-base \(\mathcal{M}_U\) 10 2 1.9
Qwen3-0.6B \(\mathcal{M}_B\) 1 10 21.5
Qwen3-0.6B \(\mathcal{M}_M\) 5 10 27.1
Qwen3-0.6B \(\mathcal{M}_E\) 9 1 20.6
Qwen3-0.6B \(\mathcal{M}_U\) 9 10 5.5

Table 4. Evidence-moving cosine analysis. Uniform training sharply reduces the peak-minus-lowest gap, especially for ModernBERT-base, but does not make all representation behavior perfectly flat.

The document-only analysis tells the same story. In Figure 4, ModernBERT-base and Qwen3-0.6B show only mild pretrained tendencies before retrieval fine-tuning, then shift toward the position emphasized during fine-tuning. Figure 5 gives the broader all-model view.

Figure 4. Full-document and segment embedding similarity for ModernBERT-base and Qwen3-0.6B.
Figure 4. Document-segment similarity for two models. The original caption describes mean cosine similarity between full-document embeddings and segment embeddings \(p_1\)-\(p_{10}\). I include it because it is the clearest representation-level visualization in the main text.
Figure 5. Full-document and segment embedding similarity for all eight models.
Figure 5. All-model document-segment similarity. This appendix figure is included because it shows that model-specific residual tendencies remain, even though fine-tuning often redirects the representation profile.

Finally, Figure 6 checks whether the effect is an artifact of a single pooling choice. For ModernBERT-base, the paper trains the same four positional distributions with CLS, mean, max, and last-token pooling. Pooling affects absolute performance, but the direction of position preference still follows the training-position distribution.

Figure 6. ModernBERT-base pooling ablation.
Figure 6. Pooling ablation. The original caption reports position-wise nDCG@10 under four pooling strategies. I include it because it supports the claim that the observed direction effect is not merely a pooling artifact.

Practical Takeaways

The limitations section explicitly cautions that beginning-, middle-, and end-targeted examples use different target segments and generated queries. It also notes residual labeling errors, possible verifier-induced biases, single-seed training, limited hyperparameter exploration, and no end-to-end production retrieval evaluation. Those caveats are important: the paper identifies training-position distribution as a major controllable factor, not as the only factor behind dense-retriever position bias.