Is Position Bias in Dense Retrievers Built In-or Learned from Data?

Source-first digest for checked paper rank 45, rank_id p019.

Routing status: success
PDF extraction: not used

Motivation / Background

Dense retrievers are often used as the first stage in open-domain QA and retrieval-augmented generation, so a systematic preference for early document evidence can become a downstream evidence-missing failure. The paper starts from an unresolved causal question: is this position bias mostly baked into transformer architecture and pooling, or can retrieval fine-tuning data redirect the bias?

The authors argue that prior evidence is mostly observational. MS MARCO and many natural corpora are early-skewed, and earlier dense-retrieval studies found primacy bias across architectures, but those results do not isolate the training-position distribution. This paper builds a controlled intervention: generate position-targeted query-document training examples, verify that the query is answerable from the intended document segment, and fine-tune the same model families under begin-, middle-, end-, and uniformly distributed evidence positions.

The central design is visible in the controlled sampling setup: the retained-pool table shows the verified candidate pool, while the model overview table shows why the test is not tied to one architecture family.

Length bin	Begin	Middle	End
256-512	105,652	13,934	21,405
512-1024	86,495	16,660	21,427
1024-2048	60,357	13,594	16,691
2048-4096	43,946	10,527	13,363
4096-8192	39,200	8,189	9,796
Total	335,650	62,904	82,682

Table. Retained candidate examples. Candidates are retained after the multi-reranker verification rule with margin threshold \(\delta=0.3\), before downsampling. I include this table because it explains why controlled downsampling is needed: the high-confidence pool is large but strongly begin-skewed.

Model	Type	Positional encoding	Pooling	Params	Max length
BERT-base	Encoder	APE	CLS	110M	512
Longformer-base	Encoder	APE	Mean	149M	4k
ModernBERT-base	Encoder	RoPE	CLS	149M	8k
ModernBERT-large	Encoder	RoPE	CLS	395M	8k
GPT-2-medium	Decoder	APE	Last token	355M	1k
BLOOM-560M	Decoder	ALiBi	Last token	560M	2k
TinyLlama-NoPE	Decoder	NoPE	Last token	1.1B	2k
Qwen3-0.6B	Decoder	RoPE	Last token	0.6B	32k

Table. Model families. The experiments span encoder and decoder retrievers, multiple positional encodings, and different pooling strategies. This matters because a consistent direction shift across this table would weaken a purely architectural explanation.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	Retrieval-level position-bias direction is strongly shaped by the evidence-position distribution used during fine-tuning.	5	controlled setup, main position-aware results, mirror reversal, representation shift
C2	Position-balanced fine-tuning reduces positional sensitivity while preserving competitive mean retrieval performance.	5	main results table, PosIR results, PSI definition
C3	Architecture and pretraining are not sufficient explanations for the observed retrieval-level bias direction.	4	model overview, main results, pooling ablation, limitations
C4	Standard BEIR scores can partly reward early-position priors when benchmark evidence is early-skewed.	4	BEIR evidence-position figure, BEIR table, BEIR interpretation
C5	The generated position-targeted training data is reasonably high-confidence under model-based checks, but still not human-labeled ground truth.	4	data pool, verification rule, retained pool, LLM audit, limitations
C6	Fine-tuning can reshape query-document and document-only representations toward the emphasized evidence position.	4	evidence-moving table, document-segment heatmap, all-model heatmap

Scores are support-from-paper scores, not independent reproduction scores. Claims about causality are strong within the controlled synthetic setting, but the paper itself narrows them because generated queries, segment content, discourse role, and difficulty can remain entangled with physical position.

Core Technical Idea

The paper turns position bias into a controlled data intervention. For each English Wikipedia document, it divides the text into beginning, middle, and end segments, asks GPT-4o-mini to generate a query answerable from a target segment, filters the generated query with multiple cross-encoder rerankers, and then constructs matched fine-tuning sets:

\(\mathcal{M}_B\): all training queries target the beginning segment.
\(\mathcal{M}_M\): all training queries target the middle segment.
\(\mathcal{M}_E\): all training queries target the end segment.
\(\mathcal{M}_U\): training queries are distributed uniformly over beginning, middle, and end.

The important control is that the four configurations are matched in training scale and document-length distribution. Each concentrated configuration samples 8,189 examples from the target position in each length bin, giving 40,945 examples. The uniform configuration samples 2,729 examples from each position within each length bin, giving 40,935 examples. This makes the main comparison about evidence position rather than dataset size or document length.

Method Details

Position-Targeted Data Construction

The query generator is not trusted by itself. The paper uses three cross-encoder rerankers as position verifiers: bge-reranker-v2-m3, gte-multilingual-reranker-base, and jina-reranker-v2-base-multilingual. For a query \(q\), a segment \(s_i\), and a reranker \(R\), the reranker score is:

r_{R,i} = R(q, s_i), \quad i \in \mathcal{P}.

The candidate is retained only when every reranker scores the intended target segment \(t\) above the best non-target segment by at least \(\delta\):

r_{R,t} - \max_{u \neq t} r_{R,u} \geq \delta, \quad \forall R \in \mathcal{R}.

For the main experiments, \(\delta=0.3\). The LLM audit table is referenced from the main text because it is the paper's sanity check that larger reranker margins correlate with segment-exclusive answerability.

Reranker condition	Target Yes	Distractor No	Exclusive
Failed top-rank check	87.7%	73.8%	51.4%
\(0 \le m_{\mathrm{cons}} < 0.1\)	93.4%	83.5%	67.0%
\(0.1 \le m_{\mathrm{cons}} < 0.2\)	93.7%	90.2%	77.7%
\(0.2 \le m_{\mathrm{cons}} < 0.3\)	94.7%	94.4%	85.3%
\(0.3 \le m_{\mathrm{cons}}\)	95.4%	97.0%	90.4%

Table. Segment-wise LLM audit. The highest-margin stratum, which corresponds to the retained pool used for training, reaches 95.4% Target Yes, 97.0% Distractor No, and 90.4% Exclusive in the model-based audit.

Training And Evaluation

All eight models are fine-tuned as bi-encoder retrievers with InfoNCE loss and chunk-aware negatives. The paper deliberately avoids hard negative mining because it could introduce position-dependent confounds. The shared settings are AdamW, batch size 256, 3 epochs, warmup ratio 0.1, similarity scale 20.0, seed 42, and model-specific learning rates.

Evaluation uses nDCG@10 separately on beginning, middle, and end evidence subsets. The summary statistic is Position Sensitivity Index:

\mathrm{PSI} = 1 - \frac{\min(s)}{\max(s)}, \quad \mathrm{where}\ \max(s) > 0.

Here \(s=\{s_{\mathrm{begin}},s_{\mathrm{mid}},s_{\mathrm{end}}\}\). A lower PSI means less sensitivity to where the relevant evidence appears. This metric is always interpreted with mean nDCG@10 so that a uniformly bad retriever is not mistaken for a robust one.

Experiments And Results

Position-Aware Retrieval Benchmarks

Figure 1 and Table 1 are the main evidence for C1 and C2. The figure shows the position-wise curves; the table recovers the mean nDCG@10 and PSI values from the LaTeX source because the Markdown table header is malformed.

Figure 1. Position-wise nDCG@10 across training configurations. — **Figure 1. Main position-wise results.** The original caption reports SQuAD-PosQ on the top row and FineWeb-PosQ on the bottom row. Columns are \(\mathcal{M}_B\), \(\mathcal{M}_M\), \(\mathcal{M}_E\), and \(\mathcal{M}_U\). I place it here because it visually shows the training-position direction effect.

Benchmark / model	\(\mathcal{M}_B\) nDCG	\(\mathcal{M}_M\) nDCG	\(\mathcal{M}_E\) nDCG	\(\mathcal{M}_U\) nDCG	\(\mathcal{M}_B\) PSI	\(\mathcal{M}_M\) PSI	\(\mathcal{M}_E\) PSI	\(\mathcal{M}_U\) PSI
SQuAD-PosQ / BERT-base	0.438	0.432	0.408	0.487	0.281	0.371	0.533	0.136
SQuAD-PosQ / Longformer-base	0.481	0.539	0.489	0.543	0.304	0.236	0.331	0.143
SQuAD-PosQ / ModernBERT-base	0.466	0.520	0.463	0.516	0.433	0.286	0.341	0.088
SQuAD-PosQ / ModernBERT-large	0.490	0.564	0.499	0.557	0.348	0.166	0.335	0.082
SQuAD-PosQ / GPT-2-medium	0.231	0.287	0.244	0.283	0.592	0.233	0.431	0.080
SQuAD-PosQ / BLOOM-560M	0.359	0.447	0.361	0.465	0.536	0.290	0.431	0.080
SQuAD-PosQ / TinyLlama-NoPE	0.362	0.455	0.418	0.462	0.561	0.271	0.389	0.132
SQuAD-PosQ / Qwen3-0.6B	0.546	0.626	0.556	0.655	0.395	0.283	0.409	0.068
FineWeb-PosQ / ModernBERT-base	0.554	0.570	0.571	0.640	0.476	0.422	0.343	0.108
FineWeb-PosQ / ModernBERT-large	0.592	0.647	0.595	0.675	0.426	0.305	0.361	0.116
FineWeb-PosQ / Qwen3-0.6B	0.578	0.533	0.535	0.604	0.359	0.245	0.272	0.116

Table 1. Mean nDCG@10 and PSI. Uniform training has the lowest PSI for every model on SQuAD-PosQ and for every evaluated long-context model on FineWeb-PosQ. On SQuAD-PosQ, the paper reports that \(\mathcal{M}_U\) reduces PSI by 57-87% relative to the worst skewed configuration.

The paper gives concrete directional examples. On SQuAD-PosQ, Qwen3-0.6B scores 0.672 in the 0-100 bucket under begin training but 0.415 under end training; in the 500-3120 bucket, the end-trained model scores 0.702 versus 0.407 for begin training. On FineWeb-PosQ, ModernBERT-large similarly favors the position it was trained on: beginning evidence scores 0.778 under \(\mathcal{M}_B\) versus 0.475 under \(\mathcal{M}_E\), while end evidence scores 0.743 under \(\mathcal{M}_E\) versus 0.447 under \(\mathcal{M}_B\). This is the clearest evidence that the bias direction can be redirected by training data.

Standard BEIR Evaluation

The BEIR result is more nuanced. Figure 2 shows that several evaluated BEIR subsets place evidence near the beginning. Table 2 then shows that begin-trained retrievers obtain the highest average nDCG@10 across these subsets. This supports the paper's warning that standard benchmark gains can reflect evidence-location skew rather than position robustness.

**Figure 2. BEIR evidence start-position distributions.** The original caption says evidence start positions are normalized by relevant-document length, with dashed red means and dotted begin/middle/end boundaries. I include it because it explains why BEIR can reward an early-position prior.

BEIR subset	\(\mathcal{M}_B\)	\(\mathcal{M}_M\)	\(\mathcal{M}_E\)	\(\mathcal{M}_U\)
SciFact	0.351	0.368	0.340	0.393
HotpotQA	0.338	0.192	0.165	0.284
FEVER	0.491	0.164	0.156	0.357
Climate-FEVER	0.153	0.125	0.109	0.154
Average	0.333	0.212	0.193	0.297

Table 2. BEIR nDCG@10 averaged over all eight models. Begin training wins on average, especially on FEVER and HotpotQA where evidence is early-concentrated. Uniform training is best on SciFact and effectively tied on Climate-FEVER, where the early skew is weaker.

PosIR And Mirror Reversal

Figure 3 and Table 3 show that the same broad pattern extends to PosIR for the long-context models. Uniform training has the best mean nDCG@10 and lowest PSI for ModernBERT-base, ModernBERT-large, and Qwen3-0.6B.

Figure 3. Position-wise nDCG@10 on selected PosIR domains. — **Figure 3. PosIR position-wise results.** The original caption reports four selected domains: Subject Education, News Media, Law Judiciary, and Finance Economics. I place it here because it tests whether the controlled finding survives a separate position-aware benchmark.

Model	\(\mathcal{M}_B\) nDCG	\(\mathcal{M}_M\) nDCG	\(\mathcal{M}_E\) nDCG	\(\mathcal{M}_U\) nDCG	\(\mathcal{M}_B\) PSI	\(\mathcal{M}_M\) PSI	\(\mathcal{M}_E\) PSI	\(\mathcal{M}_U\) PSI
ModernBERT-base	0.341	0.359	0.358	0.411	0.547	0.596	0.570	0.158
ModernBERT-large	0.377	0.419	0.362	0.423	0.472	0.383	0.562	0.138
Qwen3-0.6B	0.409	0.401	0.386	0.450	0.341	0.562	0.590	0.261

Table 3. PosIR mean nDCG@10 and PSI. The paper reports PSI reductions relative to the worst skewed configuration of 73.5% for ModernBERT-base, 75.4% for ModernBERT-large, and 55.8% for Qwen3-0.6B.

The mirror-reversal diagnostic further tests physical position rather than document origin. Documents are split into five segments and reversed, so front-origin evidence moves to the back and back-origin evidence moves to the front. The reversal front-back gap \(\Delta_{\mathrm{rev}}=\mathrm{B{\to}F}-\mathrm{F{\to}B}\) is positive when the model favors currently front-placed evidence. For Qwen3-0.6B, \(\mathcal{M}_B\) has \(+0.236\), \(\mathcal{M}_E\) has \(-0.230\), and \(\mathcal{M}_U\) drops to \(+0.039\). This mirrors the main conclusion: the concentrated training configurations prefer their trained physical positions, while uniform training is less position-sensitive.

Representation-Level Analyses

The paper then asks whether the ranking-level pattern appears inside embeddings. The evidence-moving experiment relocates query-relevant evidence to ten uniformly spaced positions inside the same document and measures query-document cosine similarity. Table 4 shows that the highest-similarity insertion position tracks the training configuration for ModernBERT-base and Qwen3-0.6B.

Model	Config	Peak position	Lowest position	\(\Delta \times 10^3\)
ModernBERT-base	\(\mathcal{M}_B\)	1	9	21.2
ModernBERT-base	\(\mathcal{M}_M\)	4	10	9.4
ModernBERT-base	\(\mathcal{M}_E\)	9	1	20.6
ModernBERT-base	\(\mathcal{M}_U\)	10	2	1.9
Qwen3-0.6B	\(\mathcal{M}_B\)	1	10	21.5
Qwen3-0.6B	\(\mathcal{M}_M\)	5	10	27.1
Qwen3-0.6B	\(\mathcal{M}_E\)	9	1	20.6
Qwen3-0.6B	\(\mathcal{M}_U\)	9	10	5.5

Table 4. Evidence-moving cosine analysis. Uniform training sharply reduces the peak-minus-lowest gap, especially for ModernBERT-base, but does not make all representation behavior perfectly flat.

The document-only analysis tells the same story. In Figure 4, ModernBERT-base and Qwen3-0.6B show only mild pretrained tendencies before retrieval fine-tuning, then shift toward the position emphasized during fine-tuning. Figure 5 gives the broader all-model view.

Figure 4. Full-document and segment embedding similarity for ModernBERT-base and Qwen3-0.6B. — **Figure 4. Document-segment similarity for two models.** The original caption describes mean cosine similarity between full-document embeddings and segment embeddings \(p_1\)-\(p_{10}\). I include it because it is the clearest representation-level visualization in the main text.

Figure 5. Full-document and segment embedding similarity for all eight models. — **Figure 5. All-model document-segment similarity.** This appendix figure is included because it shows that model-specific residual tendencies remain, even though fine-tuning often redirects the representation profile.

Finally, Figure 6 checks whether the effect is an artifact of a single pooling choice. For ModernBERT-base, the paper trains the same four positional distributions with CLS, mean, max, and last-token pooling. Pooling affects absolute performance, but the direction of position preference still follows the training-position distribution.

Figure 6. ModernBERT-base pooling ablation. — **Figure 6. Pooling ablation.** The original caption reports position-wise nDCG@10 under four pooling strategies. I include it because it supports the claim that the observed direction effect is not merely a pooling artifact.

Practical Takeaways

For retriever training, evidence-position balance is a practical mitigation lever. The strongest result is not that one architecture is robust, but that balanced evidence positions consistently reduce PSI across diverse retrievers.
Standard retrieval benchmarks can hide or even reward position bias. If a benchmark's relevant evidence tends to start early, begin-trained models can look better without being robust to later evidence.
The result is most relevant to RAG pipelines that retrieve long documents or passages with non-front-loaded answers. Data curation should check where evidence appears, not only what topic or domain the text covers.
The paper's causal claim is strong inside its controlled synthetic setup, but it does not prove that physical position is the only cause. Segment content, discourse role, generated-query semantics, and difficulty can still be entangled.
The training data is filtered by rerankers and audited by an LLM, not human annotated. The audit is useful evidence, but it is still model-based validation.
The experiments are single-seed, omit hard-negative mining, and evaluate retrieval-level behavior rather than end-to-end RAG. Follow-up work should test multilingual, domain-specific, human-validated, and downstream QA settings.

The limitations section explicitly cautions that beginning-, middle-, and end-targeted examples use different target segments and generated queries. It also notes residual labeling errors, possible verifier-induced biases, single-seed training, limited hyperparameter exploration, and no end-to-end production retrieval evaluation. Those caveats are important: the paper identifies training-position distribution as a major controllable factor, not as the only factor behind dense-retriever position bias.