arXiv20262026avg 3.67interest 5.5015 HF retrieval explainabilityrepresentation analysis

Xetrieval explains dense retrieval decisions at the embedding level by internalizing reasoning into sentence embeddings and decomposing them into sparse, human-interpretable features. Aggregating feature overlaps across document-side views yields retrieval explanations, with experiments showing coherent features, stronger intervention effects, and task-level feature steering.

Source-first digest for checked paper rank 39, rank_id p016.

Motivation / Background

Dense retrievers rank documents by comparing high-dimensional query and document embeddings. That makes them effective, but it hides the reason a document scored highly: the system exposes a similarity value, not the latent factors that aligned the query and document. Figure 1 frames this opacity problem at the retrieval-result level.

Figure 1. Dense retrieval opacity.
Figure 1. Dense retrieval opacity. The original caption says dense retrieval offers limited insight into the rationales underlying individual retrieval results. I place it here because it defines the problem Xetrieval is built to solve.

Xetrieval attacks the problem at the embedding level. Instead of explaining results through only lexical overlap or a generated post-hoc rationale, it tries to decompose the same embedding space used by the retriever into sparse, named features. The overview in Figure 2 is the central map: a reasoning internalizer enriches document embeddings with CoT-like reasoning views, and a sparse autoencoder converts raw and reasoned embeddings into feature activations that can be shared, named, and intervened on.

Figure 2. Overview of Xetrieval.
Figure 2. Overview of Xetrieval. The reasoning internalizer injects reasoning-oriented signals into sentence embeddings, while the mechanistic explainer decomposes enriched embeddings into sparse, human-readable features for feature-level analysis and intervention.

The key promise is not that Xetrieval explains an embedding model's internal neural circuitry. The paper is explicit that the analysis is limited to the sentence embedding output layer. The practical claim is narrower and still useful: for a query-document pair, Xetrieval can identify sparse features co-activated by the query and document-side views, attach short natural-language hypotheses to those features, and use feature interventions to test whether the selected directions matter for retrieval scores.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 Xetrieval gives embedding-level explanations of dense retrieval by decomposing query and document representations into shared sparse, human-readable factors. 4 framework overview, sparse explanation interface, feature explanation examples
C2 The reasoning internalizer approximates CoT-enhanced document representations with a single feed-forward pass and usually recovers useful retrieval gains without autoregressive CoT generation. 4 internalizer objective, main NDCG table, efficiency figure
C3 Reasoning-enriched embeddings are easier for the mechanistic explainer to decompose into useful sparse features than raw embeddings. 4 reasoning comparison, detection score distribution, SAE trade-off
C4 TopK SAE at the reported sparsity setting is a reasonable explainer backbone because it balances reconstruction quality, mono-semanticity, and retrieval retention. 4 SAE trade-off, SAE training details
C5 Xetrieval explanations are computationally practical compared with explicit CoT reasoning while staying competitive with CoT-enhanced retrieval on the reported scaling comparison. 4 efficiency figure, main NDCG table
C6 The selected features are not only descriptive: erasing, retaining, amplifying, or suppressing them changes similarity scores and retrieval performance more strongly than baseline feature sets. 4 pair-level attribution, task-level steering, intervention details
C7 The experimental evidence spans several retrieval tasks and retriever families, but the strongest numeric support is still based on sampled benchmark subsets and selected main-text retrievers. 3 benchmark statistics, main NDCG table, limitations
C8 The paper's "mechanistic" scope is embedding-level rather than full circuit-level interpretability. 5 limitations

Scores are support-from-paper scores, not independent reproduction scores. I cap broad deployment and mechanism claims below 5 when the evidence is convincing but limited by sampled corpora, output-layer analysis, or figure-only quantitative details.

Core Technical Idea

Xetrieval starts from a standard dense retrieval setup. A query encoder and document encoder produce embeddings:

$$ \mathbf{q} = E_Q(q)\in\mathbb{R}^m,\qquad \mathbf{z} = E_D(d)\in\mathbb{R}^m. $$

The retriever scores relevance with a dot product or cosine similarity:

$$ s(q,d)=\langle \mathbf{q}, \mathbf{z}\rangle \quad \text{or} \quad s(q,d)=\frac{\langle \mathbf{q},\mathbf{z}\rangle}{\|\mathbf{q}\|_2\|\mathbf{z}\|_2}. $$

The explainer then maps the query and document-side representations into sparse codes:

$$ \mathbf{c}_q = g(\tilde{\mathbf{q}}), \qquad \mathbf{c}_d = g(\tilde{\mathbf{z}}). $$

Feature activations are thresholded:

$$ a_{q,j}=\mathbb{I}[c_{q,j}>\tau], \qquad a_{d,j}=\mathbb{I}[c_{d,j}>\tau]. $$

The local explanation is the set of jointly active features:

$$ \mathcal{O}(q,d)=\{j \mid a_{q,j}a_{d,j}=1\}, \qquad \mathcal{E}(q,d)=\{(j,h_j)\}_{j\in\mathcal{O}(q,d)}. $$

Here \(h_j\) is the natural-language hypothesis attached to feature \(j\). This is why the paper calls the explanation embedding-level: the explanation is derived from sparse factors in the retrieval representation space, not from token-level attention or a separately generated rationale.

The second move is to enrich the document side with reasoning views. For each aspect \(t\in\{\textsc{Summary},\textsc{Purpose},\textsc{QA}\}\), a one-hidden-layer MLP maps a raw sentence embedding to a reasoning-enhanced embedding:

$$ \hat{\mathbf{z}}^{(t)}_i=\mathcal{R}_t(\mathbf{z}_i),\qquad \hat{\mathbf{z}}^{(t)}_i\in\mathbb{R}^m. $$

It is trained with MSE against embeddings of LLM-generated reasoning texts:

$$ \mathcal{L}_t = \mathbb{E}_i \left[ \left\| \mathcal{R}_t(\mathbf{z}_i)-\mathbf{z}^{(t)}_i \right\|_2^2 \right]. $$

At explanation time, the document is represented by multiple views:

$$ \mathcal{V}(d)=\{\mathbf{z}_d\}\cup\{\hat{\mathbf{z}}_d^{(t)}:t\in\mathcal{T}\}. $$

Xetrieval then aggregates query-overlap features across those views:

$$ O(q,d)= \left\{ j \mid a_{q,j}\cdot \max_{\mathbf{v}\in\mathcal{V}(d)}a_{\mathbf{v},j}=1 \right\}. $$

This view aggregation is the technical difference between direct sparse decomposition and Xetrieval. Direct decomposition only sees raw query and document embeddings; Xetrieval can expose relevance factors that are weak in the original document representation but salient after summary, purpose, or QA reasoning internalization.

Method Details

Training Data And Evaluation Scope

The reasoning internalizer is trained on StackExchange-derived documents. Table 1 shows the training-domain mix: 11,796 documents across politics, math, programming, science, robotics, economics, philosophy, sustainability, and related communities. The paper uses this corpus to generate aspect-specific teacher texts, embeds those texts with the same retriever, and trains three MLP internalizers for summary, purpose, and QA.

Community Docs Community Docs
politics 1,000 mathematica 1,000
codereview 600 economics 600
cs 600 chemistry 600
StackOverflow 600 ai 600
bioinformatics 600 codegolf 600
math 600 robotics 600
earthscience 600 mathoverflow 600
biology 600 philosophy 600
software-engineering 600 sustainability 432
computergraphics 364
Total 11,796

Table 1. Reasoning-internalizer training domains. This table is recovered from main.flattened.tex because the Markdown table was malformed. It matters because the paper's reasoning views are not trained on one narrow domain.

Table 2 gives the sampled benchmark sizes used for retrieval evaluation. The main metric is NDCG@10.

Dataset Queries Documents
BRIGHT 1,384 12,000
NQ 8,383 8,383
MuTual 846 3,542
TREC-NEWS 57 9,968
Signal-1M 97 10,000
Robust04 249 15,790
ArguAna 1,406 8,674

Table 2. Sampled benchmark statistics. This table is included because it bounds the scale and task diversity behind the reported retrieval gains.

Mechanistic Explainer

The mechanistic explainer is a sparse autoencoder. Given an embedding \(\mathbf{x}\), the SAE encoder creates a sparse code and the decoder reconstructs the embedding:

$$ \mathbf{c}=g(\mathbf{x}),\qquad \tilde{\mathbf{x}}=W\mathbf{c}+\mathbf{b}. $$

The columns of \(W\) act as learned feature directions. The training objective combines reconstruction and sparsity:

$$ \mathcal{L} = \mathbb{E}_{\mathbf{x}} \left[ \|\mathbf{x}-(Wg(\mathbf{x})+\mathbf{b})\|_2^2 +\lambda\,\Omega(g(\mathbf{x})) \right]. $$

The paper evaluates ReLU, TopK, BatchTopK, Gated, JumpReLU, P-Annealing, and GatedAnnealing SAE variants. The trade-off in Figure 3 is central: lower sparsity gives cleaner mono-semantic features but worse reconstruction and retrieval retention, while higher active-feature counts recover more retrieval behavior at the cost of interpretability. The authors choose TopK with \(k=256\).

Figure 3. SAE evaluation.
Figure 3. SAE evaluation. The original caption compares SAE variants across sparsity levels \(L_0\), reconstruction error, mono-semanticity, and retrieval retention. I place it here because it is the evidence for choosing TopK-SAE as the explainer backbone.

Feature Naming And Quality Checks

Feature descriptions are generated by taking top-activating samples for a sparse feature and asking an LLM to summarize the common semantic pattern into a short hypothesis. The paper evaluates these hypotheses with an intruder-detection style Detection Score: an LLM receives activating and non-activating examples and judges whether each conforms to the feature hypothesis.

Figure 4 and Figure 5 are the main evidence that reasoning internalization helps explanation quality. Reasoned embeddings have lower reconstruction error and more active features under the same sparsity controls, and the final Xetrieval explainer has a stronger detection-score distribution than random or raw SAE baselines.

Figure 4. Raw versus reasoned embeddings.
Figure 4. Raw versus reasoned embeddings. The original caption compares reconstruction error and active-feature counts between raw and reasoned embeddings. It supports the claim that internalized reasoning makes sparse decomposition richer and easier.
Figure 5. Detection score distribution.
Figure 5. Detection score distribution. The original caption compares Raw SAE, Random SAE, and the Mechanistic Explainer with kernel density estimates. It supports the claim that the generated feature hypotheses are more distinguishable for Xetrieval.

Experiments And Results

Retrieval Benefit Of Internalized Reasoning

Table 3 is the main recovered NDCG@10 table for DeepSeek-V3-powered reasoning. The most useful pattern is that the reasoning internalizer usually improves over the unenhanced retriever and often recovers a large part of the explicit CoT reasoner's gain. The exceptions matter: for Qwen3-4B, the baseline average is 69.2 while the internalizer average is 68.7 and CoT average is 68.8, so "reasoning helps" is not uniformly true for every strong base retriever and every benchmark.

Retriever Enhancement BRIGHT NQ MuTual TREC Signal1M Robust04 ArguAna Avg.
gte-base None 37.0 81.0 28.8 92.2 73.8 77.1 41.7 61.7
gte-base Reasoning Internalizer 39.0 80.8 29.6 92.3 74.2 80.2 40.9 62.4
gte-base CoT Reasoner 43.8 83.3 30.3 93.4 74.6 84.0 41.7 64.4
e5-large None 31.5 83.3 47.1 90.4 66.8 77.3 34.2 61.5
e5-large Reasoning Internalizer 37.9 84.2 46.5 90.3 70.3 81.1 39.2 64.2
e5-large CoT Reasoner 43.8 86.3 47.0 92.8 72.0 82.1 41.3 66.5
qwen3-4b None 51.2 84.0 45.2 92.3 74.1 87.0 50.7 69.2
qwen3-4b Reasoning Internalizer 51.7 83.5 44.9 91.9 72.8 87.1 49.3 68.7
qwen3-4b CoT Reasoner 54.8 84.6 45.8 92.9 73.2 86.7 43.8 68.8
snowflake None 34.8 48.1 36.2 22.5 64.8 24.1 37.2 38.3
snowflake Reasoning Internalizer 38.8 68.9 36.3 64.9 67.9 42.7 38.6 51.2
snowflake CoT Reasoner 44.0 74.2 33.0 77.6 67.4 46.0 40.5 54.7

Table 3. NDCG@10 (%) of dense retrievers under different enhancements. This table is recovered from main.flattened.tex; paper.md contained an empty converted table block. The main text states that the reasoning internalizer and CoT reasoner are powered by DeepSeek-V3.

Explanation Efficiency

Xetrieval's main efficiency claim is that it avoids per-document autoregressive CoT generation. The CoT reasoner generates reasoning text for documents, while the internalizer runs an MLP over cached embeddings. Figure 6 is the paper's scaling evidence: as the Biology subset size increases, explicit CoT explanation time grows substantially, while Xetrieval remains near flat; the paired retrieval plot shows Xetrieval staying above the base retriever and close to CoT-enhanced retrieval as candidate size changes.

Figure 6. Explanation efficiency and retrieval trend.
Figure 6. Explanation efficiency and retrieval trend. The original caption compares explanation-time trends between the CoT reasoner and Xetrieval, and compares retrieval performance trends between the base retriever, CoT reasoner, and Xetrieval on the BRIGHT Biology subset.

Local Attribution

The pair-level intervention asks whether the selected features are actually tied to a query-document similarity score. For a query-document pair, the paper compares three feature sets: direct-decomposition overlaps, Xetrieval overlaps across reasoned document views, and non-overlap active features as a control. It then intervenes on the original document embedding, either erasing the selected feature span or retaining only that span.

The appendix describes the projection used for these interventions. For feature set \(S\), decoder directions \(W_S\), decoder bias \(\mathbf{b}\), and \(\lambda=10^{-6}\):

$$ P_S(\mathbf{z}_d-\mathbf{b}) = W_S(W_S^\top W_S+\lambda I)^{-1}W_S^\top(\mathbf{z}_d-\mathbf{b}). $$

The two edited document embeddings are:

$$ \mathbf{z}_d^{\setminus S} = \mathbf{z}_d - P_S(\mathbf{z}_d-\mathbf{b}), \qquad \mathbf{z}_d^{S} = \mathbf{b}+P_S(\mathbf{z}_d-\mathbf{b}). $$

Figure 7 reports the intervention effect. Erasing Xetrieval-selected features produces the largest similarity drop; retaining only those features preserves or increases similarity more effectively than direct decomposition. The non-overlap controls behave less consistently and can even increase score when erased, which supports the interpretation that they are query-irrelevant or distracting document factors.

Figure 7. Pair-level document-side interventions.
Figure 7. Pair-level document-side interventions. The original caption reports cosine-similarity changes after erasing or retaining selected feature spans for Xetrieval, direct decomposition, and non-overlap active features.

Task-Level Steering

For global feature utility, the paper defines a Retrieval Utility Score (RUS). A feature's co-activation indicator is

$$ I_j(q,d)=a_{q,j}a_{d,j}. $$

The paper scores features by how often they co-activate on positive pairs minus negative pairs:

$$ \mathrm{RUS}(f_j)= \sum_{(q,d)\in\mathcal{D}_{pos}}I_j(q,d) - \sum_{(q,d)\in\mathcal{D}_{neg}}I_j(q,d). $$

The top RUS features form a key set. During steering, selected activations are amplified with \(\alpha>1\) or suppressed with \(\alpha<1\), then retrieval is evaluated on BRIGHT, ArguAna, and NQ. Figure 8 shows that amplifying key features improves retrieval while suppressing them hurts, and that Xetrieval-selected features have stronger steering effects than direct raw-SAE features.

Figure 8. Task-level feature steering.
Figure 8. Task-level feature steering. The original caption reports retrieval results when steering key and non-key features identified by basic SAE and Xetrieval.

What The Evidence Supports

The strongest evidence is intervention-based. If the feature set were only a plausible label set, erasing or retaining it would not necessarily move the original embedding similarity in the predicted direction. Figure 7 and Figure 8 therefore do important work: they connect the chosen feature directions to local similarity and task-level ranking behavior.

The retrieval-benefit evidence is useful but less clean. Table 3 shows large gains for weak or mid-strength retrievers such as Snowflake and e5-large, but it also shows the internalizer slightly underperforming the strong qwen3-4b baseline on average. This does not invalidate the explainer, but it means the digest should separate two claims: Xetrieval's reasoning views often help retrieval and explanation, while retrieval-score improvement is not guaranteed for every already-strong embedder.

Practical Takeaways

The paper's limitation statement says the analysis is confined to the sentence embedding level, specifically the output layer of the embedding model. It also says SAE decomposition offers limited fidelity and granularity compared with stronger mechanisms such as transcoders. That limitation should travel with any downstream use of Xetrieval explanations: they are useful sparse evidence surfaces, not definitive causal accounts of the full retriever.