Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Source-first digest for checked paper rank 6, rank_id p012.

Routing status: success, but the Markdown manifest reports six missing main-text includes, so the main body was recovered from LaTeX source files rather than PDF extraction.
PDF extraction: not used

Motivation / Background

This paper asks whether VLM spatial-reasoning accuracy reflects structured 3D understanding or shortcut use from natural image statistics. The target shortcut is simple: in many photographs, objects that are farther away on the ground plane also appear higher in the image. A VLM can therefore answer "farther" questions by leaning on vertical image position instead of representing distance as its own spatial axis.

The authors call this failure mode vertical-distance entanglement. It matters for embodied and robotic settings because a model that appears good on standard spatial benchmarks may still fail when the natural correlation between "higher in the image" and "farther in 3D" is weakened, reversed, or synthetically controlled. Figure 1 summarizes the paper's thesis: many VLMs follow the perspective shortcut, while stronger spatial models separate axes and remain correct on counter-heuristic cases.

Figure 1. Paper overview: shortcut, SpatialTunnel, and probing. — **Figure 1. Paper overview.** The original figure states that many VLMs confuse 2D vertical position with 3D distance, fail on counter examples, and can be diagnosed by `SpatialTunnel` plus contrastive probing. I place it here because it is the clearest high-level statement of the paper's motivation and contribution.

The geometric reason the shortcut exists is the standard ground-plane projection relation. For a camera at height \(H_c\) with focal length \(f\), a ground-plane point at depth \(Z\) projects to:

v_{\mathrm{img}}(Z)=\frac{fH_c}{Z}>0.

As \(Z\) grows, \(v_{\mathrm{img}}\) approaches the horizon, so farther ground-plane objects move upward in the image. The paper's core concern is that VLMs often internalize this correlation as if it were the depth relation itself.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	VLMs across multiple model families exhibit vertical-distance entanglement: they do better when the true depth relation agrees with the "higher means farther" heuristic than when it contradicts it.	5	problem overview, consistent/counter setup, real benchmark split, real accuracy, SpatialTunnel results
C2	Existing real-image depth benchmarks are skewed toward perspective-consistent examples, so aggregate benchmark accuracy can overstate robust 3D reasoning.	5	distribution table, benchmark-performance table
C3	`SpatialTunnel` isolates this shortcut by decoupling vertical image position from depth while keeping depth ordering fixed.	4	SpatialTunnel design, SpatialTunnel table, Molmo heatmaps
C4	Spatial fine-tuning can improve average accuracy while still leaving distance representations weak or entangled; data scale alone is not a guarantee.	4	real accuracy, SpatialTunnel table, axis coherence
C5	Contrastive probing gives a representation-level diagnostic: higher distance coherence and lower VD-EI align with better counter-heuristic robustness.	4	probing method, internal analysis, PCA, coherence table
C6	Stronger spatial models such as RoboRefer and Qwen3-VL-235B have more separated spatial axes and stronger cross-benchmark performance.	4	spatial benchmarks, PCA, coherence table
C7	Shortcut reliance is not limited to vertical image position; apparent size can create a similar depth cue failure.	3	object-size extension

Scores are support-from-paper scores, not independent reproduction scores. C3-C6 are capped below 5 because the evidence is strong within the paper but still diagnostic/correlational rather than causal proof of the full training mechanisms.

Core Technical Idea

The paper combines two diagnostics:

A behavioral split that classifies depth questions as perspective-consistent or counter-heuristic depending on whether the farther object appears higher in the image.
A representation probe that uses swapped question pairs to test whether the model's hidden-state deltas form separable horizontal, vertical, and distance axes.

The behavioral split asks whether the shortcut appears in outputs. The probing method asks whether the shortcut is geometrically present in representation space. This distinction is important: two models can have similar benchmark accuracy but very different internal spatial structure.

The consistent/counter categorization is visualized in Figure 2. This figure is referenced by the main text as the operational definition behind the real-benchmark and SpatialTunnel splits.

Figure 2. Consistent and counter examples. — **Figure 2. Consistent vs. counter examples.** A consistent example has the farther object higher in the image; a counter example has the farther object lower. The paper computes the label by comparing vertical centers of the queried objects in pixel space.

Method Details

Model And Data Setup

The experiments cover Molmo-7B-O-0924, NVILA-Lite-2B, Qwen2.5-VL-3B-Instruct, RoboRefer-2B-SFT, and Qwen3-VL-235B-A22B-Instruct. For Molmo, NVILA, and Qwen2.5-VL, the authors train spatial fine-tuning variants at 80k, 400k, 800k, and 2M samples. The sampled spatial data mix comes from SAT, RoboSpatial, SPAR-7M, RefSpatial, and PRISM. RoboRefer is treated as a depth/spatially supervised reference model sharing the NVILA base family, and Qwen3-VL-235B is treated as a very large-scale reference model.

Consistent-Counter Split

For real-image benchmarks, each depth-related example is grouped as:

Consistent: the farther object has a smaller image \(y\)-coordinate, meaning it is higher in the image.
Counter: the farther object appears lower.
Ambiguous: the vertical center difference is too small to support a clean split.

The split is a behavioral test for whether models rely on the elevation cue. If a model truly reasons about 3D depth, consistent and counter accuracy should be close. Large positive gaps imply shortcut dependence.

SpatialTunnel

SpatialTunnel is a synthetic Blender benchmark built around a single-point-perspective corridor. Two objects are placed at fixed depths while their angular positions on the tunnel cross-section are swept independently. This creates a \(16 \times 16\) grid of positions where vertical image location changes without changing the true depth order. Figure 3 shows this design.

For each image, the model answers binary depth questions such as whether one object is closer or farther than the other. The probability-based scoring rule extracts the first-token logits for Yes and No:

p=\sigma(\ell_{\texttt{Yes}}-\ell_{\texttt{No}}).

The correctness score is \(v=p\) when the ground-truth answer is Yes, and \(v=1-p\) when it is No. Reported metrics are mean correctness \(v\), consistent correctness \(v_{\text{cons}}\), counter correctness \(v_{\text{ctr}}\), and the shortcut gap:

\Delta = v_{\text{cons}} - v_{\text{ctr}}.

Contrastive Probing

Figure 6 summarizes the representation-level probe. For each VQA example, the authors build a minimally swapped question pair, such as "Is A left of B?" vs. "Is B left of A?", so the ground-truth spatial relation flips while the image and object identities remain fixed. They extract the final-token hidden state at an intermediate layer and compute the delta vector:

\delta = h_{q_2} - h_{q_1}.

Figure 6. Contrastive probing method. — **Figure 6. Contrastive probing.** The figure shows how swapped question pairs produce hidden-state displacement vectors that isolate relational direction in embedding space.

For each spatial axis, deltas from opposing categories are sign-corrected so they point in a canonical direction:

\tilde{\delta}^{(i)} = \begin{cases} \delta^{(i)} & \text{if category is canonical, e.g. far},\\ -\delta^{(i)} & \text{if category is opposite, e.g. close}. \end{cases}

Axis coherence is then the mean pairwise cosine similarity over the sign-corrected set:

\mathrm{Coh}_{\mathrm{axis}} = \frac{2}{N(N-1)} \sum_{i < j}\cos(\tilde{\delta}^{(i)}, \tilde{\delta}^{(j)}).

The vertical-distance entanglement index compares perspective-aligned and perspective-opposing mean deltas:

\mathrm{VD\text{-}EI} = \tfrac{1}{4}\left[ \cos(\mu_{\text{above}},\mu_{\text{far}}) +\cos(\mu_{\text{below}},\mu_{\text{close}}) -\cos(\mu_{\text{above}},\mu_{\text{close}}) -\cos(\mu_{\text{below}},\mu_{\text{far}}) \right].

Positive VD-EI means the hidden representation aligns above with far and below with close.

Experiments And Results

Real Benchmark Evidence

Table 1 shows that real-image benchmarks are dominated by perspective-consistent examples. This supports C2: a model can receive a lot of credit on aggregate depth performance while repeatedly failing the rarer counter cases.

Type	EmbSpatial-Bench	CV-Bench-3D	Definition
Consistent	976 (80.9%)	363 (60.5%)	Ground truth aligns with heuristic
Counter	129 (10.7%)	65 (10.8%)	Ground truth contradicts heuristic
Ambiguous	101 (8.4%)	172 (28.7%)	Vertical difference below 5% of image height

Table 1. Distribution of consistent, counter, and ambiguous examples. The skew toward consistent examples mirrors natural perspective statistics.

Table 2 is the main real-image evidence for C1. Every listed model family has lower counter accuracy than consistent accuracy. The most striking example is Qwen2.5-VL-3B with 2M spatial samples: 60.9% consistent vs. 24.0% counter on EmbSpatial-Bench.

Model	EmbSpatial Consistent	EmbSpatial Counter	CV-3D Consistent	CV-3D Counter
Molmo-7B-O-0924	63.5	34.9	93.1	75.4
+ 80k	60.6	29.5	80.2	56.9
+ 400k	62.7	27.1	89.5	56.9
+ 800k	65.2	34.1	88.7	70.8
+ 2M	65.3	39.5	90.6	72.3
NVILA-Lite-2B	49.0	27.1	74.4	40.0
+ 80k	57.7	15.5	71.6	50.8
+ 400k	61.1	34.1	81.3	58.5
+ 800k	63.2	38.8	84.6	67.7
+ 2M	60.7	41.1	97.2	93.8
RoboRefer-2B-SFT	87.0	59.7	98.9	95.4
Qwen2.5-VL-3B	54.7	32.6	75.5	55.4
+ 80k	50.6	30.2	69.7	60.0
+ 400k	52.6	27.1	65.8	58.5
+ 800k	55.8	26.4	61.2	58.5
+ 2M	60.9	24.0	62.0	53.8
Qwen3-VL-235B	73.3	41.7	98.1	90.8

Table 2. Accuracy on consistent vs. counter examples. Values are accuracies from depth-related examples in EmbSpatial-Bench and CV-Bench-3D. The gap persists across architectures, scales, and fine-tuning levels.

SpatialTunnel Evidence

Table 3 confirms the same pattern in a controlled synthetic environment. Because SpatialTunnel balances the geometry and decouples vertical position from depth, the positive \(\Delta\) values are stronger evidence that the shortcut is model-intrinsic rather than merely a property of evaluation-set skew.

Model	\(v\)	\(v_{\text{cons}}\)	\(v_{\text{ctr}}\)	\(\Delta\)
Molmo-7B-O-0924	0.528	0.565	0.487	+0.078
+ 80k	0.496	0.507	0.486	+0.021
+ 400k	0.501	0.593	0.409	+0.184
+ 800k	0.531	0.628	0.430	+0.198
+ 2M	0.666	0.703	0.630	+0.073
NVILA-Lite-2B	0.488	0.504	0.471	+0.033
+ 80k	0.499	0.562	0.438	+0.124
+ 400k	0.669	0.804	0.538	+0.267
+ 800k	0.646	0.728	0.571	+0.157
+ 2M	0.812	0.875	0.749	+0.127
RoboRefer-2B-SFT	0.793	0.816	0.770	+0.046
Qwen2.5-VL-3B	0.570	0.776	0.360	+0.416
+ 80k	0.512	0.585	0.440	+0.145
+ 400k	0.503	0.588	0.418	+0.171
+ 800k	0.499	0.600	0.398	+0.202
+ 2M	0.500	0.648	0.353	+0.295
Qwen3-VL-235B	0.908	0.948	0.880	+0.068

Table 3. Consistent vs. counter correctness on SpatialTunnel. \(v\) is mean correctness, \(v_{\text{cons}}\) is correctness on consistent samples, \(v_{\text{ctr}}\) is correctness on counter samples, and \(\Delta\) is their difference.

The Molmo heatmaps in Figure 4 and Figure 5 show the spatial pattern behind the aggregate scores. Consistent cells become easier with scaling, while counter cells remain harder and are especially degraded around intermediate fine-tuning steps.

Figure 4. SpatialTunnel Molmo consistent heatmaps. — **Figure 4. Molmo SpatialTunnel consistent heatmaps.** Redder cells indicate higher correctness. The source caption says accuracy on perspective-consistent cells improves from base to 400k to 2M samples.

Figure 5. SpatialTunnel Molmo counter heatmaps. — **Figure 5. Molmo SpatialTunnel counter heatmaps.** Counter cells are substantially harder, showing that the model is not merely learning a depth-invariant object-pair relation.

Benchmark Accuracy Is Not Enough

Table 4 broadens the evaluation across EmbSpatial, CV-Bench, and BLINK. The pattern is not a clean "more spatial fine-tuning means better spatial understanding" story. For example, NVILA 2M reaches 93.8 on CV-3D Depth but only 62.9 on BLINK Spatial Relation, while Qwen 2M is 78.3 on BLINK Spatial Relation but 52.2 on CV-3D Distance.

Model	EmbSpatial Overall	CV-2D Relation	CV-3D Depth	CV-3D Distance	BLINK Rel. Depth	BLINK Spat. Rel.
Molmo-7B-O-0924	60.7	76.3	84.5	68.5	78.2	70.6
+ 80k	52.9	62.3	71.0	67.5	72.6	60.8
+ 400k	64.9	84.3	80.0	70.8	72.6	68.5
+ 800k	69.1	90.0	82.0	70.8	75.0	61.5
+ 2M	74.3	93.7	87.3	81.3	71.0	69.2
NVILA-Lite-2B	54.0	58.6	69.2	52.3	64.5	67.1
+ 80k	65.1	78.9	66.2	60.8	53.2	74.1
+ 400k	62.1	83.2	74.3	67.0	71.8	63.6
+ 800k	69.7	85.5	78.2	71.3	57.3	65.0
+ 2M	69.4	91.4	93.8	87.2	70.2	62.9
RoboRefer-SFT-2B	92.0	96.5	95.7	90.5	84.7	79.7
Qwen2.5-VL-3B	62.3	67.4	70.3	60.2	68.6	83.9
+ 80k	57.3	59.7	64.7	61.5	58.1	79.7
+ 400k	58.6	58.2	62.0	54.5	58.9	78.3
+ 800k	60.9	59.4	58.7	51.2	58.1	79.0
+ 2M	65.7	68.8	58.5	52.2	53.2	78.3
Qwen3-VL-235B	82.0	96.5	93.3	91.0	84.7	90.2

Table 4. Performance across spatial benchmarks. Fine-tuned variants fluctuate across benchmark formats. RoboRefer and Qwen3-VL-235B are more consistently strong.

Representation-Level Evidence

Figure 7 links counter accuracy to distance coherence. The paper reports that \(\mathrm{Coh}_D\) computed on SpatialTunnel correlates with counter accuracy on EmbSpatial-Bench and CV-Bench-3D with Spearman \(\rho=0.759\) and \(\rho=0.804\), respectively, both with \(p<10^{-3}\). Figure 8 shows that within the NVILA family, RoboRefer occupies the desirable region of high distance coherence and lower VD-EI.

Figure 7. Counter accuracy vs distance coherence. — **Figure 7. Counter accuracy vs. distance coherence.** The source caption describes a positive relation between counter-example accuracy and internal distance coherence; labels indicate fine-tuning sample scale.

Figure 8. Distance coherence vs VD-EI. — **Figure 8. Distance coherence vs. VD-EI.** RoboRefer is highlighted as a high-coherence, lower-entanglement reference within the NVILA family.

The PCA visualization in Figure 9 makes the same point visually. Molmo 2M, NVILA 2M, and Qwen 2M separate horizontal and vertical deltas better than distance deltas; RoboRefer and Qwen3 show much cleaner separation of all three axes.

Figure 9. PCA of delta vectors across models. — **Figure 9. PCA of delta vectors.** Each point is a contrastive-probing delta vector. The paper interprets separated horizontal, vertical, and distance clusters as evidence of more structured spatial representation.

Model	\(\mathrm{Coh}_H\)	\(\mathrm{Coh}_V\)	\(\mathrm{Coh}_D\)	VD-EI
Molmo-7B	0.143	0.228	0.075	0.279
+ 80k	0.122	0.332	0.072	0.388
+ 400k	0.236	0.597	0.096	0.459
+ 800k	0.247	0.559	0.107	0.514
+ 2M	0.239	0.574	0.112	0.474
NVILA-2B	0.323	0.289	0.052	0.539
+ 80k	0.295	0.497	0.070	0.606
+ 400k	0.242	0.574	0.095	0.589
+ 800k	0.278	0.498	0.089	0.591
+ 2M	0.241	0.553	0.104	0.550
RoboRefer-2B	0.649	0.830	0.182	0.362
Qwen2.5-3B	0.367	0.293	0.043	0.457
+ 80k	0.386	0.315	0.040	0.456
+ 400k	0.450	0.452	0.042	0.451
+ 800k	0.473	0.538	0.045	0.429
+ 2M	0.485	0.586	0.052	0.472

Table 5. Axis coherence and VD-Entanglement Index. Distance coherence is the weakest axis across the reported models. RoboRefer has the highest coherence on all three axes and lower VD-EI than the NVILA scaling variants.

Additional Evidence From Source Appendices

The appendix extends the cue analysis from vertical position to object size. Figure 10 shows that Molmo and NVILA degrade when the farther object becomes larger than the nearer object, while Qwen remains near chance and therefore cannot be read as robust. This supports C7: high average depth accuracy can reflect multiple correlated visual cues, not just true 3D reasoning.

**Figure 10. Correctness as a function of object size.** The source caption reports that Molmo and NVILA degrade as the farther object grows larger than the nearer one, while Qwen stays near chance.

Figure 11 is useful as an externality check on the probing metric: the absolute \(\mathrm{Coh}_D\) values differ between synthetic SpatialTunnel and real EmbSpatial-Bench, but the relative ordering within model families is largely preserved.

**Figure 11. Cross-domain distance coherence.** The paper uses this to argue that \(\mathrm{Coh}_D\) is a reusable relative comparison signal when models are evaluated under the same condition.

Practical Takeaways

The main reusable idea is the split between behavior and representation: benchmark accuracy says whether a model answered correctly, while contrastive probing asks whether it encoded spatial axes cleanly.
The highest-value diagnostic is distance coherence. In this paper, \(\mathrm{Coh}_D\) is consistently the weakest axis and tracks counter-heuristic robustness better than aggregate accuracy alone.
Spatial fine-tuning is not automatically enough. Some fine-tuned variants improve benchmark scores while maintaining or increasing shortcut gaps.
RoboRefer and Qwen3-VL-235B are the strongest evidence that better spatial representations are possible, but the paper does not isolate a single causal ingredient; RoboRefer differs in both data scale and supervision, and Qwen3 differs heavily in model scale.
The limitation to remember is that SpatialTunnel is synthetic and diagnostic. It isolates one failure mode well, but it is not a full proxy for embodied 3D competence.
For future model evaluation, the paper argues for reporting counter-heuristic splits and representation structure, not only standard benchmark averages.