Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

Source-first digest for checked paper rank p040.

Routing status: success
PDF extraction: not used
Table recovery note: the benchmark and ablation tables were malformed in paper.md; their numeric values were recovered from latex_flattened/main.flattened.tex.

Motivation / Background

Semantic correspondence asks for matches between semantically equivalent object parts across images. It is harder than low-level feature matching because viewpoint, articulation, intra-class shape, occlusion, and background all vary. The paper starts from a concrete failure mode: strong 2D foundation features such as DINOv2 and Stable Diffusion can still confuse left and right sides of symmetric objects or collapse repeated parts such as wheels, legs, and windows.

Prior 3D-aware correspondence pipelines partly address this by using pose annotations and coarse category-level spherical proxies. Geometry Matters argues that this is the wrong granularity: semantic correspondence needs per-instance geometry when the ambiguity is local, repeated, or asymmetric. The paper therefore uses single-image 3D foundation models as teachers: SAM3/SAM3D produce an object mask, mesh, and camera; a render-and-compare refinement improves the pose; PartField descriptors are rendered into the image plane; and geodesic distances on the reconstructed meshes filter candidate matches before training a small adapter.

The motivation is easiest to see in Figure 1: SD+DINO produces wrong repeated-part matches, a geodesic filter removes many mistakes but becomes sparse, and adding PartField features gives denser, more accurate pseudo-correspondences.

SD and DINO baseline correspondences — **Figure 1. 3D foundation priors improve candidate generation and filtering.** Existing SD+DINO pipelines suffer from left-right and repeated-part confusion. A geodesic filter removes wrong matches but can leave few correspondences. Adding PartField features yields denser and more accurate correspondences under large pose changes.

Geodesic filtering correspondences — **Figure 1. 3D foundation priors improve candidate generation and filtering.** Existing SD+DINO pipelines suffer from left-right and repeated-part confusion. A geodesic filter removes wrong matches but can leave few correspondences. Adding PartField features yields denser and more accurate correspondences under large pose changes.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	Single-image 3D foundation priors can replace manual pose annotations as a practical geometry source for weakly supervised semantic correspondence.	4	reconstruction, correspondence pipeline, main results
C2	Rendered PartField descriptors complement DINO and Stable Diffusion features by resolving symmetric and repeated structures that 2D features confuse.	4	teaser, PartField PCA, PartField similarity, category pattern
C3	Bicyclic geodesic filtering produces cleaner pseudo-labels than spherical, triplane, or direct PartField-similarity filters, and the cleaner labels improve final PCK.	5	weight search, filter validation, ablation
C4	3D-SC is strongest among weakly supervised methods without human annotations on the main reported benchmarks, especially SPair-71k and SPair-Geo-Aware.	5	main results
C5	Gains are concentrated in rigid, symmetric, man-made categories; within-part localization and deformable animals remain weaker.	4	category pattern, limitations
C6	The reduced-supervision claim is real but bounded: the method still depends on dataset category labels or detector-quality masks, SAM3D pose/shape quality, and PartField's part-level granularity.	4	implementation details, limitations

Scores are support-from-paper scores, not independent reproduction scores. I cap the broad reduced-supervision and geometry-transfer claims below 5 because the evidence is strong on standard benchmarks but still depends on the quality and assumptions of the upstream 3D foundation models.

Core Technical Idea

The paper treats 3D foundation models as a source of pseudo-supervision for a 2D correspondence adapter. The pipeline is:

1. Reconstruct and canonicalize a mesh for each object instance. 2. Render PartField descriptors from that mesh into image coordinates. 3. Fuse rendered PartField with DINO and Stable Diffusion descriptors to propose candidate matches. 4. Lift each candidate match back to the two meshes and reject it if the 3D surface locations are geodesically inconsistent. 5. Train a lightweight adapter on top of frozen DINO+SD features using the retained pseudo-labels.

The key design choice is not simply adding a 3D feature. The paper uses 3D twice: PartField improves feature quality during candidate generation, and mesh geodesics filter candidate labels before supervision. Figure 3 is the main method diagram and shows how the image-space and mesh-space checks interact.

Pseudo-label correspondence pipeline — **Figure 3. Pseudo-label correspondences pipeline.** DINO, SD, and PartField features propose candidates via nearest-neighbor search with relaxed cyclic consistency. Each candidate is then lifted onto reconstructed meshes and geometrically verified with bicyclic geodesic error before training the adapter.

Method Details

Canonicalized 3D Object Reconstruction

Figure 2 shows the upstream geometry stage. The method obtains an instance mask with SAM3, reconstructs an object-centric mesh and camera with SAM3D, then refines scale and translation by matching the rendered silhouette to the observed mask. The refinement uses a distance-transform phase first, because soft IoU has poor gradients when the rendered and observed masks do not overlap.

The distance fields are:

\mathcal{D}_{\text{out}}(p) = \frac{1}{d}\min_{p':\tilde{\mathbf{M}}(p')=1}\|p-p'\|_2^2, \qquad \mathcal{D}_{\text{in}}(p) = \frac{1}{d}\min_{p':\tilde{\mathbf{M}}(p')=0}\|p-p'\|_2^2.

The DT loss combines an outside attraction term with an inside coverage term:

\mathcal{L}_{\text{DT}} = \frac{1}{HW}\sum_p \left[ \hat{\mathbf{M}}_p\mathcal{D}_{\text{out}}(p) + \mathcal{D}_{\text{in}}(p)(1-\lambda\hat{\mathbf{M}}_p) \right].

After overlap is established, the method switches to soft IoU:

\mathcal{L}_{\text{IoU}} = 1 - \frac{\sum_p \hat{\mathbf{M}}_p\mathbf{M}_p} {\sum_p(\hat{\mathbf{M}}_p+\mathbf{M}_p-\hat{\mathbf{M}}_p\mathbf{M}_p)}.

For yaw canonicalization, the paper renders each mesh at eight known yaw angles, asks OrientAnything V2 for apparent orientation, chooses a correction from \(\{0^\circ,90^\circ,180^\circ,270^\circ\}\), and aggregates by majority vote:

\Delta\psi^* = \underset{\Delta\psi \in \{0^\circ,90^\circ,180^\circ,270^\circ\}}{\arg\min} \left|\psi_{\text{est}}+\Delta\psi-\psi_{\text{known}}\right|.

The supplement reports that 79 of 1,319 refined meshes required a non-zero yaw correction, or 5.99 percent. The most affected classes were bus, boat, train, and cow.

PartField Features And Feature Fusion

PartField predicts a continuous per-vertex descriptor field on a 3D shape. The method rasterizes these descriptors into the image using the SAM3D camera and refined pose. Vertices outside the frustum or foreground mask are discarded; pixels without projected descriptors are filled by nearest-neighbor propagation.

The PCA visualizations in Figure 5 support the paper's claim that PartField features are spatially coherent within semantic parts while remaining comparable across instances.

PartField PCA projection across car instances — **Figure 5. PCA visualizations of PartField features.** Consistent colors within parts indicate coherent descriptors, and similar colors across instances indicate rough cross-instance part alignment.

PartField PCA projection across chair instances — **Figure 5. PCA visualizations of PartField features.** Consistent colors within parts indicate coherent descriptors, and similar colors across instances indicate rough cross-instance part alignment.

Figure 6 is more diagnostic: a queried car wheel or chair leg activates the corresponding repeated part instead of all visually similar parts. This is the mechanism the paper uses to explain gains on rigid symmetric categories.

PartField query similarity on car wheel — **Figure 6. PartField features reduce repeated-part and symmetry ambiguity.** The queried wheel and chair leg produce localized similarity responses rather than activating all similar-looking repeated parts.

PartField query similarity on chair leg — **Figure 6. PartField features reduce repeated-part and symmetry ambiguity.** The queried wheel and chair leg produce localized similarity responses rather than activating all similar-looking repeated parts.

The fused image descriptor concatenates independently normalized SD, DINO, and PartField features with square-root weights:

\mathcal{F}_{\text{fused}} = \left( \sqrt{\alpha}\widehat{\mathcal{F}}_{\text{SD}}, \sqrt{\beta}\widehat{\mathcal{F}}_{\text{DINO}}, \sqrt{\gamma}\widehat{\mathcal{F}}_{\text{PF}} \right), \qquad \gamma = 1-\alpha-\beta.

The default weights are \(\alpha=1/2\), \(\beta=1/3\), and \(\gamma=1/6\). The square-root form makes the dot product in fused space equivalent to a weighted average of cosine similarities from the three normalized feature spaces.

The paper's weight search in Figure 7 shows that multiple mixtures work, but the chosen one keeps PartField present without letting the coarser 3D part descriptor dominate within-part localization.

Feature fusion weight search heatmap — **Figure 7. Feature fusion weight search.** PCK@0.10 of pseudo-correspondences before filtering on SPair-71k validation as a function of SD and DINO weights, with the PartField weight determined by the simplex constraint.

Pseudo-Label Filtering And Adapter Training

Candidate matches are proposed by nearest-neighbor search in fused feature space, then filtered by relaxed cyclic consistency:

\left\| \mathrm{NN}^{t\rightarrow s}_{\text{fused}}(\hat{p}^t)-p^s \right\|_2 \lt \tau_{cc}\max(h,w).

The important second filter is geometric. For a source-target candidate, the method lifts both pixels onto their meshes. It then uses PartField nearest-neighbor search to predict a cross-mesh counterpart and compares that predicted surface point with the target surface point induced by the image-space match. The forward geodesic error is:

d_{\text{geo}}^{s\rightarrow t} = d_{\mathcal{M}_t}(\hat{\mathbf{v}}^t,\bar{\mathbf{v}}^t).

The final bicyclic geodesic error averages forward and backward errors and normalizes by mesh bounding-box diagonals:

d_{\text{geo}}^{s\rightleftarrows t} = \frac{1}{2} \left( \frac{d_{\text{geo}}^{s\rightarrow t}}{\mathrm{diag}(\mathcal{M}_t)} + \frac{d_{\text{geo}}^{t\rightarrow s}}{\mathrm{diag}(\mathcal{M}_s)} \right).

Retained pseudo-labels are:

\mathcal{P} = \left\{ (p^s,p^t) \mid d_{\text{geo}}^{s\rightleftarrows t} \leq \tau_{\text{geo}} \right\}.

The adapter is a four-layer, 5M-parameter model trained on top of frozen DINOv2 and Stable Diffusion features. It uses sparse contrastive supervision on pseudo-correspondences plus dense regression through a window soft-argmax:

\mathcal{L}_{\text{sparse}} = \mathrm{CL}\left(\mathcal{F}^s(\mathcal{P}^s), \mathcal{F}^t(\mathcal{P}^t)\right),

\mathcal{L}_{\text{dense}} = \sum_{(p^s,p^t)\in\mathcal{P}} \left\|\hat{p}^t-(p^t+\epsilon)\right\|_2.

Implementation settings matter for interpreting the results. The paper uses \(\lambda=4\), \(\tau_{cc}=0.05\), and \(\tau_{\text{geo}}=0.05\). SD and DINO features are extracted from high-resolution images, PartField descriptors are rasterized at \(60^2\), and training runs for 200k iterations with 50 pseudo-labels sampled per pair per iteration.

Experiments And Results

Table 1 is the central quantitative evidence. The compact version below keeps the PCK@0.1 columns needed to judge the main claims; values come from the recovered LaTeX table.

Method	Supervision type	SPair-71k	SPair-Geo-Aware	AP-10K intra	AP-10K cross-species	AP-10K cross-family	SPairU
DINOv2+NN	Unsupervised	53.9	42.0	60.9	57.3	47.4	54.9
DIFT	Unsupervised	52.9	42.5	50.3	46.0	35.0	47.4
Spherical Map	Weak, human pose annotations	64.4	--	65.4	63.1	51.0	61.0
DIY-SC	Weak, human annotations	71.6	67.5	70.6	69.8	57.8	67.9
SD+DINOv2	Weak, no human annotations	59.9	49.3	62.9	59.3	48.3	59.4
DIY-SC+OriAny	Weak, no human annotations	69.6	65.8	69.3	66.8	54.0	66.3
3D-SC (ours)	Weak, no human annotations	73.0	70.8	69.6	68.5	56.9	67.3

Table 1. Evaluation on standard benchmarks. 3D-SC is strongest among the no-human-annotation weakly supervised methods on all listed columns and beats the human-annotation DIY-SC baseline on SPair-71k and SPair-Geo-Aware. The AP-10K gains are smaller but still positive against DIY-SC+OriAny.

Main Benchmark Interpretation

On SPair-71k, 3D-SC reaches 73.0 PCK@0.1, improving over DIY-SC+OriAny by 3.4 points. On the geometry-targeted SPair-Geo-Aware subset, it reaches 70.8, improving over DIY-SC+OriAny by 5.0 points and over SD+DINOv2 by 21.5 points. This is the cleanest support for the paper's core hypothesis: if the benchmark emphasizes symmetric and repeated parts, instance-specific 3D geometry helps.

On AP-10K, the method reports 69.6/68.5/56.9 on intra-species, cross-species, and cross-family splits. That beats the no-human-annotation baseline, but the gap is modest compared with the SPair-Geo-Aware gain. The paper's explanation is plausible: PartField's part-level cues are less reliable for deformable animals and unusual poses.

Pseudo-Annotation Quality

Figure 4 visually supports the pseudo-label quality claim: 3D-SC annotations are denser and more geometrically consistent than DIY-SC in the paper's examples.

Qualitative pseudo-annotations from 3D-SC and DIY-SC — **Figure 4. Qualitative pseudo-annotations.** The paper compares 3D-SC and DIY-SC pseudo-ground-truth annotations and reports denser, more geometrically consistent 3D-SC matches.

Figure 8 extends the same evidence with more categories. I include it because the qualitative claim is not just one cherry-picked pair; the supplement uses the same visual comparison across additional objects.

Supplementary pseudo-annotation comparison 1 — **Figure 8. Additional pseudo-annotation comparison.** Across object categories, 3D-SC produces denser pseudo-annotations that cover more of the object surface while avoiding left-right ambiguity better than the spherical-prior baseline.

Supplementary pseudo-annotation comparison 2 — **Figure 8. Additional pseudo-annotation comparison.** Across object categories, 3D-SC produces denser pseudo-annotations that cover more of the object surface while avoiding left-right ambiguity better than the spherical-prior baseline.

Filter And Ablation Evidence

Table 2 isolates the filter. With SD+DINO+PartField candidates, geodesic filtering has the lowest false positive rate, 1.78 percent, while keeping 1,634 candidates per pair.

Feature set	Filter	FPR	Candidates per pair
SD+DINO	Spherical mapper	10.95	1856
SD+DINO	Triplane	13.15	1948
SD+DINO	PF feature similarity	2.81	1608
SD+DINO	Geodesic	1.82	1543
SD+DINO+PartField	Spherical mapper	10.75	2001
SD+DINO+PartField	Triplane	13.07	2090
SD+DINO+PartField	PF feature similarity	2.47	1694
SD+DINO+PartField	Geodesic	1.78	1634

Table 2. Filtering evaluation on the validation set. FPR is the false positive rate among unfiltered wrong predictions. The geodesic filter is best under both feature sets.

Table 3 shows how the components accumulate. Pseudo-labeling, cyclic consistency, geodesic filtering, PartField, DINOv3, and capped sampling all contribute to the final 73.0 SPair-71k PCK@0.1.

Pseudo labels	PartField	Cyclic consistency	Geodesic filter	Sampling cap	DINO	SPair-71k PCK@0.1
no	no	no	no	yes	v2	64.9
yes	no	no	no	yes	v2	67.0
yes	no	yes	no	yes	v2	67.6
yes	no	yes	yes	yes	v2	71.6
yes	no	yes	yes	yes	v3	72.4
yes	yes	no	no	yes	v2	66.9
yes	yes	yes	no	yes	v2	68.8
yes	yes	yes	yes	yes	v2	72.1
yes	yes	yes	yes	no	v3	72.4
yes	yes	yes	yes	yes	v3	73.0
DIY-SC baseline	--	--	--	--	v3	72.1
DIY-SC+OriAny baseline	--	--	--	--	v3	70.4

Table 3. Ablations on SPair-71k. The largest single jump in the visible sequence is adding geodesic filtering after pseudo-labels and cyclic consistency. The final row also shows a 0.6 point gain from the sampling cap relative to the same v3 configuration without it.

Category Pattern

The per-category table in the supplement is too wide for the digest, but Table 4 captures the main pattern the authors emphasize: the largest gains over DIY-SC+OriAny appear on rigid, symmetric, man-made categories. Regressions appear on deformable or less PartField-friendly categories.

Category pattern	Reported delta or behavior	Interpretation
Bus	+10.8 over DIY-SC+OriAny	Strong symmetry and repeated structures; 3D geometry is highly useful.
TV/monitor	+9.8	Rigid object with viewpoint-sensitive part identity.
Bottle	+8.8	Geometry helps distinguish visually similar object regions.
Car	+6.9	Front/rear and left/right ambiguity is a central failure case.
Train	+6.2	Repeated rigid components benefit from geometric filtering.
Motorcycle	+5.1	Rigid structure and repeated wheels match the PartField prior.
Chair	+4.0	Repeated legs and front/back ambiguity benefit from 3D cues.
Sheep, cat, cow	-2.7, -1.5, -1.7	Deformable animals are weaker for the current PartField/SAM3D setup.

Table 4. Per-category result pattern. The strongest gains align with the method's stated target: symmetric or repeated rigid structure. This also exposes the current boundary of the approach.

Practical Takeaways

The reusable idea is to use 3D foundation models as a source of geometry-aware pseudo-labels for a 2D task, not necessarily to make the final model 3D-heavy. The final adapter still runs on frozen 2D features; 3D is used offline to produce better supervision.

The strongest evidence is the combination of SPair-Geo-Aware results, geodesic-filter FPR, and ablations. Those three pieces line up: geometry helps most where the benchmark demands geometric disambiguation, the geodesic filter removes false matches directly, and the ablation shows that filtering improves downstream PCK.

The weak spots are also clear. The pipeline inherits failures from SAM3D pose and shape estimates. PartField descriptors are coarse part-level descriptors, so they help less when the task requires precise within-part localization. Deformable animals are not as clean a fit as rigid man-made objects. The method also assumes object category information or detector-quality masks are available, even if it removes the manual pose labels required by prior approaches.

For future paper digestion, the biggest lesson is that the source Markdown can render figures well but still mangle LaTeX tables. Here, the main benchmark and ablation tables needed recovery from latex_flattened/main.flattened.tex; without that recovery, the digest would miss the strongest quantitative evidence.