OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Source-first digest for monthly 2026_05 rank 15, rank_id p042.

Routing status: pandoc_failed; used flattened TeX plus equations.json and figures.json
PDF extraction: not used

Motivation / Background

OpenSearch-VL targets the reproducibility gap around frontier multimodal search agents. The paper argues that strong systems now need to do more than answer from a single visual encoder pass: they must search, verify, repair weak visual evidence, and reason over multi-step tool observations. The authors frame existing frontier systems as hard to reproduce because the training data, trajectory generation, tool environment, and RL recipe are often proprietary or underspecified.

The paper's open recipe has three moving parts: high-quality image-grounded search data, a tool environment that combines retrieval with active perception, and a fatal-aware GRPO objective for long-horizon tool rollouts. The introduction states the intended public release scope as data, code, models, and training recipe, but this digest treats that as a paper-stated release plan rather than independently verified availability. The main claims and their support levels are summarized in Table 1.

The authors' core motivation is that multimodal deep search fails when the problem is reduced to one-step retrieval. Their training data is therefore designed to force a visual-to-text chain: first identify or repair the visual anchor, then traverse textual relations, then verify the answer. This matters for questions where the image alone does not expose the answer and a naive reverse-image or text search would create shortcuts.

Claims And Evidence

Support scores in Table 1 are source-support scores, not independent reproduction scores. A score of 5 means the claim is directly backed by source text, equations, tables, or figures. A score of 4 means the paper gives strong internal evidence but still depends on assumptions such as judge quality or benchmark representativeness. A score below 4 flags a claim that is partly prospective or externally fragile.

Claim id	Main claim	Support	Evidence anchors
C1	OpenSearch-VL is presented as a fully open recipe for training multimodal deep-search agents, spanning data, tools, SFT, and RL.	4	open recipe, training setup, limitations
C2	The data pipeline deliberately reduces shortcut solving by combining Wikipedia path sampling, fuzzy entity rewriting, source-anchor visual grounding, staged filtering, and a degradation-enhancement subset.	5	data pipeline, Figure 1, data ablation
C3	The tool environment is broader than retrieval-only search: it includes text search, image search, OCR, crop, sharpen, super-resolution, and perspective correction.	5	tool environment, Table 3, case study
C4	Fatal-aware GRPO preserves useful pre-failure reasoning by masking post-failure tokens and clamping fatal-trajectory advantages to be non-negative.	5	fatal-aware GRPO, key equations, Figure 2, Figure 5
C5	The reported benchmark gains are large: the 30B-A3B OpenSearch-VL agent improves the Qwen3-VL-30B-A3B agentic baseline average from 47.8 to 61.6.	5	main results, Table 4
C6	The ablations support both main design families: removing data-pipeline safeguards causes large SFT drops, and adding fatal masking plus one-sided clamping beats vanilla GRPO.	5	ablation, Table 5, Figure 3, Figure 4
C7	The recipe is promising but not fully self-contained experimentally: tool drift, proprietary LLM judges, API dependence, and missing multi-seed error bars remain limitations.	5	limitations, main results

Core Technical Idea

The core idea is to train a multimodal agent on tasks that require active visual grounding plus external evidence acquisition, rather than tasks answerable from parametric memory, the raw image, or one retrieval call. Figure 1 is the source paper's overview of that data pipeline.

Figure 1. OpenSearch-VL data curation pipeline. — **Figure 1. Data curation pipeline.** The source figure shows path sampling over Wikipedia, canonical QA generation, fuzzy rewriting, source-anchor image grounding, filtering, enhancement, trajectory synthesis, and rejection sampling.

The data starts from a constrained random walk over the Wikipedia hyperlink graph. Each path assigns a role to nodes: \(v_0\) is the visual anchor, intermediate nodes are bridges, and \(v_h\) supplies the answer attribute. The paper then rewrites explicit entity names into fuzzy descriptors while keeping the answer invariant and unique. The final visual question replaces the source anchor with an image reference, forcing the agent to identify the visual entry point before following textual relations.

The authors add two filters that directly target shortcut behavior. First, a frozen Qwen3-VL-32B model discards examples answerable without tools. Second, it discards examples solvable with a single ImageSearch call. A 10 percent enhancement subset further degrades images with blur, downsampling, or perspective distortion, then pairs those cases with the corresponding image-repair tools. The resulting SFT corpus contains 36,592 rejected-and-filtered expert trajectories with an average of 6.3 tool-invocation turns; the RL set contains 8K examples sampled from the filtered VQA pool.

Figure 6. Australia Zoo visual anchor example. — **Figure 6. Source-anchor example.** The appendix uses an Australia Zoo image as the visual anchor for a multi-hop question whose answer is Terri Irwin's Australian citizenship date.

The appendix example illustrates why the anchor/answer split matters. If the image were the answer entity, reverse-image search would expose the answer directly. If fuzzy rewriting were removed, a plain text query would do the same. If hub avoidance were removed, descriptors become ambiguous. The example and Figure 6 make the paper's data construction less abstract: the question is intentionally designed so the image is load-bearing but not answer-revealing.

Method Details

OpenSearch-VL defines a multi-turn trajectory in which the agent conditions on accumulated images, previous actions, and previous observations, then emits either a tool call or a final response. Table 2 lists the most important formal objects, and Table 3 lists the tool suite.

Object	Source label	Digest interpretation
\(h_l = (\mathcal{I}_l, q, \mathbf{a}_{<l}, \mathbf{o}_{<l})\)	`eq:history`	The state is not just text; it includes the active visual context plus action and observation history.
\(\tau = \{(h_0,a_0,o_0), \dots, (h_L,a_L)\}\)	`eq:trajectory`	Training and RL operate over full multi-turn ReAct-style trajectories.
\(o_l = \mathcal{E}(c_l,h_l)\) routed to image or text observations	`eq:obs-routing`	Visual tools can return new images, while retrieval and OCR return text-like observations.
\(\pi_\theta(\tau \mid I_0,q) = \prod_l P_\theta(z_l \mid h_l)P_\theta(c_l \mid h_l,z_l)\)	`eq:traj-likelihood`	The model is trained on reasoning traces and commands, not on exogenous tool observations.
\(a(q_f)=a(q_t)\), \(	\mathcal{R}(q_f)	=1\), and no path aliases in \(q_f\)	`eq:fuzz-invariants`	Fuzzy rewrites must preserve answer, uniqueness, and non-leakage.

Table 2. Key formulation equations. The equations come from equations.json and the flattened TeX source, with HTML rendering avoided in this task.

Tool family	Tools	Role in the recipe
Retrieval	`TextSearch`, `ImageSearch`	Acquire external factual and visual-entity evidence.
Image enhancement	`Sharpen`, `SuperResolution`, `PerspectiveCorrect`	Repair blur, low resolution, or skew before search or OCR.
Attention and parsing	`Crop`, `OCR`	Focus on diagnostic regions and decode text/layout from images.

Table 3. OpenSearch-VL tool suite. The source table describes seven tools and the appendix gives implementation details, including Serper/JINA/Qwen3-32B for text search, Polaris Lens for image search, OpenCV-style local image enhancement, and remote PaddleX-backed OCR.

Training is sequential. First, SFT maximizes the likelihood of reasoning trace \(z_l\) and command \(c_l\) over the 36,592 expert trajectories:

\max_\theta \sum_{i=1}^{N} \sum_{l=1}^{L_i} \left[ \log P_\theta(z_l^{(i)} \mid h_l^{(i)}) + \log P_\theta(c_l^{(i)} \mid h_l^{(i)}, z_l^{(i)}) \right].

The paper's implementation details say all three model sizes use full-parameter finetuning with the vision tower and multimodal projector unfrozen, DeepSpeed ZeRO-3, bfloat16, a 32K-token cutoff, and 256 H20 GPUs for SFT. RL then uses an async SGLang rollout engine with Megatron-LM actor training over 64 H20 GPUs. These details support the "recipe" claim, but they also show that reproducing the exact scale is expensive.

Figure 2 summarizes the RL stage. The RL reward is a composite of format correctness, terminal answer accuracy, and query quality:

r(\tau) = r_{\text{fmt}}(\tau) \cdot \left[\alpha r_{\text{acc}}(\tau) + (1-\alpha) r_{\text{query}}(\tau)\right], \qquad \alpha = 0.8.

**Figure 2. RL training pipeline.** Rollouts interact with the real environment, receive composite rewards, and are optimized with fatal-aware masking plus one-sided advantage clamping.

The paper's fatal-state logic is concrete: a trajectory becomes fatal after \(K=3\) consecutive tool-execution errors, including malformed calls, timeouts, or argument-parsing failures. The token mask then keeps generated tokens before the fatal step and zeros out post-failure generated tokens:

M(y_{i,t}) = M_{\text{gen}}(y_{i,t}) \cdot \mathbb{1}[s(t) < f_i].

The advantage rule keeps ordinary GRPO advantages for non-fatal trajectories but clamps fatal trajectories to avoid negative gradients on potentially useful prefixes:

\hat{A}_i = \begin{cases} \widetilde{r}_i & \text{if } f_i = L_i + 1 \text{ (non-fatal)},\\ \max(\widetilde{r}_i, 0) & \text{if } f_i \le L_i \text{ (fatal)}. \end{cases}

The final objective is the search-augmented GRPO surrogate over the multimodal environment \(\mathcal{E}\), with the fatal-aware mask \(M_{i,t}\) and the clamped advantage \(\hat{A}_i\):

\mathcal{J}(\theta)= \mathbb{E}\left[ \frac{1}{G}\sum_{i=1}^{G} \frac{1}{\sum_t M_{i,t}} \sum_{t=1}^{|\tau_i|} M_{i,t} \min\left( \rho_{i,t}(\theta)\hat{A}_{i}, \mathrm{clip}_{1-\epsilon}^{1+\epsilon}(\rho_{i,t}(\theta))\hat{A}_{i} \right) \right].

The key mathematical move is not only the mask. The appendix derives group normalization over all \(G\) rollouts, including fatal ones, then applies the one-sided clamp only when assigning fatal-trajectory advantages. This creates a non-negative bias for fatal trajectories, which the authors explicitly acknowledge, but they argue it is benign because most fatal trajectories are clamped to zero and the preserved tail overlaps the high-score regime of non-fatal rollouts. The aggregate visualization in Figure 5 is the paper's empirical sanity check for that argument.

**Figure 5. Aggregate fatal clamping distribution.** Across 10,000 groups and 47,978 fatal rollouts, the paper reports that 91.8 percent fall below zero and are clamped, while 8.2 percent are preserved.

Experiments And Results

The evaluation covers seven knowledge-intensive multimodal search benchmarks: SimpleVQA, VDR, MMSearch, LiveVQA, BrowseComp-VL, FVQA, and InfoSeek. Baselines are grouped into direct reasoning, RAG workflow, and agentic workflow. Correctness is judged by a GPT-4o prompt aligned with the VDR-style evaluation protocol. Table 4 condenses the main benchmark table around the key comparisons.

Comparison	Avg Pass@1	Main interpretation
GPT-5 direct reasoning	45.1	Strong closed baseline without an explicit agent loop.
Gemini-2.5-Pro direct reasoning	46.0	Best direct-reasoning average in the source table.
GPT-5 RAG workflow	53.6	Retrieval helps, but the workflow is still not the trained OpenSearch-VL agent.
SenseNova-MARS-8B agentic	52.7	Strongest prior 8B-scale open agentic baseline in the table.
OpenSearch-VL-8B	56.6	Beats SenseNova-MARS-8B by 3.9 average points.
Qwen3-VL-30B-A3B agentic baseline	47.8	Same model family before the OpenSearch-VL recipe.
OpenSearch-VL-30B-A3B	61.6	Improves the corresponding Qwen3-VL agentic baseline by 13.8 average points.
Qwen3-VL-32B agentic baseline	48.0	Same model family at 32B.
OpenSearch-VL-32B	63.7	Highest average in the source table.

Table 4. Main result digest. The source table also reports per-benchmark values; for the 30B-A3B comparison, the paper highlights gains of +13.3 on VDR, +24.5 on MMSearch, +10.2 on FVQA, and +16.2 on InfoSeek.

The result pattern is consistent with the paper's thesis: plain direct reasoning is weaker than retrieval-augmented settings, but the largest gains come from a trained agentic loop with visual tools and RL. The 8B model already beats the listed 8B agentic baselines, and the 30B/32B variants exceed the corresponding Qwen3-VL agentic baselines by double-digit average margins.

The ablations in Table 5 are especially important because they test the recipe components rather than only the final benchmark score.

Ablation family	Setting	Avg score	Delta / conclusion
SFT data	Full pipeline	64.6	Reference pipeline.
SFT data	Without source-anchor grounding	53.1	-11.5, shortcut-resistant grounding matters.
SFT data	Without fuzzy entity rewriting	54.3	-10.3, entity leakage is a major risk.
SFT data	Without staged filtering	56.4	-8.2, easy examples dilute the target behavior.
SFT data	Without enhancement subset	63.3	-1.3, image repair improves robustness but is not the main gain.
RL recipe	Qwen3-VL-8B baseline	53.7	Pre-SFT baseline.
RL recipe	+ SFT only	64.6	SFT provides the largest first jump.
RL recipe	+ Vanilla GRPO	67.6	Online exploration helps.
RL recipe	+ Hard masking	67.7	Nearly flat relative to vanilla GRPO.
RL recipe	+ Fatal masking only	69.1	Prefix preservation helps.
RL recipe	+ Fatal masking + one-sided clamp	71.8	Best reported 8B ablation, +4.2 over vanilla GRPO.

Table 5. Ablation digest. The top panel supports the data-pipeline design, and the bottom panel supports fatal-aware RL over vanilla or hard-masking variants.

The training curves in Figure 3 reinforce the ablation table by showing that fatal-aware GRPO maintains longer rollouts while reaching higher batch-level accuracy than vanilla GRPO and hard masking.

**Figure 3. RL training dynamics.** The source figure plots average rollout turns and batch accuracy during RL training.

Figure 4 gives the intuition behind the one-sided clamp. In harder groups, some fatal prefixes still exceed the group mean and should be reinforced. In easier groups, fatal rollouts are below the successful rollouts and get zeroed rather than negatively training against their prefix.

Figure 4. Fatal-aware clamping cases. — **Figure 4. Two clamping cases.** The source figure contrasts preserved positive fatal prefixes with clamped sub-mean fatal prefixes.

The case study in Figure 7 shows the intended runtime behavior: crop first, image search to identify the bridge, then text search to verify the opening year. This is a qualitative example rather than benchmark evidence, but it makes the "verify, do not guess" behavior concrete.

Figure 7. OpenSearch-VL case study. — **Figure 7. Case study.** The agent crops a road sign, uses image search to identify Kessock Bridge, then uses text search to verify the answer as 1982.

The limitations section is candid about several remaining risks. The tool environment can drift because web search ranking, fetching, and summarization can fail or change. The composite reward uses proprietary GPT-4o/GPT-5.4-style judges, which makes cost, version dependence, and open reproducibility concerns real. The query-quality reward scores textual query behavior and does not fully judge visual operations such as crop quality. Finally, the paper does not report multi-seed error bars for the large-scale runs, so the source table supports the reported recipe but not fine-grained variance claims.

Practical Takeaways

For builders of multimodal search agents, the paper's most useful takeaway is that data design and environment design are inseparable. If the dataset can be solved by entity-name leakage or one reverse-image call, RL will optimize the wrong behavior. OpenSearch-VL therefore makes the visual anchor useful but not answer-revealing, filters out no-tool and one-tool examples, and uses process judges to select trajectories that actually chain tools.

The second takeaway is that retrieval-only agents are incomplete for real images. The tool suite treats crop, OCR, sharpening, super-resolution, and perspective correction as first-class actions, not preprocessing hacks. That is a practical design choice for documents, signs, charts, screenshots, and low-quality photos where the agent must repair perception before it can retrieve facts.

The third takeaway is about RL failure handling. In long-horizon tool environments, a rollout can contain useful early reasoning followed by an unrecoverable tool-error cascade. Hard-discarding the rollout wastes data, while training on the suffix injects noise. Fatal-aware masking plus one-sided clamping is a pragmatic compromise: keep useful prefixes, suppress invalid suffixes, and never push the model away from a valid prefix simply because later tool calls failed.

The fourth takeaway is cautionary. The recipe is operationally heavy: large GPU counts, online tools, external APIs, and proprietary judges all appear in the source. The paper is valuable as a detailed recipe and internal evidence package, but exact reproduction will depend on whether the released code/data/models and tool stack match the paper's described environment.

Reference Coverage

Anchor coverage links: open recipe, data pipeline, tool environment, training setup, fatal-aware GRPO, key equations, main results, ablation, case study, limitations, Table 1, Table 2, Table 3, Table 4, Table 5, Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, and Figure 7.