arXiv20262026avg 5.55interest 9.0017 HF VLM fusioncross-modal representationmultimodal reasoning

LoMo addresses VLM fragility when semantically equivalent content is moved between text and rendered-image carriers. It creates interleaved multimodal supervision by rendering selected text spans as images, improving cross-modal invariance and multimodal reasoning across 13 benchmarks.

Source-first digest for checked paper rank 9, rank_id p013.

Motivation / Background

The paper starts from a concrete failure mode in current VLMs: semantically identical text can produce different behavior depending on whether it is presented as tokens or rendered as an image. The authors call this carrier sensitivity. In the motivating experiment, the same questions are evaluated under normal text input and under a rendered-text-as-image protocol; accuracy drops for all four tested VLMs, including Qwen3.5-9B dropping from 68.95 to 60.25 and LLaVA-OV1.5-8B dropping from 53.52 to 41.84.

The paper argues that this is not just OCR weakness. It measures hidden-state distance between a text input and its rendered-image counterpart, then shows a monotonic relationship: samples in the closest distance bin lose 7.75 points, while samples in the farthest bin lose 21.23 points. Figure 1 is the central motivation figure because it connects the observed accuracy drop, the pairwise cross-modal distance, and LoMo's claimed 14.2% reduction in average distance.

Figure 1. Carrier sensitivity and modality gap.
Figure 1. Carrier sensitivity and modality gap. The original caption says current VLMs exhibit carrier sensitivity driven by an underlying modality gap. It shows accuracy drops when questions are rendered as images, a monotonic link between pairwise cross-modal distance and accuracy drop, and LoMo's shift toward smaller cross-modal distances.

LoMo is the proposed data-side fix. Instead of changing the VLM architecture, it takes a text-only training example, renders a local span as an image, perturbs that image, and inserts it back into the surrounding text. The model must then answer from a text -> visual -> text input while the supervision target remains unchanged. The design is summarized in Figure 2.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 Carrier sensitivity is a systematic VLM failure mode and correlates with a measurable cross-carrier modality gap. 4 carrier sensitivity, rendered-evaluation results, alignment metrics
C2 Local modality substitution gives cross-carrier supervision without architecture changes, extra annotations, or inference overhead. 4 LoMo transformation, method overview, training setup
C3 LoMo improves standard multimodal benchmark accuracy across two backbones and 13 benchmarks. 5 main results, radar figure
C4 LoMo is especially helpful when text is delivered through pixels, sharply reducing the rendered-evaluation degradation. 5 main results, carrier sensitivity figure
C5 The middle-span interleaving structure and perceptual distortion are meaningful design choices rather than incidental details. 4 component ablation, rewrite ratio, rendering position
C6 The gain is not merely due to increasing the number of image-bearing training samples. 4 matched image:text ratio
C7 LoMo does not substantially damage pure-text abilities in the reported setup, but the evidence is bounded by model families and training-scale choices. 3 pure-text sanity check, limitations
C8 LoMo improves internal cross-modal alignment metrics alongside task accuracy. 4 alignment decomposition, data-scale figure

Scores are support-from-paper scores, not independent reproduction scores. C1 is capped at 4 because the distance analysis is correlational even though the trend is clean. C7 is capped at 3 because the pure-text gains are small and one metric regresses.

Core Technical Idea

LoMo is a data transformation, not a new model. Given a text-only supervised example \((x, a)\), it selects a middle span \(x_{\text{mid}}\), renders that span into an image, applies a visual perturbation, and returns a mixed-carrier input whose answer is still \(a\).

$$ (x_{\text{pre}},\, x_{\text{mid}},\, x_{\text{suf}}) = \mathcal{S}(x), \qquad I' = \mathcal{A}\big(\mathcal{R}(x_{\text{mid}})\big). $$
$$ \mathcal{T}(x) \triangleq (x_{\text{pre}},\, I',\, x_{\text{suf}}), \qquad (x, a) \;\longrightarrow\; \big((x_{\text{pre}},\, I',\, x_{\text{suf}}),\, a\big). $$

The important supervision signal is that the target answer is unchanged, so the model must learn to treat a rendered local text span as a semantic continuation of the surrounding token text. Figure 2 makes the mechanism concrete: choose a coherent span, render it, optionally distort it, and insert it into the original position.

Figure 2. Overview of LoMo.
Figure 2. Overview of LoMo. The original caption describes the three stages: structure-aware span localization, visual rendering, and perceptual distortion. I place it here because it is the clearest depiction of the text -> visual carrier -> text training instance.

The paper also rewrites the SFT objective to expose the intended alignment pressure. Standard SFT optimizes:

$$ \mathcal{L}_{\text{SFT}}(\theta; x, a) = -\log p_\theta(a \mid x). $$

LoMo instead optimizes the same answer under the transformed carrier:

$$ \mathcal{L}_{\text{LoMo}}(\theta; x, a) = -\log p_\theta\!\big(a \,\big|\, \mathcal{T}(x)\big) = \underbrace{-\log p_\theta(a \mid x)}_{\text{standard SFT supervision}} \;+\; \underbrace{\log \frac{p_\theta(a \mid x)}{p_\theta\!\big(a \,\big|\, \mathcal{T}(x)\big)}}_{\text{cross-carrier alignment}}. $$

Taking expectation over \(a \sim p_\theta(\cdot \mid x)\) turns the second term into a KL-style agreement pressure:

$$ \begin{aligned} \mathbb{E}_{a \sim p_\theta(\cdot \mid x)}\!\!\left[-\log p_\theta\!\big(a \,\big|\, \mathcal{T}(x)\big)\right] &= \mathbb{E}_{a \sim p_\theta(\cdot \mid x)}\!\!\left[-\log p_\theta(a \mid x)\right] \\ &\quad+ \mathrm{KL}\!\Big(p_\theta(\cdot \mid x) \,\Big\Vert\, p_\theta\!\big(\cdot \,\big|\, \mathcal{T}(x)\big)\Big). \end{aligned} $$

The claim is not that LoMo explicitly minimizes a separate KL loss in implementation. The claim is that training on semantically equivalent mixed-carrier inputs turns the ordinary answer loss into an implicit pressure for text and rendered-text carriers to agree.

Method Details

Span Localization, Rendering, And Distortion

LoMo's pipeline is intentionally simple:

1. If the prompt has at most three sentences, render the whole text as the target span. Otherwise, split the input with a formula-aware chunker. 2. Treat text and formula blocks as atomic units so equations are not cut in half:

$$ x \mapsto ((t_1, l_1), (m_1, l_2), \dots, (t_n, l_{2n-1})). $$

3. Select the middle one-third of the block sequence as \(x_{\text{mid}}\), keeping prefix and suffix in text. 4. Render math spans with a LaTeX renderer and non-math spans with a text renderer. If LaTeX rendering fails, fall back to text rendering rather than dropping the sample. 5. Trim excess margins and sample one of Clean, Rotate, Blur, ShadowOrStain, or Wave as the final perceptual distortion.

The appendix algorithm can be summarized as:

Input: text-only instance (x, a)
If SentenceCount(x) <= 3:
  (x_pre, x_mid, x_suf) = (empty, x, empty)
Else:
  C = FormulaAwareChunk(x)
  (x_pre, x_mid, x_suf) = ExtractMiddle(C)
If ContainsMath(x_mid):
  I = LaTeXRender(x_mid), falling back to TextRender
Else:
  I = TextRender(x_mid)
I = TrimMargin(I)
I' = one of Clean, Rotate, Blur, ShadowOrStain, Wave
Return ((x_pre, I', x_suf), a)

Training Setup

The experiments use two open VLM backbones: LLaVA-OneVision-1.5-8B-Base and Qwen3.5-9B-Base. The training pool is the LLaVA-OneVision 1.5 SFT corpus: 2M multimodal instruction examples plus 2M text-only instruction examples. Standard SFT trains on the same 4M examples without rendering; LoMo renders 50% of the text-only examples and keeps the optimizer, learning-rate schedule, number of steps, and data scale matched.

Implementation details are conventional SFT rather than a new architecture: one node with 8 NVIDIA H200 GPUs, maximum sequence length 32,768, maximum image resolution 2,560,000 pixels, FlashAttention 2, bf16, DeepSpeed ZeRO Stage 1, sequence packing, AdamW, one epoch, frozen vision tower, and updates to the language model plus multimodal projector.

Evaluation Protocols And Metrics

The paper evaluates each benchmark two ways. Standard Evaluation uses the original image plus text question. Rendered Evaluation renders the whole question as an image and uses that rendered image in place of text. Because the linguistic content is identical, the rendered protocol isolates carrier sensitivity.

For internal alignment, the paper uses MIR as a set-level visual/text token distribution gap and a paired distance:

$$ d = 1 - \cos(\bar{h}_{\text{text}},\, \bar{h}_{\text{img}}). $$

This metric is important because Figure 4 shows that LoMo improves both task accuracy and representation alignment as the data scale grows.

Experiments And Results

Main Benchmark Results

Table 1 is the main evidence for the paper's performance claims. Under Standard Evaluation, LoMo improves the average by +2.68 for LLaVA-OV1.5-8B and +2.82 for Qwen3.5-9B. Under Rendered Evaluation, the average gains grow to +18.86 and +11.92, respectively. This is the strongest support for the claim that LoMo specifically attacks carrier sensitivity.

Table 1. Main results across 13 multimodal benchmarks under two evaluation protocols. Original caption: Standard Evaluation feeds the original multimodal inputs (image + text question) to the model. Rendered Evaluation renders the entire text question as a single image. \(\Delta\) denotes the absolute change of LoMo over Standard SFT.

Protocol Model Setting MMMU MMMU-P MathV ZeroB WeMath SVQA HalB MM-IF MLB-Doc DocVQA CC-OCR V* CntB Avg.
Standard LLaVA-OV1.5-8B Standard SFT 51.78 35.24 51.30 10.18 22.76 35.51 40.35 58.40 15.49 73.05 46.76 47.71 42.97 40.88
Standard LLaVA-OV1.5-8B + LoMo 51.22 36.36 53.90 11.98 25.14 38.62 42.82 61.61 18.06 74.77 48.97 51.70 51.12 43.56
Standard LLaVA-OV1.5-8B Delta -0.56 +1.12 +2.60 +1.80 +2.38 +3.11 +2.47 +3.21 +2.57 +1.72 +2.21 +3.99 +8.15 +2.68
Standard Qwen3.5-9B Standard SFT 59.44 39.83 67.20 16.54 38.19 42.57 51.53 51.74 40.79 85.89 65.61 68.13 80.08 54.43
Standard Qwen3.5-9B + LoMo 63.00 46.18 66.60 17.66 40.00 43.51 52.99 57.23 42.90 91.99 65.31 71.86 85.01 57.25
Standard Qwen3.5-9B Delta +3.56 +6.35 -0.60 +1.12 +1.81 +0.94 +1.46 +5.49 +2.11 +6.10 -0.30 +3.73 +4.93 +2.82
Rendered LLaVA-OV1.5-8B Standard SFT 21.22 16.01 18.10 3.59 0.95 22.42 41.36 28.83 3.94 15.38 14.49 8.12 3.67 15.24
Rendered LLaVA-OV1.5-8B + LoMo 35.56 27.59 39.50 7.63 11.33 31.06 61.65 48.83 6.69 58.89 39.89 36.13 38.49 34.10
Rendered LLaVA-OV1.5-8B Delta +14.34 +11.58 +21.40 +4.04 +10.38 +8.64 +20.29 +20.00 +2.75 +43.51 +25.40 +28.01 +34.82 +18.86
Rendered Qwen3.5-9B Standard SFT 49.52 33.06 56.90 15.94 23.43 39.95 43.37 41.25 28.87 47.24 49.60 61.58 71.66 43.26
Rendered Qwen3.5-9B + LoMo 62.48 45.29 65.50 16.62 39.72 43.26 47.05 54.32 36.57 91.73 66.14 64.73 83.98 55.18
Rendered Qwen3.5-9B Delta +12.96 +12.23 +8.60 +0.68 +16.29 +3.31 +3.68 +13.07 +7.70 +44.49 +16.54 +3.15 +12.32 +11.92

The radar figure gives the compact visual version of the Standard Evaluation results. It is useful because it shows that the average gain is not concentrated in a single category: LoMo improves most axes for both backbones, with small regressions on LLaVA MMMU and Qwen MathVista/CC-OCR.

Figure 3. Standard-evaluation radar results.
Figure 3. Standard-evaluation radar results. The original caption reports consistent improvements over Standard SFT across two backbones: +2.68 on LLaVA-OV1.5-8B and +2.82 on Qwen3.5-9B over 13 benchmarks.

Representation Alignment And Data Scale

Figure 4 is the main evidence that LoMo's benefits track alignment, not only downstream accuracy. At the 4M scale, LoMo's average accuracy gain reaches +2.68, MIR is 0.122 lower than Standard SFT, and paired cross-modal distance drops to 0.49 while Standard SFT rises to 0.57.

Figure 4. Data-scale ablations.
Figure 4. Data-scale ablations. The original caption says LoMo consistently outperforms standard SFT across data scales on average accuracy, mean MIR, and pairwise cross-modal distance.

Component Ablations

Table 2 separates the pieces of LoMo. Full-question rendering improves the average from 40.88 to 42.07, but LoMo without perceptual distortion reaches 43.10 and full LoMo reaches 43.56. The paper's interpretation is that span localization and interleaving are the dominant ingredients, while perceptual distortion adds another smaller gain.

Table 2. Component ablation of LoMo on LLaVA-OV1.5-8B. Original caption: Full-Text Rendering renders the entire input as images without Structure-Aware Span Localization or Perceptual Distortion. PD: Perceptual Distortion.

Method MMMU MMMU-P MathV ZeroB WeMath SVQA HalB MM-IF MLB-Doc DocVQA CC-OCR V* CntB Avg.
Standard SFT 51.78 35.24 51.30 10.18 22.76 35.51 40.35 58.40 15.49 73.05 46.76 47.71 42.97 40.88
+ Full-Text Rendering 50.56 35.53 55.90 11.68 22.76 35.60 39.45 59.51 16.87 71.80 47.35 47.32 52.55 42.07
+ LoMo w/o PD 51.00 35.38 55.40 11.98 25.33 36.59 45.17 60.00 16.59 73.77 48.15 50.65 50.31 43.10
+ LoMo 51.22 36.36 53.90 11.98 25.14 38.62 42.82 61.61 18.06 74.77 48.97 51.70 51.12 43.56

Rewrite Ratio, Position, And Matched Exposure

Table 3 shows that rewriting some text-only data is consistently useful, but rewriting all of it is not best. The average peaks at a 50% rewrite ratio, 43.56, then falls to 42.68 at 100%, suggesting that LoMo needs a mixed diet of normal text and interleaved text-image examples.

Table 3. Quantitative results of different rewrite ratios on LLaVA-OV1.5-8B. Original caption: The rewrite ratio controls the fraction of text-only training samples reformatted into text-image interleaved sequences with LoMo. \(\Delta\) denotes the average improvement over the Standard SFT baseline.

Setting Rewrite ratio MMMU MMMU-P MathV ZeroB WeMath SVQA HalB MM-IF MLB-Doc DocVQA CC-OCR V* CntB Avg. Delta
Standard SFT 0% 51.78 35.24 51.30 10.18 22.76 35.51 40.35 58.40 15.49 73.05 46.76 47.71 42.97 40.88 --
LoMo 25% 50.44 37.19 51.60 12.13 24.67 37.33 42.84 62.58 17.14 74.52 49.84 51.18 46.24 42.90 +2.02
LoMo 50% 51.22 36.36 53.90 11.98 25.14 38.62 42.82 61.61 18.06 74.77 48.97 51.70 51.12 43.56 +2.68
LoMo 75% 50.56 36.13 52.40 13.62 27.14 37.23 44.25 61.22 17.51 74.74 49.19 49.08 49.03 43.24 +2.36
LoMo 100% 50.67 36.32 53.80 11.23 26.38 37.63 41.37 61.07 17.14 74.77 48.23 50.07 46.18 42.68 +1.80

Table 4 tests where the rendered span should sit. Middle placement is best at 43.56; prefix, suffix, and multi-span are all lower. This supports the method rationale that a text-image-text structure forces fusion between visual and textual carriers.

Table 4. Quantitative results of different rendering positions on LLaVA-OV1.5-8B. Original caption: Prefix renders the first one-third of the prompt, Middle renders the central one-third, Suffix renders the last one-third, and Multi-Span renders two short spans.

Setting Position MMMU MMMU-P MathV ZeroB WeMath SVQA HalB MM-IF MLB-Doc DocVQA CC-OCR V* CntB Avg. Delta
Standard SFT -- 51.78 35.24 51.30 10.18 22.76 35.51 40.35 58.40 15.49 73.05 46.76 47.71 42.97 40.88 --
LoMo Prefix 52.89 36.49 53.50 11.83 24.00 35.85 40.75 59.06 15.77 74.60 47.98 51.70 47.21 42.44 +1.56
LoMo Middle 51.22 36.36 53.90 11.98 25.14 38.62 42.82 61.61 18.06 74.77 48.97 51.70 51.12 43.56 +2.68
LoMo Suffix 50.78 36.71 52.10 10.63 26.10 37.58 40.20 60.64 15.12 75.58 49.59 49.93 45.29 42.33 +1.45
LoMo Multi-Span 50.89 37.22 55.00 10.48 24.57 37.48 41.49 62.01 16.32 75.30 49.72 50.52 43.29 42.64 +1.76

Table 5 addresses a potential confound: LoMo converts some text-only samples into image-bearing samples, so perhaps the benefit comes from more visual exposure. The matched 1:1 image:text comparison still gives +2.45 average points over Standard SFT, close to the original +2.68, so this concern is partially controlled.

Table 5. Controlled comparison under different image-bearing to text-only sample ratios on LLaVA-OV1.5-8B. Original caption: The original setting uses LoMo's natural 3:1 ratio after rewriting, while the matched setting controls the effective ratio back to 1:1 to match Standard SFT.

Setting Image:Text ratio MMMU MMMU-P MathV ZeroB WeMath SVQA HalB MM-IF MLB-Doc DocVQA CC-OCR V* CntB Avg. Delta
Standard SFT 1:1 51.78 35.24 51.30 10.18 22.76 35.51 40.35 58.40 15.49 73.05 46.76 47.71 42.97 40.88 --
LoMo 3:1 Original 51.22 36.36 53.90 11.98 25.14 38.62 42.82 61.61 18.06 74.77 48.97 51.70 51.12 43.56 +2.68
LoMo 1:1 Matched 51.44 36.24 51.40 12.57 24.86 38.27 43.70 62.10 18.88 74.02 48.37 51.57 49.97 43.33 +2.45

Pure-Text Sanity Check

Table 6 is a narrow safety check. LoMo slightly improves the average on both backbones, +0.28 and +0.58, but the changes are small and not uniformly positive. The most important reading is "no obvious pure-text collapse" rather than a strong pure-text improvement claim.

Table 6. Pure-text capability sanity check. Original caption: Standard SFT and LoMo are evaluated on five pure-text benchmarks covering general knowledge, mathematical reasoning, code generation, and instruction following.

Model Setting MMLU-Pro HumanEval LCB-V6 GSM8K IFEval Avg.
LLaVA-OV1.5-8B Standard SFT 62.58 82.62 29.19 92.87 75.42 68.54
LLaVA-OV1.5-8B + LoMo 62.27 82.93 29.91 93.40 75.60 68.82
LLaVA-OV1.5-8B Delta -0.31 +0.31 +0.72 +0.53 +0.18 +0.28
Qwen3.5-9B Standard SFT 70.74 71.34 33.29 95.38 72.27 68.60
Qwen3.5-9B + LoMo 71.16 71.95 33.43 94.54 74.86 69.19
Qwen3.5-9B Delta +0.42 +0.61 +0.14 -0.84 +2.59 +0.58

Practical Takeaways

The paper's limitations section says LoMo is applied only during SFT, uses a block-level middle-span heuristic, and is evaluated under compute constraints on two backbones. Those caveats matter because the method is promising as a data recipe, but not yet shown as a universal pretraining or post-training principle.