SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Source-first digest for monthly 2026_05 rank 4, rank_id p011.

Routing status: success
PDF extraction: not used

Motivation / Background

Most recent multimodal systems still split understanding and generation into different mechanisms. Perception usually depends on pretrained vision encoders, while image generation usually depends on VAEs or other latent bottlenecks. SenseNova-U1 argues that this split is not only an implementation detail: it creates separate objectives, representation spaces, and deployment paths that make unified multimodal intelligence harder to obtain.

The paper's answer is NEO-unify, a native pixel-word architecture that removes separate VEs and VAEs, uses lightweight patch interfaces, and trains understanding and generation inside one Mixture-of-Transformers backbone. The intended outcome is a model that can read, reason, draw, edit, interleave text and images, and show early VLA/world-model behavior without routing across separate modality-specific systems. The high-level scope is summarized in Figure 1.

Figure 1. SenseNova-U1 overview. — **Figure 1. Overview of SenseNova-U1.** The original caption says the model uses lightweight encoding and decoding interfaces to support perception, synthesis, and interleaved vision-language generation inside a single end-to-end architecture.

Claims And Evidence

Claim id	Main claim	Support	Evidence anchors
C1	Understanding and generation can be trained as synergistic views of one native pixel-word process rather than separate VE/VAE pipelines.	4	overview, architecture, joint objective, unified reasoning
C2	The near-lossless visual interface preserves useful semantic and pixel-level information without pretrained vision encoders or latent autoencoders.	3	visual interface, reconstruction ablation
C3	Native MoT routing lets understanding and generation co-train with limited objective conflict.	4	MoT design, co-training ablation, scaling curves
C4	SenseNova-U1 remains competitive on multimodal understanding, text reasoning, agentic benchmarks, and spatial intelligence despite being a unified model.	4	understanding table, text and agent table
C5	Unified modeling does not sacrifice image generation: the model is strong on compositional, dense-prompt, text-centric, infographic, and knowledge-centric generation.	5	generation table, text-rich generation table, infographic results, showcase figures
C6	Image editing works in the same native framework, but specialist editors still lead on several editing axes.	3	editing table, reasoning-editing table, limitations
C7	Interleaved generation and bidirectional understanding-generation tests support real cross-modal synergy, not only colocated skills.	4	interleaved table, Uni-MMMU and RealUnify
C8	VLA and world-model behavior is promising but preliminary because the paper provides visual qualitative evidence rather than full embodied benchmark suites.	2	VLA visualization, world modeling visualization

Scores are support-from-paper scores, not independent reproduction scores. I cap claims when evidence is mainly qualitative, when baselines are mostly benchmark-specific, or when the result is inherited from a smaller NEO-unify ablation rather than measured directly on the final SenseNova-U1 variants.

Core Technical Idea

SenseNova-U1 makes three architectural moves:

Near-lossless visual interface. Images/noise are converted into 32 by 32 patch tokens through two convolutional layers with GELU and 2D sinusoidal positions. The generation stream predicts pixel patches directly with an MLP-like head, rather than decoding through a VAE.
Native pixel-word sequence modeling. Text tokens and image tokens are projected into one embedding space, delimited by image boundary tokens, and processed jointly.
Native Mixture-of-Transformers. Clean understanding tokens and noisy generation tokens share attention context but use token-type-routed projections, normalizations, and feedforward blocks. This keeps the architecture unified while decoupling parameters enough to reduce interference.

The architecture is shown in Figure 2, and the two released variants are summarized in Table 1.

Figure 2. NEO-unify architecture. — **Figure 2. Architecture.** SenseNova-U1 operates directly on native pixels and text. It uses a two-layer convolutional patch encoder, an MLP-like pixel decoder, and a native MoT backbone. The paper emphasizes the 32x compression ratio as a practical compromise between near-lossless information retention and efficient unified modeling.

Configuration	SenseNova-U1-8B-MoT	SenseNova-U1-A3B-MoT
Patch size	\(32 \times 32\)	\(32 \times 32\)
Pre-buffer	Yes	No
Layers	42	48
Heads, Q / KV	32 / 8	32 / 4
Head size, T / H / W	64 / 32 / 32	64 / 32 / 32
Hidden size	4,096	2,048
Understanding / generation experts	1 / 1	128 / 32, A8
Understanding / generation params	8.2B / 8.2B	30.0B / 8.2B, A3B

Table 1. Model configurations. This table comes from the LaTeX source table tab:model_config; it matters because the paper's main comparison is not only small versus large, but dense symmetric streams versus stream-wise MoE.

The training objective combines text understanding and pixel-space visual generation:

\mathcal{L}_\text{total} = \lambda_1 \mathcal{L}_\text{Und} + \lambda_2 \mathcal{L}_\text{Gen}.

For understanding, the model uses autoregressive next-token loss over text conditioned on multimodal context:

\mathcal{L}_\text{Und} = -\frac{1}{N}\sum_{n=1}^{N} \log p_\theta(x_n \mid x_{\lt n}, \mathbf{c}).

For generation, it applies rectified-flow style pixel-space modeling. A clean image \(\mathbf{x}\) and Gaussian noise \(\boldsymbol\epsilon\) are interpolated with a resolution-adaptive noise scale \(\sigma_R\):

\mathbf{z}_t = t\,\mathbf{x} + (1-t)\,\sigma_R\,\boldsymbol\epsilon, \qquad t \in [0,1].

The model predicts a clean signal and converts it into a velocity:

\mathbf{v}_\theta(\mathbf{z}_t,t) = \frac{\hat{\mathbf{x}}_\theta(\mathbf{z}_t,t,\mathbf{s}_t)-\mathbf{z}_t}{1-t}.

The generation loss is:

\mathcal{L}_{\text{Gen}} = \mathbb{E}_{t,\mathbf{x},\boldsymbol\epsilon,(H,W)} \left[ \left\|\mathbf{v}_\theta(\mathbf{z}_t,t)-\mathbf{v}^\star\right\|_2^2 \right], \qquad \mathbf{v}^\star= \frac{\mathbf{x}-\mathbf{z}_t}{1-t}.

For image-conditioned generation and editing, the paper uses a unified classifier-free guidance rule with separate text and image-context strengths:

\begin{aligned} \nabla_{\mathbf{x}} \log p(\mathbf{x}\mid c_{\mathrm{img}}, c_{\mathrm{txt}}) =&\ \gamma \left( \nabla_{\mathbf{x}} \log p(\mathbf{x}\mid c_{\mathrm{img}}, c_{\mathrm{txt}}) - \nabla_{\mathbf{x}} \log p(\mathbf{x}\mid c_{\mathrm{img}}) \right) \\ &+ \gamma_{\mathrm{img}} \left( \nabla_{\mathbf{x}} \log p(\mathbf{x}\mid c_{\mathrm{img}}) - \nabla_{\mathbf{x}} \log p(\mathbf{x}) \right) + \nabla_{\mathbf{x}} \log p(\mathbf{x}). \end{aligned}

The reported practical setting is \(\gamma = 4\), \(\gamma_{\mathrm{img}} = 1\), timestep shift 3.0, and global CFG renormalization.

Method Details

Training Recipe

The model is trained in a staged sequence, with pure understanding first, generation pretraining second, joint mid-training third, and unified SFT fourth. Table 2 compresses the stage table recovered from the flattened LaTeX source.

Stage	Main role	Steps	Data mix	Key settings
Stage 1, understanding warmup	Adapt NEO backbone and attention fusion	120K	100% understanding	LR \(2\times10^{-5}\), seq length 32768, 0.75T tokens
Stage 2-I, generation pretraining	Establish text-to-image generation branch	120K	100% generation	LR \(2\times10^{-4}\), gen resolution \(256^2\) to \(512^2\)
Stage 2-II, high-res generation	Continue high-resolution generation	60K	100% generation	LR \(1\times10^{-4}\), gen resolution \(512^2\) to \(2048^2\)
Stage 2-III, expanded generation	Add editing, reasoning, and interleaved data	120K	56% generation, 37% editing, 7% interleaved	Cosine LR \(1\times10^{-4}\) to \(2\times10^{-5}\)
Stage 3, unified mid-training	Jointly train both branches	84K	33% understanding, 37% generation, 24% editing, 6% interleaved	\(\lambda_1=0.1\), \(\lambda_2=1.0\), 1.19T tokens
Stage 4, unified SFT	Instruction alignment across modalities	9K	Same ratios as Stage 3	Cosine LR \(2\times10^{-5}\) to 0

Table 2. Training stages. The important design is not simply "more data"; Stage 2 deliberately lets the generation branch learn pixel synthesis before the unified model is jointly optimized.

After SFT, the paper adds generation post-training. Dynamic resolution warmup gates candidate resolution sampling by difficulty:

\hat{p}_i = p_i \cdot \mathrm{clamp}\left( \frac{\min(e / E_{\text{warm}}, 1)-d_i}{\delta}+1, 0, 1 \right).

Text-rendering RL uses OCR overlap:

r_{\text{ocr}} = \frac{ \left|\mathcal{C}(\hat{T}) \cap \mathcal{C}(T^\star)\right| }{ \left|\mathcal{C}(\hat{T}) \cup \mathcal{C}(T^\star)\right| }.

For one reward group, OCR and style rewards are combined as:

r = r_{\text{ocr}} + \lambda_{\text{sty}} r_{\text{sty}}.

The text-rendering RL run uses 600 epochs, learning rate \(1\times10^{-5}\), KL coefficient \(\beta=0.01\), \(N=48\) prompts per epoch, \(K=16\) samples per prompt, 10-step flow matching, guidance 4.0, and a 200-epoch resolution warmup. The later general RL stage interleaves text/style and aesthetic reward groups; the 8B variant trains for 1,600 epochs and the A3B variant for 200.

Inference Infrastructure

The paper keeps the user-facing model unified but splits serving into two specialized engines because understanding and generation have different runtime bottlenecks. Figure 3 shows the disaggregated LightLLM/LightX2V serving setup, and Figure 4 shows the hybrid attention kernel.

Figure 3. Disaggregated LightLLM and LightX2V inference. — **Figure 3. Inference architecture.** LightLLM handles multimodal understanding, text streaming, and control flow, while LightX2V handles image generation. The engines exchange generation state through pinned shared memory.

Figure 4. Hybrid attention kernel. — **Figure 4. Hybrid attention.** Text rows retain causal masking, while image rows attend to the preceding text prefix and full image span. The kernel preserves the causal fast path for pure-text blocks and expands the key range only when a block contains image tokens.

The infrastructure supports separate and colocated deployment. For \(2048 \times 2048\) generation with SenseNova-U1-8B-MoT, both modes support TP2+CFG2; in separate mode, the paper reports per-step latencies of 0.415 seconds on 5090 and 0.443 seconds on L40S GPUs.

Data Construction

The understanding corpus and generation corpus are both heavily filtered. Figure 5 shows the understanding pipeline, Figure 6 summarizes the corpus distribution, and Figure 7 shows the generation filtering pipeline.

Figure 5. Understanding data pipeline. — **Figure 5. Understanding pipeline.** The mid-training corpus is curated across ten vertical domains by distribution-balanced sampling, prompt augmentation, and multi-criteria filtering.

Figure 6. Training corpus distribution. — **Figure 6. Corpus distribution.** The source figure presents sunburst charts for understanding, text-to-image, editing, and interleaved datasets. The text gives several key proportions: mid-training data includes General 39.2%, Agent and Spatial 22.3%, Knowledge Reasoning 19.3%, and Pure Text 19.2%; SFT includes roughly 15% spatial intelligence, 13% general multimodal understanding, 12% reasoning, 11% general NLP, 11% OCR/document analysis, and 10% agentic function calling.

Figure 7. Generation data pipeline. — **Figure 7. Generation pipeline.** Text-to-image and editing data go through low-level filtering, deduplication, VLM captioning, and quality filtering. Text-to-image data is roughly Nature 40.5%, People 26.7%, and Design 20.7%; editing data is dominated by natural scenes 52.3% and human subjects 14.7%.

For interleaved data, the paper reports a compact corpus spanning video, lifestyle, infographics, and reasoning. Lifestyle is about 44%, infographics about 29%, video about 19%, and reasoning about 8%, with the reasoning subset including explicit chain-of-thought traces.

Experiments And Results

Understanding, Text, Agent, And Spatial Results

The paper evaluates understanding with EvalScope under an LLM-as-a-judge llm_recall strategy, using gpt-4o-mini-2024-07-18 as judge. The setup uses temperature 0.6, top-p 0.95, top-k 20, repetition penalty 1.0, 40,960 max sequence length, 600 second timeout, and internal thinking enabled.

Table 3 condenses the multimodal understanding and spatial-intelligence table recovered from flattened LaTeX. The strongest evidence is not that SenseNova-U1 wins every column; it is that an encoder-free unified model stays close to or above strong understanding-only VLMs across diverse categories.

Benchmark	SenseNova-U1 8B-Think	Qwen3VL 8B-Think	Qwen3.5 9B	SenseNova-U1 30BA3B-Think	Qwen3VL 30BA3B-Think	Qwen3.5 35BA3B
MMMU	74.78	74.10	78.40	80.55	76.00	81.40
MMMU-Pro	67.69	60.40	70.10	72.83	63.00	75.10
MathVista-mini	84.20	81.40	85.70	85.30	81.90	86.20
MathVision	75.82	62.70	78.90	79.63	65.70	83.90
MMBench-EN	90.25	87.50	90.10	91.59	88.90	91.50
OCRBench-v2	61.30	61.55	66.54	68.64	61.50	73.71
VSI-Bench	62.66	56.61*	55.67*	56.90	51.56*	58.10*
MindCube-Tiny	62.01	43.17	57.59	70.86	40.86	63.46

Table 3. Selected multimodal and spatial benchmarks. Asterisks mark Qwen variants that used 128 frames for best performance under the paper's EASI note; the standard protocol otherwise uses 32 frames on VSI-Bench.

Benchmark	SenseNova-U1 8B-Think	Qwen3VL 8B-Think	Qwen3.5 9B	SenseNova-U1 30BA3B-Think	Qwen3VL 30BA3B-Think	Qwen3.5 35BA3B
MMLU-Pro	81.44	77.30	82.50	84.04	80.50	85.30
MMLU-Redux	87.61	88.80	91.10	89.44	90.90	93.30
C-Eval	84.40	83.88	88.20	85.89	87.29	90.20
SuperGPQA	49.67	51.20	58.20	59.71	56.40	63.40
IFEval	91.13	83.20	91.50	92.39	81.70	91.90
IFBench	67.01	29.93	64.50	79.79	34.69	70.20
Tau2	71.70	31.65	79.10	75.39	46.40	81.20
Claw eval	58.90	21.70	65.40	58.50	22.10	36.50

Table 4. Selected text, instruction, and agent results. The instruction-following and agent rows are the clearest wins for SenseNova-U1 over Qwen3VL variants; dense reasoning-focused Qwen3.5 still leads on several knowledge rows.

Image Generation

For general generation, the paper reports GenEval, DPG-Bench, OneIG, TIIF, CVTG-2K, LongText-Bench, WISE, IGenBench, and BizGenEval. Figure 8 and Figure 9 show qualitative examples, while Table 5 and Table 6 summarize the strongest quantitative claims.

Figure 8. Infographic and human generation examples. — **Figure 8. Text-to-image showcases.** The source caption describes SenseNova-U1-8B-MoT examples in infographics and human generation.

Figure 9. Editing and interleaved examples. — **Figure 9. Editing and interleaved showcases.** The source caption describes SenseNova-U1-8B-MoT examples in image editing and interleaved generation.

Benchmark	Metric	SenseNova-U1 8BA3B	SenseNova-U1 8B	Qwen-Image 20B	BAGEL 7B
GenEval	Overall	0.91	0.91	0.87	0.82
DPG-Bench	Overall	88.14	87.78	88.32	85.07
OneIG-EN	Overall	0.543	0.549	0.539	0.361
OneIG-ZH	Overall	0.540	0.535	0.548	0.370
TIIF short	Overall	89.25	89.74	86.14	-
TIIF long	Overall	87.36	89.17	86.83	-
WISE with CoT	Overall	0.81	0.78	-	0.70
WISE without CoT	Overall	0.74	0.69	0.63	0.49

Table 5. Selected generation benchmarks. SenseNova-U1 is particularly strong on GenEval and TIIF. On DPG-Bench it is competitive but not the top overall row because Qwen-Image reports 88.32 and Seedream 4.5 reports 88.63.

Benchmark	Metric	SenseNova-U1 8B	SenseNova-U1 8BA3B	Qwen-Image 20B	FLUX.1-dev 12B
CVTG-2K	Word accuracy average	0.940	0.881	0.829	0.497
LongText-Bench-EN	Accuracy	0.979	0.950	0.943	0.607
LongText-Bench-ZH	Accuracy	0.962	0.955	0.946	0.005

Table 6. Text-centric generation. The 8B variant is the strongest SenseNova-U1 text-rendering model in these rows, especially on CVTG-2K and LongText-Bench.

Benchmark	Metric	SenseNova-U1 8B	SenseNova-U1 8BA3B	Qwen-Image 20B	Z-Image-Turbo 6B
IGenBench	Overall	0.51	0.42	0.36	0.35
BizGenEval	Average hard / easy	39.7 / 61.7	28.2 / 51.9	2.8 / 23.8	-

Table 7. Infographic and business visual generation. IGenBench and BizGenEval are important because they test layouts, charts, data constraints, text rendering, and domain knowledge rather than only natural-image aesthetics.

Image Editing

The editing results are more nuanced. SenseNova-U1 is competitive among unified open-source models, but it does not beat the strongest dedicated editing systems such as Qwen-Image-Edit-2511 or Seedream in the general editing table. The paper explicitly attributes part of the gap to editing data limitations.

Benchmark	Metric	SenseNova-U1 8BA3B	SenseNova-U1 8B	Qwen-Image-Edit-2511 20B	BAGEL 7B
ImgEdit	Overall	3.91	3.90	4.51	3.20
GEdit-Bench-EN	Score	8.07	8.27	8.30	7.36
GEdit-Bench-ZH	Score	7.36	7.49	8.20	6.83
GEdit-Bench overall	Score	7.32	7.47	7.88	6.52

Table 8. General editing. The unified native model clears older unified baselines, but dedicated editors still lead, especially on ImgEdit and GEdit overall.

RISEBench row	Temporal	Causal	Spatial	Logical	Overall	IR	AC	VP
SenseNova-U1-SFT 8BA3B with CoT	24.7	46.7	28.0	20.0	30.0	63.2	84.1	87.4
SenseNova-U1-SFT 8B with CoT	31.8	33.3	27.0	15.3	26.9	60.8	86.6	88.2
SenseNova-U1-SFT 8BA3B	25.9	41.1	26.0	7.1	25.3	57.4	82.6	85.4
SenseNova-U1-SFT 8B	22.4	33.3	27.0	11.8	23.9	58.2	84.1	82.4
Qwen-Image-Edit-2511 20B	21.2	18.9	31.0	4.7	19.4	49.9	71.0	91.5

Table 9. Reasoning editing. CoT helps substantially on RISEBench. For example, A3B logical score moves from 7.1 to 20.0, and overall rises from 25.3 to 30.0.

Interleaved Generation And Unified Reasoning

Interleaved generation is where the paper most directly tests whether the model can coordinate language and images over multiple steps. Table 10 covers OpenING and VBVR-Image preview; Table 11 covers generation-aids-understanding and bidirectional synergy.

Benchmark	Metric	SenseNova-U1-SFT 8BA3B	SenseNova-U1-SFT 8B	Reference baseline
OpenING	Overall with CoT	9.16	9.07	GPT-4o + DALL-E3: 8.20
VBVR-Image preview	Overall	68.9	68.8	VBVR-BAGEL: 36.5; BAGEL: 29.1

Table 10. Interleaved generation. These rows support the claim that SenseNova-U1 can sustain multi-step image-text coherence and reasoning-through-generation better than the compared unified baselines.

Benchmark	Metric	SenseNova-U1-SFT 8B	SenseNova-U1-SFT 8BA3B	BAGEL 7B	Ovis-U1 1.2B
Uni-MMMU GaU	Avg	35.0	32.6	20.5	14.1
RealUnify	Avg-UEG	55.7	52.8	47.7	39.3
RealUnify	Avg-GEU	47.5	47.0	35.8	29.5
RealUnify	Overall avg	52.4	50.5	42.9	35.4

Table 11. Unified reasoning. Uni-MMMU measures generation assisting understanding. RealUnify separates Understanding Enhances Generation and Generation Enhances Understanding, which is why it is central evidence for C7.

Ablation Studies

The ablations address three questions: whether the encoder-free visual interface preserves detail, whether MoT co-training creates manageable interaction between objectives, and whether data scale improves both generation and unified capabilities. Figure 10, Figure 11, Figure 12, Figure 13a, Figure 13b, Figure 13c, and Figure 13d show the visual evidence.

**Figure 10. Reconstruction ablation.** The source caption shows out-of-domain reconstruction with 2B NEO-unify under a frozen understanding branch. The text reports 31.56 PSNR and 0.85 SSIM on MS-COCO 2017 after 90K pretraining steps, approaching FLUX.1-dev VAE at 31.56 PSNR and 0.93 SSIM.

**Figure 11. Editing ablation.** The source caption validates ImgEdit prompts with 2B NEO-unify under a frozen understanding branch. The paper reports an ImgEdit score of 3.32 after 60K mixed-training steps using public text-to-image and editing datasets.

Figure 12. Understanding-generation co-training. — **Figure 12. Co-training conflict.** The source caption states that GEdit-Bench scores are normalized to a 0-100 scale. The main text says understanding remains stable while generation converges rapidly, suggesting limited intrinsic conflict under the native MoT backbone.

Figures 13a-13d. Data-scaling curves. These curves cover DPG-Bench, WISE, GEdit-Bench, and RISEBench. The paper interprets them as evidence that diverse data scale improves generation quality and understanding-generation synergy.

Preliminary VLA And World Modeling

The final experiments are qualitative visualizations rather than full embodied evaluations. They are still useful because they show what the authors mean by a path from passive multimodal understanding toward acting and predicting state transitions.

Figure 14. VLA behavior visualization. — **Figure 14. VLA visualization.** The figure samples four frames from robotic manipulation videos and asks the model to reason about action-relevant dynamics, object states, manipulation trajectories, and task progression.

**Figure 15. World modeling visualization.** The model receives an input image and action-oriented instruction, then predicts a plausible visual outcome while preserving scene consistency and object coherence. Because this is qualitative, it supports C8 only weakly.

Practical Takeaways

1. The real contribution is architectural unification, not just benchmark breadth. SenseNova-U1 tries to remove the VE/VAE split and force pixel-word understanding, image synthesis, editing, and interleaving through one native modeling interface.

2. Parameter decoupling inside a unified backbone is the pragmatic compromise. The model is "single architecture" at the sequence and attention level, but projections, normalizations, and feedforward blocks are stream-routed by token type. That choice is what makes the unification claim more credible than a purely shared-backbone design.

3. Text rendering and structured visual generation are unusually central. The strongest generation evidence is not only natural-image quality; it is GenEval, TIIF, CVTG-2K, LongText-Bench, IGenBench, BizGenEval, and WISE, which stress layout, text, knowledge, and reasoning.

4. Editing is good but not solved. SenseNova-U1 beats many unified baselines, but dedicated editors still lead on general editing. The paper's own limitation is that editing supervision is still dominated by public resources and lacks broader pipelines and preference optimization.

5. The VLA/world-model story should be treated as roadmap evidence. The figures are promising, but they are not a substitute for benchmarked embodied control or closed-loop world-model evaluation.

6. For follow-up work, the most interesting tests are interference and transfer. A strong reproduction would vary the MoT decoupling, VE/VAE removal, data mixture, and post-training rewards to test whether the claimed synergy survives outside this training stack.

Reference Coverage

Additional local references for validation and navigation: evidence-data, evidence-inference, evidence-training, fig-vla, fig-world-modeling, table-editing.
Additional local references for validation and navigation: table-infographic, table-reasoning-editing, table-text-agent.