^*Equal Contribution ^#Corresponding Author

Blogpost Content:

AR + Diffusion Architecture: Similar with BLIP3o, BLIP3o-NEXT generates intermediate features via the autoregressive model and then conditions on these features to generate images through the diffusion model.

Discrete Image Token Supervision: We add discrete SigLIP-2 image token prediction as extra training supervision, jointly optimizing CrossEntropy and the diffusion objective. These two objective training not only accelerates convergence in the image generation training but also yields better prompt alignment and image quality.

Reinforcement Learning: The introduction of discrete image tokens unlocks seamless compatibility with existing language-model RL framework. Using Group Relative Policy Optimization (GRPO), we train the BLIP3o-NEXT to improve prompt alignment and text rendering in image generation.

Fully Open-Source: We release every component of BLIP3o-NEXT: datasets, model weights (including pretrian, instruction tuning and RL model), and complete training pipelines for both multimodal pretraining, instruction-tuning and RL.

AR + Diffusion

In the Autoregressive + Diffusion architecture, the autoregressive model (a language model or vision language model) first takes in a user prompt and generates continuous latent embeddings. The diffusion model then generates images conditioned on the latent embeddings.

Discrete Image Token Supervision

BLIP3o-NEXT Architecture — **Figure 1:** BLIP3o-NEXT architecture with discrete image token supervision. The autoregressive model generates discrete image tokens, and their hidden representations serve as conditions for the diffusion model. We jointly optimize both CrossEntropy and Flow-Matching objective during training.

In our training pipeline, the AR model learns to predict discrete image tokens via teacher forcing, and the hidden states of these predicted discrete tokens are used as conditioning inputs for the diffusion model. We jointly optimize a cross-entropy objective on those discrete tokens alongside the diffusion loss for end-to-end image generation. This approach brings several key benefits:

1) Discrete tokens align naturally with autoregressive architectures compared with continuous features, enabling faster training convergence compared with BLIP3o.

2) Autoregressive image token prediction excels at tasks requiring spatial structures (e.g., rendering text or generating multiple objects composition), while diffusion model excels at producing high visual-fidelity images. By having the AR model lay down a discrete "blueprint" and feeding their hidden representations into the diffusion model, we combine structural accuracy with high visual-fidelity image outputs.

3) Training the AR and diffusion model together ensures they see the same distributions at train and inference time. This joint optimization avoids mismatches that occurs when the diffusion model is trained to reconstruct from features (e.g., CLIP embeddings) while the AR model is separately trained to predict those features.

For discrete image tokens, we apply a vector-quantization (VQ) step that maps continuous SigLIP2 embeddings into a finite set of codebook entries. Training this SigLIP2 quantizer is similar to the VQ-VAE paradigm, learning both the codebook and encoder/decoder jointly to reconstruct the SigLIP2 feature. Similar to BLIP3o’s findings, semantic features can be well aligned with the autoregressive model compared with VAE latents, thus we quantize SigLIP2’s semantic representations instead of VAE latents. Also SigLIP2 can be used as the image understanding encoder, which unifies the image generation and understanding into one semantic space.

Reinforcement Learning

To further enhance prompt alignment and text rendering, we employ Group Relative Policy Optimization (GRPO) to fine-tune the autoregressive model. By introducing discrete image tokens, we seamlessly integrate with existing reinforcement learning frameworks designed for language models. We focus on two verifiable reward tasks:

1) Prompt Alignment: verifying count, color, and position (e.g., "generate two red apples," following a GenEval setup).

2) Text Rendering: scoring rendered text via an OCR-based evaluator.

During the RL training phase, the AR model samples discrete image tokens conditioned on the input prompt. The hidden states associated with these tokens are then provided to a diffusion model, which remains frozen during this process, to generate the corresponding images. A reward model subsequently evaluates each image based on prompt alignment or OCR-derived text accuracy, and these reward signals are used to update the AR model via policy gradients. This paradigm leverages and extends established RL methodologies for text generation, opening new avenues for reinforcement learning applications that incorporate both text and image modalities.

GRPO Training Pipeline — **Figure 2:** GRPO training pipeline for BLIP3o-NEXT. The autoregressive model generates discrete tokens, the frozen diffusion model decodes their hidden states into images, and a reward model provides feedback for policy optimization. Only the AR model parameters are updated during RL training.

Prompt Alignment

GenEval GRPO Training Rewards — **Figure 3:** Training reward progression during GRPO training on GenEval dataset, showing the optimization of reward function over training steps.

GenEval GRPO Training Performance — **Figure 4:** GenEval performance improvement during GRPO training, evaluated every 50 training steps from 0 to 400 steps.

We peform GRPO training on BLIP3o-NEXT leveraging synthesized data and report the plot of the training reward, and both qualitative and quantitative results on the GenEval dataset. The reward function is the same as the official implementation of evaluation metric used for GenEval. We compare our BLIP3o-NEXT to recent strong unified multimodal foundation models, including BLIP3o, FLUX.1-dev, Metaqueries, and BAGEL.

Model	Single Obj.	Two Obj.	Counting	Colors	Position	Color Attri.	Overall
FLUX.1-dev	0.98	0.93	0.75	0.93	0.68	0.65	0.82
Metaqueries XL	-	-	-	-	-	-	0.80^†
BAGEL	0.98	0.95	0.84	0.95	0.78	0.77	0.88^†
BLIP3o	-	-	-	-	-	-	0.84
BLIP3o-NEXT (3B)	0.99	0.95	0.88	0.90	0.92	0.79	0.91

Table 4: Performance evaluation on GenEval. ^†With Prompt Rewrite.

Multi-object Prompt Following Results — **Figure 4:** Qualitative results on prompt following.

Text Rendering

OCR GRPO Training Rewards — **Figure 5:** OCR training reward progression during GRPO training, showing the optimization of OCR reward function over training steps.

We peform GRPO training on BLIP3o-NEXT leveraging synthesized data and report the plot of the training reward and quantitative results on prompts with various text. The reward function is an ORC model.

Text Rendering Results — **Figure 6:** Qualitative results on Text Rendering.

Complete Training Recipe

Below we provide a complete training recipe for BLIP3o-NEXT, including hyperparameters and dataset compositions. This comprehensive documentation is designed to facilitate reproducibility and support the open-source research community in building upon our work.

Stage	Learning Rate	Number of Steps	Batch Size	Dataset	GPUs
Pretrain	1e-4	336K	512	BLIP3o-Pretrain	32 × H200
Finetuning	5e-5	7K	128	BLIP3o-60k + ShareGPT-4o	1 × H100
RL-Text	1e-6	900	512	Synthesis prompts	32 × H100
RL-GenEval	5e-6	400	256	Synthesis prompts	16 × H100

Table 5: Complete training recipe for BLIP3o-NEXT across different training stages.

Acknowledgements

We thank the research community for their foundational work in language models, multimodal models and diffusion models. Special thanks to the teams behind Qwen3, SANA 1.5, and SigLIP2, for their excellent contributions. We also thank Tar, to open source excellent VQ-SigLIP2.

BLIP3o-NEXT

AR + Diffusion Architecture

Discrete Image Token Supervision

Reinforcement Learning

Fully Open-Source

AR + Diffusion

Discrete Image Token Supervision

Reinforcement Learning

Prompt Alignment

Text Rendering

Complete Training Recipe

Acknowledgements

BibTeX