Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Abstract

1. We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks.

2. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion" (DBFusion) to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs.

3. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced.

LLaVA v.s. Florence-VL

Comparison of LLaVA-style MLLMs with our Florence-VL. LLaVA-style models use CLIP, pretrained with contrastive learning, to generate a single high-level image feature. In contrast, Florence-VL leverages Florence-2, pretrained with generative modeling across various vision tasks such as image captioning, OCR, and grounding. This enables Florence-2 to flexibly extract multiple task-specific image features using Florence-2 as the image encoder.

Florence-VL

An overview of Florence-VL, which extracts visual features of different depths and breaths from Florence-2, combines them using DBFusion, and project the fused features to an LLM's input space.

Breadth. We focus on three distinct tasks that contribute to image understanding, resulting in three different image embeddings:

Detailed Image Caption: Describe what is shown in the image with a paragraph.
OCR: Provide the text shown in the image.
Dense Region Caption: Locate the objects in the image, with their descriptions.

Depth. We also integrate lower-level features from DaViT combined with higher-level features.

DB-fusion. We concatenate features along the channel dimension, showing better performance and training efficiency than other fusion strategy.

Visualization

Visualization of the first three PCA components: we apply PCA to image features generated from Detailed Caption, OCR, and Grounding prompts, excluding the background by setting a threshold on the first PCA component.

The image features derived from the Detailed Caption prompt (second column) capture the general context of the image.

Those from the OCR prompt (third column) focus primarily on text information.

Those from the Grounding prompt (fourth column) highlight spatial relationships between objects.

Additionally, we visualize the final layer features from OpenAI CLIP (ViT-L/14@336) in the last column, showing that CLIP features often miss certain region-level details, such as text information in many cases.

Dolly Zoom Effect

Training Recipe

17M Detailed Image Caption. In order to build a state-of-the-art MLLM, we use images from CC12M, Redcaps, and Commonpool during the pretraining stage, with detailed captions sourced from PixelProse and ShareGPT4V .

10M High Quality Instruction Data. For the instruction tuning stage, we also curate our high quality instruction tuning datasets, sourcing from Cambrian-7M, Vision Flan, ShareGPT4V, along with additional data from Docmatix to improve chart and diagram comprehension.

State of the Art MLLM Performance

Dolly Zoom Effect

BibTeX

@article{chen2024florence,
  title={Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion},
  author={Chen, Jiuhai and Yang, Jianwei and Wu, Haiping and Li, Dianqi and Gao, Jianfeng and Zhou, Tianyi and Xiao, Bin},
  journal={arXiv preprint arXiv:2412.04424},
  year={2024}
}