Ovis-U1: An Open-Source 3B Multimodal LLM Advancing Toward Human-Level Task Performance
The progress of Artificial General Intelligence (AGI) toward human-level task performance is being propelled by multimodal large language models (MLLMs). By merging multiple modalities, these systems can pack more information into their inputs and unlock stronger inference-time abilities. In this article, we explore Ovis-U1: an open-source, 3-billion-parameter model released by the Alibaba Ovis team. Its strengths span multimodal input understanding, text-to-image generation, and editing user-provided images.
Key Takeaways
- Ovis-U1 is a 3-billion-parameter open-source multimodal large language model developed by Alibaba.
- It supports capabilities such as multimodal understanding, text-to-image generation, and image editing.
- The model was trained using a diverse mixture of datasets spanning multiple tasks and domains.
- You can run the model on a GPU server or experiment with it directly through Hugging Face Spaces.
Training Process
| Stage | Trained Parameters | Task | Steps / Batch Size / Learning Rate | Description |
|---|---|---|---|---|
| 0 | Refiner + Visual Decoder | Text-to-Image Generation | 500 / 1024 / 1e − 4 | Visual decoder pretraining begins from random initialization to form foundational image generation ability. The visual decoder and refiner produce images from LLM embeddings using text-to-image data. |
| 1 | Adapter | Understanding Text-to-Image Generation, Image Editing | 1.5k / 8192 / 5e − 4 | Adapter pretraining aligns visual and textual embeddings. The adapter starts from random initialization and is trained during this stage across understanding, text-to-image, and image editing tasks. |
| 2 | Visual Encoder + Adapter | Understanding Text-to-Image Generation, Image Editing | 2.6k / 8192 / 1e − 4 | Visual encoder alignment fine-tunes both the visual encoder and adapter to better match visual and textual representations. All three task categories are used, and generation helps support embedding alignment. |
| 3 | Visual Encoder + Adapter + LLM | Understanding | 23 / 2240 / 5e-5 | Understanding learning trains the visual encoder, adapter, and LLM on understanding tasks. After this stage, these parameters are fixed to preserve understanding capability. |
| 4 | Refiner + Visual Decoder | Text-to-Image Generation | 275 / 256 / 5e − 5 | Generation learning trains the refiner and visual decoder to align with improved text and image embeddings after the LLM is tuned in Stage 3. This stage delivers stronger text-to-image performance. |
| 5 | Refiner + Visual Decoder | Text-to-Image Generation, Image Editing | 325 / 256 / 5e − 5 | Generation fine-tuning extends the text-to-image foundation by fine-tuning the decoder for both text-to-image and image editing tasks. |
Data Mix
Let’s review the data used to train the model.
| Task | Datasets used | Additional information |
|---|---|---|
| Multimodal understanding | COYO, Wukong, Laion-5B, ShareGPT4V, CC3M | The researchers built a data preprocessing pipeline that filters noisy data, enhances caption quality, and balances dataset ratios to achieve optimal training performance. |
| Text-to-Image Generation | Laion-5B, JourneyDB | Using Laion5B, the researchers first choose samples with an aesthetic score above 6. They then apply the Qwen2-VL model to generate detailed descriptions for each selected image, forming the Laion-aes6 dataset. |
| Image+Text-to-Image Generation | Image Editing: OmniEdit, UltraEdit, SeedEdit | Datasets used to strengthen the model’s image editing capability. |
| Reference-Image-Driven Image Generation | Subjects200K, SynCD, StyleBooth | Subjects200K and SynCD were used for subject-driven image generation, while StyleBooth was used for style-driven image generation. |
| Pixel-Level Controlled Image Generation | MultiGen_20M | To enable canny-to-image (canny = edge detection), depth-to-image, inpainting, and outpainting. |
| In-House Data | Additional datasets that incorporated style-driven data, content removal, style translation, de-noise/de-blur data, colourization data, text rendering data, etc. |
What About Reinforcement Learning?
In the paper’s conclusion, they note that Ovis-U1 currently does not include a reinforcement learning stage, even though such a stage has shown itself to be critical for large-model optimization. They also point out that creating effective ways to align unified multimodal models with human preferences remains an important open research question in this area.
Now that we have reviewed the model architecture and training process, let’s run the model.
Implementation
Start by setting up a GPU server (e.g., by centron). Once it’s ready, clone the repository and install the necessary packages. You can do this using the shell commands below in the terminal. As an alternative, you can also try the model on HuggingFace Spaces.
# Install git-lfs for handling large files
apt install git-lfs
# Clone the Ovis-U1-3B repository from HuggingFace Spaces
git-lfs clone https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B
# Change directory into the cloned repository
cd Ovis-U1-3B
# Install pip for Python package management
apt install python3-pip
# Install required Python packages from requirements.txt
pip install -r requirements.txt
# Install additional Python packages for wheel and spaces
pip install wheel spaces
# Install PyTorch with CUDA 12.8 support and upgrade existing installations
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --upgrade
# Install xformers for optimized transformer operations
pip install -U xformers
# Install flash_attn for attention mechanism optimization
pip install flash_attn==2.7.4.post1
# Run the main application script
python app.py
Final Thoughts
We’re incredibly excited about the continued evolution of multimodal large language models (MLLMs). The combination of carefully curated datasets, architectural innovations, and iterative capability improvements makes this area of AI especially compelling to watch. It’s fascinating to see how each advancement pushes these models closer to more versatile and practical real-world applications. Feel free to try it out for yourself!


