Content

Vijona

55 minutes ago

Ovis-U1: An Open-Source 3B Multimodal LLM Advancing Toward Human-Level Task Performance

The progress of Artificial General Intelligence (AGI) toward human-level task performance is being propelled by multimodal large language models (MLLMs). By merging multiple modalities, these systems can pack more information into their inputs and unlock stronger inference-time abilities. In this article, we explore Ovis-U1: an open-source, 3-billion-parameter model released by the Alibaba Ovis team. Its strengths span multimodal input understanding, text-to-image generation, and editing user-provided images.

Key Takeaways

Ovis-U1 is a 3-billion-parameter open-source multimodal large language model developed by Alibaba.
It supports capabilities such as multimodal understanding, text-to-image generation, and image editing.
The model was trained using a diverse mixture of datasets spanning multiple tasks and domains.
You can run the model on a GPU server or experiment with it directly through Hugging Face Spaces.

Training Process

Stage	Trained Parameters	Task	Steps / Batch Size / Learning Rate	Description
0	Refiner + Visual Decoder	Text-to-Image Generation	500 / 1024 / 1e − 4	Visual decoder pretraining begins from random initialization to form foundational image generation ability. The visual decoder and refiner produce images from LLM embeddings using text-to-image data.
1	Adapter	Understanding Text-to-Image Generation, Image Editing	1.5k / 8192 / 5e − 4	Adapter pretraining aligns visual and textual embeddings. The adapter starts from random initialization and is trained during this stage across understanding, text-to-image, and image editing tasks.
2	Visual Encoder + Adapter	Understanding Text-to-Image Generation, Image Editing	2.6k / 8192 / 1e − 4	Visual encoder alignment fine-tunes both the visual encoder and adapter to better match visual and textual representations. All three task categories are used, and generation helps support embedding alignment.
3	Visual Encoder + Adapter + LLM	Understanding	23 / 2240 / 5e-5	Understanding learning trains the visual encoder, adapter, and LLM on understanding tasks. After this stage, these parameters are fixed to preserve understanding capability.
4	Refiner + Visual Decoder	Text-to-Image Generation	275 / 256 / 5e − 5	Generation learning trains the refiner and visual decoder to align with improved text and image embeddings after the LLM is tuned in Stage 3. This stage delivers stronger text-to-image performance.
5	Refiner + Visual Decoder	Text-to-Image Generation, Image Editing	325 / 256 / 5e − 5	Generation fine-tuning extends the text-to-image foundation by fine-tuning the decoder for both text-to-image and image editing tasks.

Data Mix

Let’s review the data used to train the model.

Task	Datasets used	Additional information
Multimodal understanding	COYO, Wukong, Laion-5B, ShareGPT4V, CC3M	The researchers built a data preprocessing pipeline that filters noisy data, enhances caption quality, and balances dataset ratios to achieve optimal training performance.
Text-to-Image Generation	Laion-5B, JourneyDB	Using Laion5B, the researchers first choose samples with an aesthetic score above 6. They then apply the Qwen2-VL model to generate detailed descriptions for each selected image, forming the Laion-aes6 dataset.
Image+Text-to-Image Generation	Image Editing: OmniEdit, UltraEdit, SeedEdit	Datasets used to strengthen the model’s image editing capability.
Reference-Image-Driven Image Generation	Subjects200K, SynCD, StyleBooth	Subjects200K and SynCD were used for subject-driven image generation, while StyleBooth was used for style-driven image generation.
Pixel-Level Controlled Image Generation	MultiGen_20M	To enable canny-to-image (canny = edge detection), depth-to-image, inpainting, and outpainting.
	In-House Data	Additional datasets that incorporated style-driven data, content removal, style translation, de-noise/de-blur data, colourization data, text rendering data, etc.

What About Reinforcement Learning?

In the paper’s conclusion, they note that Ovis-U1 currently does not include a reinforcement learning stage, even though such a stage has shown itself to be critical for large-model optimization. They also point out that creating effective ways to align unified multimodal models with human preferences remains an important open research question in this area.

Now that we have reviewed the model architecture and training process, let’s run the model.

Implementation

Start by setting up a GPU server (e.g., by centron). Once it’s ready, clone the repository and install the necessary packages. You can do this using the shell commands below in the terminal. As an alternative, you can also try the model on HuggingFace Spaces.

Copy Code


# Install git-lfs for handling large files
apt install git-lfs

# Clone the Ovis-U1-3B repository from HuggingFace Spaces
git-lfs clone https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B

# Change directory into the cloned repository
cd Ovis-U1-3B

# Install pip for Python package management
apt install python3-pip

# Install required Python packages from requirements.txt
pip install -r requirements.txt

# Install additional Python packages for wheel and spaces
pip install wheel spaces

# Install PyTorch with CUDA 12.8 support and upgrade existing installations
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --upgrade

# Install xformers for optimized transformer operations
pip install -U xformers

# Install flash_attn for attention mechanism optimization
pip install flash_attn==2.7.4.post1

# Run the main application script
python app.py

Final Thoughts

We’re incredibly excited about the continued evolution of multimodal large language models (MLLMs). The combination of carefully curated datasets, architectural innovations, and iterative capability improvements makes this area of AI especially compelling to watch. It’s fascinating to see how each advancement pushes these models closer to more versatile and practical real-world applications. Feel free to try it out for yourself!

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

Database Normalization Explained: 1NF, 2NF, 3NF & BCNF Guide

Databases, Tutorial

4 days ago

Vijona7 May at 16:09 Database Normalization Guide: Normal Forms, Examples, and When to Use Normalization Database normalization is a core practice in relational database design that focuses on structuring data…

Python `.pop()`: Remove and Return Items from Lists and Dicts

Python, Tutorial

5 days ago

Vijona6 May at 12:23 Content1 Understanding Python’s ».pop()« Method2 Key Takeaways3 Requirements4 What Does the ».pop()« Method Do in Python?5 Syntax Cheat Sheet: Lists vs. Dictionaries6 Why Should You Use the…

Levenshtein Distance in Python: Edit Distance, Libraries & Benchmarks

Python, Tutorial

2 weeks ago

Vijona29 Apr at 16:54 Levenshtein Edit Distance in NLP: Measuring String Similarity in Python In Natural Language Processing (NLP), assessing and comparing how similar two strings are is a core…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS