Content

1 Key Takeaways
2 Release Details
3 Weight Initialization
4 Pixtral Architecture
5 Continual Pretraining
6 Supervised Fine-Tuning (SFT)
7 Implementation
8 References
9 Conclusion

Vijona

2 hours ago

ServiceNow Apriel-1.5-15B-Thinker: Multimodal Reasoning Model Overview

What makes ServiceNow’s new multimodal reasoning model, Apriel-1.5-15B-Thinker, especially notable is its focus on midtraining instead of relying heavily on post-training and Reinforcement Learning (RL). A form of post-training, supervised fine-tuning (SFT), is still used, but only for text-based data that contains reasoning traces, not for image data.

The model has 15 billion parameters, is open-weight, multimodal, and can be deployed on a single GPU. For context, this is smaller than gpt-oss, an open-weight multimodal reasoning model from OpenAI that is available in 20B and 120B parameter variants. This makes Apriel-1.5-15B-Thinker more memory-efficient to run. Despite its relatively compact size, the model delivers strong performance. It is reported to be at least one-tenth the size of any other model that scores above 50 on the Artificial Analysis Intelligence Index.

Key Takeaways

Midtraining Emphasis: Apriel-1.5-15B-Thinker is distinctive because it relies on a midtraining pipeline that includes depth upscaling, staged continual pre-training, and text-only SFT, rather than extensive post-training and RL.
Memory Efficiency and Performance: Although it has only 15 billion parameters, Apriel-1.5-15B-Thinker is an open-weight multimodal reasoning model that can run on a single GPU. It offers a memory-efficient alternative to larger models while still achieving strong results, including a score above 50 on the Artificial Analysis Intelligence Index at one-tenth the size of comparable models.
Open-Weight Release: The release provides the model checkpoint, training recipes, and evaluation protocols, supporting transparency and further research.
Staged Training Approach: The model benefits from a multi-stage training process. This includes initialization from Pixtral-12B-Base-2409, depth upscaling, projection network realignment, and two-stage continual pretraining, first focused on text and then on images. This is followed by Supervised Fine-Tuning (SFT) using carefully selected high-signal data.
General-Purpose Capabilities: Apriel-1.5-15B-Thinker is built for a broad range of instruction-based tasks, including code assistance, logical reasoning, and function calling. However, it is not intended for safety-critical use cases where complete factual accuracy is required.

Release Details

The open-weight release of Apriel-1.5-15B-Thinker includes the model checkpoint, full training recipes, and evaluation protocols. To our knowledge, however, the specific datasets used for training are not included in the release.

The paper only provides broad descriptions of the data types involved, including:

Pretraining-style corpora
Web-style text and image data
Reasoning-focused samples
Verified and unverified synthetic data

Weight Initialization

This model was not pretrained from the beginning. Instead, the researchers initialized training with the weights of Pixtral-12B-Base-2409, a 12-billion-parameter multimodal model. This base model uses a LLaVA-style architecture, with a vision encoder connected to a multimodal decoder through a two-layer fully connected projection network.

It is interesting to consider why this model was selected for weight initialization, since it was released last year and smaller, more performant alternatives now exist. The paper explains that the goal was “to enable multimodal capabilities in a compute efficient manner” and that the researchers “used a version from Unsloth, which is no longer available at https://huggingface.co/unsloth as of this writing.”

Pixtral Architecture

Midtraining

Because the term “midtraining” is still relatively new, its definition can vary. For that reason, papers often define how they use the term.

Note: For Apriel-1.5-15B-Thinker, the midtraining pipeline is organized into three stages: depth upscaling, staged continual pre-training, and high-quality text-only SFT.

Depth Upscaling

Depth upscaling in this paper means that the decoder was expanded from 40 to 48 hidden layers. The authors used a related strategy in Apriel-Nemotron-15B-Thinker, where they scaled a 12B model into a stable 15B model by adding transformer layers. These additional layers were initialized using techniques such as averaging, max-pooling, averaging alternate layers, and layer duplication.

Afterward, the model was trained on a large text-token corpus. Part of this data had already been included in earlier training, while the rest came from sources such as high-quality web content, technical literature, mathematical problem sets, programming code, and StackExchange discussions.

Compared with training a new model from scratch, this approach is more efficient in terms of compute and data. The trade-off is that adding more layers can improve performance while also making inference more computationally expensive. Before the projection network alignment stage, the researchers also averaged the weights from six evenly spaced intermediate checkpoints produced during depth upscaling.

Projection Network Realignment

After scaling the layers, the researchers realigned the projection network using image captioning datasets, multimodal instruction-response pairs, and document understanding tasks. The pretrained weights of the encoder and decoder were not modified during this step.

This likely helped stabilize the model’s performance by ensuring that the expanded decoder could still interpret visual features from the encoder effectively. The checkpoint produced during projection network realignment was then used for the later training stages.

Training Setup

The researchers used a sequence length of 8192 tokens with sequence packing, along with a learning rate of 5e-5 and linear decay, for both depth upscaling and projection network realignment.

The sequence length defines how many tokens the model can process in a single forward and backward pass. It is also commonly referred to as the context window. A longer sequence length allows the model to consider a much broader context at once, which is especially useful for tasks such as multimodal instruction following and document understanding, where information may be spread across distant parts of the input.

Sequence packing means combining multiple shorter training examples into one longer sequence, up to the maximum length of 8192 tokens in this case. This improves GPU efficiency by reducing the amount of padding required, meaning less computation is wasted on empty tokens. As a result, training becomes more efficient and can be completed faster.

The learning rate controls the size of each optimization step during gradient descent. With linear decay, training starts at a learning rate of 5 × 10⁻⁵, or 0.00005, and gradually reduces it to zero over time. This allows larger updates early in training and smaller, more stable updates later, helping the model converge more reliably without overshooting the optimal solution.

Continual Pretraining

The continual pretraining process is split into two stages. The first stage focuses on text data, while the second stage focuses on image data.

The term continual pretraining may sound unusual. In this context, it appears to mean that the researchers continued the pretraining process of the base model, Pixtral-12B. This differs from fine-tuning, which is usually intended to improve performance on a specific task. Here, the goal was to improve general multimodal performance.

The following table summarizes how the researchers structured staged continual pretraining:

Feature	CPT Stage 1: Foundational Reasoning and Multimodal Data	CPT Stage 2: Targeted Visual Reasoning Data
Purpose	To improve textual reasoning and build broad multimodal capabilities and foundational image understanding.	To further strengthen visual reasoning, especially spatial structure, compositional understanding, and fine-grained perception.
Dataset Composition	A mixture of text-only and multimodal tokens: 50% text-only tokens covering mathematical and scientific reasoning, coding, and general knowledge; 20% replayed tokens from the decoder upscaling stage; and 30% multimodal tokens covering document and chart understanding, image captioning, long-form image descriptions, OCR, and reasoning over visual mathematical and logical problems.	A targeted multimodal dataset created through a synthetic data generation pipeline applied to large raw image collections. Main categories include image reconstruction, visual matching, object detection, and counting.
Unfrozen and Frozen Components	The vision encoder, projection network, and decoder were all unfrozen and updated during training.	The vision encoder was frozen and not updated. The projection network and decoder were updated.
Sequence Length	32768 with sequence packing.	16384 with sequence packing.
Learning Rate	5e-5 with cosine decay and 10% warmup.	1e-5 with cosine decay and 10% warmup.
Loss Computation	Computed across all tokens in the sequence.	Computed only on responses for instruction-response samples.
Final Checkpoint	The weights of three evenly spaced intermediate checkpoints were averaged.	The final checkpoint from this stage was used as the base model for later stages, including SFT.

Supervised Fine-Tuning (SFT)

Depth upscaling and continual pretraining produced a base model with solid reasoning abilities, but Supervised Fine-Tuning (SFT) helped improve it further. The researchers were careful about compute usage, so they selected curated data with high-signal prompts and used open-source models, specifically gpt-oss-120B, as annotators instead of training a separate annotator model.

Aspect	Details
Dataset	Millions of high-quality instruction-response pairs with explicit reasoning traces.
Domains	Math, coding, science, tool calling, conversations, instruction-following, security, content moderation, and robustness.
Annotator Model	gpt-oss-120b, selected instead of DeepSeek-R1-0528 for compute efficiency.
Verification	Execution-verified data for verifiable domains, with samples evolved toward greater complexity.
Data Processing	De-duplication, content filtering, heuristic filtering, LLM-as-Judge verification, execution-based checks, rejection sampling, format checks, and benchmark decontamination.
Initial Training	4 epochs at 32768 sequence length.
Smaller Run 1	25% stratified subset, 4 epochs at 32768 sequence length.
Smaller Run 2	49152 sequence length using mixed-length samples.
Updates	Decoder only, using text data only.
Loss Computation	Response tokens only.
Final Model	Weight average of two smaller runs to achieve cost-effective performance gains.

Implementation

The Apriel model family is designed to support a wide range of general-purpose instruction tasks, including code assistance and generation, logical reasoning and multi-step problem-solving, question answering, and information retrieval. These models also perform well in function calling, complex instruction execution, and agent-based applications.

However, they are not designed for use in safety-critical environments without human supervision, or in any context where absolute factual accuracy is required.

As mentioned earlier, inference can be performed on a single GPU. Start by setting up a suitable GPU-enabled cloud instance. To configure a GPU instance in a Jupyter Notebook environment, follow a general tutorial for setting up a GPU-based AI/ML development environment with Jupyter Labs.

An AI/ML-ready or inference-optimized image is recommended. Many cloud infrastructure providers offer GPU instances based on NVIDIA or AMD hardware.

In the terminal:

Copy Code

pip install transformers==4.48 jinja2==3.1.0 torch torchvision jupyter pip install huggingface-hub huggingface-cli download ServiceNow-AI/Apriel-1.5-15b-Thinker jupyter lab --allow-root

In this example code snippet, two sample prompts are used. The first is a simple text prompt asking for the capital of France. The second analyzes an image.

Copy Code


#Tested with transformers==4.48

import re
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

# Load model
model_id = "ServiceNow-AI/Apriel-1.5-15b-Thinker"
model = AutoModelForImageTextToText.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Example 1: Text-only prompt
chat = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital for France?"},
        ],
    }
]

inputs = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
inputs.pop("token_type_ids", None)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6)

generated_ids = output_ids[:, inputs['input_ids'].shape[1]:]
output = processor.decode(generated_ids[0], skip_special_tokens=True)
response = re.findall(r"\[BEGIN FINAL RESPONSE\](.*?)\[END FINAL RESPONSE\]", output, re.DOTALL)[0].strip()

print("Text-only Response:", response)

# Example 2: Image understanding
url = "https://picsum.photos/id/237/200/300"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

chat = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Which animal is this?"},
            {"type": "image"},
        ],
    }
]

prompt = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6)

generated_ids = output_ids[:, inputs['input_ids'].shape[1]:]
output = processor.decode(generated_ids[0], skip_special_tokens=True)
response = re.findall(r"\[BEGIN FINAL RESPONSE\](.*?)\[END FINAL RESPONSE\]", output, re.DOTALL)[0].strip()

print("Image Response:", response)

You can test the model yourself using more demanding prompts.

References

(Paper) Apriel-1.5-15B-Thinker: Mid-training is all you need
Hugging Face Model Page

Conclusion

Apriel-1.5-15B-Thinker is a 15B-parameter model that can run on a single GPU. It uses depth upscaling, staged continual pre-training, and text-only SFT to achieve strong reasoning performance despite its relatively small parameter count. Test the model and share your thoughts.

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

Model Context Protocol (MCP) with OpenAI Agents SDK: Complete Guide

AI/ML, Tutorial

13 minutes ago

Vijona13 minutes ago Model Context Protocol and OpenAI Agents: A Practical Guide for Agentic AI Workflows Workflows powered by large language models have moved from being an optional enhancement to…

DeepSeek-OCR Explained: Efficient OCR with Optical Context Compression

AI/ML, Tutorial

21 hours ago

VijonaYesterday at 14:24 DeepSeek-OCR for Efficient Document Processing Large Language Models (LLMs) and Vision-Language Models (VLMs) often struggle with the high computational effort required to process long documents. As documents…

How to Deploy gpt-oss 120B with vLLM on AMD MI300X GPUs

AI/ML, Tutorial

22 hours ago

VijonaYesterday at 14:05 Running gpt-oss 120b with vLLM on AMD GPUs One of the biggest considerations for anyone starting with large-scale LLM technology is compute capacity. VRAM, throughput, hardware architecture,…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

ServiceNow Apriel-1.5-15B-Thinker: Multimodal Reasoning Model Overview

Key Takeaways

Release Details

Weight Initialization

Pixtral Architecture

Midtraining

Depth Upscaling

Projection Network Realignment

Training Setup

Continual Pretraining

Supervised Fine-Tuning (SFT)

Implementation

References

Conclusion