ServiceNow Apriel-1.5-15B-Thinker: Multimodal Reasoning Model Overview
What makes ServiceNow’s new multimodal reasoning model, Apriel-1.5-15B-Thinker, especially notable is its focus on midtraining instead of relying heavily on post-training and Reinforcement Learning (RL). A form of post-training, supervised fine-tuning (SFT), is still used, but only for text-based data that contains reasoning traces, not for image data.
The model has 15 billion parameters, is open-weight, multimodal, and can be deployed on a single GPU. For context, this is smaller than gpt-oss, an open-weight multimodal reasoning model from OpenAI that is available in 20B and 120B parameter variants. This makes Apriel-1.5-15B-Thinker more memory-efficient to run. Despite its relatively compact size, the model delivers strong performance. It is reported to be at least one-tenth the size of any other model that scores above 50 on the Artificial Analysis Intelligence Index.
Key Takeaways
- Midtraining Emphasis: Apriel-1.5-15B-Thinker is distinctive because it relies on a midtraining pipeline that includes depth upscaling, staged continual pre-training, and text-only SFT, rather than extensive post-training and RL.
- Memory Efficiency and Performance: Although it has only 15 billion parameters, Apriel-1.5-15B-Thinker is an open-weight multimodal reasoning model that can run on a single GPU. It offers a memory-efficient alternative to larger models while still achieving strong results, including a score above 50 on the Artificial Analysis Intelligence Index at one-tenth the size of comparable models.
- Open-Weight Release: The release provides the model checkpoint, training recipes, and evaluation protocols, supporting transparency and further research.
- Staged Training Approach: The model benefits from a multi-stage training process. This includes initialization from Pixtral-12B-Base-2409, depth upscaling, projection network realignment, and two-stage continual pretraining, first focused on text and then on images. This is followed by Supervised Fine-Tuning (SFT) using carefully selected high-signal data.
- General-Purpose Capabilities: Apriel-1.5-15B-Thinker is built for a broad range of instruction-based tasks, including code assistance, logical reasoning, and function calling. However, it is not intended for safety-critical use cases where complete factual accuracy is required.
Release Details
The open-weight release of Apriel-1.5-15B-Thinker includes the model checkpoint, full training recipes, and evaluation protocols. To our knowledge, however, the specific datasets used for training are not included in the release.
The paper only provides broad descriptions of the data types involved, including:
- Pretraining-style corpora
- Web-style text and image data
- Reasoning-focused samples
- Verified and unverified synthetic data
Weight Initialization
This model was not pretrained from the beginning. Instead, the researchers initialized training with the weights of Pixtral-12B-Base-2409, a 12-billion-parameter multimodal model. This base model uses a LLaVA-style architecture, with a vision encoder connected to a multimodal decoder through a two-layer fully connected projection network.
It is interesting to consider why this model was selected for weight initialization, since it was released last year and smaller, more performant alternatives now exist. The paper explains that the goal was “to enable multimodal capabilities in a compute efficient manner” and that the researchers “used a version from Unsloth, which is no longer available at https://huggingface.co/unsloth as of this writing.”
Pixtral Architecture
Midtraining
Because the term “midtraining” is still relatively new, its definition can vary. For that reason, papers often define how they use the term.
Note: For Apriel-1.5-15B-Thinker, the midtraining pipeline is organized into three stages: depth upscaling, staged continual pre-training, and high-quality text-only SFT.
Depth Upscaling
Depth upscaling in this paper means that the decoder was expanded from 40 to 48 hidden layers. The authors used a related strategy in Apriel-Nemotron-15B-Thinker, where they scaled a 12B model into a stable 15B model by adding transformer layers. These additional layers were initialized using techniques such as averaging, max-pooling, averaging alternate layers, and layer duplication.
Afterward, the model was trained on a large text-token corpus. Part of this data had already been included in earlier training, while the rest came from sources such as high-quality web content, technical literature, mathematical problem sets, programming code, and StackExchange discussions.
Compared with training a new model from scratch, this approach is more efficient in terms of compute and data. The trade-off is that adding more layers can improve performance while also making inference more computationally expensive. Before the projection network alignment stage, the researchers also averaged the weights from six evenly spaced intermediate checkpoints produced during depth upscaling.
Projection Network Realignment
After scaling the layers, the researchers realigned the projection network using image captioning datasets, multimodal instruction-response pairs, and document understanding tasks. The pretrained weights of the encoder and decoder were not modified during this step.
This likely helped stabilize the model’s performance by ensuring that the expanded decoder could still interpret visual features from the encoder effectively. The checkpoint produced during projection network realignment was then used for the later training stages.
Training Setup
The researchers used a sequence length of 8192 tokens with sequence packing, along with a learning rate of 5e-5 and linear decay, for both depth upscaling and projection network realignment.
The sequence length defines how many tokens the model can process in a single forward and backward pass. It is also commonly referred to as the context window. A longer sequence length allows the model to consider a much broader context at once, which is especially useful for tasks such as multimodal instruction following and document understanding, where information may be spread across distant parts of the input.
Sequence packing means combining multiple shorter training examples into one longer sequence, up to the maximum length of 8192 tokens in this case. This improves GPU efficiency by reducing the amount of padding required, meaning less computation is wasted on empty tokens. As a result, training becomes more efficient and can be completed faster.
The learning rate controls the size of each optimization step during gradient descent. With linear decay, training starts at a learning rate of 5 × 10⁻⁵, or 0.00005, and gradually reduces it to zero over time. This allows larger updates early in training and smaller, more stable updates later, helping the model converge more reliably without overshooting the optimal solution.
Continual Pretraining
The continual pretraining process is split into two stages. The first stage focuses on text data, while the second stage focuses on image data.
The term continual pretraining may sound unusual. In this context, it appears to mean that the researchers continued the pretraining process of the base model, Pixtral-12B. This differs from fine-tuning, which is usually intended to improve performance on a specific task. Here, the goal was to improve general multimodal performance.
The following table summarizes how the researchers structured staged continual pretraining:
| Feature | CPT Stage 1: Foundational Reasoning and Multimodal Data | CPT Stage 2: Targeted Visual Reasoning Data |
|---|---|---|
| Purpose | To improve textual reasoning and build broad multimodal capabilities and foundational image understanding. | To further strengthen visual reasoning, especially spatial structure, compositional understanding, and fine-grained perception. |
| Dataset Composition | A mixture of text-only and multimodal tokens: 50% text-only tokens covering mathematical and scientific reasoning, coding, and general knowledge; 20% replayed tokens from the decoder upscaling stage; and 30% multimodal tokens covering document and chart understanding, image captioning, long-form image descriptions, OCR, and reasoning over visual mathematical and logical problems. | A targeted multimodal dataset created through a synthetic data generation pipeline applied to large raw image collections. Main categories include image reconstruction, visual matching, object detection, and counting. |
| Unfrozen and Frozen Components | The vision encoder, projection network, and decoder were all unfrozen and updated during training. | The vision encoder was frozen and not updated. The projection network and decoder were updated. |
| Sequence Length | 32768 with sequence packing. | 16384 with sequence packing. |
| Learning Rate | 5e-5 with cosine decay and 10% warmup. | 1e-5 with cosine decay and 10% warmup. |
| Loss Computation | Computed across all tokens in the sequence. | Computed only on responses for instruction-response samples. |
| Final Checkpoint | The weights of three evenly spaced intermediate checkpoints were averaged. | The final checkpoint from this stage was used as the base model for later stages, including SFT. |
Supervised Fine-Tuning (SFT)
Depth upscaling and continual pretraining produced a base model with solid reasoning abilities, but Supervised Fine-Tuning (SFT) helped improve it further. The researchers were careful about compute usage, so they selected curated data with high-signal prompts and used open-source models, specifically gpt-oss-120B, as annotators instead of training a separate annotator model.
| Aspect | Details |
|---|---|
| Dataset | Millions of high-quality instruction-response pairs with explicit reasoning traces. |
| Domains | Math, coding, science, tool calling, conversations, instruction-following, security, content moderation, and robustness. |
| Annotator Model | gpt-oss-120b, selected instead of DeepSeek-R1-0528 for compute efficiency. |
| Verification | Execution-verified data for verifiable domains, with samples evolved toward greater complexity. |
| Data Processing | De-duplication, content filtering, heuristic filtering, LLM-as-Judge verification, execution-based checks, rejection sampling, format checks, and benchmark decontamination. |
| Initial Training | 4 epochs at 32768 sequence length. |
| Smaller Run 1 | 25% stratified subset, 4 epochs at 32768 sequence length. |
| Smaller Run 2 | 49152 sequence length using mixed-length samples. |
| Updates | Decoder only, using text data only. |
| Loss Computation | Response tokens only. |
| Final Model | Weight average of two smaller runs to achieve cost-effective performance gains. |
Implementation
The Apriel model family is designed to support a wide range of general-purpose instruction tasks, including code assistance and generation, logical reasoning and multi-step problem-solving, question answering, and information retrieval. These models also perform well in function calling, complex instruction execution, and agent-based applications.
However, they are not designed for use in safety-critical environments without human supervision, or in any context where absolute factual accuracy is required.
As mentioned earlier, inference can be performed on a single GPU. Start by setting up a suitable GPU-enabled cloud instance. To configure a GPU instance in a Jupyter Notebook environment, follow a general tutorial for setting up a GPU-based AI/ML development environment with Jupyter Labs.
An AI/ML-ready or inference-optimized image is recommended. Many cloud infrastructure providers offer GPU instances based on NVIDIA or AMD hardware.
In the terminal:
pip install transformers==4.48 jinja2==3.1.0 torch torchvision jupyter
pip install huggingface-hub
huggingface-cli download ServiceNow-AI/Apriel-1.5-15b-Thinker
jupyter lab --allow-root
In this example code snippet, two sample prompts are used. The first is a simple text prompt asking for the capital of France. The second analyzes an image.
#Tested with transformers==4.48
import re
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
# Load model
model_id = "ServiceNow-AI/Apriel-1.5-15b-Thinker"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Example 1: Text-only prompt
chat = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is the capital for France?"},
],
}
]
inputs = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
inputs.pop("token_type_ids", None)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6)
generated_ids = output_ids[:, inputs['input_ids'].shape[1]:]
output = processor.decode(generated_ids[0], skip_special_tokens=True)
response = re.findall(r"\[BEGIN FINAL RESPONSE\](.*?)\[END FINAL RESPONSE\]", output, re.DOTALL)[0].strip()
print("Text-only Response:", response)
# Example 2: Image understanding
url = "https://picsum.photos/id/237/200/300"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
chat = [
{
"role": "user",
"content": [
{"type": "text", "text": "Which animal is this?"},
{"type": "image"},
],
}
]
prompt = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6)
generated_ids = output_ids[:, inputs['input_ids'].shape[1]:]
output = processor.decode(generated_ids[0], skip_special_tokens=True)
response = re.findall(r"\[BEGIN FINAL RESPONSE\](.*?)\[END FINAL RESPONSE\]", output, re.DOTALL)[0].strip()
print("Image Response:", response)
You can test the model yourself using more demanding prompts.
References
- (Paper) Apriel-1.5-15B-Thinker: Mid-training is all you need
- Hugging Face Model Page
Conclusion
Apriel-1.5-15B-Thinker is a 15B-parameter model that can run on a single GPU. It uses depth upscaling, staged continual pre-training, and text-only SFT to achieve strong reasoning performance despite its relatively small parameter count. Test the model and share your thoughts.


