DeepSeek-OCR for Efficient Document Processing
Large Language Models (LLMs) and Vision-Language Models (VLMs) often struggle with the high computational effort required to process long documents. As documents become longer, token counts increase as well, which leads to greater memory consumption, slower inference, and higher operating costs.
DeepSeek-OCR is a proof-of-concept approach designed to improve efficiency by applying optical context compression. This method represents document pages as visual tokens, which can greatly reduce the number of tokens compared with a purely text-based format. To measure its effectiveness, the approach is evaluated through OCR (Optical Character Recognition). The paper describes OCR as an ideal test environment for this vision-text compression method because it naturally maps visual input to text output while also providing measurable evaluation metrics. DeepSeek-OCR reduces token counts by 7–20x while still achieving strong benchmark performance, making it a practical option for efficient document processing at scale and training data generation.
DeepSeek-OCR is built around two primary components: DeepEncoder, which compresses document images into a compact set of visual tokens, and DeepSeek-3B-MoE, a decoder that reconstructs the original text from those tokens. The model aims to balance efficiency with accuracy and delivers competitive results on benchmarks such as OmniDocBench and Fox while using fewer tokens than many existing approaches.
Benchmark Performance
Several OCR models have already been covered across general cloud and AI infrastructure topics, including Dolphin, olm-OCR, rolm-OCR, smoldocling, and others.
Key Takeaways
Optical Context Compression for Lower Computational Cost
DeepSeek-OCR introduces optical context compression, a technique that encodes document pages as visual tokens. By reducing token usage by 7–20x compared with conventional text tokens, the overall computational cost can be reduced.
Architecture
The model includes DeepEncoder for visual tokenization and compression using SAM and CLIP, along with DeepSeek-3B-MoE-A570M, an efficient Mixture-of-Experts (MoE) decoder that reconstructs the text.
Efficiency and Accuracy
DeepSeek-OCR provides a strong balance between performance and resource usage. It reaches approximately 97% OCR precision at compression ratios below 10x, meaning the number of text tokens remains within 10 times the number of vision tokens. It also outperforms existing models on benchmarks such as OmniDocBench while requiring significantly fewer tokens.
Training Data
The model was trained on more than 30 million PDF pages across over 100 languages, as well as specialized OCR 2.0 data containing charts, formulas, and figures. This gives it strong capabilities across many document types and complex visual elements.
Use Cases
DeepSeek-OCR is well suited for large-scale document digitization, training data generation for LLMs and VLMs, multilingual document processing, and structured data extraction from technical documents.
DeepSeek-OCR Architecture
DeepEncoder: Visual Tokenization
DeepEncoder is a vision encoder designed to keep activation memory low, even when processing high-resolution inputs.
Local attention via SAM (Segment Anything Model): With 80M parameters, SAM captures fine visual details and layout information.
Global attention via CLIP (Contrastive Language–Image Pre-training): With 300M parameters, CLIP extracts semantic features from compressed visual tokens.
Decoder: DeepSeek3B-MoE-A570M
The decoder uses DeepSeek’s Mixture-of-Experts (MoE) architecture. During inference, it activates only a subset of its 3B total parameters, around 570M. The advantage of MoE is that it delivers efficient processing while offering performance comparable to larger models. The decoder reconstructs the original text from the compressed visual tokens while preserving layout and content where possible.
Training Data
DeepSeek-OCR was trained on a broad and diverse dataset to provide reliable performance across many document formats and languages. Its training data includes more than 30 million PDF pages in over 100 languages, with a strong focus on Chinese and English. The model was also trained on OCR 2.0 data, including 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures. This expands its capabilities beyond standard text extraction and allows it to handle specialized content such as scientific diagrams and financial charts. This extensive training approach enables the model to process many kinds of documents and languages while maintaining strong results on complex visual elements.
Performance and Benchmarks
Compression vs. Accuracy
DeepSeek-OCR’s accuracy depends on the selected compression ratio. At compression levels below 10x, the model achieves around 97% OCR precision and can reconstruct the original text with only minimal loss. At 20x compression, accuracy falls to roughly 60%, which may still be acceptable for archive-related or secondary use cases.
Comparative Results
On the OmniDocBench benchmark, DeepSeek-OCR performs better than competing models while using fewer tokens. With 100 tokens per page, it exceeds GOT-OCR2.0, which usually uses 256 tokens per page. With fewer than 800 tokens per page, it also outperforms MinerU2.0, which often requires more than 6,000 tokens per page.
Practical Applications
DeepSeek-OCR can be used in several practical scenarios. For large-scale document digitization, libraries, legal organizations, and research institutions can process high document volumes more efficiently. AI labs can use the model to generate training data and create text-image pairs for LLM pretraining, helping to address data scarcity. Because the model supports more than 100 languages, it is useful for multilingual document processing in global environments. Its ability to parse charts, tables, and formulas also makes it valuable for structured data extraction from technical and financial documents.
Implementation
from transformers import AutoModel, AutoTokenizer
import torch
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True
).eval().cuda().to(torch.bfloat16)
# Load an image and run OCR
from PIL import Image
image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."
inputs = tokenizer(prompt, images=[image], return_tensors="pt").to("cuda")
output = model.generate(**inputs)
print(tokenizer.decode(output[0]))
Selecting a Resolution Mode
| Mode | Resolution | Vision Tokens | Typical Use Case |
|---|---|---|---|
| Tiny | 512×512 | 64 | Quick previews, low-resolution documents |
| Small | 640×640 | 100 | Standard documents |
| Base | 1024×1024 | 256 | High-resolution pages |
| Large | 1280×1280 | 400 | Complex layouts |
| Gundam | Dynamic | 795+ | Multi-column and dense documents |
Limitations and Considerations
There are several factors to consider when using DeepSeek-OCR. In terms of accuracy versus compression, compression ratios above 10x can reduce accuracy, especially with dense or low-resolution documents. Although Gundam mode improves support for multi-column layouts, very complex documents such as newspapers may still require manual review because of their layout structure. For best performance, the model requires NVIDIA GPUs with CUDA support.
Frequently Asked Questions
What is DeepSeek-OCR?
DeepSeek-OCR is an open-source Vision-Language Model (VLM) developed by DeepSeek-AI for efficient document understanding and OCR tasks. It converts document images into structured text while using optical context compression to significantly reduce computational overhead and improve processing efficiency.
How does DeepSeek-OCR achieve high efficiency?
The model uses optical context compression through its DeepEncoder component. Instead of converting a full page into a long sequence of text tokens, it compresses the visual information into a compact set of visual tokens. These tokens are 7–20x fewer than standard text tokens and are then decoded by the DeepSeek-3B-MoE decoder. This token reduction enables faster inference and lower memory usage.
What is the architecture of DeepSeek-OCR?
The model uses a two-part architecture:
- DeepEncoder: Compresses document images into visual tokens using SAM for local visual detail and CLIP for global semantic context.
- DeepSeek-3B-MoE-A570M: An efficient Mixture-of-Experts (MoE) decoder that reconstructs text from visual tokens. It has 3 billion total parameters but activates only about 570 million during inference.
What is the trade-off between compression and accuracy?
DeepSeek-OCR keeps accuracy high, around 97% OCR precision, at moderate compression ratios of up to 10x. When compression increases beyond 10x, such as up to 20x, accuracy drops to around 60%. Users need to choose a compression mode that matches their required precision and efficiency goals.
What kind of data was DeepSeek-OCR trained on?
The model was trained on a large dataset of more than 30 million PDF pages in over 100 languages. It was also trained on OCR 2.0 data, which includes millions of synthetic charts, chemical formulas, and geometric figures. This helps it handle complex and specialized visual elements beyond simple text.
Can DeepSeek-OCR handle multilingual documents?
Yes. Since the training data covers more than 100 languages, including a strong focus on Chinese and English, DeepSeek-OCR is suitable for multilingual document processing and global use cases.
What are the primary use cases for DeepSeek-OCR?
Key applications include:
- Large-scale document digitization: Efficiently processing large volumes of documents, such as archives or legal records.
- AI training data generation: Creating high-quality text-image pairs for pretraining other LLMs and VLMs.
- Structured data extraction: Parsing complex elements such as charts, tables, and scientific formulas from technical documents.
- Multilingual processing: Handling documents in more than 100 languages.
Conclusion
The model’s architecture, built with DeepEncoder and DeepSeek3B-MoE-A570M, shows practical value for generating training data for LLMs and VLMs. DeepSeek-OCR combines optical context compression, multi-resolution support, and open-source availability, making it useful for applications ranging from archival digitization to AI training data generation.
For users who want to explore its capabilities, the model is available on GitHub and Hugging Face and can be run on GPU-based cloud infrastructure. Its architecture and performance indicate broader potential for AI efficiency and long-context processing.


