Document Processing with SmolDocling: Efficient Multimodal OCR and Document Understanding
The world is becoming more digital every day. From the early era of personal computing to the widespread use of email, social platforms, e-commerce, remote work, Large Language Models, and Agentic AI, digital transformation has created an enormous demand for digitizing content and extracting useful information. In this context, document processing serves as the link between physical documents and digital systems, turning printed text into usable digital data. It is the technology that enables a smartphone to translate foreign menus, supports historians in preserving old manuscripts, and reduces the need for time-consuming manual data entry.
At a fundamental level, document processing is the process of converting text found in images into machine-readable formats, enabling applications such as document conversion and document understanding. Document conversion focuses on transforming scanned text into formats such as Markdown, Word, or PDF. Document understanding goes a step further by identifying structure, meaning, and useful insights within the content rather than only recognizing text.
SmolDocling
SmolDocling is a lightweight multimodal vision language model created for efficient document processing. With only 256 million parameters, it supports full-page conversion while retaining layout, structure, and spatial relationships within the document. Because of its compact size, it is a cost-efficient solution for document processing, requiring less compute and memory. This makes it especially suitable for rapid prototyping and deployment on edge hardware.
Prerequisites
For the best performance, SmolDocling requires NVIDIA GPUs with CUDA support. Cloud GPU instances equipped with e.g. H100 GPUs provide the compute resources necessary for efficient processing at production scale.
Features and Capabilities
SmolDocling processes documents with DocTags, a markup format designed to preserve both context and layout details. It keeps formatting intact through bounding box detection and includes specialized recognition for code, formulas, charts, tables, and figures. The model also preserves document structure by correctly grouping lists and linking captions to related content.
The DocTags format specifies the type, position, and content of elements such as text, tables, images, and code. It uses nested tags to maintain relationships between elements, such as captions inside images, entries inside lists, and specialized Optimized Table-Structure Language (OTSL) notation for representing tables. This method preserves both the visual arrangement and the semantic structure of complex documents, making SmolDocling well suited for end-to-end document conversion tasks.
DocTags
| Tag Type | Description |
|---|---|
| XML-like Syntax | Uses XML-style notation with opening and closing tags for text blocks and standalone tags for instructions, for example <text>hello world</text> and <page_break>. |
| Document Structure | Complete DocTags fragments enclosed in <doctag>...</doctag> can represent one or multiple pages separated by <page_break> tags. |
| Block Type Tags | <text>, <caption>, <footnote>, <formula>, <title>, <page_footer>, <page_header>, <picture>, <section_header>, <document_index>, <code>, <otsl>, <list_item>, <ordered_list>, <unordered_list> |
| Location Encoding | Elements may include nested location tags that define bounding box coordinates: <loc_x1><loc_y1><loc_x2><loc_y2> using a 0–500 grid system. |
| Table Structure | Uses OTSL vocabulary for tables with extensions such as <fcel> (full cell), <ecel> (empty cell), <ched> (column headers), <rhed> (row headers), and <srow> (table sections). |
| List Handling | <list_item> elements inside <ordered_list> or <unordered_list> determine the list type. |
| Captions | <picture> and <otsl> elements can contain a <caption> tag to provide descriptive information. |
| Code Handling | <code> elements retain formatting and include a <_programming-language_> classification tag with support for 57 languages. |
| Image Classification | <picture> elements include <image_class> tags for more than 20 image categories, including charts, diagrams, code, and more. |
| Uniform Representation | Cropped page elements use the same DocTags representation as their full-page equivalents. |
Additional SmolDocling features are summarized below:
| Feature | Description |
|---|---|
| OCR + Layout Preservation | Extracts text while preserving spatial organization. |
| Specialized Recognition | Supports code blocks, formulas, tables, and charts. |
| Full-Page Conversion | Processes every element on the page at the same time. |
| Fast Inference | Runs in 0.35 seconds per page on A100 GPUs. |
| DocTags Markup | Represents document content and layout in a structured format. |
SmolDocling works together with Docling to support flexible import and export workflows. Planned improvements include one-shot multi-page inference, better chart recognition, and chemical structure detection.
Model Architecture
SmolDocling is built on SmolVLM, a model from HuggingFace. The conversion process from document page images to DocTags sequences works as follows. First, the input images pass through a vision encoder and are then reshaped through projection and pooling methods. After that, the processed image embeddings are combined with the text embeddings taken from the user prompt in an interleaved sequence. Finally, this combined representation is passed into a large language model, which autoregressively generates the DocTags sequence.
Data
Multiple dataset collections were used to strengthen the model across different capabilities. Datasets used for training with a focus on document understanding and image captioning include The Cauldron, Docmatix, and MathWriting.
Competitive Performance
SmolDocling performs competitively against models that are up to 27 times larger while lowering compute requirements. It works well across business documents, research papers, technical reports, patents, and forms. In contrast to many OCR models that primarily target scientific papers, SmolDocling is intended for a broad variety of document types.
Conclusion
In this tutorial, we explored SmolDocling, a compact yet capable vision language model built specifically for document conversion tasks. By using its unified DocTags output format, SmolDocling can efficiently process many document types, from plain text to complex forms and even code listings, while requiring far fewer computing resources than larger alternatives. Its strong balance of efficiency and accuracy makes it a valuable option for developers and organizations that want to implement document understanding capabilities.


