0.35 seconds to OCR a full page of documents, 10% higher than Qwen2.5 VL's document conversion multimodal model!

Written by
Clara Bennett
Updated on:July-10th-2025
Recommendation

A new breakthrough in efficient document conversion, the SmolDocling model achieves extremely fast processing, and its performance exceeds that of Qwen2.5 VL.

Core content:
1. The SmolDocling model is jointly launched with IBM Research, with efficient document conversion capabilities
2. Comprehensive functional features, including OCR, layout recognition, code and formula recognition, etc.
3. Recommended reading resources, in-depth exploration of the development of AI Agents and multimodal systems

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
SmolDocling is a multimodal image-to-text model designed for efficient document conversion and is currently ranked 2nd on the huggingface hot list.
SmolDocling was jointly launched by the Docling team and IBM Research. It takes only 0.35 seconds per page on the A100 GPU on average , and its 256M parameters are more efficient than Qwen2.5 VL ( 7B )!
SmolDocling features:
DocTags efficient tagging - Introducing DocTags, an efficient and concise document representation method that is fully compatible with DoclingDocuments.
Optical Character Recognition (OCR) – Accurately extract text from images.
Layout and positioning - Preserve the document structure and bounding boxes of document elements.

Code recognition - detects and formats code blocks, including indentation.

Formula Recognition - Recognize and process mathematical expressions.

Chart recognition – extracting and interpreting chart data.
Table Recognition - Supports structured table extraction, including column and row headers.
Graphic classification - distinguishing between graphics and graphic elements.
Title Correspondence – Link titles to relevant images and graphics.
List Grouping - Properly organize and structure list elements.
Full Page Conversion - Processes the entire page for comprehensive document conversion, covering all page elements (code, formulas, tables, charts, etc.).
OCR with Bounding Boxes - OCR region recognition using bounding boxes.
General Document Processing – Trained on both scientific and non-scientific documents.
Seamless Docling integration - Import Docling and export in multiple formats.
https://hf-mirror.com/ds4sd/SmolDocling-256M-previewhttps://arxiv.org/pdf/2503.11576SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion