0.35 seconds to OCR a full page of documents, 10% higher than Qwen2.5 VL's document conversion multimodal model!

Written by

Clara Bennett

Updated on:July-10th-2025

SmolDocling is a multimodal image-to-text model designed for efficient document conversion and is currently ranked 2nd on the huggingface hot list.

SmolDocling was jointly launched by the Docling team and IBM Research. It takes only 0.35 seconds per page on the A100 GPU on average , and its 256M parameters are more efficient than Qwen2.5 VL ( 7B )!

SmolDocling features:

DocTags efficient tagging - Introducing DocTags, an efficient and concise document representation method that is fully compatible with DoclingDocuments.

Optical Character Recognition (OCR) – Accurately extract text from images.

Layout and positioning - Preserve the document structure and bounding boxes of document elements.

Code recognition - detects and formats code blocks, including indentation.

Formula Recognition - Recognize and process mathematical expressions.

Chart recognition – extracting and interpreting chart data.

Table Recognition - Supports structured table extraction, including column and row headers.

Graphic classification - distinguishing between graphics and graphic elements.

Title Correspondence – Link titles to relevant images and graphics.

List Grouping - Properly organize and structure list elements.

Full Page Conversion - Processes the entire page for comprehensive document conversion, covering all page elements (code, formulas, tables, charts, etc.).

OCR with Bounding Boxes - OCR region recognition using bounding boxes.

General Document Processing – Trained on both scientific and non-scientific documents.

Seamless Docling integration - Import Docling and export in multiple formats.

https://hf-mirror.com/ds4sd/SmolDocling-256M-previewhttps://arxiv.org/pdf/2503.11576SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion