SmolDocling: Consumer-grade graphics cards take off, RAG artifact, the smallest OCR king is open source!

Written by
Clara Bennett
Updated on:July-10th-2025
Recommendation

The latest work of the IBM Research team, an OCR tool that can be easily controlled by consumer-grade graphics cards.

Core content:
1. Detailed explanation of SmolDocling model parameters and architecture
2. Hardware-friendly low video memory usage and fast processing capabilities
3. Multimodal processing and open source advantages, as well as analysis of the limitations of small models

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


Recently, the IBM Research team released a visual language model, SmolDocling, with 256M parameters, focusing on full-document OCR and multimodal processing. It claims to complete each page in 0.35 seconds, and can run on consumer-grade graphics cards . It sounds great, but what are the specific parameters and capabilities? Today we will disassemble it to see how hardcore it is.

Parameters and architecture: small and exquisite design

The core of SmolDocling is a 256M parameter visual language model (VLM). Although it is small, it is not lazy in design. According to official disclosure, it is based on the evolution of SmolVLM , combined with the document transcription capabilities of the Docling ecosystem, and outputs a new format DocTags, which can fully preserve the context and location information of page elements. The following are the key parameter details:

  • •  Parameter scale : 256M, which is "miniature" compared to large models that can easily take billions of bytes. This means that it has extremely low requirements for video memory, and can run with less than 500MB of VRAM , such as old cards like GTX 1060.
  • •  Visual encoder : It uses lightweight SigLIP (93M parameter version, patch-16/512) , which processes images with higher resolution than regular VLM. The official said that it was inspired by the research of Apple and Google, and the high resolution improves the ability to capture details, so that fine elements such as formulas and charts can be recognized more accurately.
  • •  Language backbone : It is likely to use the 1.7B architecture of the SmolLM2 series (it is not specified, but SmolVLM uses this), with a context window of 2048 tokens, which is enough to handle most document requirements .
  • •  Multimodal fusion : The image and text information are combined through the cross-attention mechanism to output structured text. A single end-to-end objective function is used during training to simplify the process.
  • •  Training data : 5.5M formulas (including 4.7 million LaTeX formulas extracted from arXiv), 9.3 million code snippets (in 56 languages), 2.5 million charts (bar charts, pie charts, etc.), and a large number of public data sets. The data was strictly cleaned and rendered to ensure quality.

Advantages: efficiency and capability

Hardware friendly

256M parameters plus 93M visual encoder, a total of only about 350M, the video memory usage is ridiculously low. An ordinary laptop can run it, the fan does not rotate much, it is energy-saving and quiet. Compared with the 2B parameter model such as Qwen2-VL, SmolDocling is simply the lightweight king of "lightweight".

Fast speed

The official claim is that it takes 0.35 seconds per page. The actual test varies slightly due to document complexity and hardware, but it is no problem to get results for 10 pages of PDF in a few seconds. Complex documents such as scientific papers and contracts can be parsed quickly, even footnotes, formulas, and tables are not spared.

Multimodal hard core

It supports full parsing of text, layout, code, formula, chart, and table, and can also do graphic classification and title matching. For example, if you put a paper in, LaTeX formulas, table structure, chart text can all be extracted, and the accuracy is not inferior to large models.

Open source and peace of mind

The models, datasets, and tools are all open source and compatible with Hugging Face's transformers and vLLM. Developers can get started quickly and can also fine-tune and customize them.

Disadvantages: Limitations of small models

Complex scenes have shortcomings

High-resolution scans or handwritten manuscripts are prone to failure. Some people have measured that they only have a bunch of garbled characters, and their stability is not as good as commercial OCR.

Lack of professionalism

There are few parameters and limited knowledge. The understanding of professional content such as chemical molecular formulas and legal terms is not deep enough, the output is not smart enough, and the Chinese support is not very friendly to domestic users.

Ecological immaturity

The Docling ecosystem is just getting started, with few documents and tutorials. Parameter adjustment may require metaphysics, and novices may easily fail.

Summary: Potential stocks, but don’t make a myth

SmolDocling is a small monster that takes both efficiency and capability into consideration. It can run large models with 256M parameters. It is fast, has low hardware requirements, and has solid multi-modal capabilities. It is suitable for players with tight budgets and who want to save time. But it is not a panacea. It still needs practice in complex scenes and professional fields. If you want to try it, you can go to Hugging Face and try it out. It is absolutely cost-effective.