"Document Processing Terminator" ByteDance Dolphin open source: From contracts to test papers, it can handle everything, multi-language OCR + intelligent typesetting restoration, B-side enterprises just need

ByteDance Dolphin, a revolutionary breakthrough in document processing, efficiently parses multi-language documents.
Core content:
1. Dolphin project overview: lightweight, efficient document parsing model
2. Two-stage parsing method: page-level layout analysis and element-level content analysis
3. Heterogeneous anchor prompts: special prompt word design for different document elements
In today's digital age, document processing is an indispensable task in many fields. Whether it is academic research, business office, education, technology development and other scenarios, it is necessary to extract and parse information from documents efficiently and accurately. However, traditional document parsing methods often face many challenges, such as complex document layout, diverse element types (such as text, tables, formulas, etc.), and high requirements for efficiency and accuracy . In recent years, with the rapid development of artificial intelligence technology, especially the rise of large model technology, new opportunities have been brought to document parsing. ByteDance's open source document parsing large model Dolphin came into being in this context. With its unique technical architecture and excellent performance, it has brought new breakthroughs in the field of document parsing.
1. Project Overview
Dolphin is a lightweight and efficient document parsing model open sourced by ByteDance. Based on a two-stage method of parsing the structure first and then the content, it can efficiently process various types of document images, including academic papers, business reports, technical documents, etc. It performs well in a variety of document parsing tasks, and its performance exceeds that of models such as GPT-4.1 and Mistral-OCR . Dolphin has 322M parameters, a small size, and fast speed. It supports the parsing of various document elements, including text, tables, formulas, etc., and can output the parsing results in JSON , Markdown , HTML and other formats, which is easy to integrate with different systems. Its open source code and pre-trained models provide great convenience for developers and inject new vitality into the development of the document parsing field.
2. Technical Principle
1. Two-stage parsing method
Dolphin adopts an innovative two-stage parsing method, which effectively solves the bottleneck problems of traditional methods in terms of efficiency and accuracy.
1. Page-level layout analysis: In the first stage, Dolphin uses Swin Transformer to encode the input document image and extract visual features. The decoder generates a sequence of document elements, each of which contains its category (such as title, table, chart, etc.) and coordinate position. The goal of this stage is to generate structured layout information in a natural reading order, providing a basis for subsequent element-level content analysis.
2. Element-level content parsing: In the second stage, Dolphin crops a partial view of each element from the original image based on the layout information generated in the first stage. Then, using specific prompts , each element is parsed in parallel. For example, tables are parsed into HTML format with dedicated prompts , and formulas and text paragraphs are parsed into LaTeX format with shared prompts . This parallel parsing mechanism greatly improves processing efficiency, and at the same time, through task-specific prompts, it can better handle different types of document elements.
2. Heterogeneous anchor prompts
Another core technical feature of Dolphin is heterogeneous anchor prompting . In the second stage of element-level content parsing, Dolphin designs special prompt words for different types of document elements. These prompt words can not only guide the model to accurately identify and parse the corresponding element content, but also help the model better understand the structural relationship between elements. For example, for table elements, using special prompt words can generate structured HTML format output; while for text paragraphs and formulas, they are parsed into LaTeX format through shared prompt words . This heterogeneous anchor prompt mechanism makes Dolphin more flexible and efficient when dealing with complex document layouts and diverse element types .
3. Main functions
1. Layout analysis
Dolphin can identify various elements in a document, such as titles, charts, tables, footnotes, etc., and generate a sequence of elements in a natural reading order. This function is crucial for understanding the overall structure of a document, especially when dealing with complex academic papers and technical documents, and can help users quickly locate and extract key information.
2. Content Extraction
Dolphin can parse the entire document page into a structured JSON format or Markdown format for easy subsequent processing and display. This structured output format makes the document content easier to operate and integrate, providing great convenience whether it is used for data storage, information retrieval or further content analysis.
3. Text paragraph analysis
Dolphin supports text content recognition and extraction in multiple languages (such as Chinese and English), and can accurately recognize the text content in documents and maintain its original format and layout information. This is of great significance for cross-language document processing and multi-language information extraction, and greatly improves the versatility and flexibility of document processing.
4. Formula Identification
Dolphin supports the recognition of complex formulas, including inline formulas and block-level formulas, and outputs them in LaTeX format. This feature is particularly important for academic research and technical document processing, because formulas are often one of the most critical and complex information in a document. By accurately recognizing and parsing formulas, Dolphin can help users better understand and utilize the mathematical content in documents.
(V) Table analysis
Dolphin can parse complex table structures, extract cell contents and generate tables in HTML format. Tables are a common form of data presentation in documents. Dolphin 's table parsing function can effectively extract data from tables and convert them into structured HTML format, making it easier for users to analyze and further process data.
4. Performance
1. Efficiency
Dolphin 's lightweight architecture and parallel parsing mechanism make it perform well in terms of operating efficiency. It has 322M parameters, is small in size, fast in speed, and can run efficiently in resource-constrained environments. In actual tests, Dolphin 's running speed is much faster than other similar models. For example, when processing complex documents, its speed is nearly 2 times faster than Mathpix , which enables it to quickly respond to user needs in actual applications and improve document processing efficiency.
2. Accuracy
Dolphin has achieved excellent performance in a variety of document parsing tasks. In page-level parsing tasks, whether it is a plain text document or a document containing complex elements (such as tables, formulas, charts, etc.), Dolphin can accurately extract document content and structural information , and its edit distance metric is better than existing advanced models in multiple benchmarks. In element-level parsing tasks, Dolphin's parsing accuracy for text paragraphs, formulas, and tables has also reached the industry-leading level, which can meet the needs of high-precision document parsing in different scenarios.
5. Application Scenarios
1. Academic Research
In the field of academic research, Dolphin can help researchers quickly parse the text, formulas, and charts in papers, so as to organize literature and analyze data more efficiently . By converting the content of papers into a structured format, researchers can more easily extract key information, accelerate the research process, and improve research efficiency.
2. Commercial Office
In commercial office scenarios, Dolphin can be used to extract key information from business documents, such as contract review, report generation, etc. It can quickly and accurately identify the text content and structure in documents, helping users to quickly locate and extract important information, thereby improving work efficiency and reducing manual processing costs.
3. Education
In the field of education, Dolphin can digitize textbooks and test papers to support online learning and multilingual teaching. By converting paper textbooks and test papers into electronic formats, students and teachers can conduct learning and teaching activities more conveniently, and it also helps to share and disseminate educational resources.
(IV) Technology Development
In the field of technology development, Dolphin can parse technical documents to facilitate code management and technical communication. It can help developers quickly extract key information from technical documents, such as code snippets, technical parameters, etc., so as to better understand and apply related technologies and improve development efficiency.
(V) Daily Application
In daily office work, Dolphin can quickly process various types of documents, such as meeting minutes, reports, etc. , helping users improve office efficiency and save time and energy.
6. Online Experience
In order to help users better understand and experience the powerful functions of Dolphin , ByteDance provides an online experience Demo . Users can visit [Demo-Dolphin]( http://115.190.42.15:8888/dolphin/ ) to perform actual operations.
On this online platform, users can upload their own document images, view Dolphin 's parsing results in real time, and experience its efficient and accurate document parsing capabilities. Through online experience, users can more intuitively understand Dolphin 's performance and application scenarios, providing reference for subsequent practical applications.
7. Deployment and Use
1. Environmental preparation
1. Clone the repository: First, you need to clone the official repository of Dolphin from GitHub , which can be done with the following command:
git clone https://github.com/ByteDance/Dolphin.gitcd Dolphin
2. Install dependencies: Install the dependency libraries required by the project and run the following command:
pip install -r requirements.txt
2. Download the pre-trained model
Users can choose one of the following two ways to download the pre-trained model:
1. Original model format (based on configuration file): Download the pre-trained model file from Baidu Yun (follow and send “ Dolphin ” to get the link) and place it in the `./checkpoints` folder.
2. Hugging Face model format : Visit the Hugging Face model card page, or download the model from Hugging Face Hub using the following command :
git lfs installgit clone https://huggingface.co/ByteDance/Dolphin ./hf_model
Or use the Hugging Face CLI tool:
huggingface-cli download ByteDance/Dolphin --local-dir ./hf_model
3. Reasoning and Use
Dolphin provides two reasoning frameworks to support document parsing at the page level and element level.
1. Page-level analysis
Using the original framework (based on the configuration file):
# Processing a single document image
python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results
# Process all document images in the directory
python demo_page.py --config ./config/Dolphin.yaml --input_path ./demo/page_imgs --save_dir ./results
Using Hugging Face framework:
# Processing a single document image
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results
# Process all document images in the directory
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results
2. Element-level parsing
Using the original framework (based on the configuration file):
# Process a single table image python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/table_1.jpeg --element_type table# Process a single formula image python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/line_formula.jpeg --element_type formula# Process a single text paragraph image python demo_element.py --config ./config/Dolphin.yaml --input_path ./demo/element_imgs/para_1.jpg --element_type text
Using Hugging Face framework:
# Process a single table image python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/table_1.jpeg --element_type table# Process a single formula image python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/line_formula.jpeg --element_type formula# Process a single text paragraph image python demo_element_hf.py --model_path ./hf_model --input_path ./demo/element_imgs/para_1.jpg --element_type text
8. Conclusion
As a large document parsing model open sourced by ByteDance, Dolphin has brought new breakthroughs in the field of document parsing with its innovative two-stage parsing method, heterogeneous anchor prompting technology, and lightweight architecture. It not only performs well in performance and can efficiently and accurately process various types of document images, but also demonstrates strong practical value in multiple actual application scenarios. In the future, Dolphin is expected to play an important role in more fields and bring more possibilities for intelligent document processing.