Dolphin-API: A comprehensive guide to the API-based multimodal document parsing model of ByteDance Dolphin

Explore how ByteDance's Dolphin model revolutionizes multimodal document parsing.
Core content:
1. Dolphin model performance breakthrough and "analysis before parsing" paradigm
2. Core components: Swin Transformer visual encoder and MBart text decoder
3. NougatModel: parsing capabilities and applications of the visual-language model subject
Dolphin is a new multimodal document image parsing model that ByteDance quietly released and open-sourced on platforms such as Hugging Face in May 2025. It does not simply pile up parameters, but through a sophisticated architecture design, with about 322M parameters, it has achieved a remarkable performance breakthrough in the field of document parsing. Based on the project source code, this article will deploy the Dolphin API through Docker for the dify platform to call.
Dolphin Basics
Overview
The core feature of Dolphin lies in its innovative two-stage paradigm of "analysis first, then parsing". In the first stage, the model performs page-level layout analysis on the entire document image, identifies the various elements contained therein, and generates a sequence of elements arranged in the natural reading order of humans. In the second stage, these identified "heterogeneous elements" will serve as "anchors", and the model will perform parallel content parsing on the image areas corresponding to these anchors in combination with specific task prompts for different element types. This design enables Dolphin to efficiently and accurately process document images containing multiple interwoven elements such as text paragraphs, complex tables, mathematical formulas, illustrations, etc. The model currently mainly supports the parsing of Chinese and English documents, and outputs the final parsing results in structured formats such as Markdown or JSON that are easy to process later.
Core Components
Visual Encoder: Swin Transformer
The visual encoder is responsible for extracting rich, deep visual features from the input document image. Dolphin uses Swin Transformer as its visual backbone network. Swin Transformer is known for its hierarchical feature representation and efficient shift window self-attention mechanism, which can capture the global layout information and local detail features of the image at the same time. Before input, the image will be carefully preprocessed, such as scaling the long side to a specific size and then filling it into a square while maintaining the original aspect ratio to avoid text distortion and maximize information retention.
Text decoder: Based on MBart architecture
The text decoder receives image features from the visual encoder and text cues for specific tasks, and generates structured text sequences such as paragraph content, Markdown representation of tables, LaTeX code of formulas, etc. through autoregression. Dolphin's text decoder is modified based on the MBart architecture to serve as a decoder-specific model. It effectively fuses the visual features extracted by the Swin Transformer and the input text cues through a cross-attention mechanism to "translate" the visual information into the corresponding language description. The decoder contains multiple Transformer layers with a hidden layer dimension of 1024.
Core Model: NougatModel
NougatModel integrates the core modules of Swin Transformer visual encoder and BART text decoder to form the main body of Dolphin's visual-language model. It is responsible for performing the actual "speaking from pictures" or "generating structures from pictures" tasks. NougatModel is called in both parsing stages of Dolphin. In the layout analysis stage, it receives the complete page image and hints for layout recognition, and outputs the sequence and position of page elements. In the content parsing stage, it receives cropped element image blocks and specific content parsing hints for the element type, and outputs the specific text content of the element. In this way, a unified model architecture can efficiently complete a variety of parsing subtasks with the help of different input granularities and hint strategies.
Document parsing performance comparison
category | method | Model size | Average ED | FPS |
---|---|---|---|---|
Integrated approach | MinerU | 1.2B | 0.1732 | 0.0350 |
Mathpix | - | 0.0924 | 0.0944 | |
Professional VLM | Nougat | 250M | 0.6131 | 0.0673 |
Kosmos-2.5 | 1.3B | 0.2691 | 0.0841 | |
Vary | 7B | - | - | |
Fox | 1.8B | - | - | |
GOT | 580M | 0.1411 | 0.0604 | |
olmOCR | 7B | 0.1148 | 0.0427 | |
SmolDocling | 256M | 0.4636 | 0.0140 | |
Mistral-OCR | - | 0.0737 | 0.0996 | |
Universal VLM | InternVL-2.5 | 8B | 0.4037 | 0.0444 |
InternVL-3 | 8B | 0.2089 | 0.0431 | |
MiniCPM-o 2.6 | 8B | 0.2882 | 0.0494 | |
GLM4v-plus | 9B | 0.2481 | 0.0427 | |
Gemini-1.5 pro | - | 0.1348 | 0.0376 | |
Gemini-2.5 pro | - | 0.1432 | 0.0231 | |
Claude3.5-Sonnet | - | 0.1358 | 0.0320 | |
GPT-4o-202408 | - | 0.2453 | 0.0368 | |
GPT-4.1-250414 | - | 0.2133 | 0.0337 | |
Step-1v-8k | - | 0.1227 | 0.0417 | |
Qwen2-VL | 7B | 0.2550 | 0.0315 | |
Qwen2.5-VL | 7B | 0.1112 | 0.0343 | |
Dolphin | Dolphin | 322M | 0.0575 | 0.1729 |
ED (Edit Distance) : The lower the value, the better FPS (frames per second) : the higher the value, the better
Dolphin-API interface transformation
FastAPI rewrites the API interface
Based on the Dolphin source code and the FastAPI framework, the Dolphie-API interface is constructed.
app = FastAPI(
title= "DOLPHIN API" ,
description= "API for document layout analysis and text recognition" ,
version= "1.0.0"
)
@app.post("/analyze")
async def analyze_document (file: UploadFile = File (...) ) :
"""
Analyze document images and return results
parameter:
- file: uploaded image file
return:
- Analysis results in JSON format
"""
try :
# Create an image object directly from the uploaded file content
content = await file.read()
image = Image.open(io.BytesIO(content)).convert( "RGB" )
# Create a temporary directory to save the results
with tempfile.TemporaryDirectory() as temp_dir:
# Make sure necessary subdirectories exist
setup_output_dirs(temp_dir)
# Use prepare_image function to process the image
padded_image, dimensions = prepare_image(image)
try :
# Save the processed image to a temporary file (because process_page requires a file path)
temp_image_path = os.path.join(temp_dir, "temp_image.png" )
cv2.imwrite(temp_image_path, padded_image)
# Processing images
json_path, recognition_results = process_page(
image_path=temp_image_path,
model=model,
save_dir=temp_dir,
max_batch_size = 16
)
return JSONResponse(content=recognition_results)
finally :
# Ensure that temporary image files are deleted
if os.path.exists(temp_image_path):
try :
os.remove(temp_image_path)
except Exception as e:
print( f"Warning: Failed to delete temporary file {temp_image_path} : {e} " )
except Exception as e:
return JSONResponse(
status_code = 500 ,
content={ "error" : f"Error while processing image: {str(e)} " }
)
Docker image building and deployment
Writing a Dockerfile
# Install system dependencies
RUN apt-get update && apt-get install -y \
libgl1-mesa-glx \
libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# Copy dependency files and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
# Install FastAPI, uvicorn and huggingface_hub
RUN pip install --no-cache-dir fastapi uvicorn python-multipart huggingface_hub -i https://mirrors.aliyun.com/pypi/simple/
# Copy project files
COPY . .
# Set environment variables to use the mirror station
ENV HF_ENDPOINT=https://hf-mirror.com
# Download the model
RUN python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='ByteDance/Dolphin', local_dir='./hf_model')"
Build the image and start the service
## Build the image
docker build -t dolphin-api: 0.1 .
## Start the service (expose the port to the outside world)
docker run -d --gpus all -p 6500 : 6500 --ipc=host --name dolphin-api dolphin-api: 0.1
## Start the service (only expose the port to the local Dify)
docker run -d --gpus all --network docker_ssrf_proxy_network --name dolphin-api dolphin-api: 0.1
Swagger UI interface test interface
Currently the API only supports image formats. We upload a screenshot for testing.
Dify workflow integration
First, create a Chatflow in Dify and arrange the following conversation flow. You can refer to MinerU's article "MinerU-API | Support multi-format parsing to further improve Dify's document capabilities"
Dolphin-API is a node that we execute through a python script
headers = {
'accept' : 'application/json' ,
}
files = {
'file' : (file_name, open(file_path, 'rb' ), mime_type),
}
response = requests.post( 'http://dolphin-api:6500/analyze' , headers=headers, files=files)
Test on the page