Dolphin-API: A comprehensive guide to the API-based multimodal document parsing model of ByteDance Dolphin

Written by

Audrey Miles

Updated on:June-18th-2025

Dolphin is a new multimodal document image parsing model that ByteDance quietly released and open-sourced on platforms such as Hugging Face in May 2025. It does not simply pile up parameters, but through a sophisticated architecture design, with about 322M parameters, it has achieved a remarkable performance breakthrough in the field of document parsing. Based on the project source code, this article will deploy the Dolphin API through Docker for the dify platform to call.

Dolphin Basics

Overview

The core feature of Dolphin lies in its innovative two-stage paradigm of "analysis first, then parsing". In the first stage, the model performs page-level layout analysis on the entire document image, identifies the various elements contained therein, and generates a sequence of elements arranged in the natural reading order of humans. In the second stage, these identified "heterogeneous elements" will serve as "anchors", and the model will perform parallel content parsing on the image areas corresponding to these anchors in combination with specific task prompts for different element types. This design enables Dolphin to efficiently and accurately process document images containing multiple interwoven elements such as text paragraphs, complex tables, mathematical formulas, illustrations, etc. The model currently mainly supports the parsing of Chinese and English documents, and outputs the final parsing results in structured formats such as Markdown or JSON that are easy to process later.

Core Components

Visual Encoder: Swin Transformer

The visual encoder is responsible for extracting rich, deep visual features from the input document image. Dolphin uses Swin Transformer as its visual backbone network. Swin Transformer is known for its hierarchical feature representation and efficient shift window self-attention mechanism, which can capture the global layout information and local detail features of the image at the same time. Before input, the image will be carefully preprocessed, such as scaling the long side to a specific size and then filling it into a square while maintaining the original aspect ratio to avoid text distortion and maximize information retention.

Text decoder: Based on MBart architecture

The text decoder receives image features from the visual encoder and text cues for specific tasks, and generates structured text sequences such as paragraph content, Markdown representation of tables, LaTeX code of formulas, etc. through autoregression. Dolphin's text decoder is modified based on the MBart architecture to serve as a decoder-specific model. It effectively fuses the visual features extracted by the Swin Transformer and the input text cues through a cross-attention mechanism to "translate" the visual information into the corresponding language description. The decoder contains multiple Transformer layers with a hidden layer dimension of 1024.

Core Model: NougatModel

NougatModel integrates the core modules of Swin Transformer visual encoder and BART text decoder to form the main body of Dolphin's visual-language model. It is responsible for performing the actual "speaking from pictures" or "generating structures from pictures" tasks. NougatModel is called in both parsing stages of Dolphin. In the layout analysis stage, it receives the complete page image and hints for layout recognition, and outputs the sequence and position of page elements. In the content parsing stage, it receives cropped element image blocks and specific content parsing hints for the element type, and outputs the specific text content of the element. In this way, a unified model architecture can efficiently complete a variety of parsing subtasks with the help of different input granularities and hint strategies.

Document parsing performance comparison

category	method	Model size	Average ED	FPS
Integrated approach	MinerU	1.2B	0.1732	0.0350
	Mathpix	-	0.0924	0.0944
Professional VLM	Nougat	250M	0.6131	0.0673
	Kosmos-2.5	1.3B	0.2691	0.0841
	Vary	7B	-	-
	Fox	1.8B	-	-
	GOT	580M	0.1411	0.0604
	olmOCR	7B	0.1148	0.0427
	SmolDocling	256M	0.4636	0.0140
	Mistral-OCR	-	0.0737	0.0996
Universal VLM	InternVL-2.5	8B	0.4037	0.0444
	InternVL-3	8B	0.2089	0.0431
	MiniCPM-o 2.6	8B	0.2882	0.0494
	GLM4v-plus	9B	0.2481	0.0427
	Gemini-1.5 pro	-	0.1348	0.0376
	Gemini-2.5 pro	-	0.1432	0.0231
	Claude3.5-Sonnet	-	0.1358	0.0320
	GPT-4o-202408	-	0.2453	0.0368
	GPT-4.1-250414	-	0.2133	0.0337
	Step-1v-8k	-	0.1227	0.0417
	Qwen2-VL	7B	0.2550	0.0315
	Qwen2.5-VL	7B	0.1112	0.0343
Dolphin	Dolphin	322M	0.0575	0.1729

ED (Edit Distance) : The lower the value, the better
FPS (frames per second) : the higher the value, the better

Dolphin-API interface transformation

FastAPI rewrites the API interface

Based on the Dolphin source code and the FastAPI framework, the Dolphie-API interface is constructed.

app = FastAPI(
    title= "DOLPHIN API" ,
    description= "API for document layout analysis and text recognition" ,
    version= "1.0.0"
)

@app.post("/analyze")
async def analyze_document (file: UploadFile = File (...) ) : 
    """
    Analyze document images and return results
    
    parameter:
    - file: uploaded image file
    
    return:
    - Analysis results in JSON format
    """
    try :
        # Create an image object directly from the uploaded file content
        content =  await  file.read()
        image = Image.open(io.BytesIO(content)).convert( "RGB" )
        
        # Create a temporary directory to save the results
        with  tempfile.TemporaryDirectory()  as  temp_dir:
            # Make sure necessary subdirectories exist
            setup_output_dirs(temp_dir)
            
            # Use prepare_image function to process the image
            padded_image, dimensions = prepare_image(image)
            
            try :
                # Save the processed image to a temporary file (because process_page requires a file path)
                temp_image_path = os.path.join(temp_dir,  "temp_image.png" )
                cv2.imwrite(temp_image_path, padded_image)
                
                # Processing images
                json_path, recognition_results = process_page(
                    image_path=temp_image_path,
                    model=model,
                    save_dir=temp_dir,
                    max_batch_size = 16
                )
                
                return  JSONResponse(content=recognition_results)
            finally :
                # Ensure that temporary image files are deleted
                if  os.path.exists(temp_image_path):
                    try :
                        os.remove(temp_image_path)
                    except  Exception  as  e:
                        print( f"Warning: Failed to delete temporary file  {temp_image_path} :  {e} " )
                
    except  Exception  as  e:
        return  JSONResponse(
            status_code = 500 ,
            content={ "error" :  f"Error while processing image:  {str(e)} " }
        )

Docker image building and deployment

Writing a Dockerfile

# Install system dependencies
RUN  apt-get update && apt-get install -y \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Copy dependency files and install dependencies
COPY  requirements.txt .
RUN  pip install --no-cache-dir -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
# Install FastAPI, uvicorn and huggingface_hub
RUN  pip install --no-cache-dir fastapi uvicorn python-multipart huggingface_hub -i https://mirrors.aliyun.com/pypi/simple/

# Copy project files
COPY  . .

# Set environment variables to use the mirror station
ENV  HF_ENDPOINT=https://hf-mirror.com

# Download the model
RUN  python -c  "from huggingface_hub import snapshot_download; snapshot_download(repo_id='ByteDance/Dolphin', local_dir='./hf_model')"

Build the image and start the service

## Build the image
docker build -t dolphin-api: 0.1  .

## Start the service (expose the port to the outside world)
docker run -d --gpus all -p  6500 : 6500  --ipc=host --name dolphin-api dolphin-api: 0.1

## Start the service (only expose the port to the local Dify)
docker run -d --gpus all --network docker_ssrf_proxy_network --name dolphin-api dolphin-api: 0.1

Swagger UI interface test interface

Currently the API only supports image formats. We upload a screenshot for testing.

Dify workflow integration

First, create a Chatflow in Dify and arrange the following conversation flow. You can refer to MinerU's article "MinerU-API | Support multi-format parsing to further improve Dify's document capabilities"

Dolphin-API is a node that we execute through a python script

    headers = {
    'accept' :  'application/json' ,
    }

    files = {
        'file' : (file_name, open(file_path,  'rb' ), mime_type),
    }

    response = requests.post( 'http://dolphin-api:6500/analyze' , headers=headers, files=files)

Test on the page

#dolphin #dify #MinerU