Intelligent data extraction tool - Detailed explanation and use of MinerU

Written by
Silas Grey
Updated on:July-03rd-2025
Recommendation

MinerU is a powerful document processing tool in the AI ​​era. It helps you extract PDF information easily.
Core content:
1. Open source background and function overview of the intelligent data extraction tool MinerU
2. Core capabilities of multi-type conversion, multi-language recognition and multi-element analysis
3. User-friendly interface and API services to improve document processing efficiency and experience

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
With the rapid development of AI technology, the processing of large amounts of unstructured data has become an urgent problem to be solved. Especially for PDF documents, as one of the most common file formats, how to efficiently and accurately extract information from them has become a pain point for many companies and research institutions. The OpenDataLab team of the Shanghai Artificial Intelligence Laboratory (Shanghai AI Laboratory), a large model data base, has open-sourced a new intelligent data extraction tool, MinerU, to solve this problem.

MinerU can convert PDF documents mixed with complex elements such as pictures, formulas, tables, footnotes, etc. into Markdown and JSON formats, greatly improving the efficiency of AI corpus preparation. With its fast, accurate, open-source and easy-to-use capabilities, MinerU is favored by a large number of users and large model developers. After eight months of launch, the number of GitHub stars has approached 30,000, and it is praised by developers as "a magical tool for document extraction and conversion in the era of large models."


Features



Multi-type conversion capability

MinerU supports extraction of multiple types of PDF documents, including text-based PDF, layer-based PDF, and scanned PDF. In the initial stage, when you input a PDF document, the system will enable the document classification module, extract PDF metadata, and detect whether there is garbled code. MinerU can also convert other types of documents (such as pictures, PPT, and Word documents) into PDF before extraction.

Multi-language recognition

MinerU supports cross-language recognition and is universally used. It currently supports Chinese (Simplified and Traditional), English, Russian, Japanese, Korean and other languages. This makes MinerU have wide application potential around the world.

Multi-element analysis

MinerU can accurately analyze multiple elements and extract comprehensive information, including:

  • Text content
  • Formulas (including mathematical formulas, chemical equations, etc.)
  • sheet
  • chart
  • Table headers and footers
  • Image Description

Deleting layout elements

MinerU can accurately identify layout elements, delete headers/footers/footnotes, and retain only the main text content, ensuring that the extracted text is semantically coherent and free of interfering information.

Multi-format output

MinerU supports multiple output formats, including:

  • Multimodality and Markdown format for NLP
  • JSON sorted by reading order
  • An intermediate format rich in information
  • Image and table extraction results

User-friendly interface

MinerU has launched a new client that supports mainstream operating systems such as Win/Mac/Linux. No programming or login required, just download and use. Users can intelligently extract documents in the graphical interface through simple interaction. After downloading and installing the client from the MinerU official website, you only need to drag and drop the file, or enter the URL of the file to be converted and click confirm to complete the quick and automatic parsing and export of the document.

API Services

MinerU online API service is also aligned with the latest release 1.0 of the MinerU open source project, providing batch parsing of URLs and local files, query and download of parsing results, and configuration of model-related parameters. You can try it for free after filling out the questionnaire and applying. Thanks to the continuous optimization of computing power scheduling strategies and the enhancement of document batch processing capabilities, MinerU is more efficient in processing large numbers of concurrent documents. Whether it is batch processing or a single large file, it can respond quickly, providing users with a smoother and more reliable experience.


Technical Architecture



MinerU's technical architecture integrates the most advanced document parsing models, covering layout detection, formula detection, formula recognition, OCR and table recognition. The following are MinerU's core technical components:

Layout Detection

MinerU uses fine-tuned DocLayout-YOLO and LayoutLMv3 models to locate different elements in documents, including images, tables, text, titles, and formulas. These models have been fine-tuned on a variety of PDF document annotations, achieving accurate extraction results on diverse PDF documents such as papers, textbooks, research reports, and financial reports, and showing high robustness in the face of challenges such as blur and watermarks.

Formula Detection

MinerU uses a fine-tuned YOLOv8 model to locate formulas in documents, including inline formulas and block formulas. This advanced formula detection capability ensures that mathematical content can be accurately identified and extracted.

Formula Recognition

MinerU uses the UniMERNet model for formula recognition, an algorithm designed for various formula recognition in real-world scenarios. By building large-scale training data and carefully designed results, it achieves excellent recognition performance for complex long formulas, handwritten formulas, and noisy screenshot formulas.

OCR Capability

MinerU uses PaddleOCR for text recognition. PaddleOCR is an end-to-end optical character recognition (OCR) engine based on PaddlePaddle (an open source deep learning platform developed by Baidu) with the following features:

  • Provides a complete OCR process from text detection, text recognition to post-processing of text recognition results
  • The model structure and inference speed have been optimized, allowing it to run quickly on a variety of hardware while maintaining high recognition accuracy.
  • Supports text recognition in multiple languages, including Chinese, English, French, German, Japanese, and Korean
  • Integrates a variety of advanced text detection and recognition models, such as DB for text detection, CRNN and STAR-Net for text recognition
  • Provides a series of pre-trained models that users can use directly for text recognition
  • Provides detailed documentation and sample code, allowing users to easily get started and quickly integrate OCR functions into their own applications

Table Recognition

MinerU provides two table recognition methods:

  1. StructEqTable: This is an efficient toolkit that can convert table images to LaTeX/HTML/MarkDown. The latest version adopts the InternVL2-1B base model, improves Chinese recognition accuracy, and expands multi-format output options.
  2. PaddleOCR+TableMaster: PaddleOCR first detects the location of text lines in the document image, and TableMaster focuses on the recognition of table structure, detecting the existence of tables in document images and locating the boundaries of tables.

In the process of table recognition, PaddleOCR provides accurate text recognition, while TableMaster is responsible for identifying and reconstructing the structure of the table. Combining the two, the system is able to extract complete table information from the image, including the structure and content of the table.


How to use



MinerU provides a variety of usage methods to meet the needs of different users:

Graphical client

MinerU has launched a new client that supports mainstream operating systems such as Win/Mac/Linux. No programming or login required, just download and use. Users can intelligently extract documents in the graphical interface through simple interaction.

Directions:

  1. Download the client from the MinerU official website and install it
  2. Simply drag and drop the file to be processed, or enter the URL of the file to be converted and click Confirm
  3. The client uses the latest release 1.0 of the MinerU open source project, which supports content extraction of various types of documents such as pdf, doc, docx, ppt, pptx, etc.
  4. The client provides a variety of recognition modes, models, languages ​​and other configuration switches for users to choose freely
  5. You can export Markdown files, as well as key intermediate files such as content_list.json and layout.json.

Online API interface

MinerU online API service is aligned with the latest release 1.0 of MinerU open source project and provides the following features:

  • Batch parsing of url & local files
  • Query and download analysis results
  • Model related parameter configuration

Users need to fill out a questionnaire application, and after passing it, they can try it for free. Thanks to the continuous optimization of computing power scheduling strategies and the enhancement of document batch processing capabilities, MinerU is more efficient in processing large amounts of documents concurrently, and can respond quickly whether it is batch processing or a single large file.

Local installation and operation

The following steps are required to install and run MinerU in your local environment:

  1. Basic environment description

  • This is a Python project
  • Local basic environment: Windows 10, PyCharm, Python 3.10
  • Build a local virtual environment

    • Build a virtual environment for Python running directly in PyCharm (Python Virtualenv Environment )
    • Install and run MinerU related dependencies through a virtual environment
    • Add a virtual environment to your project structure
  • Install MinerU project running dependencies

    • Installation command:pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple
    • Or a faster command:pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
  • Download model file

    • download_models_hf.py
      : Download the model file from Hugging Face
    • download_models.py
      : Download model files from ModelScope
    • There are two model download scripts in the scripts directory of the code:
    • The model download will include two paths:modelsandLayoutReader[ 14 ]

    Container installation and operation

    In Linux environment, MinerU can also be installed and run in container mode:

    1. CPU container environment

    • Build the image:docker build -t magic-pdf-cpu-test .
    • Run the container:docker run -d magic-pdf-cpu-test
    • Enter the container and run magic-pdf related commands to use the CPU to perform PDF extraction
  • GPU container environment

    • The container image needs to be built on a GPU CUDA-powered machine
    • Build command:docker build -t magic-pdf-gpu-test .
    • Run the container:docker run --rm --device nvidia.com/gpu=all --security-opt=label=disable -it magic-pdf-gpu-test
    • Once inside the container, you can use the GPU to perform PDF extraction


    Comparison with other tools


    MinerU

    Advantages :

    It supports multiple input models and has high-precision PDF model parsing capabilities.

    Automatically identify and remove non-content elements such as headers, footers, footnotes and page numbers to purify document information.

    Convert formulas to LaTeX format, suitable for academic exchanges and technical documents.

    Supports CPU and GPU acceleration, compatible with Windows/Linux/Mac platforms.

    Disadvantages:

    It has high requirements for GPU resources and its configuration is relatively complex.

    Form processing is slow.


    Marker

    Marker is a lightweight open source PDF to Markdown tool that also has certain OCR recognition capabilities, especially suitable for basic document processing tasks. Although it has a fast processing speed, its ability to parse complex documents is limited.

    Advantages:

    It is open source and free, with fast processing speed (4 times faster than similar tools).

    Suitable for rapid deployment and use by users with technical background.

    Disadvantages:

    Lack of ability to parse complex layouts.

    Rely on local GPU resources.


    Docling

    Docling adopts a modular design, supports multi-format document parsing, and can be integrated with AI frameworks, suitable for enterprise-level contract and report automation. However, some functions depend on the business model and require CUDA environment support.

    Advantages:

    Compatible with IBM ecosystem and supports multi-format mixed processing.

    Modular design allows for easy integration.

    Disadvantages:

    CUDA environment support is required.

    Some features depend on the business model.


    Markitdown

    Markitdown is open sourced by Microsoft and supports conversion of multiple formats and AI-enhanced processing, making it suitable for multi-format content creation. However, some functions rely on the OpenAI API, which may cause some format conversions to lose structure.

    Advantages:

    The most comprehensive format support and developer-friendly (Python API/CLI).

    Disadvantages:

    Relying on external API, some functions require a paid model.


    OmniParse

    Although OmniParse may have character errors when processing PDF documents, especially formulas, it provides a web terminal for easy operation and can process various types of files.

    Advantages:

    Supports processing of multiple types of files.

    Provides a web terminal for simple operation.

    Disadvantages:

    Character errors may occur when processing PDF documents.


    Llamaparse

    Llamaparse is designed for RAG, supports complex PDF parsing, and can generate knowledge graphs, which is suitable for legal and technical document analysis. However, its processing speed is slow and requires API key support.

    Advantages:

    The parsing accuracy is high and supports semantic optimization of semi-structured data.

    Disadvantages:

    Slow processing speed.

    The free quota is limited and an API key is required.


    On the whole, MinerU performs outstandingly in multimodal content processing, formula recognition and conversion, and is particularly suitable for academic literature or technical documents that need to process a large number of mathematical formulas. At the same time, it supports GPU acceleration, making it more efficient when processing large-scale documents. In contrast, Marker is more suitable for processing simple PDF documents, while Docling is more inclined to enterprise-level applications, especially when integration with AI frameworks is required. Each tool has its specific application scenarios and target user groups, and choosing the right tool depends on specific needs and personal preferences. For example, if you need a solution that can quickly process a large number of documents, you may want to consider Marker; for projects that require high-precision parsing and professional functions, MinerU may be a better choice.


    Project Application



    MinerU was born in the pre-training process of Shusheng-Puyu. It mainly solves the problem of symbol conversion in scientific and technological literature and is mainly used in data cleaning scenarios: large model pre-training data cleaning and RAG application unstructured data cleaning.

    MinerU project list:

    https://github.com/opendatalab/MinerU/blob/master/projects/README_zh-CN.md

    Medical AI Assistant: https://github.com/PancrePal-xiaoyibao/MinerU-xyb

    MinerU × CAMEL-AI: One-click PDF extraction, facilitating multi-agent cross-document collaboration and in-depth analysis


    Summarize



    MinerU provides a variety of usage methods to meet the needs of different users:

    Graphical client

    MinerU has launched a new client that supports mainstream operating systems such as Win/Mac/Linux. No programming or login required, just download and use. Users can intelligently extract documents in the graphical interface through simple interaction.

    MinerU is a powerful intelligent data extraction tool that can efficiently and accurately process complex PDF documents, extract text, formulas, tables, images and other elements, and convert them into structured Markdown or JSON formats. Its open source features, multi-platform support, multiple usage methods and excellent performance make it an ideal choice for large model corpus preparation and document data automation processing.

    MinerU's technical architecture integrates advanced document parsing models, covering layout detection, formula detection, formula recognition, OCR and table recognition, ensuring accurate extraction of various complex PDF documents. Whether it is academic papers, textbooks, legal contracts, medical reports, annual reports or engineering drawings,

    MinerU can provide high-quality extraction results.

    Although MinerU still has some problems in processing certain types of PDF documents, its overall performance is already very good, especially in application scenarios such as RAG, MinerU has shown great potential. For enterprises and research institutions that need to process a large number of PDF documents, MinerU is undoubtedly a tool worth considering.