Markify: An open source document parsing tool optimized for LLM that easily solves PDF problems!

Explore the new open source document parsing tool, Markify helps you easily master PDF!
Core content:
1. Markify: A PDF parsing tool that combines the advantages of Microsoft Markitdown and MinerU
2. Supports unified conversion of multiple file formats to Markdown, efficient and accurate PDF parsing
3. Seamless integration with LlamaIndex, quick installation test guide
Whether in RAG applications or the current trendy Deep Research applications, parsing multi-format files is always a big challenge, especially PDF files. Due to their complex structure and diverse layout methods, many tools have different parsing effects. Although there are many PDF parsing tools on the market, there are very few high-quality and unified solutions. We have previously evaluated existing tools in detail through "Solving the PDF Parsing Problem: The Best Choice for Efficient Parsing of Complex PDFs in RAG" and "Microsoft's open source Markitdown can convert any file to Markdown format. How about PDF parsing?" Although markitdown [1] solves the problem of converting various formats to Markdown, it is still insufficient in PDF parsing.
In 2024, a new PDF parsing tool MinerU [2] made its debut, and has received 27.7K followers on GitHub , quickly becoming a star in the document processing field. MinerU is a domestic open source and powerful document data extraction tool that focuses on converting complex documents such as PDF into machine-readable formats. It is very suitable for scenarios such as academic research, technical writing, and large model training. However, the AGPL v3 license it adopts is contagious, and direct integration will force the project to be open source as a whole, which is often difficult to accept in commercial projects.
To solve this problem, I officially launched Markify [3] , a tool that combines the advantages of Microsoft Markitdown and MinerU. Markify can not only convert multiple files such as PDF, Word, PPT, Excel, pictures, audio, web pages, CSV, JSON, XML and even ZIP compressed files into Markdown format, but also use MinerU to achieve efficient and accurate PDF parsing, and cleverly circumvent the AGPL infection problem by developing HTTP services, so that it can be seamlessly integrated into various projects.
This article first introduces the functions and conversion effects of Markify, then explains how to seamlessly integrate it with LlamaIndex, and finally provides guidance on quick installation and testing.
1. Introduction to Markify
Markify provides a unified parsing framework for multiple file formats. In particular, it has three modes for PDF parsing to meet the needs of different scenarios:
The fast mode (simple) is based on pdfminer (the built-in PDF parser of markitdown), focuses on efficient text extraction, and is suitable for scenarios with low text requirements.
Advanced mode (advanced) combined with MinerU's deep analysis can not only accurately extract text, but also recognize and convert complex tables and images, and automatically convert images into network citation forms in Markdown.
The cloud mode is under development and will provide users with more cloud analysis capabilities in the future.
2. Conversion effect display
The left side is the original PDF text, and the right side is the markdown preview after conversion.
2.1 Overall conversion effect
When converting the recent popular paper PIKE-RAG [4] , Markify accurately extracted the text content and the overall layout was clear and easy to read.
2.2 Table extraction effect
For complex tables in the text, Markify can accurately identify and convert them into Markdown tables, with excellent display effects.
2.3 Image extraction effect
In terms of image conversion, Markify uploads the image to the server and embeds it into Markdown, making the mixed text and image effect more intuitive and beautiful.
These cases fully demonstrate the excellent performance of Markify in PDF parsing. Whether it is text, tables or images, they can all be converted into Markdown format with high quality, providing a solid foundation for subsequent model processing.
3. Seamless integration with LlamaIndex
To further simplify the data preprocessing of large models, Markify also supports integration with LlamaIndex. LlamaIndex defines the BaseReader interface, and users only need to implement this interface to customize the file parser. The following example shows how to use the custom MyFileLoader to load PDF files into LlamaIndex through the Markify API, just like using LlamaParse:
class MyFileLoader (BaseReader) :
def __init__ (self, conversion_service_url, poll_interval= 5 , timeout= 300 , mode= 'advanced' ) :
...
self.service_url = conversion_service_url.rstrip( '/' )
self.poll_interval = poll_interval
self.timeout = timeout
self.mode = mode
…
In actual use, just specify.pdf
The file can be processed by MyFileLoader, and files of other formats can also be processed by makify:
pdf_loader = MyFileLoader(
conversion_service_url=settings.markify_api_base,
poll_interval = 5 ,
timeout=settings.markify_api_timeout
)
documents = SimpleDirectoryReader(input_files=[file_path], file_extractor={
".pdf" : pdf_loader,
}).load_data()
In this way,Markify
Realized with LlamaIndex
Seamless access, just likeLlamaParser
Equally efficient and stable. Complete MyFileLoader
I have put the implementation in the comment section, and students who are interested are welcome to join in to learn more details.
4. Installation and Usage Guide
In order to help everyone integrate Markify more conveniently, we provide an HTTP API service based on FastAPI. The client can bypass AGPL infection through HTTP calls, and internal projects do not need to be open source.
4.1 Installation
First clone the source code:
git clone https://github.com/KylinMountain/markify
Enter the project directory and install dependencies:
cd markify
conda create --name markify python=3.10
pip install -r requirements.txt
4.2 Start API Service
When you first start Markify, it will automatically start from ModelScope
download MinerU
Model file (if the download is slow, you can set the environment variable MINERU_USE_MODELSCOPE=false
Switch to HuggingFace
download):
uvicorn main:app --reload --port 20926
After startup, you can access it through the browserhttp://localhost:20926/docs
View the API documentation, which supports uploading documents, querying task status, and downloading files.
4.3 Start the Streamlit Client
Execute the following command to start the Streamlit client and access it from a browser:http://localhost:8501/
You can start converting files quickly and easily:
streamlit run ./client/streamlit_client.p
In the Streamlit client, select PDF processing mode to choose from the above three modes. The conversion list will be displayed on the right. After completion, you can download the converted markdown document yourself.
5. Conclusion
Markify integrates the advantages of Markitdown and MinerU to provide a unified, high-quality file parsing solution, especially in PDF parsing. Whether it is text, table or image extraction, Markify can meet the needs of various scenarios. In addition, the API service design developed based on FastAPI allows users to integrate into existing projects through HTTP calls, easily avoid AGPL infection problems, and achieve seamless docking with large models such as LlamaIndex.
In short, Markify brings a new solution and higher parsing efficiency to RAG applications and document preprocessing. We hope you will also experience and develop this open source tool and contribute to open source!