A tool that easily converts multiple file formats to Markdown, Star 45K+!

Explore Microsoft's open source multi-functional document conversion tool MarkItDown, which can easily convert file formats to Markdown.
Core content:
1. The importance of Markdown in LLM and RAG models
2. MarkItDown's functional features and supported formats
3. MarkItDown's integration with large language models and configuration options
With the widespread use and implementation of LLM (Large Language Model) applications, Markdown documents are preferred by various LLM and RAG (Retrieval Enhanced Generation), mainly in the following two aspects:
First of all, Markdown is a lightweight markup language that is concise and easy to read and write, making it an ideal choice for writing and storing documents, especially when these documents need to be processed by LLM or enhanced by the RAG model.
Secondly, the structured nature of Markdown makes it more efficient when processing text. For example, when vectorizing documents, it is necessary to perform structured segmentation on the Markdown file according to the title level. This standardized structured segmentation retains the context and structural information of the text, which is very important for the RAG model and helps improve the effect of text vectorization and RAG retrieval.
A lightweight Python utility that converts various files to Markdown format for use with large language models (LLMs) and related text analysis pipelines. Compared with traditional text extraction tools, MarkItDown focuses more on preserving important document structure and content, such as titles, lists, tables, links, etc. Although its output is mainly intended for use by text analysis tools, it is undoubtedly a powerful tool for users who need to quickly convert multiple file formats to Markdown.
# GitHub address https://github.com/microsoft/markitdown
1. Multi-format support
MarkItDown supports a wide variety of file formats, covering common office and multimedia file types, including:
PDF
PowerPoint
Word
Excel
Images (EXIF metadata and OCR)
Audio (EXIF metadata and speech transcription)
HTML
Text-based formats (CSV, JSON, XML)
ZIP file
Youtube URL
EPubs
2. Flexible configuration options
Optional dependencies : MarkItDown provides a variety of optional dependencies, users can install specific dependencies as needed, such as
pip install markitdown[pdf, docx, pptx]
Only dependencies for PDF, DOCX, and PPTX files are installed.Plugin support : MarkItDown supports third-party plugins, and users can extend its functionality through plugins. Plugins are disabled by default and can be enabled through
markitdown --use-plugins
Command enabled.
3. Integration with large language models
MarkItDown supports integration with large language models such as GPT-4, and can generate rich descriptive output, such as analysis and description of images. Users can provide llm_client
and llm_model
parameter to enable this feature.
4. Other Features
Docker support : MarkItDown provides a Docker image, and users can quickly deploy and use the tool through Docker.
Command line and Python API : MarkItDown provides command line tools and Python API, and users can choose to use it according to their needs.
02 — MarkItDown Installation
pip install 'markitdown[all]'
git clone git@github.com:microsoft/markitdown.gitcd markitdownpip install -e packages/markitdown[all]
1. Command line usage
Basic usage : Convert a file to Markdown format and output it to the console.
markitdown path-to-file.pdf
Specify output file : Use
-o
The parameter specifies the output file.
markitdown path-to-file.pdf -o document.md
Pipeline input : Input file contents through pipeline.
cat path-to-file.pdf | markitdown
2. Python API Usage
Basic usage : Use MarkItDown to perform file conversion in Python.
from markitdown import MarkItDownmd = MarkItDown(enable_plugins=False) # Disable plug-ins result = md.convert("test.xlsx")print(result.text_content)
Integration with large language models : Image description with GPT-4.
from markitdown import MarkItDownfrom openai import OpenAIclient = OpenAI()md = MarkItDown(llm_client=client, llm_model="gpt-4o")result = md.convert("example.jpg")print(result.text_content)
3. Docker Usage
Build the Docker image :
docker build -t markitdown:latest .
Run the Docker container :
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md