Is scanning PDF too painful? pdf-craft converts to Markdown/EPUB in seconds, automatically generates catalog notes and citation alignment

Written by

Silas Grey

Updated on:June-28th-2025

This official account mainly focuses on cutting-edge AI technologies such as NLP, CV, LLM, RAG, Agent, etc., and shares industry practical cases and courses for free to help you fully embrace AIGC.

PDF-Craft in Action

Convert PDF to MarkDown

from pdf_craft import PDFPageExtractor, MarkDownWriter

extractor = PDFPageExtractor(
  device= "cpu" ,  # If you want to use CUDA, please change to device="cuda:0" format.
  model_dir_path= "/path/to/model/dir/path" ,  # The folder address where the AI model is downloaded and installed
)
with MarkDownWriter(markdown_path,  "images" ,  "utf-8" ) as md:
  for  block  in  extractor.extract(pdf= "/path/to/pdf/file" ):
    md.write(block)

If there were figures (or tables, formulas) in the original PDF, a table of contents will be created at the same level as the saved images.

The images in the directory will be referenced in the MarkDown file as relative addresses.*.md``assets``*.md``assets

Convert PDF to EPUB
First create a PDF extraction object

extractor = PDFPageExtractor(
  device= "cpu" ,  # If you want to use CUDA, please change to device="cuda:0" format.
  model_dir_path= "/path/to/model/dir/path" ,  # The folder address where the AI model is downloaded and installed
)

Send the extracted content to LLM to generate EPUB file

from pdf_craft import analyse
from pdf_craft import LLM

llm = LLM(
  key = "sk-XXXXX" ,  # key provided by the LLM vendor
  url = "https://api.DeepSeek.com" ,  # URL provided by the LLM provider
  model = "deepseek-chat" ,  # Model provided by LLM vendor
  token_encoding = "o200k_base" ,  # Local model name for tokens estimation (not related to LLM, if you don't care, keep "o200k_base")
)

analyse(
  llm=llm,  # LLM configuration prepared in the previous step
  pdf_page_extractor=pdf_page_extractor,  # The PDFPageExtractor object prepared in the previous step
  pdf_path = "/path/to/pdf/file" ,  # PDF file path
  analysing_dir_path = "/path/to/analysing/dir" ,  # analysing folder address
  output_dir_path= "/path/to/output/files" ,  # The analysis results will be written to this folder
)

output_dir_path, indicating the folder where the results of the scan and analysis (there will be multiple files) should be saved.
analysing_dir_path, used to store intermediate states during the analysis process.
After the analysis is completed, output_dir_path The folder address is passed to the following code as a parameter to generate the EPUB file.

PDF-Craft main functions:

Convert PDF to Markdown using local AI models without internet connection
Supports converting PDF to structured EPUB e-book format
Intelligently identify and filter interference elements such as headers, footers, footnotes, page numbers, etc.
Automatically process charts and formulas and keep them in the converted file as images
Combine LLM technology to build book structure and generate EPUB with table of contents and chapters

PDF-craft conversion logic

First, split the PDF pages into images

Secondly, use DocLayout-YOLO to identify block elements in the image, including: headers, footers, paragraphs, titles, pictures, tables, charts, page numbers and other information

Then, use layoutreader to sort the blocks

Next, use OnnxOCR to recognize the text in the block

Finally, the text recognized by OCR is sent to Deepseek, and the structure of the book is constructed through specific information (such as the table of contents), and finally an EPUB file with a table of contents and chapters is generated.

During this parsing and building process, the notes and reference information for each page are read through LLM and then presented in a new format in the EPUB file.