Is scanning PDF too painful? pdf-craft converts to Markdown/EPUB in seconds, automatically generates catalog notes and citation alignment

PDF-Craft is here to convert PDF files to Markdown or EPUB with one click, automatically generating catalog annotations and citation alignment.
Core content:
1. Introduction and installation of PDF-Craft
2. Convert PDF files to Markdown format
3. Convert PDF files to EPUB format
This official account mainly focuses on cutting-edge AI technologies such as NLP, CV, LLM, RAG, Agent, etc., and shares industry practical cases and courses for free to help you fully embrace AIGC.
PDF-Craft in Action
Convert PDF to MarkDown
from pdf_craft import PDFPageExtractor, MarkDownWriter
extractor = PDFPageExtractor(
device= "cpu" , # If you want to use CUDA, please change to device="cuda:0" format.
model_dir_path= "/path/to/model/dir/path" , # The folder address where the AI model is downloaded and installed
)
with MarkDownWriter(markdown_path, "images" , "utf-8" ) as md:
for block in extractor.extract(pdf= "/path/to/pdf/file" ):
md.write(block)
If there were figures (or tables, formulas) in the original PDF, a table of contents will be created at the same level as the saved images.
The images in the directory will be referenced in the MarkDown file as relative addresses.*.md``assets``*.md``assets
Convert PDF to EPUB
First create a PDF extraction object
extractor = PDFPageExtractor(
device= "cpu" , # If you want to use CUDA, please change to device="cuda:0" format.
model_dir_path= "/path/to/model/dir/path" , # The folder address where the AI model is downloaded and installed
)
Send the extracted content to LLM to generate EPUB file
from pdf_craft import analyse
from pdf_craft import LLM
llm = LLM(
key = "sk-XXXXX" , # key provided by the LLM vendor
url = "https://api.DeepSeek.com" , # URL provided by the LLM provider
model = "deepseek-chat" , # Model provided by LLM vendor
token_encoding = "o200k_base" , # Local model name for tokens estimation (not related to LLM, if you don't care, keep "o200k_base")
)
analyse(
llm=llm, # LLM configuration prepared in the previous step
pdf_page_extractor=pdf_page_extractor, # The PDFPageExtractor object prepared in the previous step
pdf_path = "/path/to/pdf/file" , # PDF file path
analysing_dir_path = "/path/to/analysing/dir" , # analysing folder address
output_dir_path= "/path/to/output/files" , # The analysis results will be written to this folder
)
output_dir_path
, indicating the folder where the results of the scan and analysis (there will be multiple files) should be saved.analysing_dir_path
, used to store intermediate states during the analysis process.After the analysis is completed,
output_dir_path
The folder address is passed to the following code as a parameter to generate the EPUB file.
PDF-Craft main functions:
Convert PDF to Markdown using local AI models without internet connection Supports converting PDF to structured EPUB e-book format Intelligently identify and filter interference elements such as headers, footers, footnotes, page numbers, etc. Automatically process charts and formulas and keep them in the converted file as images Combine LLM technology to build book structure and generate EPUB with table of contents and chapters
PDF-craft conversion logic
First, split the PDF pages into images
Secondly, use DocLayout-YOLO to identify block elements in the image, including: headers, footers, paragraphs, titles, pictures, tables, charts, page numbers and other information
Then, use layoutreader to sort the blocks
Next, use OnnxOCR to recognize the text in the block
Finally, the text recognized by OCR is sent to Deepseek, and the structure of the book is constructed through specific information (such as the table of contents), and finally an EPUB file with a table of contents and chapters is generated.
During this parsing and building process, the notes and reference information for each page are read through LLM and then presented in a new format in the EPUB file.