PDF-Craft increases your document processing efficiency by 300%

Written by

Silas Grey

Updated on:July-09th-2025

In our daily work, we often need to deal with PDF documents, especially scanned books or documents. Today, we have discovered a powerful open source tool PDF-Craft, which can intelligently convert PDF files to Markdown or EPUB format, and has the ability to intelligently identify chapters, comments and references.

Tool Features

Support reading PDF files page by page
Extract text using DocLayout-YOLO combined with a custom algorithm
Smart filtering of headers, footers, footnotes and page numbers
Support cross-page text connection processing
Text recognition using OnnxOCR
Support local GPU acceleration
Optional integration of LLM services for more advanced processing

Environmental requirements

Python 3.10 or above (3.10.16 recommended)
Optional: CUDA environment (for GPU acceleration)

Practical steps

1. Basic Installation

pip install pdf-craft

2. PDF to Markdown conversion practice

This is the most basic function. It does not require calling remote LLM services and can be completed entirely by relying on local computing power. The required model will be downloaded online when it is called for the first time. When encountering illustrations, tables, and formulas in the document, screenshots will be directly inserted into the MarkDown file.

from  pdf_craft  import  PDFPageExtractor, MarkDownWriter

# Initialize the extractor
extractor = PDFPageExtractor(
    device = "cpu" ,   # Change to "cuda:0" when using GPU
    model_dir_path= "/path/to/model/dir/path" # AI model storage directory  
)

# Start conversion
with  MarkDownWriter(markdown_path,  "images" ,  "utf-8" )  as  md:
    for  block  in  extractor.extract(pdf= "/path/to/pdf/file" ):
        md.write(block)

3. Advanced Practice of Converting PDF to EPUB

This function is more powerful and needs to be used in conjunction with the LLM service.

Step 1: Configure PDF Extractor

from  pdf_craft  import  PDFPageExtractor

extractor = PDFPageExtractor(
    device = "cpu" ,   # Change to "cuda:0" when using GPU
    model_dir_path= "/path/to/model/dir/path"
)

Step 2: Configure LLM service

from  pdf_craft  import  LLM

llm = LLM(
    key= "sk-XXXXX" ,            # the key provided by the LLM vendor
    url= "https://api.deepseek.com" ,   # LLM API address
    model = "deepseek-chat" ,     # model name
    token_encoding= "o200k_base"
)

Step 3: Perform PDF Analysis

from  pdf_craft  import  analyse

analyse(
    llm=llm,
    pdf_page_extractor=pdf_page_extractor,
    pdf_path= "/path/to/pdf/file" ,
    analyzing_dir_path= "/path/to/analysing/dir" ,
    output_dir_path= "/path/to/output/files"
)

Step 4: Generate EPUB File

from  pdf_craft  import  generate_epub_file

generate_epub_file(
    from_dir_path=output_dir_path,
    epub_file_path= "/path/to/output/epub"
)

Operation and maintenance considerations

Model storage management

The required models will be automatically downloaded when running for the first time
It is recommended to pre-download the model and specify a fixed model directory
Note the disk space occupied by the model file

Interrupt recovery mechanism

useanalysing_dir_pathResume directory downloads
Remember to clear or delete the old analysis directory before starting a new task
It is recommended to implement a regular backup mechanism

Performance optimization suggestions

Prioritize CUDA acceleration in environments with GPU
Reasonably plan batch processing tasks to avoid excessive resource usage
Monitor CPU/GPU usage and adjust the number of concurrent connections as appropriate

PDF-Craft is a powerful PDF processing tool, especially suitable for converting scanned books. Through reasonable configuration and use, the efficiency of document processing can be greatly improved. It is recommended to select appropriate functional modules according to specific needs and hardware conditions during actual deployment.