PDF-Craft increases your document processing efficiency by 300%

Written by
Silas Grey
Updated on:July-09th-2025
Recommendation

PDF-Craft is a powerful tool to improve PDF document processing efficiency. It helps you convert formats easily.

Core content:
1. Introduction to PDF-Craft tool and its intelligent recognition function
2. Environment requirements and basic installation steps
3. Practical operation of PDF to Markdown and PDF to EPUB

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


In our daily work, we often need to deal with PDF documents, especially scanned books or documents. Today, we have discovered a powerful open source tool PDF-Craft, which can intelligently convert PDF files to Markdown or EPUB format, and has the ability to intelligently identify chapters, comments and references.

Tool Features 

  1. Support reading PDF files page by page
  2. Extract text using DocLayout-YOLO combined with a custom algorithm
  3. Smart filtering of headers, footers, footnotes and page numbers
  4. Support cross-page text connection processing
  5. Text recognition using OnnxOCR
  6. Support local GPU acceleration
  7. Optional integration of LLM services for more advanced processing

Environmental requirements 

  • Python 3.10 or above (3.10.16 recommended)
  • Optional: CUDA environment (for GPU acceleration)

Practical steps 

1. Basic Installation

pip install pdf-craft

2. PDF to Markdown conversion practice

This is the most basic function. It does not require calling remote LLM services and can be completed entirely by relying on local computing power. The required model will be downloaded online when it is called for the first time. When encountering illustrations, tables, and formulas in the document, screenshots will be directly inserted into the MarkDown file.

from  pdf_craft  import  PDFPageExtractor, MarkDownWriter

# Initialize the extractor
extractor = PDFPageExtractor(
    device = "cpu" ,   # Change to "cuda:0" when using GPU
    model_dir_path= "/path/to/model/dir/path" # AI model storage directory  
)

# Start conversion
with  MarkDownWriter(markdown_path,  "images""utf-8"as  md:
    for  block  in  extractor.extract(pdf= "/path/to/pdf/file" ):
        md.write(block)

3. Advanced Practice of Converting PDF to EPUB

This function is more powerful and needs to be used in conjunction with the LLM service.

Step 1: Configure PDF Extractor

from  pdf_craft  import  PDFPageExtractor

extractor = PDFPageExtractor(
    device = "cpu" ,   # Change to "cuda:0" when using GPU
    model_dir_path= "/path/to/model/dir/path"
)

Step 2: Configure LLM service

from  pdf_craft  import  LLM

llm = LLM(
    key= "sk-XXXXX" ,            # the key provided by the LLM vendor
    url= "https://api.deepseek.com" ,   # LLM API address
    model = "deepseek-chat" ,     # model name
    token_encoding= "o200k_base"
)

Step 3: Perform PDF Analysis

from  pdf_craft  import  analyse

analyse(
    llm=llm,
    pdf_page_extractor=pdf_page_extractor,
    pdf_path= "/path/to/pdf/file" ,
    analyzing_dir_path= "/path/to/analysing/dir" ,
    output_dir_path= "/path/to/output/files"
)

Step 4: Generate EPUB File

from  pdf_craft  import  generate_epub_file

generate_epub_file(
    from_dir_path=output_dir_path,
    epub_file_path= "/path/to/output/epub"
)

Operation and maintenance considerations 

  1. Model storage management
  • The required models will be automatically downloaded when running for the first time
  • It is recommended to pre-download the model and specify a fixed model directory
  • Note the disk space occupied by the model file

  • Interrupt recovery mechanism
    • useanalysing_dir_pathResume directory downloads
    • Remember to clear or delete the old analysis directory before starting a new task
    • It is recommended to implement a regular backup mechanism
  • Performance optimization suggestions
    • Prioritize CUDA acceleration in environments with GPU
    • Reasonably plan batch processing tasks to avoid excessive resource usage
    • Monitor CPU/GPU usage and adjust the number of concurrent connections as appropriate

    PDF-Craft is a powerful PDF processing tool, especially suitable for converting scanned books. Through reasonable configuration and use, the efficiency of document processing can be greatly improved. It is recommended to select appropriate functional modules according to specific needs and hardware conditions during actual deployment.