Table of Content
Synthetic Data Kit: Corpus Extraction Solution for LLM Fine-tuning

Updated on:June-24th-2025
Recommendation
Meta's latest open source toolkit generates high-quality LLM fine-tuning datasets in one click.
Core content:
1. Meta provides an open source solution to the data acquisition problem of LLM fine-tuning
2. Workflow from raw data to fine-tuning gold: import, create, filter, save
3. Support multiple file formats and fine-tuning tasks to improve data quality and fine-tuning efficiency
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
Ingest: Import data from PDF, HTML, YouTube, DOCX, PPT or plain text. The toolkit parses and organizes your files into a clear directory structure. Create: Generate question-answer pairs, Chain of Thought (CoT) reasoning examples, or summaries using the local LLM (via vLLM). You can customize the number and type of examples, and even use your own prompt templates. Curate: Use Llama as a judge to filter and score your synthetic examples, ensuring only the highest quality data makes it through. Save-As: Export the filtered data in the format required for your fine-tuning workflow—supports Alpaca, OpenAI, ChatML, and many more.
#SDK's command tree SDK --> SystemCheck[system-check] SDK[synthetic-data-kit] --> Ingest[ingest] SDK --> Create[create] SDK --> Curate[curate] SDK --> SaveAs[save-as] Ingest --> PDFFile[PDF File] Ingest --> HTMLFile[HTML File] Ingest --> YouTubeURL[File Format] Create --> CoT[CoT] Create --> QA[QA Pairs] Create --> Summary[Summary] Curate --> Filter[Filter by Quality] SaveAs --> JSONL[JSONL Format] SaveAs --> Alpaca[Alpaca Format] SaveAs --> FT[Fine-Tuning Format] SaveAs --> ChatML[ChatML Format]
# Install conda from PyPI create -n synthetic-data python=3.10conda activate synthetic-datapip install synthetic-data-kit
#Or, clone the repository to get the latest features: bashgit clone https://github.com/meta-llama/synthetic-data-kit.gitcd synthetic-data-kitpip install -e .
bashvllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000 # Create the necessary directory structure: mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final} # Check if the system is ready: synthetic-data-kit system-check
bash# import PDFsynthetic-data-kit ingest research_paper.pdf# generate 30 question-answer pairs, set quality thresholdsynthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0# filter qualitysynthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5# save in OpenAI fine-tuned formatsynthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft
# Example configurationvllm: api_base: "http://localhost:8000/v1" model: "meta-llama/Llama-3.3-70B-Instruct" generation: temperature: 0.7 chunk_size: 4000 num_pairs: 25curate: threshold: 7.0 batch_size: 8
prompts:
qa_generation: |
You are creating question-answer pairs for fine-tuning a legal assistant.
Focus on technical legal concepts, precedents, and statutory interpretation.
Below is a chunk of text about: {summary}...
Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
Return ONLY valid JSON formatted as :
[
{
"question" : "Detailed legal question?" ,
"answer" : "Precise legal answer."
},
...
]
Text:
---
{text}
# Bash script to process multiple files
for file in data/pdf/*.pdf; do
filename=$( basename " $file " .pdf)
synthetic-data-kit ingest " $file "
synthetic-data-kit create "data/output/ ${filename} .txt" -n 20
synthetic-data-kit curate "data/generated/ ${filename} _qa_pairs.json" -t 7.5
synthetic-data-kit save-as "data/cleaned/ ${filename} _cleaned.json" -f chatml
done
synthetic-data-kit curate data/generated/report_qa_pairs.json -t 7.0
Accuracy: Factual Correctness Relevance: Relevance to the content Clarity: Clear language Practicality: Value for model learning