Synthetic Data Kit: Corpus Extraction Solution for LLM Fine-tuning

Written by

Iris Vance

Updated on:June-24th-2025

The biggest challenge in fine-tuning mainstream LLMs for specific tasks is that high-quality, task-specific data is hard to come by. Meta’s Synthetic Data Kit (SDK) provides a streamlined open source solution for generating, filtering, and formatting synthetic datasets — without the need for data scientists.

I think its significance lies in the fact that it summarizes the expected preparation experience in the fine-tuning process in a tool-like manner. Most real-world datasets are messy and disorganized, and rarely adopt the "user/assistant" format that LLM likes. Meta's synthetic data toolkit fills this gap, allowing you to generate reasoning trajectories, question-answer pairs, and more, all of which are suitable for fine-tuning.

From raw data to fine-tuned gold

The SDK's workflow is simple and modular, built around four core commands:

Ingest: Import data from PDF, HTML, YouTube, DOCX, PPT or plain text. The toolkit parses and organizes your files into a clear directory structure.
Create: Generate question-answer pairs, Chain of Thought (CoT) reasoning examples, or summaries using the local LLM (via vLLM). You can customize the number and type of examples, and even use your own prompt templates.
Curate: Use Llama as a judge to filter and score your synthetic examples, ensuring only the highest quality data makes it through.
Save-As: Export the filtered data in the format required for your fine-tuning workflow—supports Alpaca, OpenAI, ChatML, and many more.

#SDK's command tree SDK --> SystemCheck[system-check] SDK[synthetic-data-kit] --> Ingest[ingest] SDK --> Create[create] SDK --> Curate[curate] SDK --> SaveAs[save-as] Ingest --> PDFFile[PDF File] Ingest --> HTMLFile[HTML File] Ingest --> YouTubeURL[File Format] Create --> CoT[CoT] Create --> QA[QA Pairs] Create --> Summary[Summary] Curate --> Filter[Filter by Quality] SaveAs --> JSONL[JSONL Format] SaveAs --> Alpaca[Alpaca Format] SaveAs --> FT[Fine-Tuning Format] SaveAs --> ChatML[ChatML Format]

Quick Start

# Install conda from PyPI create -n synthetic-data python=3.10conda activate synthetic-datapip install synthetic-data-kit

#Or, clone the repository to get the latest features: bashgit clone https://github.com/meta-llama/synthetic-data-kit.gitcd synthetic-data-kitpip install -e .

The background assumes a vLLM server running the selected teacher model. If there is spare capacity, the larger the teacher model parameter scale, the better the effect.

bashvllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000 # Create the necessary directory structure: mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final} # Check if the system is ready: synthetic-data-kit system-check

For example, to convert a research paper in PDF format into a corpus, here is how to convert it into a fine-tuning dataset:

bash# import PDFsynthetic-data-kit ingest research_paper.pdf# generate 30 question-answer pairs, set quality thresholdsynthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0# filter qualitysynthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5# save in OpenAI fine-tuned formatsynthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft

It can also process YouTube videos, HTML files, and batch process entire folders using a simple bash script.

The tool is highly configurable

Override any parameter — temperature, chunk size, number of generated pairs, filter thresholds — via the CLI or YAML configuration files, and even use custom prompt templates for domain-specific tasks.

The configuration file is in config/config.yaml

# Example configurationvllm: api_base: "http://localhost:8000/v1" model: "meta-llama/Llama-3.3-70B-Instruct" generation: temperature: 0.7 chunk_size: 4000 num_pairs: 25curate: threshold: 7.0 batch_size: 8

Examples of custom prompt words for legal question and answer pairs:

prompts:  qa_generation: |    You are creating question-answer pairs  for  fine-tuning a legal assistant.    Focus  on  technical legal concepts, precedents,  and  statutory interpretation.
    Below  is  a chunk of text about: {summary}...
    Create {num_pairs} high-quality question-answer pairs based ONLY  on  this  text.
    Return ONLY valid JSON formatted  as :    [      {        "question" :  "Detailed legal question?" ,        "answer" :  "Precise legal answer."      },      ...    ]
    Text:    ---    {text}

Currently the SDK supports 6 types of parsers: PDFParser (using pdfminer), HTMLParser (using beautifulsoup4), YouTubeParser (using pytube and youtube_transcript_api), DOCXParser (using docx), PPTParser (using pptx), TXTParser (simple reading of text files)

The creation command is synthetic-data-kit create [text file path] --type [format type]. The format type options include: qa (question-answer pair) and cot (thinking chain reasoning).

# Bash script to process multiple filesfor  file  in  data/pdf/*.pdf;  do  filename=$( basename  " $file "  .pdf)
  synthetic-data-kit ingest  " $file "  synthetic-data-kit create  "data/output/ ${filename} .txt"  -n 20  synthetic-data-kit curate  "data/generated/ ${filename} _qa_pairs.json"  -t 7.5  synthetic-data-kit save-as  "data/cleaned/ ${filename} _cleaned.json"  -f chatmldone

The SDK then loads the input text file, chunks the text into manageable paragraphs (based on chunk size),

Generates summaries for each block (these summaries are contained in the prompt words), builds prompts based on the template in the configuration file, calls the vLLM API to generate responses, and saves the generated samples in JSON format.

Corpus quality screening

synthetic-data-kit curate data/generated/report_qa_pairs.json -t 7.0

This critical quality control process is as follows:

1) Load the JSON file containing the generated samples and perform quality assessment according to the following dimensions:

Of course, you can also construct your own evaluation basis, such as <Learning LLM thinking: Self-evaluation of corpus quality >

Accuracy: Factual Correctness
Relevance: Relevance to the content
Clarity: Clear language
Practicality: Value for model learning

2) Construct the prompt words for the data and dimensions and send them to the vLLM server, where LLM assigns a quality score. For example: "Score each question-answer pair on a scale of 1-10 based on the following aspects"

3) Examples with scores below the configured threshold (default 7.0) will be removed

Finally, load the filtered samples and perform storage conversion to other common training formats such as Alpaca format, ChatML format, or JSONL.

Meta’s synthetic data toolkit is a game changer for anyone looking to provide high-quality, task-specific data for LLM. Whether you are a researcher, developer, or startup founder, this toolkit makes it easy to generate, filter, and format synthetic datasets for fine-tuning.