OpenAI's enhanced fine-tuning is finally online: you can easily create AI experts with just a few dozen samples

Written by

Iris Vance

Updated on:June-24th-2025

I have some good news to share with you! Remember the Reinforcement Fine-Tuning (RFT) that I mentioned last December? Now, it has officially landed in the OpenAI o4-mini model!

Simply put, RFT uses chain reasoning and task-specific scoring mechanisms to improve the performance of models in specific complex fields, which can easily improve the AI model from the level of high school students to the level of expert doctors. Through intensive fine-tuning, you can easily improve the professional ability of the model in a certain field and create various AI experts.

In addition, GPT-4.1 nano is now open for fine-tuning! This means you can use OpenAI’s fastest and cheapest model to “train” it for your specific scenario, maximizing the cost-effectiveness!

Currently, RFT is open to verified organizations. OpenAI also offers a bonus: sharing your dataset will not only help improve future OpenAI models, but also get a 50% discount .

The official has prepared a guide for strengthening fine-tuning, and will give you the key points as soon as possible.

Strengthening fine-tuning RFT: What can it do?

The core goal of RFT is to improve the performance of the model on specific, answer-verifiable tasks.

When is the best time to use RFT?

It is particularly suitable for "intelligent" workflows that require models to make correct and verifiable decisions . RFT uses clear scoring criteria and "Graders" based on code or large language models (LLM) to measure the success of tasks, factual accuracy, or policy compliance.

OpenAI's early users were mainly concentrated in three scenarios:

1. Instructions to code : Convert open instructions into structured code, configuration or templates, and these outputs must pass deterministic tests.
2. Extract the essence from messy text : Extract verifiable facts and summaries from unstructured text and output them in JSON or other structured formats.
3. Precise application of complex rules : When the information is subtle, large in volume, complex in hierarchy, or of great importance, make sophisticated labeling or policy decisions.

Practical cases showing off muscles?

The following companies have made a name for themselves using RFT:

1. Instructions become codes

The model needs to understand the hidden domain constraints and generate structured outputs such as code, query statements, or infrastructure templates. The output must meet multiple correctness conditions, and success or failure is usually a deterministic score.

?ChipStack: Designing “intelligent wiring” for semiconductors

• Company : ChipStack, an AI-driven chip design and verification tool.
• Pain point : Binding design interfaces to verification IP (pre-made verification components) is a time-consuming and labor-intensive task that involves a lot of signal mapping and requires deep domain knowledge.
• Goal : Train the OpenAI model to automatically complete this task. ChipStack prepared a dataset of less than 50 samples and conducted multiple RFT experiments.
• Scorer idea : A scorer is defined in Python that compares the predicted output (a series of name-value pairs) with the expected answer, and calculates the F1 score of precision and recall.
• Results : The performance of both o1-mini and o3-mini models improved by about 12 percentage points . The fine-tuned models made significant progress in identifying when to “not” apply routing, which is critical for commercial verification IPs that contain a large number of optional signals.

2. Extract the essence from messy text

Such tasks often involve nuances and require clear classification guidelines and consensus from domain experts. The consistency of the scoring signal is crucial to the effectiveness of RFT.

? Ambience Healthcare: Accurately assigning ICD-10 medical codes

• Company : Ambience, an AI platform that reduces administrative burdens for clinicians and ensures documentation is accurate and compliant.
• Pain Points : ICD-10 coding (about 70,000 codes) is one of the most complex administrative tasks in medicine, and mistakes can result in huge fines.
• Goal : Train a reasoning system that listens to patient visit audio, combines it with EHR information, and recommends ICD-10 codes with greater accuracy than expert clinicians.
• Results : On a golden test set of hundreds of patient visits, RFT improved model performance from 13 percentage points behind human experts to 12 percentage points ahead , roughly eliminating a quarter of the coding errors made by trained doctors .

• o3-mini (base): 0.39
• Doctor baseline: 0.45
• RFT-tuned o3-mini: 0.57

3. Precise application of complex rules

Extract verifiable facts or entities from unstructured input into well-defined schemas (e.g. JSON, code, citations, etc.). Accurate, continuous scoring methods (e.g. F1, fuzzy matching, numerical accuracy) are key.

? Accordance: “Expert-level” reasoning for tax analysis

• Company : Accordance, building a platform for tax, audit and CPA teams.
• Pain points : The tax field is extremely complex, regulations are changing, and reasoning requirements are high.
• Goal : Build a system that can handle complex tax scenarios with high accuracy and adapt as tax laws change.
• Scorer idea : A detailed scoring checklist that gives points for each aspect of tax analysis, such as:

• [+0.05] Correctly Identify Equity Percentage
• [+0.1] Correct calculation of annual distribution
• [+0.15] Proper allocation of ordinary income
• ...and more than ten detailed scoring points.

• Achievements : The performance of tax analysis tasks is nearly 40% higher than that of the basic model , and it outperforms other mainstream models on benchmarks such as TaxBench. According to the evaluation of tax experts, the fine-tuned model demonstrates expert-level reasoning capabilities.

Evaluations (Evals) are the cornerstone

OpenAI strongly recommends that you create and run an evaluation (eval) for your task before implementing RFT . If your model scores in the lowest or highest scores in the evaluation, RFT will be useless. RFT requires the model to distinguish between different answer qualities in order to learn. If the evaluation score is between the lowest and highest scores, then it is promising.

An effective evaluation can reveal pain points that are generally recognized by human experts but difficult for current models to solve - this is a good opportunity for RFT to show its strength.

How to get better results from RFT?

To make the fine-tuning model more effective, we need to work on two aspects: clarifying the task definition and strengthening the scoring scheme .

Redefine or clarify your mission

Good tasks give the model a fair chance to learn and allow you to quantify improvements.

• Start with tasks that the model can occasionally solve : If the model currently does it completely wrong, RFT will have no idea where to start.
• Ensure every answer can be scored : The scorer must be able to automatically score. Support multiple scorer types (including custom Python and LLM judges).
• Remove ambiguity about the “right answer” : If experts disagree on the answer, the task is too vague. Reword the prompt, add context, or split the task.
• Limit the chances of “guessing” : If the question is multiple choice and the answer is obvious, the model may have to guess. Add categories, require short open-ended text, or adjust the format to make guessing expensive.

Strengthen your grader

A clear and robust scoring scheme is crucial for RFT.

• Use smooth scores instead of pass/fail : Gradual scores provide better training signal.
• Beware of “Reward Hacking” : Models may find shortcuts to get high scores without truly mastering the skill.
• Avoid data skew : If a certain label accounts for too high a proportion in the dataset, balance the dataset or increase the weight of rare cases.
• Use LLM judges when code scoring is not enough : For complex open-ended responses, have another OpenAI model score them. Make sure:

• Evaluate the judges themselves : Test the LLM judges with multiple candidate answers and correct answers to ensure that the scoring is stable and consistent with preferences.
• Provide a small sample of examples : Include examples of excellent, average, and poor answers in the prompt to improve judging effectiveness.