Woter AI detection.Hurry - ends Jul 21st

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

From SFT to RFT: The Evolution of AI Model Training

Written by

Caleb Hayes

Updated on:July-09th-2025

Reinforcement Fine-tuning (RFT) Paradigm

Analyze the principles and applications of AI model reinforcement fine-tuning through examples

What is RFT?

Reinforcement Fine-Tuning (RFT) is an AI model training method that combines reinforcement learning and fine-tuning techniques . It optimizes large language models through reward-driven training cycles, enabling them to achieve better results with less data.

Core idea: Unlike traditional supervised fine-tuning (SFT) which directly imitates labeled data, RFT uses a "scorer" (or reward model) to provide feedback for the model's output, thereby guiding the model to optimize in the desired direction.

RFT vs Supervised Fine-tuning (SFT)

feature	Supervised Fine-tuning (SFT)	Reinforcement Fine Tuning (RFT)
Core Idea	Train the model directly on labeled data to match the desired output	Using reward signals to guide the model to generate better outputs
Data requirements	Requires a large amount of labeled example data	Probably only a handful of examples (dozens)
Learning Method	Learning from existing input-output pairs by imitation	Discover the optimal strategy through trial and error and feedback
Innovation	Limited by the diversity of training data	Creative solutions can be found
Human involvement	Mainly in the initial data annotation stage	Mainly in the reward function design stage

How does RFT work?

The RFT workflow consists of three main steps:

Model generation:
The base model generates multiple candidate outputs based on the input prompt
Reward Assessment:
The reward function evaluates each output and assigns a score
Model Update:
The model optimizes its parameters based on the reward signal

The key role of reward function

The reward function defines what a "good" output is. It can:

Give high scores to completely correct outputs
Give partial credit for partially correct output
Provide positive feedback for innovative solutions
Penalize bad output or solution

Key Benefits of RFT

Data efficiency:
RFT typically requires only a small number of examples (tens instead of thousands) for effective fine-tuning, significantly reducing data collection and annotation costs.
Some bonus abilities:
Rewards can be provided for partially correct solutions, allowing the model to improve gradually, rather than being limited to binary feedback of completely correct or completely wrong.
Discover innovative solutions:
RFT encourages the model to explore multiple solution paths and may discover more effective solutions than manually designed methods.
Enhanced reasoning capabilities:
Reinforcement learning helps the model develop complex reasoning strategies, which is difficult to achieve through imitation learning (SFT) alone.

RFT Example Analysis

Example 1: Code Generation Optimization

Task: Convert natural language descriptions into SQL queries

enter:

Find the names and email addresses of all customers who purchased "Premium Member" products in January 2023 and spent more than 1,000 yuan

Expected Output:

SELECT c.customer_name, c.email FROM customers c JOIN orders o ON c.customer_id = o.customer_id JOIN order_items oi ON o.order_id = oi.order_id JOIN products p ON oi.product_id = p.product_id WHERE p.product_name = 'Premium Member' AND o.order_date BETWEEN '2023-01-01' AND '2023-01-31' AND o.total_amount > 1000;

RFT application method:

The reward function here can be evaluated:

Syntactic correctness (whether the SQL is executable)
Check if the correct columns of data (customer name and email address) are returned
Whether the filter conditions are complete (date range, product name, amount threshold)
Is the table connection correct?

Even if the model's SQL query is wrong in some aspects (such as missing a JOIN), if the other parts are correct, it can still get partial rewards, helping the model gradually learn the correct query construction method.

Example 2: Mathematical reasoning task

Task: Solve complex math problems

question:

A store offers a 25% discount on all items. After the discount, the price of a shirt is 150 yuan. What is the original price of the shirt?

Expected solution process:

Step 1: Let the original price be x yuan. Step 2: A 25% discount means the selling price is 75% of the original price. Step 3: We can write the equation: 0.75x = 150. Step 4: Solve for x: x = 150 ÷ 0.75 = 200. Answer: The original price of the shirt is 200 yuan.

RFT application method:

In this case, the reward function can be evaluated as:

Is the final answer correct? (200 yuan)
Is the derivation of the steps reasonable (setting variables, understanding discounts, and correctly establishing equations)?
Is the calculation correct (150 ÷ 0.75 = 200)
Is the explanation clear and complete?

Even if the model's final answer is wrong, as long as the reasoning process is reasonable, it can still get some rewards, which encourages the model to develop structured problem-solving capabilities.

Example 3: Structured Information Extraction

Task: Extract company information from unstructured text

Input text:

Future Technology Co., Ltd. was founded in 2018 and is headquartered at No. 12, Science Park, Haidian District, Beijing. The company is mainly engaged in the research and development of artificial intelligence and cloud computing technologies, with annual revenue of approximately RMB 250 million. CEO Li Ming can be contacted by phone at 010-88889999 or email at contact@future-tech.example.com.

Expected output format:

{"Company Name": "Future Technology Co., Ltd.", "Year of Establishment": 2018, "Headquarters Address": "No. 12, Science and Technology Park, Haidian District, Beijing", "Business Area": ["Artificial Intelligence", "Cloud Computing"], "Annual Revenue": "250 million yuan", "CEO": "Li Ming", "Contact Information": {"Phone":"010-88889999","Email": "contact@future-tech.example.com"}}

RFT application method:

The reward function can be evaluated independently for each extracted field:

The extraction accuracy of each field such as company name, establishment year, address, etc.
Whether the format meets the requirements (such as whether the JSON format is correct)
Whether all available information has been extracted (completeness)

Even if the model only extracts part of the information correctly (for example, it extracts the company name and address correctly but misses the revenue data), it can still receive a corresponding part of the reward. This fine-grained feedback helps the model improve its specific information extraction ability.

Synergy between RFT and SFT

RFT and SFT are not mutually exclusive, but complementary methods. In practical applications, the common workflow is:

Phase 1: SFT uses supervised fine-tuning (SFT) and a large amount of labeled data to provide the model with basic knowledge and capabilities of the domain.

Phase 2: RFT uses reinforced fine-tuning (RFT) to further optimize the model to have more advanced capabilities or adapt it to specific performance indicators.

Actual case: Medical diagnostic assistant

First, use SFT to let the model learn basic medical knowledge and diagnostic processes (based on medical case data sets)

Then use RFT optimization to evaluate whether the model:

Ask relevant follow-up questions
Consider multiple possible diagnoses
Clearly explain the reasoning process
Recommend appropriate next steps

Summarize

The reinforcement fine-tuning (RFT) paradigm provides an effective way to optimize the performance of AI models by applying the idea of reinforcement learning to the model fine-tuning process:

Achieve better results with less data
Allow partial rewards to promote gradual learning
Encourage the discovery of innovative solutions
Enhance the model’s reasoning capabilities
Complementary to traditional supervised fine-tuning

As technology develops, RFT is becoming more accessible, making it easier for domain experts to leverage this powerful tool to optimize model performance without deep technical knowledge.