From SFT to RFT: The Evolution of AI Model Training

Written by
Caleb Hayes
Updated on:July-09th-2025
Recommendation

Explore new breakthroughs in AI model training and how RFT technology revolutionizes traditional SFT methods.

Core content:
1. The principle and core idea of ​​RFT technology
2. Comparative analysis of RFT and traditional SFT
3. Case analysis of RFT in practical applications

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Reinforcement Fine-tuning (RFT) Paradigm

Analyze the principles and applications of AI model reinforcement fine-tuning through examples

What is RFT?

Reinforcement Fine-Tuning (RFT) is an AI model training method that combines reinforcement learning and fine-tuning techniques . It optimizes large language models through reward-driven training cycles, enabling them to achieve better results with less data.

Core idea: Unlike traditional supervised fine-tuning (SFT) which directly imitates labeled data, RFT uses a "scorer" (or reward model) to provide feedback for the model's output, thereby guiding the model to optimize in the desired direction.

RFT vs Supervised Fine-tuning (SFT)

feature
Supervised Fine-tuning (SFT)
Reinforcement Fine Tuning (RFT)
Core Idea
Train the model directly on labeled data to match the desired output
Using reward signals to guide the model to generate better outputs
Data requirements
Requires a large amount of labeled example data
Probably only a handful of examples (dozens)
Learning Method
Learning from existing input-output pairs by imitation
Discover the optimal strategy through trial and error and feedback
Innovation
Limited by the diversity of training data
Creative solutions can be found
Human involvement
Mainly in the initial data annotation stage
Mainly in the reward function design stage




How does RFT work?


The RFT workflow consists of three main steps:

  1. Model generation:
    The base model generates multiple candidate outputs based on the input prompt
  2. Reward Assessment:
    The reward function evaluates each output and assigns a score
  3. Model Update:
    The model optimizes its parameters based on the reward signal

The key role of reward function

The reward function defines what a "good" output is. It can:

  • Give high scores to completely correct outputs
  • Give partial credit for partially correct output
  • Provide positive feedback for innovative solutions
  • Penalize bad output or solution

Key Benefits of RFT

  • Data efficiency:
    RFT typically requires only a small number of examples (tens instead of thousands) for effective fine-tuning, significantly reducing data collection and annotation costs.
  • Some bonus abilities:
    Rewards can be provided for partially correct solutions, allowing the model to improve gradually, rather than being limited to binary feedback of completely correct or completely wrong.
  • Discover innovative solutions:
    RFT encourages the model to explore multiple solution paths and may discover more effective solutions than manually designed methods.
  • Enhanced reasoning capabilities:
    Reinforcement learning helps the model develop complex reasoning strategies, which is difficult to achieve through imitation learning (SFT) alone.


RFT Example Analysis

Example 1: Code Generation Optimization


Task: Convert natural language descriptions into SQL queries

enter:

Find the names and email addresses of all customers who purchased "Premium Member" products in January 2023 and spent more than 1,000 yuan

Expected Output:

SELECT c.customer_name, c.email FROM customers c JOIN orders o ON c.customer_id = o.customer_id JOIN order_items oi ON o.order_id = oi.order_id JOIN products p ON oi.product_id = p.product_id WHERE p.product_name = 'Premium Member' AND o.order_date BETWEEN '2023-01-01' AND '2023-01-31' AND o.total_amount > 1000;
 

RFT application method:

The reward function here can be evaluated:

  • Syntactic correctness (whether the SQL is executable)
  • Check if the correct columns of data (customer name and email address) are returned
  • Whether the filter conditions are complete (date range, product name, amount threshold)
  • Is the table connection correct?

Even if the model's SQL query is wrong in some aspects (such as missing a JOIN), if the other parts are correct, it can still get partial rewards, helping the model gradually learn the correct query construction method.

Example 2: Mathematical reasoning task


Task: Solve complex math problems

question:

A store offers a 25% discount on all items. After the discount, the price of a shirt is 150 yuan. What is the original price of the shirt?

Expected solution process:

Step 1: Let the original price be x yuan. Step 2: A 25% discount means the selling price is 75% of the original price. Step 3: We can write the equation: 0.75x = 150. Step 4: Solve for x: x = 150 ÷ ​​0.75 = 200. Answer: The original price of the shirt is 200 yuan.
  

RFT application method:

In this case, the reward function can be evaluated as:

  • Is the final answer correct? (200 yuan)
  • Is the derivation of the steps reasonable (setting variables, understanding discounts, and correctly establishing equations)?
  • Is the calculation correct (150 ÷ ​​0.75 = 200)
  • Is the explanation clear and complete?

Even if the model's final answer is wrong, as long as the reasoning process is reasonable, it can still get some rewards, which encourages the model to develop structured problem-solving capabilities.

Example 3: Structured Information Extraction

    

Task: Extract company information from unstructured text

Input text:

Future Technology Co., Ltd. was founded in 2018 and is headquartered at No. 12, Science Park, Haidian District, Beijing. The company is mainly engaged in the research and development of artificial intelligence and cloud computing technologies, with annual revenue of approximately RMB 250 million. CEO Li Ming can be contacted by phone at 010-88889999 or email at contact@future-tech.example.com.

Expected output format:

{"Company Name": "Future Technology Co., Ltd.", "Year of Establishment": 2018, "Headquarters Address": "No. 12, Science and Technology Park, Haidian District, Beijing", "Business Area": ​​["Artificial Intelligence", "Cloud Computing"], "Annual Revenue": "250 million yuan", "CEO": "Li Ming", "Contact Information": {"Phone":"010-88889999","Email": "contact@future-tech.example.com"}}
RFT application method:

The reward function can be evaluated independently for each extracted field:

  • The extraction accuracy of each field such as company name, establishment year, address, etc.
  • Whether the format meets the requirements (such as whether the JSON format is correct)
  • Whether all available information has been extracted (completeness)

Even if the model only extracts part of the information correctly (for example, it extracts the company name and address correctly but misses the revenue data), it can still receive a corresponding part of the reward. This fine-grained feedback helps the model improve its specific information extraction ability.

Synergy between RFT and SFT

RFT and SFT are not mutually exclusive, but complementary methods. In practical applications, the common workflow is:

Phase 1: SFT uses supervised fine-tuning (SFT) and a large amount of labeled data to provide the model with basic knowledge and capabilities of the domain.

Phase 2: RFT uses reinforced fine-tuning (RFT) to further optimize the model to have more advanced capabilities or adapt it to specific performance indicators.

Actual case: Medical diagnostic assistant

First, use SFT to let the model learn basic medical knowledge and diagnostic processes (based on medical case data sets)

Then use RFT optimization to evaluate whether the model:

  • Ask relevant follow-up questions
  • Consider multiple possible diagnoses
  • Clearly explain the reasoning process
  • Recommend appropriate next steps

Summarize

The reinforcement fine-tuning (RFT) paradigm provides an effective way to optimize the performance of AI models by applying the idea of ​​reinforcement learning to the model fine-tuning process:

  • Achieve better results with less data
  • Allow partial rewards to promote gradual learning
  • Encourage the discovery of innovative solutions
  • Enhance the model’s reasoning capabilities
  • Complementary to traditional supervised fine-tuning

As technology develops, RFT is becoming more accessible, making it easier for domain experts to leverage this powerful tool to optimize model performance without deep technical knowledge.