3 times higher than DeepSeek and o1! The first serverless enhanced fine-tuning, only a dozen data points are needed.

Written by

Caleb Hayes

Updated on:July-11th-2025

Early this morning, Predibase, a well-known large-model training and development platform, released the first end-to-end reinforcement fine-tuning platform (RFT).

Compared with traditional supervised fine-tuning, RFT does not rely on a large amount of labeled data. Instead, it uses rewards and custom functions to complete continuous reinforcement learning. It also supports serverless and end-to-end training methods. Everything from data management, training models to application deployment can be completed on the same platform.

In other words, you only need a browser, set fine-tuning targets, upload data, and you can complete the previously very complicated large model fine-tuning process.

Online experience address: https://predibase.com/reinforcement-fine-tuning-playground

To demonstrate the power of RFT, Predibase fine-tuned a model specifically designed to translate PyTorch code into Triton based on Alibaba's open source Qwen2.5-Coder-32B-instruct.

This is a task that is difficult for most LLMs to complete, requiring a deep understanding of both frameworks and complex reasoning capabilities to consider computational efficiency. In addition, the accuracy of Qwen2.5-Coder-32B-instruct was relatively low before fine-tuning.

With RFT, Predibase combines cold-start supervised fine-tuning, reinforcement learning, and curriculum learning during training, using only a dozen labeled data points.

Benchmark tests on the Kernelbench dataset show that after being enhanced, Qwen2.5-Coder-32B-instruct has an accuracy rate that is 3 times higher than DeepSeek-R1 and OpenAI's o1, and more than 4 times higher than Claude 3.7 Sonnet , while the model size is much smaller than these three.

Currently, Predibase has open-sourced the fine-tuned Qwen2.5-Coder-32B-instruct model.

Open source address: https://huggingface.co/predibase/Predibase-T2T-32B-RFT

In terms of technical advantages, RFT does not rely on a large amount of labeled data, while traditional methods require a large amount of labeled data to guide model learning. These data usually need to be manually labeled, which is costly and time-consuming. RFT guides model learning through a reward function, which does not require a large amount of labeled data. The reward function can evaluate the model output according to the specific requirements of the task to guide the model's optimization goal.

RFT is more adaptable and flexible. Traditional methods rely on the quality and quantity of labeled data. If the labeled data is limited or inaccurate, the model performance will be limited. RFT allows users to customize the reward function according to specific task requirements and flexibly define the model optimization goal.

For example, in a code generation task, a reward function can be defined to verify the correctness of the code; in a question-answering task, a reward function can be defined to evaluate the relevance and accuracy of the answer.

RFT has the ability to continuously improve. Traditional methods are usually a one-time process, and it is difficult to continue to improve the model after training. RFT supports continuous improvement. As the reward function is optimized and more feedback data is accumulated, the model can continue to learn and improve to adapt to changes in task requirements.

In terms of training and inference efficiency, traditional methods usually need to be performed in a local environment, have high requirements for hardware resources, and require manual management of the training and deployment process.

The RFT platform provided by Predibase is a fully managed serverless platform. Users do not need to manage the underlying servers or infrastructure. The platform automatically handles the entire process of training, deployment and reasoning, greatly reducing the complexity of development and operation and maintenance. In addition, RFT uses the multi-LoRA framework and streaming micro-batch processing technology to achieve efficient training and reasoning.

RFT also supports curriculum learning for complex tasks. Traditional methods usually require a large amount of labeled data to cover various situations when dealing with complex tasks, otherwise it is difficult for the model to learn effective strategies. RFT supports curriculum learning, that is, gradually training the model from simple to complex so that it can handle more complex tasks, which is particularly effective in tasks that require deep reasoning.

In terms of model deployment, traditional methods of model deployment usually require additional tools and configurations, and it is difficult to ensure high performance. Predibase's inference engine natively supports RFT-trained models and provides a high-performance serverless deployment solution. Users can quickly deploy trained models to production environments and obtain industry-level service level support.

RFT also has better generalization capabilities. Traditional methods may cause the model to overfit the labeled data, resulting in poor performance on unseen data. RFT guides model learning through the reward function, allowing the model to better generalize to unseen data and improve its robustness in practical applications.

Predibase said that after DeepSeek open-sourced R1, it had a huge impact in the global AI field, making many people realize the importance of reinforcement learning fine-tuning for training large models. Inspired by this, they developed this end-to-end serverless reinforcement fine-tuning platform.