Subversion! Reinforcement learning is no longer a fine-tuning patent, Microsoft directly uses it for basic model training

Written by
Jasper Cole
Updated on:June-13th-2025
Recommendation

Microsoft's breakthrough research: Reinforcement learning is directly integrated into pre-training, allowing large models to upgrade from "rote memorization" to "real thinking".

Core content:
1. Analysis of the limitations of traditional pre-training and reinforcement learning
2. How the RPT method integrates reinforcement learning into the pre-training stage
3. The improvement of model reasoning ability brought by the new method and its potential impact

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

In a nutshell, reinforcement learning no longer has fine-tuning as its “after-meal dessert”. RPT directly turns it into the “staple food” of pre-training, using the original text of the corpus as the answer, rewarding correct answers and punishing wrong answers, allowing the model to “grow a brain” from the ground up.

Phase 1: Identifying core concepts

Imagine that most LLMs are like students who are frantically "doing exercises". Their learning method (pre-training) is to constantly do "fill-in-the-blank questions" - given the previous context, predict the next word (Next-Token Prediction). This method is very effective and allows the model to remember a large amount of knowledge and language patterns, but it may remember more surface associations rather than real understanding and reasoning. On the other hand, we know that reinforcement learning (RL) is like a coach. It trains models to complete specific tasks or align human preferences through "rewards" and "punishments", which can effectively improve the reasoning ability of the model. However, this is usually used in the fine-tuning stage and requires expensive manually labeled data or question-answering data in specific fields, which makes it difficult to scale up.

This paper attempts to combine the two and proposes a new method called  Reinforcement Pre-Training (RPT)  .

Analysis of thesis motivation

The starting point of the paper is to solve the two core pain points of the existing technical route:

  • Limitations of traditional pre-training : The standard "Next-Token Prediction" (NTP) pre-training method, although highly scalable (it can be used with any amount of text data), tends to let the model learn the superficial co-occurrence relationship between words, that is, rote memorization, rather than encouraging the model to deeply understand and reason "why" the next word should be this.
  • Limitations of existing reinforcement learning applications : Reinforcement learning (RL) has been shown to effectively improve the reasoning and alignment capabilities of models (such as RLHF). However, applying RL to LLM faces challenges in scalability and generality. For example, RLHF relies on expensive manual feedback data, and the reward model may be "hackered"; while RL based on verifiable rewards (such as solving math problems and giving rewards if they are correct) relies on domain-specific datasets with labeled answers, and the amount of data is limited, making it difficult to use for general, large-scale pre-training.
  • Core Motivation Bridging:  The author hopes to bridge the gap between "scalable but potentially shallow" self-supervised pre-training and "powerful but difficult to scale" reinforcement learning. The goal is to create a new pre-training paradigm that can leverage massive amounts of unlabeled online text data like traditional pre-training, and, like reinforcement learning, explicitly motivate the model to develop stronger reasoning capabilities rather than simple memorization. The "cherry cake" metaphor in Figure 1 of the paper is very vivid: in traditional methods, pre-training (NTP) is the main body of the cake, and RL is just the cherry (fine-tuning) at the end; while RPT wants the entire cake (pre-training) to be infused with the flavor of RL and reasoning.

Analysis of the main contributions of the paper

Main innovations

  • A new pre-training paradigm, RPT (Reinforcement Pre-Training), was proposed.
  • The traditional "predict the next word" task is redefined as a "next-token reasoning" task and trained using reinforcement learning.
  • We design a method that allows reinforcement learning to be scalably applied to a large, general, unlabeled pre-trained text corpus, rather than being limited to a specific dataset.

Key technologies or methods

  • **Task Redefinition (Next-Token Reasoning)**: Before predicting the next word, the model is required to first generate a "thinking process" (similar to the Chain-of-Thought) and then give the predicted word.
  • **Intrinsic Verifiable Reward**: This is the key to achieving scale! The reward signal comes directly from the corpus itself - if the word predicted by the model matches the ground-truth in the corpus, a reward (such as +1) is given, otherwise no reward (such as 0) is given. This reward is objective, rule-clear, and automatic, does not require manual annotation, and greatly reduces the risk of rewards being "gamed".
  • Reinforcement learning framework : Using online policy reinforcement learning (On-policy RL), the model generates multiple different "thinking + prediction" trajectories (Rollouts) for a piece of context. The model parameters are updated according to the reward (right or wrong) obtained for each trajectory to encourage those thinking processes that can lead to correct predictions.

Significant results

  • Improved language model capabilities : The model trained with RPT significantly exceeds the base model trained with traditional methods in the most basic "predicting the next word" accuracy, especially when predicting difficult words (high entropy words). A 14B RPT model can even reach or exceed the performance of a larger 32B traditional model. This means that RPT allows the model to learn more "deeply".
  • Better fine-tuning foundation : The RPT pre-trained model provides a better starting point for subsequent reinforcement learning fine-tuning. Compared with the traditional pre-trained model, the RL fine-tuning for specific tasks based on it has better results. This shows that the goals of pre-training and fine-tuning are more consistent.
  • **Good scalability**: The paper shows that RPT has a good scaling law property, that is, as the amount of training calculations increases, the accuracy of the model predicting the next word will continue to improve predictably. This proves that RPT has the potential to become a sustainable technical path for the future scale development of models.
  • Changes in reasoning patterns : Analysis shows that RPT encourages the model to generate a different reasoning pattern from simple problem solving, using more hypothetical, deductive and other thinking methods, and engaging in more exploratory thinking.

Identification of Difficulties in Understanding

Key concepts/methods

  • Difference between standard next word prediction (NTP) and reinforcement learning objective functions.
  • How to transform a deterministic NTP task, which is usually trained with supervised learning (cross entropy loss), into an RL task that includes "exploration" and "reward".
  • The specific design of the reward signal (especially the prefix matching reward mentioned in the paper, which takes into account byte and word boundaries).
  • How to apply concepts such as On-policy, Rollout, Trajectory, Policy Update in RL to generate "thinking process + prediction words".
  • “Next-Token Reasoning” itself: How the model learns to “think” before predicting a word.

The most challenging part

The most difficult thing to understand may be this "Paradigm Shift": how to put the seemingly simple imitation task of "predicting the next word" into the framework of reinforcement learning. Readers need to understand: * State = the current context. * Action = generating a whole paragraph (thinking process token sequence + final prediction token). * Trajectory = the whole process of executing an action from the state. * Reward = a sparse reward (0 or 1) given only at the end of the trajectory, depending on whether the final predicted token matches the real token. * Policy = the model itself, which determines the probability distribution of what kind of "thinking + prediction" trajectory is generated under a given state. The goal of RL is to adjust the strategy (model parameters) to maximize the expected value of the cumulative reward. This is a completely different optimization goal and process from NTP maximizing the logarithmic probability of the correct word at each position.

Key concepts that need to be explained

The core mechanism loop of RPT: how to reconstruct the "next word prediction" into the "next word reasoning task" based on RL, including: how the model generates multiple trajectories with thinking process (Rollout), how to define simple and verifiable rewards (Reward) based on the real words in the corpus, and how to update the model through RL (RL Update).  (Corresponding to Figure 3 and Formulas 3 and 4 in the paper).

Concept Dependencies

Conceptual relationship analysis

  • First, we must understand Standard Next Word Prediction (NTP) What it is, and its limitations (this is the background and starting point).
  • Secondly, you need to understand Reinforcement Learning (RL) The basic idea of ​​​​(reward-driven learning) and Chain of Thought (CoT) / Reasoning concept (the model can generate thinking steps).
  • Then, we can understand the paper's Next-Token Reasoning Task definition: Combine CoT and NTP.
  • Understand how to design smartly based on task definition Verifiable Reward , which is the key bridge connecting massive text and RL.
  • Finally, all the above concepts are integrated into RPT overall framework (Context -> Rollout -> Reward -> Update).

Best entry point

It’s best to start with a comparison: first describe how the familiar “standard next word prediction (NTP)” works, and then introduce the innovation of the paper - what if the model is allowed to “think” before making a prediction, and is rewarded based on whether the prediction is right or wrong? That is, compare “NTP” with “Next-Token Reasoning + RL reward”, which naturally leads to the core mechanism that we need to explain in detail.


Phase 2: In-depth explanation of core concepts

After understanding the "ambition" of RPT, let's now tackle the most difficult part, which is also the essential innovation of this paper: How does it turn predicting the next word into a reinforcement learning game? We will focus on the core concepts: The core mechanism of RPT: How to reconstruct "next word prediction" into an RL-based "next word reasoning task" (generate thinking trajectory -> define reward -> RL update).

Metaphor from daily life: the apprentice writes a poem and the master corrects it

Let's imagine a scene: a calligraphy apprentice is learning to continue writing ancient poems. His master has a complete copy of "Three Hundred Tang Poems" ( pre-training corpus ).

  • **Traditional method (NTP)**: The master gives the first half of a poem, such as "The moon shines brightly in front of the bed," and the apprentice writes down the next word he thinks is most likely, such as "疑". The master tells him that the probability of guessing "疑" is higher. The apprentice practices repeatedly, with the goal of making his guess higher and higher. He may just remember this combination.
  • The RPT method of the paper : The master also gave the apprentice "The moon shines brightly before the bed." This time, the master asked the apprentice to:
    • The conclusion of Draft 1 is “疑”, which is consistent with the standard answer! The master gave this draft a “big red flower” (reward = 1).
    • The conclusion word "frost" in draft 2 is inconsistent with the standard answer. There is no red flower (reward = 0).
    • The conclusion word "照" in draft 3 is inconsistent with the standard answer. No red flower (reward = 0).
    • ...
    • Draft 1: "Thinking: In front of the bed, the moon, the light, the mood is cold, maybe it is doubt... Conclusion:suspect"
    • Draft 2: "Think: What does the moonlight look like when it shines on the ground? It may be frost... Conclusion:frost"
    • Draft 3: "Thinking: What should be followed by Guang? ... Conclusion:According to"
    • ... (8 drafts in total)
  1. You cannot write the answer directly.
  2. Take a few sheets of scratch paper (e.g. 8 sheets, Rollouts ).
  3. On each piece of draft paper, write down your own thinking process (Reasoning/Chain-of-Thought, ) , and finally write down the next word  derived (Prediction,)**.
  4. When the master got these 8 drafts, he only looked at the "conclusion" at the end of each draft.
  5. The teacher opened the "Three Hundred Tang Poems" and the next word of the standard answer was "疑" (**Ground-Truth, **).
  6. Master ** correction (Reward, )**:
  7. Learning Feedback (RL Update) : After receiving feedback, the apprentice will adjust himself: "Oh! It turns out that the thinking path of 'the artistic conception is cold and refreshing, perhaps it is suspicion' can get red flowers! Next time I encounter a similar scenario, I will think more in this direction. The ideas of 'like frost' and 'doggerel' will not get red flowers, so I will use them less." The apprentice's goal is not to guess the word itself, but to learn and develop a set of thinking methods that can stably obtain "big red flowers" .

This metaphor illustrates the core mechanism: when faced with the same context, multiple attempts at "thinking + conclusion" are generated, simple rewards are given based on whether the conclusion matches the objective standard answer, and ultimately those thinking paths that can lead to correct conclusions are strengthened.

Establishing a correspondence between metaphors and actual technology

Let's match the elements in the metaphor with the technical concepts one by one:

Key Elements of the Parable
Corresponding practical technical concepts
Reasonable explanation for the corresponding relationship
apprentice
Large Language Model (LLM), Policy 
The model is like an apprentice, generating actions (thinking + prediction) based on the current state (above), and learning and adjusting its own parameters through rewards.
Master and "Three Hundred Tang Poems"
Pre-training Corpus & Reward Calculation Mechanism
The corpus provides context and objective standard answers (Ground Truth), and the reward mechanism automatically calculates rewards based on the model output and the standard answers.
The first half of the poem is "The moon shines brightly before my bed."
Context ()
This is the input for the model to make predictions, i.e. the state in reinforcement learning.
The next word "疑" in "Three Hundred Tang Poems"
Ground-Truth ( or )
An objective criterion for verifying the correctness of a model's predictions.
Take out several sheets of (G=8) draft paper and try
Generate multiple trajectories Rollout / Sampling G responses ${o_t^i} {i=1}^G \sim \pi \theta(\cdot
x_{\lt t})$
Thought process on the draft ("The mood is so cool...")
Chain-of-thought reasoning sequence ()
The intermediate token generated by the model before outputting the final answer represents its "thinking".
The final conclusion word written on the draft ("doubt")
The next word/sequence predicted by the model Prediction ()
After the model "thinks", the final output is used to compare with the standard answer.
A whole draft (thinking process + conclusion words)
A complete response / trajectory 
A complete unit from state input to final output and reward evaluation.
The teacher will give a "big red flower" (1 or 0) according to the answer
Verifiable Reward Signal () (Formula 3)
Based on prediction  Whether it is true  Matching, a simple, objective, binary reward is given. The paper uses prefix matching rewards.
Apprentices adjust their thinking methods to win more red flowers
Reinforcement learning update RL Update (Formula 4), adjust the parameters 
The model updates parameters based on the rewards received through an RL algorithm (such as GRPO used in the paper). , which increases the probability of generating high-reward trajectories (i.e. correct thinking + prediction) in the future.
(The master only let the apprentice continue to write difficult poems)
Entropy-based data filtering
The paper filters out words that are too easy to predict, allowing the model to focus on learning words that require thinking to predict correctly.

Diving into technical details

Now, we transition from "apprentice poetry writing" to actual technical principles and mathematical formulas. The core lies in reward definition and optimization goals.

Let's first compare the traditional NTP target (paper formula 1):

  • Original mathematical form:

  • Symbol replacement version:Traditional prediction target (model parameters) = sum over all positions in the sequence [the model predicts the logarithmic probability of the "real next word" given the previous context and parameters]

  • Explanation : This is the maximum likelihood estimate. What the model has to do is to put as much probability mass on the correct word as possible at each position.  This corresponds to the apprentice guessing the word directly, and the master telling him that the "confidence" of guessing correctly is higher.

Now let's look at the core of the RPT:

Key Technology 1: Definition of Reward Signal

The model was generated  Trajectory We need to give each trajectory a reward The paper designed a "prefix matching reward". Why is it so complicated? Why not just compare  and  Because the model predicts  It may contain multiple tokens or involve words outside the vocabulary. It is inconvenient to directly press the token, so it is converted to the byte level and checked whether it matches a legal prefix of the real sequence.

  • Original mathematical form (Formula 3):

    (in  It is a prediction  A byte sequence of  is its length; Is the real subsequent sequence  The length from the beginning is  A byte sequence; It is the set of byte lengths corresponding to all legal token boundaries in the real sequence).

  • Symbol replacement version:

  • Explanation of key steps :
  1. The model predicts  and the real follow-up text  All converted into byte sequences  and .
  2. Checking the predicted byte sequence  Is it strictly equal to the real byte sequence?  The beginning part of .
  3. Check the length of the predicted byte sequence , whether it corresponds to the end position of a complete token in the real sequence (for example, it cannot match only half a token).
  4. If both conditions are met, the reward is 1, otherwise it is 0. This is a very clear, objective, binary signal.

Key Technology 2: RPT Optimization Objective

With the reward, the goal of the model is to maximize the expected reward through RL.

  • Original mathematical form (Formula 4):

    (Note: Strictly speaking, RL algorithms such as PPO/GRPO use these trajectories and rewards  Calculate the policy gradient to update, this formula expresses the core intention of maximizing expected reward).

  • Symbol replacement version:RPT training target (model parameters) = "expected value" [reward for each attempt] in the following case: data is sampled from the corpus (previous context, true next context), and multiple attempts of the model (thinking + prediction) are generated from the given previous context according to the current model strategy.

  • Explanation of key steps :
  1. From the dataset  Sampling a context  And the real follow-up .
  2. Model  based on  generate  Tracks .
  3. The reward for each trajectory is calculated according to Formula 3 .
  4. Using the RL algorithm (GRPO, an on-policy algorithm), we can use these (state , Action/Trajectory , award ) to calculate the gradient and update the model parameters The direction of the update is: to generate a trajectory with a reward of 1 in the future  Probability  Increase, so that the probability of generating a trajectory with a reward of 0 decreases. The model learns more than just output  , but learn to generate the entire  process.

Mapping technical details to metaphors

  • Formula 3 (Bonus) In the metaphor : it is the "rule" that the master corrects. He puts the apprentice's conclusion word () and the characters in "Three Hundred Tang Poems" () comparison. Not only must the glyphs be the same (byte prefix matching), but the apprentice must also write a complete word, not a half word (length ). If the rules are fully met, the apprentice will be awarded a big red flower (1), otherwise it will be zero (0). This rule is simple and clear, and the apprentice cannot bargain with the master or play tricks (avoid reward hacking).
  • Formula 4 (Goal) In the metaphor : it is the apprentice’s “learning goal”. The apprentice’s ultimate goal is to maximize the total number/expected value of “big red flowers” ​​(). In order to achieve this goal, he must adjust his way of thinking (update the parameters ). If the path of "thinking about artistic conception" (trajectory ) got the red flower (), the apprentice will strengthen this idea (increase  probability); if the path of "want to chant" (trajectory ) Didn't get the red flower (), the apprentice weakens this idea (reduces  probability).
  • Comparison with Formula 1 (NTP) : The goal of traditional NTP  This is equivalent to the apprentice not writing down his thinking process and just guessing the word. The master does not give him a red flower, but tells him "Your confidence in the word 'yi' is not high enough". The apprentice only adjusts his confidence in the word "yi".  It rewards the entire process of "thinking + drawing the correct conclusion".
  • Limitations of metaphor :
    • The metaphor simplifies the specific mathematical process of RL update (such as advantage function, gradient calculation, etc.).
    • The correspondence between "word" and "token/byte", as well as the token boundary () is simplified in the metaphor.
    • The apprentice’s “thinking” is conscious, while the model’s token generation is based on probability distribution.

Summarize

The core connection lies in the metaphor of "apprentice writing poetry", which vividly shows how RPT trains the model through "exploration" (multiple drafts), "objective evaluation" (giving red flowers by comparing the answers) and "feedback learning" (reinforcing the idea of ​​getting red flowers).

  • This correspondence helps us understand that RPT does not let the model simply "remember" the next word, but forces the model to learn how to "deduce" the correct next word through the RL reward mechanism.
  • The most critical mathematical principle can be summarized by analogy: Formula 3 defines the objective standard of the "big red flower" (the prediction must exactly match the real answer), and Formula 4 defines the ultimate goal of the apprentice - to adjust the thinking strategy (model parameters) to maximize the expected number of "big red flowers".  This mechanism allows the model to transform from a "fill-in-the-blank person who memorizes answers" to a "strategist who learns how to think in order to score points".

Phase 3: Detailing the process steps

Now that we understand the core mechanism and metaphor of RPT, let’s break it down step by step to see what the entire data flow and processing process would look like if we were to implement RPT.

The following is the complete process of the Reinforcement Pre-Training (RPT) solution to deal with the problem: The whole process can be divided into the preparation stage and the cyclic training stage.

Pre-computation / Setup

Input Preparation

  • A basic language model (DeepSeek-R1-Distill-Qwen-14B is used in the paper). This model already has certain language capabilities and basic reasoning capabilities, which serves as the starting point for training (apprentice enrollment). .
  • A large-scale pre-trained corpus (the OmniMATH Mathematics Corpus is used in the paper) containing a large number of text sequences. .
  • A smaller proxy model (1.5B model used in the paper) for inference.

Corpus filtering (optional, but adopted in the paper)

  • Input : original corpus , proxy model.
  • process :
    • Traverse every text sequence in the original corpus.
    • For each position in the sequence , using the text preceding it as context , input to the proxy model.
    • The proxy model calculates the probability distribution of the next word and calculates its entropy, such as the entropy of the top-K words. The higher the entropy, the more difficult it is to predict the word and the less certain the model is; the lower the entropy, the easier it is to guess the word (such as the period at the end of a sentence).
    • Set an entropy threshold. Only keep those positions where the predicted entropy of the next word is higher than the threshold and their corresponding context The purpose of this is to filter out words that can be easily predicted without reasoning, so that the model can focus its computing resources on learning those "difficult" words that require reasoning (the master only chooses difficult problems to test the apprentice).
  • Output : Filtered dataset containing "difficult" prediction points , where each element can be viewed as a pair (context , the real follow-up text ).

Hyperparameter Setting

Set the learning rate, batch size, and number of trajectories generated for each context of the reinforcement learning algorithm (such as GRPO/PPO)  (In the paper ), sampling temperature Temperature (control exploratory, 0.8 in the paper), maximum length, etc.

Training Loop

This stage will iteratively execute many steps, each of which includes the following processes:

Batch Sampling

  • Input : Filtered dataset , batch size B.
  • Process : From the dataset  Randomly sample B samples from the batch to form a batch. Each sample contains (context , the real follow-up text ).
  • Output : a batch of samples .

Rollout Generation (On-Policy Sampling)

  • Input : Current model , a sample in the batch , the number of trajectories , sampling temperature, preset prompt template (Prompt Template).
  • process :
    • The context  The model input is constructed according to the prompt word template (for example, the prompt word will tell the model: "Please think about and predict the next word, write down your thinking process, and put the final answer into\boxed{}middle").
    • Feed the constructed input to the current model .
    • The model is sampled and generated at a set temperature and repeated  times, independently generated  The complete response text of each different item.
    • From each response text  In the analysis, the following are the token sequences of the thinking process:  (For example, special markings <think> and </think> * The final predicted token sequence  (For example, the last \boxed{} ).
    • Combining thinking and prediction into one trajectory .
    • This is done for all B samples in the batch.
  • Output : For each sample in the batch , all get a set  Tracks . and their corresponding real subsequent texts .

Reward Calculation

  • Input : For a sample , generated by the model  The predicted part of the trajectory , and the actual subsequent text .
  • process :
    • Convert it to a sequence of bytes , and record its byte length .
    • Application reward formula (Formula 3): Judgment  Is it equal to  Before  bytes, and  Is it part of the collection? .
    • If both conditions are met, the reward is assigned .
    • Otherwise, assign a reward .
    • Precompute the true follow-up text  The corresponding byte sequence , and the byte length set corresponding to all legal token boundaries .
    • For each trajectory, the prediction :
    • For all B samples in the batch  This operation is performed for all tracks.
  • Output : For each track of each sample in the batch , a corresponding reward value is calculated The entire batch of data is now .

Model parameter update (Policy Update)

  • Input : The entire batch of data, including all context , the complete trajectory generated (including all tokens for thinking and prediction), and corresponding rewards , the current model .
  • process :
    • Use an online policy reinforcement learning algorithm (the paper uses GRPO, whose core idea is similar to PPO).
    • The algorithm will be based on the current model  Recalculate these trajectories  The probability (or log probability) ).
    • Algorithms leverage reward signals  (A baseline value or advantage function Advantage may also be calculated to reduce variance), combined with the probability of the trajectory, and the policy gradient is calculated.
    • Based on the calculated gradient and learning rate, an optimizer (such as Adam) is used to update the parameters of the model. The goal of the update is to maximize the expected reward (Formula 4), that is, to increase the number of trajectories that receive a reward of 1.  , and reduce the probability of generating trajectories with a reward of 0.
  • Output : Updated model parameters , get the new model , used for step 2 (trajectory generation) of the next training step.

Loop : Repeat the steps of data sampling -> trajectory generation -> reward calculation -> model parameter update until the preset number of training steps is reached or convergence occurs.

Evaluation

After training:

  • Input : trained RPT model , test set context.
  • process :
    • For language modeling tasks : given the context , which allows the model to generate thought processes and predictions ,Pick  Evaluate the accuracy; or directly perform greedy decoding or sample the next word with the highest probability like the traditional model, and evaluate the accuracy.
    • For downstream tasks : You can directly perform zero-shot testing, or use the model as a basis for further reinforcement learning fine-tuning on a specific task dataset.
  • Output : Performance indicators of the model on various tasks.

Phase 4: Experimental design and verification analysis

A new method must be tested in experiments before it can stand firm. Now let’s become reviewers and examine the experimental part of the RPT paper to see how the author builds a chain of evidence to prove that RPT is advanced and effective.

Interpretation of the main experimental design: verification of the core argument

The core proposition of the paper

As a new pre-training paradigm, RPT can: (1) improve the model’s basic language modeling capabilities (i.e., the accuracy of next word prediction); (2) stimulate the model’s reasoning capabilities; (3) provide a better foundation for subsequent RL fine-tuning; and (4) have good computational scalability (Scaling property).

Analysis of the rationality of the main experimental design and selection

The authors designed several key experiments to directly respond to these claims:

  • Language model performance test (Table 1, Figure 4):  directly verifies claim (1).
  • Extensibility law test (Figure 5):  Directly verify claim (4).
  • Subsequent RL fine-tuning tests (Table 2):  directly verify claim (3).
  • Zero-shot downstream task testing (Table 3):  directly verifies claims (2) and (1).

Let's look at the rationality of the choice:

Datasets
  • Training/ValidationOmniMATH (Contains more than 4,000 math competition questions and solutions).
    • Reasonableness : It is reasonable to choose a mathematical data set because mathematical texts naturally contain rigorous logic and reasoning processes, which are very suitable for verifying the goal of RPT "stimulating reasoning". The prediction of the next word often depends on the understanding of the previous mathematical concepts and steps, rather than simple pattern matching.
    • Limitations : The conclusion of the paper also acknowledges that it is currently trained mainly on mathematical corpora, and its effectiveness on texts in broader, general fields (such as news and novels) still needs to be verified in future work.
  • RL Fine-tuningSkywork-OR1 (Questions with verifiable answers).
  • Zero-Shot EvaluationMMLU-Pro (Multi-task understanding), SuperGPQA (Graduate-level, interdisciplinary reasoning problems).
    • Reasonableness : These are all recognized and challenging benchmarks for measuring the general capabilities and complex reasoning capabilities of models. They cover a wide range of fields and are of high difficulty. They can effectively test whether the models trained by RPT really have stronger and transferable reasoning capabilities.
Metrics
  • Next-Token Prediction Accuracy: The accuracy of predicting the next word.
    • Reasonableness : A standard metric for measuring a model’s ability to solve specific tasks and reason.
    • Accuracy on downstream tasks (MMLU-Pro, SuperGPQA, Skywork-OR1): task accuracy.
    • Reasonableness : This is the most direct indicator to measure the basic ability of the language model, and directly reflects whether RPT improves the language modeling ability. In particular, the author divides the test data into EasyMediumHard There are three difficulty levels, and the accuracy is reported separately. This design is critical and reasonable, as it can reveal whether RPT brings improvements especially on "difficult" words (those that really require reasoning).
  • $R^2$ (Coefficient of determination): Measures the goodness of fit of the Scaling Law curve.
    • Reasonableness : Quantify how well the experimental data points match the predicted expansion law trend, proving that the performance improvement is predictable and stable.
Baselines
  • R1-Distill-Qwen-14B: The direct base model of the RPT-14B model in the paper. This is the core comparison object. The author also evaluates it in two modes: (a) standard next word prediction mode; (b) reasoning mode (i.e., it also generates thought process and then predicts during testing).
  • Qwen2.5-14B: Base of the base model.
  • R1-Distill-Qwen-32B: A model with much larger number of parameters.
    • Reasonableness : The selection is very reasonable and representative.
  1. Compared with the Base model (14B) of the same size, the gain of the RPT method itself is demonstrated.
  2. The comparison of the Base model in "inference mode" is specially added to exclude the possibility that the performance improvement comes only from the form of "thinking during testing" rather than the RPT training process itself (the idea of ​​the ablation experiment).
  3. The comparison with a larger model (32B) is to demonstrate the efficiency of RPT. It is very convincing to see whether the 14B RPT model can match or even surpass the larger model not trained with RPT.
  4. In Table 2, we also added+ Continual NTP trainingThe baseline, that is, continuing to train the Base model with the traditional NTP method on the same data, proves that the improvement in effect is not simply due to "training on OmniMATH data for a longer time", but because of the RPT training "method" itself.

How the main experimental results support the core contribution

  • Table 1 & Figure 4 : RPT-14B has higher next word prediction accuracy than Base 14B model at all difficulty levels. It is particularly noteworthy that Base 14B has extremely low accuracy in inference mode (1.41-3.31%), proving that the model does not know how to "infer the next word" without RPT training; while RPT-14B's performance even matches or exceeds that of the 32B model, directly supporting the claim that RPT significantly improves language modeling capabilities.
  • Table 2 : After subsequent RLVR fine-tuning, the final performance of the model starting with RPT-14B (58.3) is higher than the model starting with Base 14B (52.7), and higher than the model starting with Base 14B + NTP training (13.0). This directly supports the claim that RPT can provide a better foundation for subsequent RL fine-tuning.
  • Table 3 : The zero-shot performance of RPT-14B (inference mode) on MMLU-Pro and SuperGPQA not only surpasses Base 14B (both modes), but even significantly surpasses the 32B model (standard mode), which strongly supports the claim that RPT can improve the general reasoning ability of the model.
  • Figure 5 : As the amount of computation increases, the accuracy increases steadily, and the curve fit isVery high (), supporting the claim that RPT has good scalability.
  • Conclusion : The main experiment forms a closed loop, which quantitatively supports the core contribution of the paper from four aspects: basic ability, reasoning ability, fine-tuning potential and scalability, by comparing with appropriate and powerful baselines on standard datasets and indicators.

Ablation Experiment Analysis: Contribution of Internal Components

Strictly speaking, the paper does not have a typical "Ablation Study" table that removes modules one by one. However, the author has achieved the effect of an ablation experiment through clever comparative experiments , verifying the necessity of key designs:

Key Module/Design 1: RPT training process itself (vs. thinking only at inference time)

  • Verification experiment : In Table 1, the comparison RPT-14B (RPT trained, reasoning mode evaluated) vs R1-Distill-Qwen-14B (Next-token reasoning, not trained with RPT, only uses reasoning mode during reasoning).
  • Corresponding innovation : RPT training paradigm, learning how to perform "next word inference".
  • Results and proof : The Base model without RPT training has a terrible accuracy in inference mode (eg, Hard: 1.41), which is much lower than its standard prediction mode (20.43) and even lower than RPT-14B (23.75). The huge performance gap quantitatively and strongly proves that the improvement of the model does not come from the form of "generating thinking process during inference", but from the fact that through RPT training, the model has truly learned how to think effectively to predict the next word . The RPT training process is absolutely necessary and irreplaceable.

Key Module/Design 2: Type of training objective (RL objective of RPT vs. traditional NTP objective)

  • Verification experiment : In Table 2, the starting point of RL fine-tuning is compared:RPT-14B vs R1-Distill-Qwen-14B + Continual NTP trainingThe latter is to continue training the Base model on the same data as RPT but using the traditional NTP target.
  • Corresponding innovation : RPT is pre-trained through RL objectives, which can provide a better basis for fine-tuning.
  • Results and proof : Using the NTP target to continue training, the model's reasoning ability (Before RL column) dropped sharply (from 51.2 to 10.7), and subsequent RL fine-tuning also recovered slowly (13.0). RPT-14B provided a high starting point (56.3) and end point (58.3). This proves that the performance improvement does not come from "having seen these training data", but from the consistency of the RL training target adopted by RPT and the subsequent RL fine-tuning target. The RL-based RPT training method is necessary to build a better fine-tuning foundation.

(Appendix A) Reward Function Design

The paper mentioned in the appendix that they tried different reward function designs (such as matching only the first token, dense rewards, etc.) and found that the performance was comparable to the proposed prefix matching reward. This shows that the RPT framework has a certain robustness to the specific details of the reward function, and its core advantage may come more from the framework itself of "reconstructing NTP as an RL task", as long as the reward is based on correctness.

Analysis of in-depth/innovative experiments: insight into the intrinsic characteristics of the method

In addition to conventional comparison and ablation experiments, the authors designed two very clever experiments to provide deeper insights:

Experiment 1: Reasoning Pattern Analysis (Figure 6, Table 4, Appendix F)

  • Experiment type : Visualization/qualitative analysis + case study + statistical analysis.
  • Purpose of the experiment : What do you want to prove? The thinking process of "next word reasoning" stimulated by RPT is different in nature from the thinking process of "structured problem solving" (Problem Solving) performed by the model. I want to intuitively show what the model is "thinking" and prove that it is not a simple pattern matching.
  • Experimental Design :
  1. Define 6 types of reasoning patterns (Transition, Reflection, Breakdown, Hypothesis, Divergent, Deduction) and their keywords.
  2. Two models were compared: the RPT-14B (performing a next-word reasoning task) and the Base 14B model (performing a standard math problem-solving task).
  3. The proportion of keywords in various reasoning patterns during the thinking process generated by the two models was statistically analyzed (Figure 6).
  4. Show specific thinking process text samples of RPT-14B (Table 4 and Case Studies) and conduct qualitative analysis.
  • Experimental conclusion and value :
    • The statistical results (Fig. 6) clearly show that problem solving relies more on Breakdown (decomposition problem), while RPT's next word reasoning uses significantly more Hypothesis (hypothesis/conjecture) and Deduction (Deduction).
    • The examples (Table 4) intuitively show how the model analyzes semantics, proposes multiple possibilities ("Alternatively..."), reflects on itself ("Wait..."), considers text structure clues, etc.
    • This experiment reveals the deep characteristics of the RPT method: it not only improves the accuracy, but also changes the model's internal "way of thinking", making it more exploratory and reasoning, which is highly consistent with the goal of "promoting deeper understanding rather than superficial memory" claimed in the paper and provides a mechanistic explanation.

    Smart Experiment 2: Scaling Properties Analysis by Difficulty (Figure 5)

    • Experiment type : parameter sensitivity analysis (for computational parameters) + robustness/stress testing (for data difficulty).

    • Experimental purpose : What do you want to prove? RPT is not only scalable, but also this scalability is stable and reliable on data of different difficulty levels. I want to see whether increasing the amount of computation can continue to solve difficult problems.

    • Experimental Design :

    1. Not only does it draw the overall scaling curve, but it also divides the data into three categories: Easy/Medium/Hard based on the entropy value.
    2. Plot the curves of accuracy versus computational effort (FLOPs) for these three types of data, fit the power law formula, and calculate the goodness of fit. .
  • Experimental conclusion and value :

    • The results show that at all difficulty levels, performance improves steadily with computational effort and is highly consistent with the power law (are very high).
    • This reveals the robustness and potential of the RPT method: it is not a technique that can only solve simple problems. Increasing computing power can continuously and predictably improve the performance of the model on difficult samples. This proves that the foundation of RPT as a "scaling paradigm" is solid, and provides confidence and theoretical basis for investing more computing power in larger-scale RPT training in the future. At the same time, combined with the results in Table 1 (RPT has improved significantly on Hard data), it shows that the value of RPT is more reflected in difficult tasks.

    Through these carefully designed experiments, the authors constructed a complete demonstration chain from performance indicators to internal mechanisms, from current effects to future potential, which fully demonstrated the effectiveness and advancement of the RPT method.