Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Subvert your cognition! A new paradigm of self-checking and self-correction of large models, completely saying goodbye to manual labeling

Written by

Caleb Hayes

Updated on:June-19th-2025

❝
To sum it up in one sentence, this paper teaches the big model to fight each other. You have to set the questions and correct them by yourself. If you don't correct them well, you will be punished. This can be called a schizophrenic learning method.

Phase 1: Identifying core concepts

This stage is an introduction stage. Do not introduce formulas and symbols. Use natural language to describe

1. Motivation analysis of the paper

This paper discusses how to make the model not only "calculate the correct answer" but also "self-examine" and "self-verify" when doing complex reasoning in a large language model (LLM). Currently, a common reinforcement learning scheme is to give the model a "verifiable reward", that is, when the model's answer can be programmatically judged right or wrong (such as math and programming questions), the right or wrong answer is used to guide the model's learning. However, the author found that existing practices often lead to "superficial self-examination" of the model, that is, the model may be able to answer the question correctly, but has not really learned to rigorously reflect on or verify its own conclusions.

To this end, the authors proposed a new idea: in the same reinforcement learning process, teach the model "how to solve problems" and "how to judge whether their answers are correct". They integrated these two things into a unified training process, and in each training iteration, let the model solve problems on the spot, then self-score its own problem-solving results, and then update the model parameters together. The motivation for doing this is to make "problem solving" and "self-verification" closely related during training, so as to prompt the model to form a more real and effective self-checking mechanism, rather than just learning the surface form.

Simply put: "The model should be able to both solve problems and act as a referee" , and these two abilities should improve together in the same reinforcement learning cycle.

2. Analysis of the main contributions of the paper

List the main innovations claimed in the paper

The main innovations claimed in the paper can be summarized as follows:

In the past, the reinforcement learning frameworks that train problem solving and self-verification at the same time often let the model learn problem solving first, and then "start from scratch" to learn how to verify; or simply introduce some self-reflection text prompts during the problem solving process. The author emphasizes here that "problem solving" and "verification" are both in the same online reinforcement learning process. Feedback and improvement.
Verifiable rewards are used to supervise both the problem solver and the verifier. The binary reward (1 or 0) originally given for the correct answer is also used to determine whether the model's verification output is consistent with the actual judgment. Because since we can judge whether a solution is correct, we can also make a true or false judgment on whether the "prediction score" (the score given by the model during self-verification) is accurate. In this way, the model is not only reinforced to learn to get the right answer, but also to "verify" the answer.
Significantly improved self-verification capabilities and brought more reliable reasoning performance . In the experiment, the authors showed that their method can significantly improve the model in both "problem solving accuracy" and "self-verification accuracy". In addition, the self-verification capability can also help itself during reasoning and inference, bringing about improved robustness of the final result.

Identify the key technologies or methods that support these innovations

The key technology supporting these innovations is to integrate the two processes of "problem solving" and "verification" into a single online reinforcement learning loop, and to design a reward mechanism based on "verifiable results" for both processes. This means that in the same learning iteration, the model must work hard to generate the correct answer and also work hard to accurately evaluate the correctness of its own answer. The two promote each other through shared reward signals and strategy update mechanisms.

What are the significant results of the paper? They don’t have to be numerical, but they can also be of great significance.

The most notable or meaningful results in the paper include:

In the math problem-solving scenario, the model not only significantly exceeds the baseline in terms of accuracy, but also learns better "self-scoring" ability: the accuracy of predicting whether it is correct is much higher than that of ordinary models.
During reasoning, “self-verification” can be used to perform answer rearrangement or weighted voting, allowing the model to better utilize confidence in the reasoning step, thereby further improving performance. This shows that the model has learned a deeper level of self-awareness, not just superficial answer matching.

3. Identification of Difficulties in Understanding

Analyze which concepts/methods are key to understanding the paper

To deeply understand this paper, readers need to grasp the following key concepts or methods:

Verifiable rewards are different from traditional manual scoring or human preference scoring. The rewards here are right or wrong signals that can be automatically judged. For example, in a math problem, if the model's final answer is consistent with the standard answer, the reward is 1, otherwise it is 0. This mechanism is applicable to both "problem solving" and "self-verification".
The core of the paper on online reinforcement learning and "self-verification" is to include the process of generating answers and the subsequent self-verification process in the same reinforcement learning trajectory. This idea of "simultaneous training" and the process of "generating first, then verifying, and then updating after verification" need to be carefully understood.
How self-verification can help solve problems in turn The paper not only trains a verifier, but also emphasizes that after training, the model has learned to reflect while solving problems. During the reasoning process, it is more inclined to check its own thinking rather than simply output according to the prompts.

Find out which of these concepts are most challenging

The most challenging parts or the ones that are most likely to be abstract to the reader are:

Why do we need to do these two things in the same training loop? And how to implement online generation and online verification ?
It is necessary to do reinforcement learning "according to a trajectory". This trajectory includes both the token stream of the "problem-solving action" and the token stream of the "verification action". How to put the two together, how to calculate the overall reward, and how to update the strategy at the same time are all difficult to understand.

Identify key concepts that require focused explanation

Through the above analysis, we can find that "how to integrate problem solving and self-verification into the same online reinforcement learning framework" is the most innovative and least intuitive core concept of the entire paper. Specifically, there are two points:

The specific usage of verifiable rewards for the two tasks of problem solving and self-verification, and
How these two subtasks reinforce each other in an on -policy reinforcement learning loop.

4. Concept Dependencies

Sorting out the relationships between core concepts

Combined with the above analysis, readers need to first know what verifiable rewards are and how they differ from traditional reinforcement learning rewards, before they can understand why self-verification can also be “scored”. On top of that, it is necessary to understand what the online reinforcement learning process is : the model must first solve the problem, then verify, and then update the policy parameters together. Finally, the impact of self-verification during reasoning : how it changes the model’s internal reasoning strategy.

Determine the best entry point for explanation

A suitable entry point is: first introduce verifiable rewards → then introduce how online reinforcement learning incorporates problem solving and self-verification into a cycle → finally let readers know that the strategy learned through self-verification can also be used in reasoning, leading to better problem-solving performance.

Phase 2: In-depth explanation of core concepts

The focus of this stage is to introduce metaphors and link metaphors with formula symbols.

1. Design life metaphors

Choose an everyday scenario or an activity that is easy to understand

Let’s use the analogy of students having to both answer questions and grade themselves in an exam.

Use this metaphor to show how the core mechanics work

Suppose there is a student named Xiao Ming who takes a math test. The test format is very special:

Doing the test : Xiao Ming must first answer each question on the test paper.
Self-evaluation : Before handing in the paper, Xiao Ming must also attach a score or judgment on whether his answer is correct.
External scoring : In the end, there is an "absolutely fair automatic marking system" to judge whether Xiao Ming's answer is correct, and it will also compare whether "Xiao Ming's self-evaluation" is accurate.

Xiao Ming will get a certain score based on his performance in answering questions and self-evaluation. The score consists of two parts:

Problem-solving points : If the answer is correct, 1 point will be given. If the answer format is correct (for example, the answer must be written in the designated area on the answer sheet), one more point may be given. Otherwise, no points will be given or points may be deducted.
Self-evaluation : If Xiao Ming says "I got this question right" and the system actually judges it to be correct, then he will get points for his self-evaluation. On the contrary, if Xiao Ming says it is right but he is actually wrong, then his self-evaluation is equivalent to a wrong score and he will not get points. Similarly, if Xiao Ming says "I was wrong" and he is actually right, his self-evaluation score will also be deducted.

Since this exam is in the form of "online learning", Xiao Ming will reflect on and improve his problem-solving and scoring strategies based on the scores given by the system (problem-solving scores + self-scoring scores) after each round of tests, and then continue to the next round of tests.

Make sure your metaphors are simple and intuitive, preferably with scenarios that most people are familiar with.

In this scenario, Xiao Ming’s process of doing the test corresponds to the generation of the solution , and Xiao Ming’s self-scoring corresponds to self-verification . The automatic grading system is like the “verifiable reward” mentioned in the paper. It can directly judge right and wrong without human intervention, thus giving a clear score.

2. Establish a correspondence between metaphors and actual technology

List the key elements of the metaphor

"Xiao Ming is doing the test"
"Xiao Ming rates himself (self-evaluation)"
"Automatic Grading System"
“After getting the total score, we will improve our strategy in the next round”

Explain the actual technical concepts corresponding to each element

“Xiao Ming is doing a problem” ↔ “The model is generating problem-solving answers (problem-solving trajectory)”

Xiao Ming will eventually write out the detailed problem-solving steps and answers; the model will output tokens step by step until the final answer is obtained.

“Xiao Ming scores himself (self-evaluation)” ↔ “The model verifies the answers it generates (self-verification trajectory)”

Xiao Ming will write a sentence like "I think I got it right" or "I think I got it wrong"; the model will score and explain the answer it just produced.

“Automated grading system” ↔ “outcome verifier”

When Xiao Ming completes the answer and self-evaluation, an automatic system will calculate whether Xiao Ming's answer is correct and whether Xiao Ming's self-evaluation is correct. The model uses a rule/program to score whether the model's answer is correct or not, and also scores the accuracy of the model's judgment in the self-verification phase.

“After getting the total score, we will improve the strategy in the next round” ↔ “Reinforcement learning algorithms such as PPO will update the problem-solving strategy + verification strategy at the same time”

The next time Xiao Ming answers a question or self-evaluates, he may think more carefully. In the next training iteration, the model will update its parameters through policy gradients to learn better and more robust problem-solving and self-verification methods.

Explain why these correspondences are reasonable

This metaphor makes sense because it well reflects what the paper calls "online self-verification": problem solving and verification are in the same process, influencing each other and being updated uniformly by the reward signal , rather than splitting verification into an independent post-processing module.

3. Dive into technical details

Transition from metaphor to actual technical principles

Based on the above metaphor, we will use the key mathematical formulas and algorithmic terms in the paper to explain the reinforcement learning mechanism of "simultaneous training of problem solving and self-verification". Readers can refer to the metaphor and first map the abstract mathematical symbols to the scenario of "examinees answering questions".

Explain the relevant mathematical formulas or algorithms

1. Policy Gradient

Original mathematical form:
Symbol substitution version: "In order for Xiao Ming to get a higher total score, we need to adjust Xiao Ming's answering strategy in the direction that will increase Xiao Ming's score. Assume that Xiao Ming has a probability for the answer fragment (token) written down at each step of each question. We will continue to fine-tune these probability distributions to make those behaviors that lead to higher scores easier to execute, and make those behaviors that lead to lower scores less likely to execute."
in,It is the topic information.is the complete answer sequence of the model,are model parameters.It can be understood as "the preference distribution in Xiao Ming's mind for all possible ways of writing the answer."
The key point is(Advantage function), "Advantage function" means "how much better is the output of this step at this moment than the average performance". If it is better than the average, its probability should be increased more; otherwise, it should be suppressed.

2. Advantage Function

Original mathematical form: The common form of the advantage function given in the paper is:
Or in practice, a baseline or other estimation method (such as GAE) may be used, which can be simplified as:
inis the TD residual at each step.
Symbol replacement version: "The value (advantage) of a certain answer step = the total reward for this question - the level that people usually get on similar questions." For example: replace "A_i = (r_i - mean(r))/std(r)" with "Advantage value of a certain attempt = (reward for this attempt - average reward for all attempts)/volatility of rewards for all attempts" (This example is the concept of a general advantage function, the specific implementation in the paper may be slightly different, but the core idea is the same)
whenWhen it is positive, it means that "writing down this step" is better than average, and the probability of writing it down should be increased; otherwise, it should be reduced.

3. Verifiable Reward

Original mathematical form: (The paper does not directly give a unified mathematical formula for verifiable rewards, but describes its mechanism)

Rewards for solving problems :
Verification Rewards :

Symbol replacement version: "If Xiao Ming not only answers the question correctly, but also correctly judges whether he has done it correctly, then he can get the benefits of both the problem-solving reward and the self-evaluation reward; otherwise, he will either lose the problem-solving reward or the self-evaluation reward."

**Question-solving rewards **: If the final answer is correct and the format is correct, the problem-solving reward = 1; if the answer is correct but the format is incorrect, you may get a negative score or no extra points; if the answer is wrong, you will get 0 or -1.
Verification reward : If the score or judgment given by the model in the self-verification phase is consistent with the actual right or wrong, the verification reward = 1 point, otherwise it is 0 points.

In this way, each "answer trajectory" not only contains the token sequence of the answering action, but also the token sequence of the verification action. Finally, in the same RL algorithm, the answering reward and the verification reward are added together to get a total reward signal.

4. The role of PPO (Proximal Policy Optimization)

Original mathematical form: (Simplified version of PPO core objective function)
in, that is, the ratio of the new and old strategies.
Symbol replacement version: "We want Xiao Ming to adjust his answer preferences appropriately each time based on the difference between his last test strategy and this test strategy. We cannot make a drastic change all at once (limit the range of change through a 'clipping' operation) to avoid going to extremes."

Explain the key steps in technical implementation

The paper mainly uses the PPO algorithm, which combines the above information such as reward values and strategy probabilities for update. PPO has some techniques to control the update amplitude, such as clip operation or KL penalty. The core idea is to ensure that the strategy does not take too extreme an update while making both the problem-solving strategy and the verification strategy move towards the direction of "higher scores".

4. Map technical details to metaphors

Explain how each technical step is represented in the metaphor

In the metaphor, "Xiao Ming doing the test + Xiao Ming self-evaluation" is the model generating the answer + the model generating the verification conclusion . Together, they form a complete "answer track".
The score given by the "automatic grading system" to the answer trajectory corresponds to a verifiable reward .
"Xiao Ming reflects on his score and will have a better strategy for the next exam" corresponds to a strategy update in PPO . After each update, the strategy parameters move in the direction of "more correct problem solving and more reliable verification".
The whole process is repeated for multiple rounds, and the question answers and self-evaluations in each round are sampled in the "on-policy" state, ensuring that the verification strategy and problem-solving strategy learned by the model are related to each other.

Explain how metaphors can help understand technical details

Metaphors help us visualize abstract reinforcement learning concepts (such as strategies, rewards, advantage functions, and PPO updates) into the process of student examinations, grading, and learning improvement, making it more intuitive to understand the core mechanism of "online simultaneous training of problem solving and self-verification."

Map the concepts in the mathematical formula to specific behaviors or phenomena in the metaphor

(The probability of the policy network outputting an action) corresponds to "Xiao Ming's tendency to choose to write down a specific word or symbol under the current question and the steps that have been written."
(Advantage function) corresponds to "After Xiao Ming wrote down a certain step, how much higher was his final score than his average score when he wrote this step in similar situations."
(Question Solving Rewards) and (Verification Reward) corresponds to "the scores given by the automatic marking system to Xiao Ming's answering part and self-evaluation part respectively."
The clip operation of PPO corresponds to "The teacher told Xiao Ming that even if he did very well or very poorly in the exam this time, he should not change his study methods too drastically next time, but should proceed step by step."

Point out the limitations of the metaphor, if any

Real reinforcement learning will have many hyperparameters and detailed implementations, such as learning rate, value function update method, entropy regularization, etc., but it is not so complicated in the "Xiao Ming's exam" scenario. But at least this metaphor helps readers understand "how to score and update the problem solving and self-evaluation at the same time in the same test process."

5. Conclusion

Reaffirming the core connection between metaphor and actual technology

Through the real-life scenario of "Xiao Ming has to answer questions and give himself scores, and then the automatic system will do objective grading", the core mechanism of "problem solving + self-verification + verifiable rewards + online RL training" in the paper is used as a metaphor.

Emphasize how this correspondence helps understand the entire concept

We can see that:

Xiao Ming’s behavioral strategy is like the model’s “problem-solving strategy + self-examination strategy”.
The exam scoring system is a "verifiable reward mechanism" that gives points for both problem-solving and self-evaluation.
Xiao Ming will improve his problem-solving and self-evaluation methods after each test; this corresponds to the online policy gradient/reinforcement learning model update.

Use metaphors to summarize the most critical mathematical principles

This correspondence helps us visualize the complex reinforcement learning framework and self-verification ideas. Once readers understand this process, it is easy to understand why putting "problem solving" and "self-verification" in the same training loop allows the model to learn more realistic and useful "self-audit" skills . The most critical mathematical principle - policy gradient, just like Xiao Ming adjusts his "problem solving and self-evaluation habits (strategies)" according to his "advantages" in each exam (doing better or worse than usual), PPO ensures that this adjustment is robust.

Phase 3: Detailing the process steps

1. What is the specific process of the complete model or solution proposed in the paper to solve the problem? Give an example of how to process the input through the model and method to obtain the output in detail.

The core method of this paper is an online reinforcement learning framework, which is used to simultaneously train the "problem solving (generation)" and "self-verification (verification)" capabilities of a large language model. The entire process can be broken down into the following main stages:

Batch Sampling takes a batch of questions from the training data. The model will generate answers for each question first and get several complete answers.
Answer evaluation and verifiable reward generation Each answer is submitted to a "verifiable external judge" to detect correctness and calculate the reward.
Construct a self-verification task Based on the answer generated by the model, construct a new "verification prompt" to let the model score or judge the answer it just generated. Then use an external judge to compare the model score with the real answer to get the reward of the verification task.
Concatenate the problem-solving and verification trajectories, and perform reinforcement learning updates together. Treat the "problem-solving sequence" and "self-verification sequence" of each problem as a whole to calculate the advantage function and policy gradient; update the model parameters under RL algorithms such as PPO.
By repeating the above steps for multiple rounds, the model's problem-solving strategy and self-verification strategy will gradually improve.

This ensures that when the model is trained, problem solving and self-verification are "bound" to the same RL track: the model must immediately score its own answer after outputting it, and can receive corresponding reward or penalty signals at the same time.

Let's expand the whole process in more detail, especially focusing on the connection between input and output:

Prepare for input

The input data set consists of "questions (and their standard answers)", such as math problems, programming problems, etc.
In each iteration, we first randomly select a batch of questions (e.g. batch size = 1024).

Model generates answers to problems (Generation Phase)

For each question, the model outputs the problem-solving steps and the final answer in sequence under the current strategy. It can sample multiple times (for example, rollout = 8) to generate multiple candidate solutions.
The outputs of these solutions (text token sequences) will be recorded and then enter the judgment phase.

External judgement device calculates "problem solving reward"

This step is like the outcome verifier (OV) mentioned in the paper. It determines whether the final answer is right or wrong, and whether the format is standard based on the comparison between the final answer and the standard answer, and assigns a discrete numerical reward (such as 1/0/-1).
In this way, a batch of training samples are obtained:.

Construct a "validation problem" and let the model validate itself

Using the problem-solving results (question & answer & reward) as the context, a fixed template is used to generate verification prompts, requiring the model to output a score or judgment on the previous answer.
The model also generates a text during the "self-verification phase", including an explanation of "why it is correct or wrong" and a numerical score (such as 1 for believing that it is right, 0 for believing that it is wrong, etc.).

External judge calculates "verification reward"

Once again, compare the model's self-verification conclusion with the actual right or wrong result. If the model says "the answer is correct" in self-verification, and it is actually correct, then the verification reward = 1; otherwise = 0 or a negative value.
This gives us new training samples:.

Combine the "solving track" and "verification track" into one RL update batch

The token sequence generated by the model when solving a problem and the token sequence generated during self-verification need to be labeled with their respective "reward" labels (one corresponding to the problem-solving reward and one corresponding to the verification reward) and their respective advantage functions are calculated.
Finally, merge it into a unified PPO or other RL algorithm for parameter update.

Repeat Iteration

When this step of updating is completed, the model strategy (referring to "how to solve the problem" and "how to verify") will change at the same time;
The next round repeats the cycle of “sampling problem - solving problem - calculating reward - self-verification - calculating reward - merging and updating”.
Iterate continuously until the training process converges or reaches the preset maximum number of steps.

2. The focus of this stage is on the detailed description of the process. You need to ensure that the input of one process can be used as the input of the next process after being processed by this process.

The above process has ensured this. For example:

The output of step 1 (a batch of questions) is the input to step 2 .
The output of step 2 (the solution generated by the model) together with the input of step 1 (the question) become the input of step 3 to calculate the reward for solving the problem.
The output of step 3 (reward for solving the problem) together with the output of steps 1 and 2 (question, answer) become the input of step 4 to construct the verification prompt.
The output of step 4 (the verification conclusion generated by the model) and the output of step 3 (the actual problem-solving reward/right or wrong situation) become the input of step 5 to calculate the verification reward.
All data generated by steps 2, 3, 4, and 5 (questions, answers, problem-solving rewards, verification prompts, verification outputs, verification rewards, and the corresponding model generation sequence) become the input of step 6 for reinforcement learning updates.
The updated model in step 6 will be used in step 2 of the next iteration .

3. You should ensure that someone who has not read the paper can accurately restore the specific process pseudocode of the method or process based on your detailed description

In order to allow readers to "write it by following", a simplified pseudo code is given below, which reflects the key logic described above. It can be regarded as a summary of the algorithm flow in the paper:

# Assume that we use PPO as the backbone RL method
initialize policy $\pi_\theta$  and  value function $V_\phi$  with  model parameters

for  iteration  in  range(num_iterations):
    # Step 1: Sampling Problem
    problem_batch = sample_problems(batch_size)  # (problem list)

    generation_trajectories = []  # Used to store (questions, solution sequences, problem-solving rewards, problem-solving log_probs, problem-solving values)
    verification_trajectories = []  # used to store (verification prompts, verification sequences, verification rewards, verification log_probs, verification values)

    # Step 2 & 3: Generate the problem and get the reward
    for  problem_prompt  in  problem_batch:
        # Under the current strategy, sample multiple solutions (rollouts)
        for  _  in  range(num_generation_rollouts):
            # generated_solution_tokens: list of tokens
            # log_probs_gen: log probability for each token
            # values_gen: value estimate for each state (token prefix)
            generated_solution_tokens, log_probs_gen, values_gen = $\pi_\theta$.generate_with_logprobs_values(problem_prompt)
            generated_solution_text = detokenize(generated_solution_tokens)

            # Evaluate using an external judger
            reward_for_solution = outcome_verifier(problem_prompt, generated_solution_text)  #eg, 0 or 1

            # Record the relevant information of the problem solving track
            generation_trajectories.append({
                "prompt" : problem_prompt,
                "solution_tokens" : generated_solution_tokens,
                "reward" : reward_for_solution,
                "log_probs" : log_probs_gen,
                "values" : values_gen
            })

    # Step 4 & 5: Self-verification phase and obtaining verification rewards
    for  gen_traj_data  in  generation_trajectories:
        problem_prompt = gen_traj_data[ "prompt" ]
        solution_text = detokenize(gen_traj_data[ "solution_tokens" ])
        actual_solution_correctness = (gen_traj_data[ "reward" ] ==  1 )  # True if solution was correct

        verification_prompt = make_verification_prompt(problem_prompt, solution_text)

        for  _  in  range(num_verification_rollouts):
            # verification_output_tokens: list of tokens
            # log_probs_ver: log probability for each token
            # values_ver: value estimate for each state (token prefix)
            verification_output_tokens, log_probs_ver, values_ver = $\pi_\theta$.generate_with_logprobs_values(verification_prompt)
            verification_output_text = detokenize(verification_output_tokens)

            # Determine whether self-verification is accurate
            # parse_verification_assessment: extracts model's belief (eg, True if model thinks solution is correct)
            model_assessment_is_correct = parse_verification_assessment(verification_output_text)
            reward_for_verification =  1.0 if  (model_assessment_is_correct == actual_solution_correctness)  else 0.0  

            verification_trajectories.append({
                "prompt" : verification_prompt,
                "verification_tokens" : verification_output_tokens,
                "reward" : reward_for_verification,
                "log_probs" : log_probs_ver,
                "values" : values_ver
            })

    # Step 6: Concatenate the trajectories and do RL update (using PPO)
    # combined_trajectories will generate the data of generation_trajectories and verification_trajectories
    # Integrate according to PPO requirements (such as calculating GAE advantages, formatting)
    # Then perform one or more PPO optimization steps
    # Note: When PPO is updated, the problem-solving trajectory and verification trajectory calculate their advantage function and loss respectively.
    # But they all contribute to the gradient of the same policy network $\pi_\theta$.
    ppo_update_step(policy=$\pi_\theta$, value_function=$V_\phi$,
                      generation_data=generation_trajectories,
                      verification_data=verification_trajectories)
# end for

In this pseudocode:

outcome_verifier Responsible for calculating the rewards for solving the problem correctly or incorrectly.
make_verification_prompt Construct prompts for verification tasks based on questions and answers.
parse_verification_assessment Parse the model's judgment on the original answer from the model's validation output (for example, whether the model thinks the original answer is "right" or "wrong").
ppo_update_step Represents the core update logic of the PPO algorithm, which uses all the collected trajectory data (including both problem solving and verification) to calculate the loss and update the policy network and value network .

By the end, the trained strategy will have a stronger ability to solve problems, and will be better at self-criticism when answering questions and know when you are likely to answer incorrectly.

Complete example: a math problem from input to output

enter

A question from the training set: “Question: Find the value of 2 + 2.” Correct answer: 4

Model Problem Solving (Generation Phase)

The model receives "Question: Find the value of 2 + 2." as problem_prompt.
Model generation problem solving process generated_solution_tokens, which may output:["2", "+", "2", "=", "5", "，", "Finally", "Answer", "Answer", "=", "5"](detokenized: "2+2=5, final answer=5")
Simultaneous recording log_probs_gen and values_gen.

External judgement device calculates "problem solving reward"

outcome_verifier("Question: What is the value of 2 + 2.", "2+2=5, final answer=5") return reward_for_solution = 0.
generation_trajectories Add this entry.

Constructing "Verification Problems" and Letting the Model Verify Itself (Verification Phase)

make_verification_prompt("Question: Find the value of 2 + 2.", "2+2=5, final answer=5") generate:"The following is the answer to the question 'Question: Find the value of 2 + 2.': '2+2=5, final answer=5'. Please determine whether this answer is correct (output 'correct' or 'wrong') and explain the reason."
The model receives this verification_prompt,generate verification_output_tokens, which may output:["this", "this", "the solution", "the answer", "is", "wrong", "error", "of", ".", "2", "+", "2", "should", "should", "equal to", "at", "4", "."](detokenized: "This solution is wrong. 2+2 should equal 4.")
Simultaneous recording log_probs_ver and values_ver.

External judge calculates "verification reward"

parse_verification_assessment("This solution is wrong. 2+2 should equal 4.") return model_assessment_is_correct = False (The model thinks the original answer is wrong).
actual_solution_correctness In this case, False (Because the original answer "2+2=5" is wrong).
because model_assessment_is_correct == actual_solution_correctness (False == False), so reward_for_verification = 1.0.
verification_trajectories Add this entry.

PPO Updates

ppo_update_step use generation_trajectories and verification_trajectories The data in (including tokens, rewards, log_probs, values) is used to calculate the advantages and losses and update the model parameters and .
For the problem-solving part, because reward_for_solution = 0, the model will learn to reduce the probability of generating incorrect answers such as "2+2=5".
For the verification part, because reward_for_verification = 1.0, the model will learn to increase the probability of its correct judgment that "2+2=5 is wrong".

Through a large number of such iterations, the model will eventually be able to output "2+2=4" more stably, and can also accurately output and score its own answers, thus forming a "self-consistent" strategy that can both solve problems and review its own answers.