A brief explanation of GRPO in DeepSeek: a magical algorithm in reinforcement learning

Written by

Audrey Miles

Updated on:July-13th-2025

In the era of rapid development of artificial intelligence, reinforcement learning is a key technology. It enables machines to learn how to do the best by constantly trying, just like humans. Today, we will talk about GRPO (Group Relative Policy Optimization), which is a very powerful algorithm in reinforcement learning. Next, we will take you into the wonderful world of deepseek (Paper Review: DeepSeek-R1——Enhanced Reasoning Ability of Large Language Models Driven by Reinforcement Learning) GRPO and see what it is all about.

1. Understanding the “microcosm” of reinforcement learning

Before understanding GRPO, let's take a look at the "small universe" of reinforcement learning. Imagine that there is an intelligent agent, which is like a "little explorer" living in a specific environment. This environment has various situations, that is, "states". The intelligent agent must make choices under these states, and these choices are "actions". When the intelligent agent makes an action, the environment will give the intelligent agent feedback based on the effect of the action, and this feedback is the "reward". If the action is effective, the reward is high; if the effect is not good, the reward is low, or even a punishment. The goal of reinforcement learning is to enable the intelligent agent to learn the best set of behavioral strategies so that it can get the most rewards in the long-term interaction with the environment.

For example, let a robot find an exit in a maze. Each position in the maze is a state, and the robot can choose to move forward, turn left, turn right, etc. If the robot finds the exit, it will get a big reward; if it hits the wall, it may get a small penalty. The robot learns how to get out of the maze as quickly as possible by constantly trying different ways of moving. This is the process of reinforcement learning.

In reinforcement learning (Deep Seek R1: The synergistic power of reinforcement learning and knowledge distillation), "strategy" is like a guide for the agent's actions. There are two types of strategies. One is a deterministic strategy, which is like a fixed rule. As long as a certain state is encountered, the agent will choose a fixed action. For example, as long as the robot is at a certain intersection in the maze, it always chooses to turn left. The other is a random strategy, which assigns a probability to each possible action, and the agent chooses actions based on these probabilities. It's like when the robot is at an intersection, it chooses to turn forward, left, or right according to a certain probability, so that it has the opportunity to explore different paths.

There is also a "value function", which is used to evaluate whether a state or action is good or not. The state value function evaluates how good a state is. For example, in a maze, some positions are closer to the exit, so the state value of these positions is higher; some positions are surrounded by walls and are difficult to walk out, so the state value is low. The action value function evaluates the quality of a certain action in a certain state. For example, at a certain position in the maze, walking forward may make it easier to get closer to the exit, so the value of this action is high; if walking forward is a dead end, the value of this action is low. The value function is closely related to the strategy. It can help the agent know which states and actions can bring more rewards, thereby making the strategy better.

In addition, the Actor-Critic model is also an important role in reinforcement learning. The Actor is like an agent, responsible for learning and updating strategies and selecting actions based on the current state. The Critic is like a critic, evaluating the value of the state and providing feedback to the Actor, telling the Actor which actions are well chosen and which ones need improvement. The two work together to enable the agent to learn the optimal strategy more effectively.

2. GRPO debuts: like a smart "little coach"

Now, the protagonist GRPO is here! GRPO (Deepseek Success Inspiration: From TRPO to GRPO Training LLM) is a reinforcement learning algorithm, which helps the model learn better, just like a smart "little coach". Its core approach is to compare different actions, and then make small and controllable updates to the model based on a set of observations.

For example, suppose a robot is playing a "treasure hunt" game. In the game, every time the robot encounters an intersection, it has to choose a path to take. At the beginning, the robot has no idea which path will lead to the treasure, so it can only choose randomly. At this time, GRPO begins to play a role.

GRPO will let the robot try different paths, which is like letting the robot explore different possibilities. The robot will start from the current action strategy and try different paths. Then, it will compare the effects of these paths to see which path is smoother and more likely to find the treasure. Finally, based on the results of the comparison, the robot will make some small adjustments to its strategy so that it can choose the path that is more likely to find the treasure next time.

For example, the robot encounters three roads at a certain intersection, namely Road A, Road B and Road C. It first walks each road several times and records the results of each time. After walking several times, it finds that it has walked Road A 3 times and found some small treasures 2 times; it has walked Road B 3 times and found small treasures only once; it has walked Road C 3 times and found treasures every time. At this time, the robot knows that Road C is the most effective. However, it will not choose only Road C all of a sudden, but will occasionally walk on Road A and Road B, because there may be new discoveries on these two roads in the future. Moreover, when the robot adjusts its strategy, it will not become too extreme all of a sudden, and will not change from randomly choosing a road to only choosing Road C in the future. Instead, it will slowly increase the possibility of choosing Road C, for example, from the original probability of choosing Road C is 30% to 50%. In this way, the robot can not only use the good paths that have been discovered, but also continue to explore other paths, and will not miss any possible opportunities.

3. GRPO’s Magical Steps

Population sampling
In GRPO, when the robot is in a certain state, that is, in a certain position in the game, it will "fish out" a set of actions based on the current strategy, just like casting a net. For example, at the intersection just now, it may select several different ways of moving from all possible ways according to the strategy. This is group sampling. This step is like providing the robot with some different directions to try, giving it the opportunity to explore multiple possibilities.
Rewards Rating
After the robot has tried different paths, it needs to score these paths. At this time, there will be a reward function to help. The reward function is like a referee, giving each path a score based on the results of the path chosen by the robot. If the robot finds a lot of treasure along a certain path, the score of this path is high; if it walks for a long time and finds nothing, the score is low. This score is an evaluation of the quality of the action (that is, the path selection).
Advantage calculation
After calculating the score of each path, the robot also needs to see whether each path is better or worse than the average level. This is advantage calculation. For example, if the average number of treasures found by the paths tried by the robot is 2, and path A finds 3 treasures, then the advantage of path A is positive, indicating that it is better than the average level; if path B only finds 1 treasure, then its advantage is negative, which means it is worse than the average level. Through advantage calculation, the robot can clearly know the relative goodness of each action.
Policy Update
After knowing the advantages of each action, the robot can adjust its strategy. For actions with positive advantages, the robot will increase the possibility of choosing it in the future; for actions with negative advantages, it will reduce the possibility of choosing it. However, the robot will not become too exaggerated all of a sudden. It will not choose a certain action every time because it has a positive advantage. It will still maintain a certain degree of exploration. This is strategy updating.
Stability guarantee: KL divergence constraint
In order to prevent the robot from going too far when adjusting its strategy, GRPO also sets a "safety rope", which is the KL divergence constraint. Its function is to ensure that the new strategy is not too different from the original strategy. Just like when the robot adjusts its routing strategy, it will not suddenly change from the original random routing to a completely different and particularly strange routing method. This ensures that the robot's learning process is stable and the learning effect will not deteriorate due to sudden large changes.
Ultimate goal: increase rewards
The ultimate goal of GRPO is to make the robot get more and more treasures in this "treasure hunt" game. It continuously repeats the above steps to make the robot's strategy better and better, and the path it chooses is more likely to find the treasure. At the same time, it ensures the stability of the strategy and does not make the strategy unstable due to the pursuit of high rewards.

IV. The Power of GRPO

Reduce fluctuations and stabilize learning
GRPO updates its strategy by comparing a group of actions instead of just looking at the results of a single action, which greatly reduces fluctuations when updating the strategy. Just like in the "treasure hunt" game, if you only look at the results of taking a certain path once to decide how to go next time, you may be lucky enough to find the treasure this time and keep taking this path, but you may not find it next time. GRPO looks at the results of a group of actions, just like combining the experience of many attempts, so the results are more stable and the learning process is smoother.
Control change and prevent loss of control
The KL divergence constraint is a "safety rope" that keeps the strategy changes within a reasonable range. During the learning process, if the strategy changes too much, the robot may suddenly become unable to play the game. With this constraint, the robot will make small steps every time it adjusts its strategy, and will not make big mistakes, ensuring the stability and reliability of learning.
Improve efficiency and save resources
GRPO does not need to try all possible actions to know which one is good. Through group sampling and comparison, it can quickly find relatively good actions and then update the strategy. This is like in the "treasure hunt" game, the robot does not need to walk all the roads in the maze to find the treasure, but only needs to try some roads to know which roads are more worth walking. This can save a lot of time and energy and improve learning efficiency.

5. The wonderful application of GRPO in large language models

Now, many large language models, such as the familiar chatbots, have also begun to use GRPO to improve their capabilities. When we give a chatbot a question, or a "prompt," it will generate several different answers, just like the robot in GRPO chooses a path. This is the process of group sampling, and the chatbot tries to answer the question in different ways.

Then, there will be a reward model to evaluate the quality of these answers. The reward model is like a strict teacher, scoring each answer based on accuracy, logic, language fluency, etc. If the answer is accurate and organized, and the language is fluent, the score will be high; if the answer is irrelevant, the score will be low.

Next, the advantage of each answer is calculated to see which answers are better than the average and which are worse than the average. Based on this result, the chatbot will adjust its "answer strategy" and be more inclined to generate answers with high scores in the future. At the same time, in order to ensure the stability of the answers and prevent them from suddenly becoming strange, the KL divergence constraint will be used to control the change of the strategy.

By repeating this process, which is called iterative training, the chatbot will become more and more powerful, and the answers it generates will increasingly meet our expectations, becoming more accurate, more useful, and more interesting.

6. Popular analogy of GRPO algorithm

To better understand how the GRPO algorithm works, we can compare it to a student learning the problem-solving process.

Suppose you are a student learning how to solve math problems. Instead of telling you whether each answer is right or wrong, your teacher (GRPO algorithm) gives you a set of similar problems and asks you to try different solutions. If one of your solutions is better than the others (i.e. you get a higher reward), the teacher will encourage you to use it more; if one of your solutions is worse than the others (i.e. you get a lower reward), the teacher will suggest you use it less. In this way, you gradually learn how to solve math problems better without the teacher explaining in detail every time why each step is right or wrong.

Similarly, in the GRPO algorithm, the model (i.e., the agent) learns how to better complete the task by trying different outputs (i.e., solutions). The algorithm adjusts the strategy (i.e., the solution method) based on the reward for each output (i.e., the quality of the solution), making it more likely that the output with better performance will be generated. This process is achieved through the relative reward mechanism within the group, which is both efficient and stable.

GRPO is a very important algorithm in the field of reinforcement learning. It uses a unique way to make the model learn and optimize better. Whether in various tasks of robots or in the training of large language models, GRPO plays an important role.