Using process reward model to realize DeepSeek Moment of enterprise-level AI agent

Written by
Caleb Hayes
Updated on:July-15th-2025
Recommendation

Gain an in-depth understanding of how AI agents achieve deep thinking and reasoning capabilities through process reward models.

Core content:
1. The basic concept and function of process reward model (PRM)
2. The difference between PRM and traditional outcome reward model (ORM)
3. The application advantages and practical cases of PRM in enterprise-level AI scenarios

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
Preface
I have been busy meeting clients and investors, as well as participating in various roadshows recently, and I haven’t updated the article for a while. Today, let’s talk about the combination of AI agents and process reward models.
With the popularity of DeepSeek, the Process Reward Model (PRM) has begun to appear frequently in everyone's field of vision.
The Process Reward Model (PRM) is one of the key technologies that enables DeepSeek to have deep thinking and reasoning capabilities. Today, I will introduce in detail how to use PRM to implement the DeepSeek Moment of AI agents.
In this article you will learn:
1. What is the concept of PRM?
2. What is the difference between PRM and the traditional reward model ORM?
3. Why is PRM more suitable for enterprise-level AI scenarios?
4. What is the implementation process of PRM?
5. Practical examples of PRM improving AI agents’ deep thinking capabilities

1. Concept of process reward model
The process reward model is a key technology in reinforcement learning, which is used to provide more fine-grained immediate feedback during the task execution , rather than relying solely on the sparse rewards of the final result. Its application in complex tasks such as robot control, dialogue generation, game strategy and other scenarios is particularly important, which can accelerate model learning and improve strategy stability.
The corresponding model to the process reward model is the outcome reward model (ORM). The main differences between the two are:
1. Different reward frequencies: ORM rewards only based on task results, and the reward frequency is sparse; while PRM rewards at each step or subtask stage, and the reward frequency is more frequent.
2. Different training stability: ORM is prone to fall into local optimal solutions, while PRM is easier to converge and the gradient update is smoother.
3. Different applicable scenarios: ORM is more suitable for relatively simple tasks, such as maze walking and object recognition; while PRM is more suitable for processing long-sequence complex tasks, such as multi-round conversations and robot control.

2. Why is PRM more suitable for enterprise-level AI scenarios?

In enterprise-level business scenarios, there are higher requirements for the accuracy of answers to questions. Not only the results are required to be accurate, but also the thinking and action process is accurate, which needs to conform to the actual business logic of the enterprise. Therefore, compared with ORM, PRM is more in line with the high requirements of enterprise-level scenarios for process accuracy.

First, the thinking process must be accurate. Based on the user's question, AI needs to give an accurate thinking process . For example, if a user asks "What was the sales volume in East China last week?", the correct thinking process should be:

Step 1: Determine the date range of last week; Step 2: Determine the system ID of the East China region; Step 3: Determine that the sales field is Sales and is located in a specific wide table; Step 4: Assemble into a complete SQL statement

The second is that the action process is accurate. Based on the thinking process, AI needs to take corresponding actions . In enterprise-level scenarios, AI's more common actions are not dialogue, but data query or API call. For example, taking the above thinking process as an example, AI should take action to generate an SQP query statement.

SELECT sale_date, SUM(quantity) AS total_salesFROM salesWHERE -- Filter out last week's records YEAR(sale_date) = YEAR(CURDATE() - INTERVAL 1 WEEK) AND WEEK(sale_date) = WEEK(CURDATE() - INTERVAL 1 WEEK)GROUP BY sale_dateORDER BY sale_date;

If the thought process and action process are accurate, then the right business results will naturally follow.


3. PRM implementation process

Below, we take the customer service question and answer scenario as an example to briefly introduce the implementation process of PRM.

1. Task decomposition and reward definition

We can simply decompose the customer service dialogue system into three steps: [understanding intent], [retrieving knowledge], and [generating answers], and define the reward signals and reward rules for each step respectively.

Understanding intent: The reward signal is the accuracy of intent classification, which is calculated by comparing the predicted intent with the actual intent label, with a score of +0.3 for correct and -0.1 for incorrect.

Knowledge retrieval: The reward signal is the relevance of the retrieval results, which can be calculated based on BM25 or vector similarity. Strong relevance is scored as 1 point, and no relevance is scored as 0 point.

Answer generation: By judging the fluency, information completeness and friendliness of the answer, a special scoring model can be built for evaluation. For example, a binary classification model is trained, and the conversation history (such as the last three rounds of conversation) is input to predict whether the user will give a positive comment.

At the same time, set weights for each link, assuming that intent understanding is 0.3, knowledge retrieval is 0.5, and answer generation is 0.2. The sample code is as follows:

import torchfrom  transformers import BertTokenizer, BertModel
class DialogueRewardModel(torch .nn.Module ):    def  __init__ (self):        super (). __init__ ()        self.bert = BertModel. from_pretrained ( "bert-base-uncased" )        self.intent_head = torch.nn.Linear ( 768 3 ) 3 intentsself.retrieval_score =         torch.nn.Linear ( 768 , )self.fluency_head         = torch.nn.Linear ( 768 , )
    def  forward (self, user_query, bot_response, retrieved_doc):        # Encoding user input        query_embed = self. bert (**user_query). last_hidden_state. mean (dim= 1 )
        # Phase 1 : Intent Understanding Reward        intent_logits = self. intent_head (query_embed)        intent_reward = torch.nn.functional. cross_entropy (intent_logits, true_intent_label)
        # Phase 2 : Retrieve Rewards        doc_embed = self. bert (**retrieved_doc). last_hidden_state. mean (dim= 1 )        retrieval_sim = torch. cosine_similarity (query_embed, doc_embed, dim= 1 )
        # Phase 3 : Generate Rewards        response_embed = self. bert (**bot_response). last_hidden_state. mean (dim= 1 )        fluency_score = self. fluency_head (response_embed)
        # Comprehensive Rewards        total_reward =  0.3  * (intent_reward) +  0.5  * retrieval_sim +  0.2  * fluency_score        return total_reward


    2. Prepare training data

    You can use historical customer service conversation data to build a trajectory dataset

    S represents the state space, which includes the current user question, conversation history, and user sentiment score; A represents the action space, which includes the decision to generate the answer text or call the knowledge base API; R represents the reward, which indicates the reward obtained by the AI ​​for taking action an in the nth state.


    3. Perform reinforcement learning


    Integrate PRM into a reinforcement learning framework, such as PPO, to replace the original reward provided by the environment, as shown in the following code:

    # Dialogue strategy based on Hugging Face Transformersfrom  transformers import GPT2LMHeadModel, GPT2Tokenizerimport torch
    class DialoguePolicy:    def  __init__ (self):        self.tokenizer = GPT2Tokenizer. from_pretrained ( "gpt2-medium" )        self.model = GPT2LMHeadModel. from_pretrained ( "gpt2-medium" )
        def  generate_response (self, state):        input_text = f "User: {state['user_query']}\nBot:"        inputs = self. tokenizer (input_text, return_tensors= "pt" )        outputs = self.model.generate ( **inputs, max_length= 100 )        response = self.tokenizer.decode ( outputs [ 0 ], skip_special_tokens=True)        return response. split ( "Bot:" )[- 1 ]. strip ()
    # Training loop (PPO framework)def  train_ppo (policy, reward_model, epochs= 10 ):    optimizer = torch.optim. Adam (policy. parameters (), lr= 1 e- 5 )    for epoch in  range (epochs):        state = env.reset ( ) # Assume env is a simulated conversation environment        for t in  range (max_steps):            action = policy. generate_response (state)            next_state, done = env. step (action)            reward =  get_reward (state, action)            # PPO update logic (simplified)            advantage =  calculate_advantage (reward, policy, next_state)            loss = -torch.mean (advantage )            optimizer.zero_grad ( )            loss. backward ()            optimizer.step ( )
    In actual training, the weights can also be adjusted dynamically. For example, in the early stage of training, the accuracy of intent recognition is more important, so the weight of intent recognition can be increased to 0.5; in the later stage of training, the accuracy of answer generation is more important, so the weight of content generation can be increased to 0.5.

    Enterprise-level PRM practice cases
    Below, we introduce our practical case, using PRM to build an enterprise-level AI agent with deep thinking capabilities.
    background
    In the field of equipment operation and maintenance, such as the factory's production line, HVAC system, power distribution system, etc., operation and maintenance engineers need to obtain equipment operation data, including voltage, current, vibration, temperature, humidity and other parameters. These parameters are captured through special sensors or virtual points and synchronized to the factory's data center. Engineers need this data for daily operation and maintenance diagnosis and troubleshooting.
    need
    When a fault occurs, we hope that AI can obtain real-time sensor data, and after deep thinking and quantitative analysis, quickly help engineers narrow the scope of troubleshooting in the form of dialogue, and guide engineers to find the root cause and solution more quickly.
    difficulty
    1. The factory environment is complex, including the connection and control relationship between IoT devices, as well as the location relationship of various devices;
    2. The system is complex. There may be multiple possibilities behind the same fault phenomenon, and even the most professional engineers find it difficult to quickly troubleshoot;
    3. It is necessary to ensure the real-time and accuracy of IoT data acquisition, and to automatically make judgments based on quantitative analysis rules by AI
    Solution
    Use the process reward model to perform reinforcement learning on the output content of AI.
    The first step is to strengthen the thinking process of AI output
    For example, if a user asks "What is the supply and return water temperature difference on the first floor?", the correct thinking process is:
    The first step is to query the water supply pipe temperature sensor on the first floor; the second step is to query the value of the water supply pipe temperature sensor; the third step is to query the return pipe temperature sensor on the first floor; the fourth step is to query the value of the return pipe temperature sensor; the fifth step is to calculate the difference between the two values.
    Engineers evaluate every step of the AI's thinking, reward it if it is correct and punish it if it is wrong. After multiple iterations, the accuracy of the AI's thinking process is guaranteed.
    The second step is to conduct reinforcement learning on the data query instructions generated by AI
    For example, the first step of the thinking process, "Query the water supply pipe temperature sensor on the first floor", needs to be converted into an accurate Cypher query statement
    MATCH(r:Sensor) WHERE r.Name =~ '.*1F.*' AND r.type =~ '.*Water supply pipe temperature sensor.*' RETURN r
    For each action step (i.e. query statement) generated by each step of thinking, engineers will evaluate it, give rewards if it is correct, and punish it if it is wrong. Ultimately, the action steps described in natural language are accurately converted into action steps described in code.
    Final result
    AI agents with deep thinking capabilities can greatly improve the operation and maintenance efficiency of operation and maintenance engineers.
    1.  AI intelligent early warning , electrical engineers can directly receive fault warning information on their mobile phones or PDAs
    2.  AI automatic reasoning , based on the obtained sensor data and combined with quantitative calculation, automatically helps engineers narrow down the scope of the fault and find the cause of the fault
    3.  AI solution , automatically providing engineers with solutions based on the cause of the fault