Using process reward model to realize DeepSeek Moment of enterprise-level AI agent

Gain an in-depth understanding of how AI agents achieve deep thinking and reasoning capabilities through process reward models.
Core content:
1. The basic concept and function of process reward model (PRM)
2. The difference between PRM and traditional outcome reward model (ORM)
3. The application advantages and practical cases of PRM in enterprise-level AI scenarios
In enterprise-level business scenarios, there are higher requirements for the accuracy of answers to questions. Not only the results are required to be accurate, but also the thinking and action process is accurate, which needs to conform to the actual business logic of the enterprise. Therefore, compared with ORM, PRM is more in line with the high requirements of enterprise-level scenarios for process accuracy.
First, the thinking process must be accurate. Based on the user's question, AI needs to give an accurate thinking process . For example, if a user asks "What was the sales volume in East China last week?", the correct thinking process should be:
Step 1: Determine the date range of last week; Step 2: Determine the system ID of the East China region; Step 3: Determine that the sales field is Sales and is located in a specific wide table; Step 4: Assemble into a complete SQL statement
The second is that the action process is accurate. Based on the thinking process, AI needs to take corresponding actions . In enterprise-level scenarios, AI's more common actions are not dialogue, but data query or API call. For example, taking the above thinking process as an example, AI should take action to generate an SQP query statement.
SELECT sale_date, SUM(quantity) AS total_salesFROM salesWHERE -- Filter out last week's records YEAR(sale_date) = YEAR(CURDATE() - INTERVAL 1 WEEK) AND WEEK(sale_date) = WEEK(CURDATE() - INTERVAL 1 WEEK)GROUP BY sale_dateORDER BY sale_date;
If the thought process and action process are accurate, then the right business results will naturally follow.
3. PRM implementation process
Below, we take the customer service question and answer scenario as an example to briefly introduce the implementation process of PRM.
1. Task decomposition and reward definition
We can simply decompose the customer service dialogue system into three steps: [understanding intent], [retrieving knowledge], and [generating answers], and define the reward signals and reward rules for each step respectively.
Understanding intent: The reward signal is the accuracy of intent classification, which is calculated by comparing the predicted intent with the actual intent label, with a score of +0.3 for correct and -0.1 for incorrect.
Knowledge retrieval: The reward signal is the relevance of the retrieval results, which can be calculated based on BM25 or vector similarity. Strong relevance is scored as 1 point, and no relevance is scored as 0 point.
Answer generation: By judging the fluency, information completeness and friendliness of the answer, a special scoring model can be built for evaluation. For example, a binary classification model is trained, and the conversation history (such as the last three rounds of conversation) is input to predict whether the user will give a positive comment.
At the same time, set weights for each link, assuming that intent understanding is 0.3, knowledge retrieval is 0.5, and answer generation is 0.2. The sample code is as follows:
import torch
from transformers import BertTokenizer, BertModel
class DialogueRewardModel(torch .nn.Module ):
def __init__ (self):
super (). __init__ ()
self.bert = BertModel. from_pretrained ( "bert-base-uncased" )
self.intent_head = torch.nn.Linear ( 768 , 3 ) # 3 intents
self.retrieval_score = torch.nn.Linear ( 768 , 1 )
self.fluency_head = torch.nn.Linear ( 768 , 1 )
def forward (self, user_query, bot_response, retrieved_doc):
# Encoding user input
query_embed = self. bert (**user_query). last_hidden_state. mean (dim= 1 )
# Phase 1 : Intent Understanding Reward
intent_logits = self. intent_head (query_embed)
intent_reward = torch.nn.functional. cross_entropy (intent_logits, true_intent_label)
# Phase 2 : Retrieve Rewards
doc_embed = self. bert (**retrieved_doc). last_hidden_state. mean (dim= 1 )
retrieval_sim = torch. cosine_similarity (query_embed, doc_embed, dim= 1 )
# Phase 3 : Generate Rewards
response_embed = self. bert (**bot_response). last_hidden_state. mean (dim= 1 )
fluency_score = self. fluency_head (response_embed)
# Comprehensive Rewards
total_reward = 0.3 * (intent_reward) + 0.5 * retrieval_sim + 0.2 * fluency_score
return total_reward
2. Prepare training data
You can use historical customer service conversation data to build a trajectory dataset
S represents the state space, which includes the current user question, conversation history, and user sentiment score; A represents the action space, which includes the decision to generate the answer text or call the knowledge base API; R represents the reward, which indicates the reward obtained by the AI for taking action an in the nth state.
3. Perform reinforcement learning
Integrate PRM into a reinforcement learning framework, such as PPO, to replace the original reward provided by the environment, as shown in the following code:
# Dialogue strategy based on Hugging Face Transformers
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
class DialoguePolicy:
def __init__ (self):
self.tokenizer = GPT2Tokenizer. from_pretrained ( "gpt2-medium" )
self.model = GPT2LMHeadModel. from_pretrained ( "gpt2-medium" )
def generate_response (self, state):
input_text = f "User: {state['user_query']}\nBot:"
inputs = self. tokenizer (input_text, return_tensors= "pt" )
outputs = self.model.generate ( **inputs, max_length= 100 )
response = self.tokenizer.decode ( outputs [ 0 ], skip_special_tokens=True)
return response. split ( "Bot:" )[- 1 ]. strip ()
# Training loop (PPO framework)
def train_ppo (policy, reward_model, epochs= 10 ):
optimizer = torch.optim. Adam (policy. parameters (), lr= 1 e- 5 )
for epoch in range (epochs):
state = env.reset ( ) # Assume env is a simulated conversation environment
for t in range (max_steps):
action = policy. generate_response (state)
next_state, done = env. step (action)
reward = get_reward (state, action)
# PPO update logic (simplified)
advantage = calculate_advantage (reward, policy, next_state)
loss = -torch.mean (advantage )
optimizer.zero_grad ( )
loss. backward ()
optimizer.step ( )
The first step is to query the water supply pipe temperature sensor on the first floor; the second step is to query the value of the water supply pipe temperature sensor; the third step is to query the return pipe temperature sensor on the first floor; the fourth step is to query the value of the return pipe temperature sensor; the fifth step is to calculate the difference between the two values.
MATCH(r:Sensor) WHERE r.Name =~ '.*1F.*' AND r.type =~ '.*Water supply pipe temperature sensor.*' RETURN r