Large Model RAG: A robotics solution based on large models

Written by

Audrey Miles

Updated on:June-29th-2025

1. Robot Application

Robots, or intelligent customer service, are already widely used in various systems. For example, group robots in IM (DingTalk, Ruliu, etc.) can realize keyword replies, custom service mounting based on webhooks, etc.; there are also intelligent customer service to answer users' common questions and save labor costs; there are also intelligent voice robots that can receive voice recognition semantics.

The robot implementation solutions also have relatively mature technical solutions, from fixed question lists to keyword replies to NLP+deep learning. After the emergence of large models, we can use RAG to quickly build robot applications. In conventional robot business scenarios, we can achieve good question-answering accuracy and recall without investing a lot of experience in model tuning.

2. Common Robot Functions

Taking a typical question-and-answer robot as an example, first there is a knowledge base. The typical scenario is question-and-answer data covering common questions (i.e., question -> answer). For example, the following Q&A is a typical question-and-answer pair:

Q: How long is the warranty period for model xx TV?

A: Hello, the warranty period for model xx TV is one year.

The second is input. The simplest robot may have fixed questions, allowing users to select scenarios, product models, etc. to narrow down the questions to be returned, and let users click on the preset fixed questions. In this case, only key-value storage and search are required, which is relatively simple, but the experience is not good. Therefore, allowing users to input by themselves, especially voice input on mobile terminals, will be closer to the real customer service scenario and provide a better user experience.

After receiving user input, semantic understanding is required and matching information is searched in the knowledge base. If a matching question is found, the answer is retrieved and returned. If the user's question is vague, relevant recommended questions need to be given. For example, if the user asks about the price of a TV, the brand, model, and size need to be asked. Considering more realistic scenarios, it is not just a one-question-one-answer model. After answering a user's question, the next round of related questions can be given to facilitate the user to ask questions. For example, after asking about the parameters of a certain model of electrical appliances, other users tend to continue to ask about the price or similar product information. These questions can be used as the next round of recommended questions to guide users to continue asking questions.

Three technical solutions

3.1 Speech Recognition

To receive user voice input and convert it into text, you can use the Baidu/Ali DingTalk js api.

3.2 Intent Identification and Distribution

It is very suitable for large model implementation. For relatively simple intent recognition scenarios, it can also be implemented using NLP combined with slot configuration.

3.3 Question-Answer Pair Storage

For simple key-value Q&A, we can only use relational database storage to maintain the relationship between questions and answers. However, as we have analyzed before, this is not very practical. Users will enter questions according to their expression habits, so it is not enough to just store the original questions. Especially when there are a lot of questions, we can't always compare the questions with the questions in the database to see if they are the same, which is too inefficient. So a common solution is to vectorize the text, and then vectorize the questions each time you ask them, and get similar questions by calculating the vector distance; this is exactly the link for RAG to do reference material/context recall.

Through the previous series of articles, we can use the Milvus or PgSQL database mentioned in "Big Model RAG: Vector Retrieval Based on PgSql". Text vectorization can use the embeddings API provided by the big model, or you can choose the trained word2vec.

3.4 Single-round Q&A and multi-round Q&A/task-based Q&A

The typical one-question-one-answer model is relatively simple, and it only needs to find the closest question through similarity matching, without considering the context comprehensively. However, multi-round question-answering or task-based question-answering is much more complicated.

TaskQA-Task-based question answering, the ability to identify the user's intention in the conversation and perform specified tasks according to the key parameters contained in the conversation. Among them, the key parameters are called word slots in the field of intelligent dialogue. Task-based dialogue is suitable for scenarios where specific tasks need to be performed based on user dialogue. For example, in the intelligent customer service scenario, tasks such as placing orders and handling returns; in the intelligent office scenario, tasks such as travel applications and conference room reservations; in the consumer electronics scenario, tasks such as listening to music and ordering takeout.

In multiple rounds of task-based question answering, the input of one round may be very simple. If only one round of input is used to search for similar questions, it is likely that the intention will be unclear due to insufficient information, or it may be incorrectly matched to other questions, so the answer will naturally be wrong.

There are several solutions/scenarios to solve this problem. One is to build an independent task-based question-and-answer library. When a user asks a question that hits this type of question and answer, subsequent questions are marked until the user's input in a certain round is completely out of the library. Another common solution is to rewrite the user's current round of questions and historical questions (generally limited to the past 3-5 rounds) after integration, and rewrite them into one question. This requires the accuracy of the rewrite to be high enough, otherwise it will directly affect the accuracy of question recall. We will describe these two solutions in detail in subsequent articles.