Improving enterprise question-answering efficiency with RAG and Agent: My AI practice journey

Written by

Audrey Miles

Updated on:June-27th-2025

In a department that is intertwined with multiple fields such as operation and maintenance, IT, and security, the team faces a large number of business needs and repetitive consultations every day. With the development of large models, how to use AI to optimize these scenarios is my main research direction? This article shares my technical exploration from RAG to Agent, trying to bring more efficient solutions to enterprise question-and-answer scenarios.

There are still many areas that need to be adjusted and improved in the client architecture

Pain points and starting points: the introduction of RAG

The company's project documents are incomplete, and the relevant project background and knowledge details cannot be precipitated into documents
Our department has a lot of consulting tasks, the questions are frequently repeated, and the users we serve have different levels of computer knowledge.

After the rise of big models, I saw the potential of RAG (retrieval enhanced generation). At the DMEO stage, I first solved the consulting scenario, and implemented the company-level RAG through the organizational structure.

In the consulting scenario, by combining knowledge base retrieval and generation capabilities, AI can take over the conversation of IM software and automatically answer questions. In the process, I developed an HTTP interface for the company's IM tool and initially realized AI question and answer takeover. However, due to the unorganized documents, the quality of the answers was unstable - either lengthy, too concise, or even occasionally deviated from the facts.

To improve the results, I designed an iterative search function for the knowledge base, similar to "DeepSearch". Its core is to gradually approach the answer by generating new queries and multiple rounds of retrieval, avoiding returning inaccurate results at one time. This made me realize that the key to RAG lies in the quality of the knowledge base and the retrieval efficiency.

Optimization and iteration: from vector library to classification library

In the practice of RAG, I tried a variety of technical solutions:

v1. Vector database

Vectorizing documents can improve retrieval speed, but semantic matching still needs to be optimized.

v2. Corpus Generation

LLM can be used to automatically generate supplementary corpus to increase the diversity of the knowledge base. It can also be used for fine-tuning.

v3. LoRA fine-tuning

Adapt to enterprise-specific scenarios by adjusting model parameters on a small scale.

v4. Classification and database mode

The knowledge base is considered as a structured book, each book is equipped with a catalog (managed by metadata.json) and stored by subject classification. Combined with higher classification accuracy and retrieval efficiency, the quality of answers is significantly improved.

Towards Agents: From Answering to Solving Problems

RAG solves basic question-answering, but AI still seems passive when faced with complex needs. So I began to explore Agent, an intelligent system driven by a large model that can perceive the environment, plan tasks, execute steps, and dynamically adjust. LangChain's "Plan-and-Execute" framework inspired me: Agent can break down tasks according to user questions, iterate execution, and optimize paths in result feedback.

My implementation logic includes four core links:

Classification

Quickly locate the knowledge area to which the problem belongs.

Extract relevant documents from the sub-database and evaluate them.

Plan

A high degree of relevance generates a step-by-step solution.

implement

Complete the plan step by step and output the results.

For example, after a user asks a question, the agent retrieves documents, plans steps, and returns answers after execution. The entire process is observable and adjustable. Although the current execution logic is more "step-by-step" (taking one step at a time), it has initially demonstrated the potential to move from "passive question-answering" to "active solution".

Demo

To support this solution, I developed a UI client that integrates:

Knowledge base management: maintaining document structure and content.
LLM and embedded model management: Automatic discovery of Ollama models.
MCP management: In the future, we plan to support SSE to improve real-time performance.

The renderings can already show the execution process of the Agent. The MCP module has more room for expansion, such as integrating multimodal capabilities or more complex task scheduling.

cli version execution effect diagram:

Reflection and Outlook

RAG effectively alleviated the efficiency bottleneck of repetitive consultation, while Agent opened up new prospects for the automation of complex tasks. In terms of technology selection, I mainly relied on Grok-2 as the core model. Grok-2's reasoning ability and context understanding allowed me to quickly verify my ideas. I did not systematically evaluate the cost, which became the focus of subsequent optimization. Although the "one-size-fits-all" approach of a single large model is convenient, there is still room for improvement in performance and resource balance.

In the future, we plan to adjust the technology stack and build a more sophisticated division of labor and collaboration system:

Classifier fine-tuning

By fine-tuning a lightweight model, the accuracy and speed of problem classification can be improved, reducing the burden on the main model.

Rerank

Use Ollama to deploy a small model to re-rank search results locally and further optimize answer relevance.

Multi-model Plan

Models such as Grok-3, DeepSeek-R1, and Gemini-Thinking are introduced to conduct multi-model reasoning comparisons for the planning stage and explore their respective advantages in logical reasoning.

Task execution separation

Split the execution modules into independent Single Agents to support parallel processing and improve the throughput of complex tasks.

There are many other features such as permission control, Endpoint API, user interaction, observability, and robustness that need to be implemented. At the tool integration level, I expect to be based on the MCP (Model Context Protocol) protocol. The client uses the stdio model to ensure low latency for local interactions; the remote end uses SSE (Server-Sent Events) to achieve real-time calls.