How to quickly build a personalized RAG chatbot

Written by
Caleb Hayes
Updated on:July-09th-2025
Recommendation

Master RAG technology, build personalized chatbots, and improve conversational AI performance.

Core content:
1. The application value of RAG technology in conversational AI
2. Key technical components for building RAG chatbots
3. Detailed steps: from installing LangChain to setting up Fireworks AI models

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

In the field of AI, Retrieval-Augmented Generation (RAG) has become an important technology for generative AI applications, especially in conversational AI. It combines pre-trained large language models (LLMs) such as OpenAI's GPT and external knowledge bases (stored in vector databases such as  Milvus  and  Zilliz Cloud ) to generate more accurate and context-related responses while keeping the information real-time.

A complete RAG pipeline usually consists of four basic components: vector database, embedding model, LLM, and framework.

Today, we will teach you step by step how to build a simple RAG chatbot in Python! If you are interested in AI technology, or are looking for ways to improve the performance of conversational AI, this article will definitely give you a lot of benefits.

 What key technical components will we use? 

In this tutorial, we will use the following tools and techniques:

  1. LangChain
    can help you easily orchestrate the interactions between LLM, vector storage, embedding models, etc., thereby simplifying the integration process of RAG pipelines.

  2. Milvus
    Milvus is an open source vector database optimized for efficient storage, indexing, and searching of large-scale vector embeddings, making it ideal for RAG, semantic search, and recommendation systems. Of course, if you don't want to manage your own infrastructure, you can also choose  Zilliz Cloud , a fully managed vector database service built on Milvus that also offers a free package that supports up to 1 million vectors.

  3. Fireworks AI Llama 3.1 8B Instruct
    This model has 8 billion parameters and is good at providing precise instructions and guidance through advanced reasoning capabilities. Whether it is an educational tool, a virtual assistant or interactive content generation, it can generate coherent and multi-domain responses, which is particularly suitable for scenarios that require personalized interaction.

  4. Cohere embed-multilingual-v2.0
    is an embedding model that focuses on generating high-quality multilingual embeddings and can effectively achieve cross-language understanding and retrieval. Its advantage lies in capturing semantic relationships in multiple languages, which is very suitable for applications such as multilingual search, recommendation systems, and global content analysis.

 Step 1: Install and set up LangChain 

First, we need to install LangChain related dependencies. Open your terminal and enter the following command:

%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

 Step 2: Install and set up Fireworks AI Llama 3.1 8B Instruct 

Next, we install the dependencies of Fireworks AI. Execute the following code:

pip install -qU  "langchain[fireworks]"
import  getpass
import  os
if not  os.environ.get( "FIREWORKS_API_KEY" ): 
    os.environ[ "FIREWORKS_API_KEY" ] = getpass.getpass( "Enter API key for Fireworks AI: " )
from  langchain.chat_models  import  init_chat_model
llm = init_chat_model( "accounts/fireworks/models/llama-v3p1-8b-instruct" , model_provider= "fireworks" )

Note: You need to obtain the Fireworks AI API key in advance!

I recommend you to use the Silicon Flow API, Qwen 7B is free~

 Step 3: Install and set up Cohere embed-multilingual-v2.0 

Next, we install Cohere's embedding model dependencies. Run the following code:

pip install -qU langchain-cohere
import  getpass
import  os
if not  os.environ.get( "COHERE_API_KEY" ): 
    os.environ[ "COHERE_API_KEY" ] = getpass.getpass( "Enter API key for Cohere: " )
from  langchain_cohere  import  CohereEmbeddings
embeddings = CohereEmbeddings(model= "embed-multilingual-v2.0" )

 Step 4: Install and configure Milvus 

Now, let's install the Milvus vector database. Execute the following code:

pip install -qU langchain-milvus
from  langchain_milvus  import  Milvus
vector_store = Milvus(embedding_function=embeddings)

 Step 5: Build the RAG chatbot 

Now that all components are ready, let's start building a chatbot! We will use  Milvus's introduction document  as a private knowledge base. Of course, you can also replace it with your own dataset to customize your own RAG chatbot.

The following is the complete code implementation:

import  bs4
from  langchain  import  hub
from  langchain_community.document_loaders  import  WebBaseLoader
from  langchain_core.documents  import  Document
from  langchain_text_splitters  import  RecursiveCharacterTextSplitter
from  langgraph.graph  import  START, StateGraph
from  typing_extensions  import  List, TypedDict

# Load and split blog content
loader = WebBaseLoader(
    web_paths=( "https://milvus.io/docs/overview.md" ,),
    bs_kwargs = dict(
        parse_only = bs4.SoupStrainer(
            class_=( "doc-style doc-post-content" )
        )
    ),
)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size= 1000 , chunk_overlap= 200 )
all_splits = text_splitter.split_documents(docs)

# Index the split documents
_ = vector_store.add_documents(documents=all_splits)

# Define the Q&A prompt template
prompt = hub.pull( "rlm/rag-prompt" )

# Define application status
class State (TypedDict) : 
    question: str
    context: List[Document]
    answer: str

# Define application steps
def retrieve (state: State) : 
    retrieved_docs = vector_store.similarity_search(state[ "question" ])
    return  { "context" : retrieved_docs}

def generate (state: State) : 
    docs_content =  "nn" .join(doc.page_content  for  doc  in  state[ "context" ])
    messages = prompt.invoke({ "question" : state[ "question" ],  "context" : docs_content})
    response = llm.invoke(messages)
    return  { "answer" : response.content}

# Compile the application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START,  "retrieve" )
graph = graph_builder.compile()

Testing the chatbot

Ok, the chatbot is built! Let’s test it:

response = graph.invoke({ "question""What data types does Milvus support?" })
print(response[ "answer" ])

Sample Output

Milvus supports multiple data types, including sparse vectors, binary vectors, JSON, and arrays. In addition, it can handle common numeric and character types, suitable for different data modeling needs. This allows users to efficiently manage unstructured or multimodal data.

 optimization 

When we build a RAG system, optimization is key to ensure performance and efficiency. Below are some optimization suggestions for each component to help you build a smarter, faster, and more responsive RAG application.

LangChain Optimization Tips

You can optimize LangChain by reducing redundant operations, such as properly designing the structure of chains and agents, and using cache to avoid repeated calculations. The modular design also allows you to flexibly replace models or databases, thereby quickly expanding the system.

Milvus Optimization Tips

Milvus is an efficient vector database. Its performance can be optimized from the following aspects:

  • Use HNSW (Hierarchical Navigation Small World) index to balance speed and accuracy;
  • Partition data based on usage patterns to improve query performance;
  • Batch insert vectors to reduce database lock contention;
  • Play around with the dimensions to find the best balance for your hardware and use case.

Fireworks AI Llama 3.1 8B Instruct Optimization Tips

This model is very cost-effective and suitable for medium-complexity RAG applications. You can optimize its performance by limiting the context length, adjusting the temperature parameter (0.1-0.3 is recommended), and caching high-frequency queries.

Cohere embed-multilingual-v2.0 optimization tips

This multilingual embedding model is very suitable for cross-lingual RAG scenarios. You can improve efficiency by pre-processing text to remove noise, compressing embeddings, and batching operations.

- END -