Built from scratch and run 100% locally: Qwen 3 Local RAG Reasoning Agent

Written by
Jasper Cole
Updated on:June-20th-2025
Recommendation

Build a local and efficient RAG system to achieve a perfect combination of document question answering and web search.

Core content:
1. Combining Qwen 3 and Gemma 3 models to create a lightweight local RAG system
2. Core functions include document processing, vector search, web search and privacy protection
3. Flexible configuration, support for different model selection and similarity threshold adjustment

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Today, we will build a RAG system from scratch based on the locally running  Qwen 3  and  Gemma 3  models, combining document processing, vector search and network search functions to provide users with accurate and contextual answers. The project comes from the tutorial of Unwind AI. The open source address can be found in the original text. Today, let’s interpret the project construction and technical highlights.

Project Overview

· Name: Qwen 3 Local RAG Reasoning Agent

· Objective: To build an efficient RAG system that supports document question answering, web content extraction, and web search by using a lightweight LLM and vector database that runs locally.

Core functions:

  1. Document processing: Support uploading PDF files or entering web page URLs, extracting content and performing intelligent segmentation.

  2. Vector search: Use the Qdrant vector database to store document embeddings for efficient similarity search.

  3. Web Search: When the document knowledge is insufficient, the Exa API can be used to conduct a web search to supplement the answer.

  4. Flexible Mode: Supports RAG mode (combining documents and search) and direct LLM interaction mode.

  5. Privacy protection: All processing is done locally, suitable for processing sensitive data.

Technical Architecture

1. Language Model:

   Supports multiple local models: Qwen 3 (1.7B, 8B), Gemma 3 (1B, 4B), DeepSeek (1.5B).

   Run models locally through  the Ollama  framework to reduce dependence on cloud services.

2. Document processing:

   Use  PyPDFLoader  to process PDF files and WebBaseLoader  to extract web page content.

   RecursiveCharacterTextSplitter   splits the document into small pieces for easy embedding and searching .

3. Vector database:

   Use  Qdrant  to store document embedding vectors to support efficient similarity search.

Embedding model: snowflake-arctic-embed    by Ollama  .

4. Web search:

   · Implement network search through  Exa API  and support custom domain name filtering.

5. User Interface:

   Use  Streamlit  to build interactive web interfaces that allow users to upload files, enter URLs, and ask questions.

Key Features

1. Document Q&A:

   · When a user uploads a PDF or enters a URL, the system converts the content into an embedding vector and stores it in Qdrant.

   When a user asks a question, the system finds relevant document fragments through similarity search and generates an answer.

2. Supplementary Internet search:

   If there is not enough information in the document, the system will automatically or manually (via a switch) trigger a web search to obtain additional information.

   Search results will clearly indicate the source.

3. Flexible configuration:

   · Ability to choose different models (such as Qwen 3 or Gemma 3).

   Adjustable similarity threshold to control the strictness of document retrieval.

   · Support disabling RAG mode and talking directly to LLM.

4. Privacy and offline support:

   All models and processing are run locally, without sending data to the cloud.

   Suitable for scenarios that require data privacy or in environments without a network.

How to use

1. Environmental preparation:

   Install Ollama and Python 3.8+.

   Run the Qdrant vector database via Docker.

   Get an Exa API key (optional, for web searches).

2. Install dependencies:

   pip install -r requirements.txt

3. Pull model:

ollama pull qwen3:1.7bollama pull snowflake-arctic-embed

4. Run Qdrant:   

docker run -p 6333:6333 -p 6334:6334 -v "$(pwd)/qdrant_storage:/qdrant/storage:z" qdrant/qdrant

5. Start the application:

streamlit run qwen_local_rag_agent.py

6. Operation:

   · Upload a PDF or enter a URL in the Streamlit interface.

   Adjust model, RAG mode or search settings.

   Enter a question and get an answer with sources.

Application Scenario

Academic research: quickly search uploaded papers or webpage content, and combine with online search to supplement the latest information.

· Enterprise Document Management: Process internal documents (such as manuals and reports) and provide intelligent Q&A.

Privacy-sensitive scenarios: Process sensitive documents such as legal and medical documents locally to prevent data leakage.

Offline environment: In the absence of network, local models and documents are used for knowledge query.

Project Advantages

Open source and free: The code is open and can be freely modified and deployed.

Localization: No need to rely on cloud services, protecting data privacy.

Modularity: supports multiple models and configurations, and is easy to expand.

User-Friendliness: Streamlit’s interface is simple and intuitive, making it suitable for non-technical users.

Summarize

This project is a powerful and flexible local RAG system that combines local language models, vector databases, and web search, suitable for scenarios that require privacy protection, offline operation, or customized knowledge query. Through simple configuration, users can quickly build an intelligent question-and-answer assistant that processes document and web content while keeping data secure.

source code
For those who have difficulty accessing Github, here is the source code:
requirements.txt
agnopypdfexaqdrant-clientlangchain-qdrantlangchain-communitystreamlitollama
qwen_local_rag_agent.py
import  osimport  tempfilefrom  datetime  import  datetimefrom  typing  import  Listimport  streamlit  as  stimport  bs4from  agno.agent  import  Agentfrom  agno.models.ollama  import  Ollamafrom  langchain_community.document_loaders  import  PyPDFLoader, WebBaseLoaderfrom  langchain.text_splitter  import  RecursiveCharacterTextSplitterfrom  langchain_qdrant  import  QdrantVectorStorefrom  qdrant_client  import  QdrantClientfrom  qdrant_client.models  import  Distance, VectorParamsfrom  langchain_core.embeddings  import  Embeddingsfrom  agno.tools.exa  import  ExaToolsfrom  agno.embedder.ollama  import  OllamaEmbedder

class  OllamaEmbedderr ( Embeddings ):    def  __init__ ( self, model_name= "snowflake-arctic-embed" ):        """        Initialize the OllamaEmbedderr with a specific model.
        Args:            model_name (str): The name of the model to use for embedding.        """        self.embedder = OllamaEmbedder( id =model_name, dimensions= 1024 )
    def  embed_documents ( self, texts:  List [ str ] ) ->  List [ List [ float ]]:        return  [self.embed_query(text)  for  text  in  texts]
    def  embed_query ( self, text:  str ) ->  List [ float ]:        return  self.embedder.get_embedding(text)

# ConstantsCOLLECTION_NAME =  "test-qwen-r1"

# Streamlit App Initializationst.title( "? Qwen 3 Local RAG Reasoning Agent" )
# --- Add Model Info Boxes --- st.info( "**Qwen3:** The latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models." )st.info( "**Gemma 3:** These models are multimodal—processing text and images—and feature a 128K context window with support for over 140 languages." )# -------------------------
# Session State Initializationif  'model_version'  not  in  st.session_state:    st.session_state.model_version =  "qwen3:1.7b"   # Default to lighter modelif  'vector_store'  not  in  st.session_state:    st.session_state.vector_store =  Noneif  'processed_documents'  not  in  st.session_state:    st.session_state.processed_documents = []if  'history'  not  in  st.session_state:    st.session_state.history = []if  'exa_api_key'  not  in  st.session_state:    st.session_state.exa_api_key =  ""if  'use_web_search'  not  in  st.session_state:    st.session_state.use_web_search =  Falseif  'force_web_search'  not  in  st.session_state:    st.session_state.force_web_search =  Falseif  'similarity_threshold'  not  in  st.session_state:    st.session_state.similarity_threshold =  0.7if  'rag_enabled'  not  in  st.session_state:    st.session_state.rag_enabled =  True   # RAG is enabled by default

# Sidebar Configurationst.sidebar.header( "⚙️ Settings" )
# Model Selectionst.sidebar.header( "? Model Choice" )model_help =  """- qwen3:1.7b: Lighter model (MoE)- gemma3:1b: More capable but requires better GPU/RAM(32k context window)- gemma3:4b: More capable and MultiModal (Vision)(128k context window)- deepseek-r1:1.5b- qwen3:8b: More capable but requires better GPU/RAM
Choose based on your hardware capabilities."""st.session_state.model_version = st.sidebar.radio(    "Select Model Version" ,    options=[ "qwen3:1.7b""gemma3:1b""gemma3:4b""deepseek-r1:1.5b""qwen3:8b" ],    help =model_help)
st.sidebar.info( "Run ollama pull qwen3:1.7b" )
# RAG Mode Togglest.sidebar.header( "? RAG Mode" )st.session_state.rag_enabled = st.sidebar.toggle( "Enable RAG" , value=st.session_state.rag_enabled)
# Clear Chat Buttonif  st.sidebar.button( "✨ Clear Chat" ):    st.session_state.history = []    st.rerun()
# Show API Configuration only if RAG is enabledif  st.session_state.rag_enabled:    st.sidebar.header( "? Search Tuning" )    st.session_state.similarity_threshold = st.sidebar.slider(        "Similarity Threshold" ,        min_value = 0.0 ,        max_value = 1.0 ,        value = 0.7 ,        help = "Lower values ​​will return more documents but might be less relevant. Higher values ​​are more strict."    )
# Add in the sidebar configuration section, after the existing API inputs
st.sidebar.header( "? Web Search" )st.session_state.use_web_search = st.sidebar.checkbox( "Enable Web Search Fallback" , value=st.session_state.use_web_search)
if  st.session_state.use_web_search:    exa_api_key = st.sidebar.text_input(        "Exa AI API Key"        type = "password" ,        value=st.session_state.exa_api_key,        help = "Required for web search fallback when no relevant documents are found"    )    st.session_state.exa_api_key = exa_api_key
    # Optional domain filtering    default_domains = [ "arxiv.org""wikipedia.org""github.com""medium.com" ]    custom_domains = st.sidebar.text_input(        "Custom domains (comma-separated)"        value= "," .join(default_domains),        help = "Enter domains to search from, eg: arxiv.org,wikipedia.org"    )    search_domains = [d.strip()  for  d  in  custom_domains.split( ","if  d.strip()]
# Utility Functionsdef  init_qdrant () -> QdrantClient |  None :    """Initialize Qdrant client with local Docker setup.
    Returns:        QdrantClient: The initialized Qdrant client if successful.        None: If the initialization fails.    """    try :        return  QdrantClient(url= "http://localhost:6333" )    except  Exception  as  e:        st.error( f"? Qdrant connection failed:  { str (e)} " )        return  None

# Document Processing Functionsdef  process_pdf ( file ) ->  List :    """Process PDF file and add source metadata."""    try :        with  tempfile.NamedTemporaryFile(delete= False , suffix= '.pdf'as  tmp_file:            tmp_file.write(file.getvalue())            loader = PyPDFLoader(tmp_file.name)            documents = loader.load()
            # Add source metadata            for  doc  in  documents:                doc.metadata.update({                    "source_type""pdf" ,                    "file_name" : file.name,                    "timestamp" : datetime.now().isoformat()                })
            text_splitter = RecursiveCharacterTextSplitter(                chunk_size = 1000 ,                chunk_overlap = 200            )            return  text_splitter.split_documents(documents)    except  Exception  as  e:        st.error( f"? PDF processing error:  { str (e)} " )        return  []

def  process_web ( url:  str ) ->  List :    """Process web URL and add source metadata."""    try :        loader = WebBaseLoader(            web_paths=(url,),            bs_kwargs = dict (                parse_only = bs4.SoupStrainer(                    class_=( "post-content""post-title""post-header""content""main" )                )            )        )        documents = loader.load()
        # Add source metadata        for  doc  in  documents:            doc.metadata.update({                "source_type""url" ,                "url" : url,                "timestamp" : datetime.now().isoformat()            })
        text_splitter = RecursiveCharacterTextSplitter(            chunk_size = 1000 ,            chunk_overlap = 200        )        return  text_splitter.split_documents(documents)    except  Exception  as  e:        st.error( f"? Web processing error:  { str (e)} " )        return  []

# Vector Store Managementdef  create_vector_store ( client, texts ):    """Create and initialize vector store with documents."""    try :        # Create collection if needed        try :            client.create_collection(                collection_name=COLLECTION_NAME,                vectors_config = VectorParams(                    size = 1024 ,                      distance=Distance.COSINE                )            )            st.success( f"? Created new collection:  {COLLECTION_NAME} " )        except  Exception  as  e:            if  "already exists"  not  in  str (e).lower():                raise  e
        # Initialize vector store        vector_store = QdrantVectorStore(            client=client,            collection_name=COLLECTION_NAME,            embedding=OllamaEmbedderr()        )
        # Add documents        with  st.spinner( '? Uploading documents to Qdrant...' ):            vector_store.add_documents(texts)            st.success( "✅ Documents stored successfully!" )            return  vector_store
    except  Exception  as  e:        st.error( f"? Vector store error:  { str (e)} " )        return  None
def  get_web_search_agent () -> Agent:    """Initialize a web search agent."""    return  Agent(        name = "Web Search Agent" ,        model=Ollama( id = "llama3.2" ),        tools=[ExaTools(            api_key=st.session_state.exa_api_key,            include_domains=search_domains,            num_results = 5        )],        instructions= """You are a web search expert. Your task is to:        1. Search the web for relevant information about the query        2. Compile and summarize the most relevant information        3. Include sources in your response        """,        show_tool_calls = True ,        markdown= True ,    )

def  get_rag_agent () -> Agent:    """Initialize the main RAG agent."""    return  Agent(        name = "Qwen 3 RAG Agent" ,        model=Ollama( id =st.session_state.model_version),        instructions= """You are an Intelligent Agent specializing in providing accurate answers.
        When asked a question:        - Analyze the question and answer the question with what you know.
        When given context from documents:        - Focus on information from the provided documents        - Be precise and cite specific details
        When given web search results:        - Clearly indicate that the information comes from web search        - Synthesize the information clearly
        Always maintain high accuracy and clarity in your responses.        """,        show_tool_calls = True ,        markdown= True ,    )



def  check_document_relevance ( query:  str , vector_store, threshold:  float  =  0.7 ) ->  tuple [ boolList ]:
    if  not  vector_store:        return  False , []
    retriever = vector_store.as_retriever(        search_type= "similarity_score_threshold" ,        search_kwargs={ "k"5"score_threshold" : threshold}    )    docs = retriever.invoke(query)    return  bool (docs), docs

chat_col, toggle_col = st.columns([ 0.90.1 ])
with  chat_col:    prompt = st.chat_input( "Ask about your documents..."  if  st.session_state.rag_enabled  else  "Ask me anything..." )
with  toggle_col:    st.session_state.force_web_search = st.toggle( '?'help = "Force web search" )
# Check if RAG is enabled if  st.session_state.rag_enabled:    qdrant_client = init_qdrant()
    # --- Document Upload Section (Moved to Main Area) ---    with  st.expander( "? Upload Documents or URLs for RAG" , expanded= False ):        if  not  qdrant_client:            st.warning( "⚠️ Please configure Qdrant API Key and URL in the sidebar to enable document processing." )        else :            uploaded_files = st.file_uploader(                "Upload PDF files"                accept_multiple_files = True                type = 'pdf'            )            url_input = st.text_input( "Enter URL to scrape" )
            if  uploaded_files:                st.write( f"Processing  { len (uploaded_files)}  PDF file(s)..." )                all_texts = []                for  file  in  uploaded_files:                    if  file.name  not  in  st.session_state.processed_documents:                        with  st.spinner( f"Processing  {file.name} ... " ):                             texts = process_pdf(file)                            if  texts:                                 all_texts.extend(texts)                                st.session_state.processed_documents.append(file.name)                    else :                        st.write( f"?  {file.name}  already processed." )
                if  all_texts:                    with  st.spinner( "Creating vector store..." ):                        st.session_state.vector_store = create_vector_store(qdrant_client, all_texts)
            if  url_input:                if  url_input  not  in  st.session_state.processed_documents:                    with  st.spinner( f"Scraping and processing  {url_input} ..." ):                        texts = process_web(url_input)                        if  texts:                            st.session_state.vector_store = create_vector_store(qdrant_client, texts)                            st.session_state.processed_documents.append(url_input)                else :                    st.write( f"?  {url_input}  already processed." )
            if  st.session_state.vector_store:                st.success( "Vector store is ready." )            elif  not  uploaded_files  and  not  url_input:                 st.info( "Upload PDFs or enter a URL to populate the vector store." )
    # Display sources in sidebar    if  st.session_state.processed_documents:        st.sidebar.header( "? Processed Sources" )        for  source  in  st.session_state.processed_documents:            if  source.endswith( '.pdf' ):                st.sidebar.text( f"?  {source} " )            else :                st.sidebar.text( f"?  {source} " )
if  prompt:    # Add user message to history    st.session_state.history.append({ "role""user""content" : prompt})    with  st.chat_message( "user" ):        st.write(prompt)
    if  st.session_state.rag_enabled:
            # Existing RAG flow remains unchanged            with  st.spinner( "?Evaluating the Query..." ):                try :                    rewritten_query = prompt
                    with  st.expander( "Evaluating the query" ):                        st.write( f"User's Prompt:  {prompt} " )                except  Exception  as  e:                    st.error( f"❌ Error rewriting query:  { str (e)} " )                    rewritten_query = prompt
            # Step 2: Choose search strategy based on force_web_search toggle            context =  ""            docs = []            if  not  st.session_state.force_web_search  and  st.session_state.vector_store:                # Try document search first                retriever = st.session_state.vector_store.as_retriever(                    search_type= "similarity_score_threshold" ,                    search_kwargs={                        "k"5                        "score_threshold" : st.session_state.similarity_threshold                    }                )                docs = retriever.invoke(rewritten_query)                if  docs:                    context =  "\n\n" .join([d.page_content  for  d  in  docs])                    st.info( f"? Found  { len (docs)}  relevant documents (similarity >  {st.session_state.similarity_threshold} )" )                elif  st.session_state.use_web_search:                    st.info( "? No relevant documents found in database, falling back to web search..." )
            # Step 3: Use web search if:            # 1. Web search is forced ON via toggle, or            # 2. No relevant documents found AND web search is enabled in settings            if  (st.session_state.force_web_search  or  not  context)  and  st.session_state.use_web_search  and  st.session_state.exa_api_key:                with  st.spinner( "? Searching the web..." ):                    try :                        web_search_agent = get_web_search_agent()                        web_results = web_search_agent.run(rewritten_query).content                        if  web_results:                            context =  f"Web Search Results:\n {web_results} "                            if  st.session_state.force_web_search:                                st.info( "ℹ️ Using web search as requested via toggle." )                            else :                                st.info( "ℹ️ Using web search as fallback since no relevant documents were found." )                    except  Exception  as  e:                        st.error( f"❌ Web search error:  { str (e)} " )
            # Step 4: Generate response using the RAG agent            with  st.spinner( "? Thinking..." ):                try :                    rag_agent = get_rag_agent()
                    if  context:                        full_prompt =  f"""Context:  {context}
Original Question:  {prompt}Please provide a comprehensive answer based on the available information."""                    else :                        full_prompt =  f"Original Question:  {prompt} \n"                        st.info( "ℹ️ No relevant information found in documents or web search." )
                    response = rag_agent.run(full_prompt)
                    # Add assistant response to history                    st.session_state.history.append({                        "role""assistant" ,                        "content" : response.content                    })
                    # Display assistant response                    with  st.chat_message( "assistant" ):                        st.write(response.content)
                        # Show sources if available                        if  not  st.session_state.force_web_search  and  'docs'  in  locals ()  and  docs:                            with  st.expander( "? See document sources" ):                                for  i, doc  in  enumerate (docs,  1 ):                                    source_type = doc.metadata.get( "source_type""unknown" )                                    source_icon =  "?"  if  source_type ==  "pdf"  else  "?"                                    source_name = doc.metadata.get( "file_name"  if  source_type ==  "pdf"  else  "url""unknown" )                                    st.write( f" {source_icon}  Source  {i}  from  {source_name} :" )                                    st.write( f" {doc.page_content[: 200 ]} ..." )
                except  Exception  as  e:                    st.error( f"❌ Error generating response:  { str (e)} " )
    else :        # Simple mode without RAG        with  st.spinner( "? Thinking..." ):            try :                rag_agent = get_rag_agent()                web_search_agent = get_web_search_agent()  if  st.session_state.use_web_search  else  None
                # Handle web search if forced or enabled                context =  ""                if  st.session_state.force_web_search  and  web_search_agent:                    with  st.spinner( "? Searching the web..." ):                        try :                            web_results = web_search_agent.run(prompt).content                            if  web_results:                                context =  f"Web Search Results:\n {web_results} "                                st.info( "ℹ️ Using web search as requested." )                        except  Exception  as  e:                            st.error( f"❌ Web search error:  { str (e)} " )
                # Generate response                if  context:                    full_prompt =  f"""Context:  {context}
Question:  {prompt}
Please provide a comprehensive answer based on the available information."""                else :                    full_prompt = prompt
                response = rag_agent.run(full_prompt)                response_content = response.content
                # Extract thinking process and final response                import  re                think_pattern =  r'<think>(.*?)</think>'                think_match = re.search(think_pattern, response_content, re.DOTALL)
                if  think_match:                    thinking_process = think_match.group( 1 ).strip()                    final_response = re.sub(think_pattern,  '' , response_content, flags=re.DOTALL).strip()                else :                    thinking_process =  None                    final_response = response_content
                # Add assistant response to history (only the final response)                st.session_state.history.append({                    "role""assistant" ,                    "content" : final_response                })
                # Display assistant response                with  st.chat_message( "assistant" ):                    if  thinking_process:                        with  st.expander( "? See thinking process" ):                            st.markdown(thinking_process)                    st.markdown(final_response)
            except  Exception  as  e:                st.error( f"❌ Error generating response:  { str (e)} " )
else :    st.warning( "You can directly talk to qwen and gemma models locally! Toggle the RAG mode to upload documents!" )