Rapidly build and deploy RAGS: A step-by-step guide to saving time and maximizing efficiency

Written by
Clara Bennett
Updated on:July-01st-2025
Recommendation

Quickly build RAG applications, save time, and improve efficiency
Core content:
1. The universality and importance of the RAG technology stack
2. Basic technology selection and steps for building RAG applications
3. Specific guidelines and operations for quickly starting RAG applications

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Most RAGs are built on this technology stack; why reinvent the wheel every time?

Photo by  Warren  on  Unsplash

RAGs make LLMs useful.

Yes, before RAG, LLMs were just toys. There weren't many applications other than doing some trivial sentiment classification with LLMs. This was mainly due to the fact that LLMs can't learn on the go. Anything real-time just doesn't work with LLMs.

This changed when RAGs came into practice.

RAGs allow us to build applications using live data, and they help us build smart applications around our private data using LLMs.

But if you ask anyone building RAGs what their technology stack is, you’ll hear an old, broken tape recorder. The first few stages of all RAG pipelines are very similar, and there are few alternatives to the core technologies.

My go-to starter apps feature the following technologies: Langchain (LlamaIndex is the only comparable alternative), ChromaDB, and OpenAI for LLMs and embeddings. I often develop in docker environments because the results are easily reproducible on other people's computers (besides the many other benefits they bring).

For my needs, I rarely package them. When I do, I use Streamlit or Gradio. I used to use Django. It's a great framework for web development. However, if you are a full-time data scientist, you'd better choose between Streamlit or Gradio.

Since I noticed that I always started projects with this tech stack, I created a project template so that whenever I had an idea, I didn’t want to waste time on boring basics.

I’m going to share this post with you; maybe you can save time too.

How to launch a RAG application in minutes

Before we delve into any details, let's get our basic RAG application up and running.

For this to work, you must have Docker  and  Git installed on your computer  and have a  valid OpenAI API key .

First clone the following repository.

git clone git@github.com:thuwarakeshm/ragbasics.git
cd ragbasics

Create a new one in the project directory .env file and put your OpenAI API key in it.

OPENAI_API_KEY=sk-proj-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

You are ready to build and run a docker instance that will open a RAG application. Here is how:

docker build -t ragbasics .
docker run -p 8000:7860 --env-file .env --name ragbasics-001 ragbasics

If this is your first time building this image, it will take a few minutes and the server will be ready. Open your browser and visit the address https://localhost:8000 to access the application.

RAG Starter Template — Image provided by the author.

You can upload any of your PDF documents and submit them. Once submitted, the application will chunk your document and create a vector store from those chunks. You can then ask questions about the document and the application will answer them.

The RAG Starter Template in action — Image provided by the author.

What can you change immediately?

This application is far from perfect. But it is also a common element of most RAG applications.

Apart from this, we also change a few things. The most important of these is the prompt of the RAG. The final response is generated through a single LLM call with the retrieved information as context. This  prompt template  plays a crucial role in the quality of our responses.

I used a basic template. It just asks the LLM to answer the user's question based on the context. But you may have to modify it to get better quality responses, or to avoid responding to specific queries.

Because it is so common, I have pyproject.toml You can specify your prompt in the configuration file instead of going into the code base.

Another common thing I change is  the chunking strategy .

Chunking is how we can split a large document into smaller (possibly) chunks. Recursive character splitting and Markdown splitting are two of the most popular chunking strategies.

Recursive character segmentation treats everything in your document as text, creating a moving window of chunks. If you set the chunk size to 1000 and the overlap to 200, the moving window will put the first 1000 characters into the first chunk, characters 201 through 1200 into the second chunk, and so on. If you use this strategy, you will usually want to change the chunk size and overlap. You can do this in your pyproject.toml Do this directly in the file.

If your documentation is already in Markdown format, you can use this information to create more informative and relevant blocks. You can do all of this in a configuration file. Here is an example.

[chunking]
strategy = "recursive _character_ text _splitter"
chunk_size
= 1000
chunk_overlap = 200

## strategy = "markdown_
splitter"
## headers _to_ split _on = ["##", "###", "####", "#####", "######"]
## return_
each _line = true
## strip_headers
= false


[rag _prompt]
prompt_
template = """
Answer the question in the below context:
{context}

Question: {question}
"""

If you wish to add a new chunking technique to your application, you can do so by implementing ChunkingStrategy Abstract class and Chunker Register it in the class to implement it.

##  Create  a chunking strategy
## chunking\ your_chunking_stragey.py

from  typing  import List
from  chunking. chunking import ChunkingStrategy
from  langchain.docstore.document import Document

class YourChunkingStrategy ( ChunkingStrategy ):

def  chunk (self,  documentsList [ Document ]) ->  List [ Document ]:
        #implement your own chunking technique here
        pass

##  Register  the  new  chunking strategy
## chunking\__ init__.py

from  chunking. your_chunking_stragey import YourChunkingStrategy
class Chunker :
    def  __init__ (self):
        ...
        elif config[ "chunking" ][ "strategy" ] ==  "your_chunking_strategy" ::
            self. chunking_strategy  =  YourChunkingStrategy ()
        else :
            raise  ValueError (f "Invalid chunking strategy: {config['chunking']['strategy']}" )

Engineers spend a lot of time doing chunking, and there is no one way that is better than the other. I have previously documented some of the things I learned about chunking. It was an experimental process with a lot of hyperparameter tuning.

Having a template like this can help us get it done faster. It can also help us communicate the process to another person easily.

Deploy to Huggingface

After building your application, the next important thing is to deploy it.

There are many options out there, but huggingface spaces are simple and popular for data scientists, so we’ll stick with Huggingface.

Why Docker?

I primarily use the Docker wrapper to build my applications. There are several reasons for this.

First, I can share it with colleagues and the application will almost always run the same way on their computers.

Second, I can easily add more technologies to the stack. For example, suppose I want to use Neo4J instead of vector storage to create a knowledge graph (as in this example). I can simply create a docker-compose And add a Neo4J container to it.

The third is the compatibility of Docker and Huggingface. You can quickly deploy Docker containers to HF Spaces. The example we discussed uses Gradio, which is also compatible with HF Spaces. But Docker is more flexible in use.

Go to HF Spaces and create a new space. If you follow this example, the Free CPU option should be enough. For more powerful applications, you will need more resources.

Creating HF Space — Screenshot by the author.

Make sure you have selected Docker for the Space SDK and a blank template.

Once created, you will see instructions for cloning and updating the Space repository. Clone it to your computer, copy your project files to this repository, and then push the changes back to the Space.

WARNING: Do not directly .env The file is pushed to HF Spaces.

cd <project_directory>
git clone <hfspace_repository> hfspace
cp -r app.py Dockerfile pyproject.toml requirements.txt chunking hfspace
git add .                                                                                            
git commit -m "Deploying basic RAG app" 
git push

When you push your changes, HF Spaces will build and deploy your Docker container.

HF Space builds and deploys Docker containers — Author's image

But the application is not available yet. Remember we ignored the .env File? We have to provide environment variables securely through HF Space settings.

You can find this configuration in the Variables and Keys section of the Settings tab of the HF space. Click the New Key button and provide OPENAI_API_KEY variable.

Adding environment variables in HF Space — Screenshot by the author.

After saving the key, the application will restart. You can now access the space and see your application running live.

Basic RAG deployed to Huggingface space — screenshot by the author.

Final Thoughts

Retrieval-enhanced applications are one of the primary use cases for LLM. However, most RAG applications use the same technology stack, so engineers spend a lot of time reinventing the wheel.

I created a template project that helps me out whenever I want to start a RAG app. Here is an article about this template. You can steal it to quickly build a RAG app and deploy it to your space, or create an app similar to this one and never have to worry about boilerplate again.