Guide to Local Deployment of Embedded and Reflow Models

Get an in-depth understanding of the application of RAG technology in knowledge base management and the local deployment process.
Core content:
1. The importance of RAG technology in large language model knowledge base management
2. The collaborative working mechanism of the embedding model and the reranking model
3. Detailed steps to deploy Xinference+bge-m3/bge-reranker-v2-m3 on MAC
An important reason for deploying AI applications locally is to use RAG technology to manage the knowledge base.
In short, the context of a large language model is usually only 16K-128K, which is 30,000 to 120,000 words. This is obviously not enough for knowledge base management, and too long texts will cause attention loss for large models. Of course, there are exceptions, such as Google's Gemini and Conch's Minimax, which have contexts of one million and four million respectively.
RAG technology uses the embedding model to perform preliminary screening on the vectorized matching problem of large texts, and then uses the rerank model to sort them, and then passes them to the large language model for processing. Those who are interested can see:
[[RAG Series (I): This article helps you understand RAG implementation from the basics to the depths]]
[[Why does RAG need to be reranked? ]]
This article deploys on MACXinference
+bge-m3/ bge-reranker-v2-m3
For example:
1. Deploy Conda
1. Download the Miniconda installation script
Download the installation script for Mac:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh
Note 1 : If
zsh: command not found: wget
When executingbrew install wget
To install wget, you can also use the error log to ask Doubao Note 2 : Intel chip Mac or PC needs to visit Miniconda official website to query the corresponding version to replace the script file name
Install wget. If you don't even have homebrew, ask the big model how to install homebrew
2. Run the installation script
bash Miniconda3-latest-MacOSX-arm64.sh
3. Refresh Shell environment
source ~/.bashrc
#Refresh the environment
~/miniconda3/bin/conda init zsh
#Activate conda ~Change to the actual path
4. Verify the installation
conda --version
2. Install Xinference using Conda
1. Create and activate a virtual environment
conda create --name xinference_env310 python=3.10
#Create a virtual environment
conda activate xinference_env310
#Activate the virtual environment
2. Install necessary dependencies
pip install torch
pip install "transformers>=4.36.0"
pip install "sentence-transformers>=3.2.0"
3. Hardware acceleration (optional)
# Apple M Series
CMAKE_ARGS= "-DLLAMA_METAL=on" pip install llama-cpp-python
# NVIDIA graphics card
CMAKE_ARGS= "-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# AMD Graphics Cards
CMAKE_ARGS= "-DLLAMA_HIPBAS=on" pip install llama-cpp-python
4. Install Xinference
pip install xinference
3. Start Xinference Service
# Foreground operation
xinference-local --host 0.0.0.0 --port 9997
# Background operation
nohup bash -c 'xinference-local --host 0.0.0.0 --port 9997' > xinference.log 2>&1 &
Verification Address:http://localhost:9997
4. Model Installation
Download via WebUI:
Recommended Models: bge-m3
,bge-reranker-v2-m3
Keep the terminal running and download (see the next chapter for background operation)
5. Create a background daemon service (macOS)
Open Automator → Create a new "Application"
Add the "Run Shell Script" component:
source /opt/anaconda3/etc/profile.d/conda.sh
conda activate xinference_env
nohup xinference-local -H 0.0.0.0 --port 9997 > /tmp/xinference.log 2>&1 &
Save As
XinferenceDaemon.ap
to/Applications
Add the app to System Preferences → Users & Groups → Login Items
View the logs:
tail -f /tmp/xinference.log
6. dify call configuration
Install the OpenAI-API-Compatible plugin
Settings → Model Provider:
Rerank/Embedding module configuration
Fill in the corresponding Xinference interface information (such as host.docker.internal:9997)