Woter AI detection.Hurry - ends Jul 22nd

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Guide to Local Deployment of Embedded and Reflow Models

Written by

Silas Grey

Updated on:July-01st-2025

An important reason for deploying AI applications locally is to use RAG technology to manage the knowledge base.

In short, the context of a large language model is usually only 16K-128K, which is 30,000 to 120,000 words. This is obviously not enough for knowledge base management, and too long texts will cause attention loss for large models. Of course, there are exceptions, such as Google's Gemini and Conch's Minimax, which have contexts of one million and four million respectively.

RAG technology uses the embedding model to perform preliminary screening on the vectorized matching problem of large texts, and then uses the rerank model to sort them, and then passes them to the large language model for processing. Those who are interested can see:

[[RAG Series (I): This article helps you understand RAG implementation from the basics to the depths]]

[[Why does RAG need to be reranked? ]]

This article deploys on MACXinference+bge-m3/ bge-reranker-v2-m3For example:

1. Deploy Conda

1. Download the Miniconda installation script

Download the installation script for Mac:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh

Note 1 : Ifzsh: command not found: wgetWhen executingbrew install wgetTo install wget, you can also use the error log to ask Doubao Note 2 : Intel chip Mac or PC needs to visit Miniconda official website to query the corresponding version to replace the script file name

Install wget. If you don't even have homebrew, ask the big model how to install homebrew

2. Run the installation script

bash Miniconda3-latest-MacOSX-arm64.sh

3. Refresh Shell environment

source  ~/.bashrc
#Refresh the environment
~/miniconda3/bin/conda init zsh 
#Activate conda ~Change to the actual path

4. Verify the installation

conda --version

2. Install Xinference using Conda

1. Create and activate a virtual environment

conda create --name xinference_env310 python=3.10
#Create a virtual environment
conda activate xinference_env310
#Activate the virtual environment

2. Install necessary dependencies

pip install torch
pip install  "transformers>=4.36.0"
pip install  "sentence-transformers>=3.2.0"

3. Hardware acceleration (optional)

# Apple M Series
CMAKE_ARGS= "-DLLAMA_METAL=on"  pip install llama-cpp-python
# NVIDIA graphics card
CMAKE_ARGS= "-DLLAMA_CUBLAS=on"  pip install llama-cpp-python 
# AMD Graphics Cards
CMAKE_ARGS= "-DLLAMA_HIPBAS=on"  pip install llama-cpp-python

4. Install Xinference

pip install xinference

3. Start Xinference Service

# Foreground operation
xinference-local --host 0.0.0.0 --port 9997
# Background operation
nohup bash -c  'xinference-local --host 0.0.0.0 --port 9997'  > xinference.log 2>&1 &

Verification Address:`http://localhost:9997`

4. Model Installation

Download via WebUI:

Recommended Models:bge-m3,bge-reranker-v2-m3
Keep the terminal running and download (see the next chapter for background operation)

5. Create a background daemon service (macOS)

Open Automator → Create a new "Application"
Add the "Run Shell Script" component:

source  /opt/anaconda3/etc/profile.d/conda.sh
conda activate xinference_env
nohup xinference-local -H 0.0.0.0 --port 9997 > /tmp/xinference.log 2>&1 &

Save AsXinferenceDaemon.apto/Applications
Add the app to System Preferences → Users & Groups → Login Items

View the logs:

tail -f /tmp/xinference.log

6. dify call configuration

Install the OpenAI-API-Compatible plugin
Settings → Model Provider:
Rerank/Embedding module configuration
Fill in the corresponding Xinference interface information (such as host.docker.internal:9997)