Table of Content
Xinference local deployment full process detailed explanation and troubleshooting

Updated on:July-09th-2025
Recommendation
Master Xinference local deployment and solve deployment problems.
Core content:
1. Basic environment configuration: Docker and NVIDIA driver verification, CUDA toolchain configuration
2. Docker container deployment: image pulling, container startup, GPU acceleration mode
3. Windows system special configuration: network stack support issues and repair solutions
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
1. Basic environment configuration
1. Docker and NVIDIA driver verification
Core steps :
Docker installation verification : docker --version # Required ≥ 24.0.5 (2025 compatibility requirement)
NVIDIA Driver Compatibility : Check the driver version (need ≥535.129.03): nvidia-smi | grep "Driver Version" # Output example: Driver Version: 571.96.03
If the driver is missing or the version is too low: sudo apt install -y nvidia-driver-570-server # Enterprise-level stable version driver
2. CUDA toolchain configuration
Key operations :
# Rebuild CUDA repository (for Ubuntu 24.04)
sudo tee /etc/apt/sources.list.d/cuda.list <<EOF
deb https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/ /
EOF
# Migrate keys to the new canonical path (adapt APT key management strategy)
sudo mkdir -p /etc/apt/keyrings && sudo cp /etc/apt/trusted.gpg /etc/apt/keyrings/nvidia-cuda.gpg
sudo apt update
2. Docker container deployment
1. Image pulling and container startup
GPU Acceleration Mode :
docker run -d --name xinference \
-e XINFERENCE_MODEL_SRC=modelscope \ # Specify the model source
-p 9998:9997 \ # Port mapping (host: container)
--gpus all \ # Enable GPU penetration
-v /host/cuda/libs:/usr/lib/x86_64-linux-gnu:ro \ # Driver file mount
xprobe/xinference:latest \
xinference-local -H 0.0.0.0 -- log -level debug
Verify GPU penetration :
docker exec xinference nvidia-smi # Output should be consistent with the host machine
2. Special configuration for Windows system
Source of the problem :
Windows Network Stack 0.0.0.0
Limited support, need to use127.0.0.1
Repair plan :
# Container startup command adjustment (PowerShell)
docker run -d --name xinference `
-v C:\xinference:/xinference ` # Windows path mount
-p 9997:9997 `
--gpus all `
xprobe/xinference:latest `
xinference-local -H 127.0.0.1 -- log -level debug
Firewall configuration :
netsh advfirewall firewall add rule name="Xinference" dir=in action=allow protocol=TCP localport=9997
3. Model loading and API calls
1. Model deployment process
Steps in detail :
Upload the model file : docker cp qwen2.5-instruct/ xinference:/xinference/models/ # host to container
Start the model service : xinference launch -n qwen2.5-instruct -f pytorch -s 0_5 # Specify the framework and version
Verify the model status : curl http://localhost:9997/v1/models # Check if the status is "Running"
2. API Integration Examples
Python SDK call :
from xinference.client import Client
client = Client( "http://localhost:9998" ) # Note the host machine mapping port
model_uid = client.launch_model(
"rerank-chinese" ,
framework = "transformers" ,
max_memory = 4096 # Anti-OOM limit
)
response = client.rerank(
model_uid,
query = "deep learning framework" ,
documents=[ "TensorFlow" , "PyTorch" , "Xinference" ]
)
print(response.scores) # Output relevance score
4. Production environment optimization
1. Stability enhancement configuration
Container persistence : docker run -d --restart unless-stopped \ # Automatic restart
-v xinference_data:/root/.xinference \ # Data persistence
xprobe/xinference:latestEnterprise-level image source : sed -i 's|developer.download.nvidia.com|mirrors.aliyun.com/nvidia|g' /etc/apt/sources.list.d/cuda.list # Domestic acceleration
2. Performance monitoring and tuning
Real-time resource monitoring : watch -n 1 nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
Flame graph generation : xinference profile -m rerank-chinese -o profile.html # Locate the inference bottleneck
5. Full analysis of cross-platform compatibility
Ubuntu 24.04 | -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:ro | ls /usr/lib/x86_64-linux-gnu/libcuda* |
Windows 11 | -H 127.0.0.1 , directory mount:-v C:\xinference:/xinference | docker logs xinference --tail 100 |