​Xinference local deployment full process detailed explanation and troubleshooting

Written by
Audrey Miles
Updated on:July-09th-2025
Recommendation

Master Xinference local deployment and solve deployment problems.

Core content:
1. Basic environment configuration: Docker and NVIDIA driver verification, CUDA toolchain configuration
2. Docker container deployment: image pulling, container startup, GPU acceleration mode
3. Windows system special configuration: network stack support issues and repair solutions

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. Basic environment configuration

1.  Docker and NVIDIA driver verification

Core steps :

  • Docker installation verification :
    docker --version   # Required ≥ 24.0.5 (2025 compatibility requirement)
  • NVIDIA Driver Compatibility :
    • Check the driver version (need ≥535.129.03):
      nvidia-smi | grep  "Driver Version" # Output example: Driver Version: 571.96.03  
    • If the driver is missing or the version is too low:
      sudo apt install -y nvidia-driver-570-server   # Enterprise-level stable version driver
2.  CUDA toolchain configuration

Key operations :

# Rebuild CUDA repository (for Ubuntu 24.04)
sudo tee /etc/apt/sources.list.d/cuda.list <<EOF
deb https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/ /
EOF
# Migrate keys to the new canonical path (adapt APT key management strategy)
sudo mkdir -p /etc/apt/keyrings && sudo cp /etc/apt/trusted.gpg /etc/apt/keyrings/nvidia-cuda.gpg
sudo apt update

2. Docker container deployment

1.  Image pulling and container startup

GPU Acceleration Mode :

docker run -d --name xinference \
  -e XINFERENCE_MODEL_SRC=modelscope \   # Specify the model source
  -p 9998:9997 \   # Port mapping (host: container)
  --gpus all \   # Enable GPU penetration
  -v /host/cuda/libs:/usr/lib/x86_64-linux-gnu:ro \   # Driver file mount
  xprobe/xinference:latest \
  xinference-local -H 0.0.0.0 -- log -level debug

Verify GPU penetration :

docker  exec  xinference nvidia-smi   # Output should be consistent with the host machine

2.  Special configuration for Windows system

Source of the problem :

  • Windows Network Stack 0.0.0.0 Limited support, need to use 127.0.0.1
    Repair plan :
# Container startup command adjustment (PowerShell)
docker run -d --name xinference `
  -v C:\xinference:/xinference `   # Windows path mount
  -p 9997:9997 `
  --gpus all `
  xprobe/xinference:latest `
  xinference-local -H 127.0.0.1 -- log -level debug

Firewall configuration :

netsh advfirewall firewall add rule name="Xinference" dir=in action=allow protocol=TCP localport=9997

3. Model loading and API calls

1.  Model deployment process

Steps in detail :

  1. Upload the model file :
    docker cp qwen2.5-instruct/ xinference:/xinference/models/   # host to container
  2. Start the model service :
    xinference launch -n qwen2.5-instruct -f pytorch -s 0_5   # Specify the framework and version
  3. Verify the model status :
    curl http://localhost:9997/v1/models   # Check if the status is "Running"
2.  API Integration Examples

Python SDK call :

from  xinference.client  import  Client
client = Client( "http://localhost:9998" )   # Note the host machine mapping port
model_uid = client.launch_model(
    "rerank-chinese"
    framework = "transformers"
    max_memory = 4096 # Anti-OOM limit  
)
response = client.rerank(
    model_uid, 
    query = "deep learning framework"
    documents=[ "TensorFlow""PyTorch""Xinference" ]
)
print(response.scores)   # Output relevance score

4. Production environment optimization

1.  Stability enhancement configuration
  • Container persistence :
    docker run -d --restart unless-stopped \   # Automatic restart
      -v xinference_data:/root/.xinference \   # Data persistence
      xprobe/xinference:latest
  • Enterprise-level image source :
    sed -i  's|developer.download.nvidia.com|mirrors.aliyun.com/nvidia|g'  /etc/apt/sources.list.d/cuda.list   # Domestic acceleration
2.  Performance monitoring and tuning
  • Real-time resource monitoring :
    watch -n 1 nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
  • Flame graph generation :
    xinference profile -m rerank-chinese -o profile.html   # Locate the inference bottleneck

5. Full analysis of cross-platform compatibility

operating system
Key Configuration
Verify Command
Ubuntu 24.04
Driver mounting:-v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:ro
ls /usr/lib/x86_64-linux-gnu/libcuda*
Windows 11
Address binding:-H 127.0.0.1, directory mount:-v C:\xinference:/xinference
docker logs xinference --tail 100