Docker local deployment of large model integration framework Xinference

Written by

Audrey Miles

Updated on:June-24th-2025

Xorbits Inference (Xinference) is a powerful and comprehensive distributed inference framework. It can be used for inference of various models such as large language models (LLM), speech recognition models, multimodal models, etc. With Xorbits Inference, you can easily deploy your own model or built-in cutting-edge open source models with one click. Whether you are a researcher, developer, or data scientist, you can explore more possibilities through Xorbits Inference and the most cutting-edge AI models.

In plain words: Xinference is a model deployment framework that can deploy the desired large open source model with one click.

What Xinference can do

Preparation

Xinference uses GPU to accelerate reasoning. This image needs to be installed on a GPU graphics card.

Runs on a CUDA-enabled machine.
Ensure that CUDA is correctly installed on your machine. You can use nvidia-smi Check for correct operation.
The CUDA version in the image is 12.4 To avoid unexpected problems, please upgrade the host machine's CUDA version and NVIDIA Driver version to 12.4 and 550 above.

Docker images

Currently, you can pull the official image of Xinference through two channels. 1. In Dockerhub xprobe/xinference 2. A copy of the image in Dockerhub will be uploaded to Alibaba Cloud's public image repository for users who have difficulty accessing Dockerhub to pull. Pull command:docker pull registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:<tag> The currently available tags are:

nightly-main: This image will be updated and produced from the GitHub main branch every day, and is not guaranteed to be stable and reliable.
v<release version> : This image is created every time Xinference is released and can generally be considered stable and reliable.
latest: This image will point to the latest release when Xinference is released
For the CPU version, add -cpu Suffixes, such as nightly-main-cpu.

Custom images

If you need to install additional dependencies, you can refer to xinference/deploy/docker/Dockerfile (https://inference.readthedocs.io/zh-cn/latest/getting_started/using_docker_image.html) . Please make sure to use Dockerfile to create the image in the root directory of the Xinference project. For example:

git clone https://github.com/xorbitsai/inference.gitcd inferencedocker build --progress=plain -t test -f xinference/deploy/docker/Dockerfile .

Using Mirror

You can start Xinference in the container as follows, map port 9997 to port 9998 on the host, set the log level to DEBUG, and specify the required environment variables.

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0 --log-level debug

warn

--gpus It must be specified. As described above, the image must be run on a machine with a GPU, otherwise an error will occur.
-H 0.0.0.0 It must also be specified, otherwise you will not be able to connect to the Xinference service outside the container.
You can specify multiple -e Option to assign values to multiple environment variables.

Of course, you can also run the container and then enter the container to manually start Xinference.

Mount model directory

By default, the image does not contain any model files, and the model will be downloaded in the container during use. If you need to use the downloaded model, you need to mount the host directory into the container. In this case, you need to specify the local volume when running the container and configure the environment variables for Xinference.

docker run -v </on/your/host>:</on/the/container> -e XINFERENCE_HOME=</on/the/container> -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0

The principle of the above command is to mount the specified directory on the host into the container and set XINFERENCE_HOME The environment variable points to the directory inside the container. In this way, all downloaded model files will be stored in the directory you specified on the host. You don't need to worry about losing these files when the Docker container stops. The next time you run the container, you can directly use the existing model without downloading it again.

If you download the model using the default path on the host machine, since the xinference cache directory uses a soft link to store the model, you need to mount the directory where the original file is located into the container. For example, if you use huggingface and modelscope as the model repository, you need to mount the two corresponding directories into the container. Generally, the corresponding cache directories are <home_path>/.cache/huggingface and <home_path>/.cache/modelscope respectively. The commands used are as follows:

docker run\-v</your/home/path>/.xinference:/root/.xinference\-v</your/home/path>/.cache/huggingface:/root/.cache/huggingface\-v</your/hom e/path>/.cache/modelscope:/root/.cache/modelscope\-p9997:9997\--gpusall\xprobe/xinference:v<your_version>\xinference-local-H0.0.0.0

Start deployment:

mkdir /data/xinference & cd /data/xinferencedocker run -d --privileged --gpus all --restart always \ -v /data/xinference/.xinference:/root/.xinference \ -v /data/xinference/.cache/huggingface:/root/.cache/huggingface \ -v /data/xinference/.cache/modelscope:/root/.cache/modelscope \ -p 9997:9997\xprobe/xinference:v1.5.0\xinference-local -H 0.0.0.0

docker run -d --privileged --gpus all --restart always -v /data/xinference/.xinference:/root/.xinference -v /data/xinference/.cache/huggingface:/root/.cache/huggingface -v /data/xinference/.cache/modelscope:/root/.cache/modelscope -p 9997:9997 xprobe/xinference:v1.5.0 xinference-local -H 0.0.0.0

At this point, Xinference is deployed successfully and can be accessed using http://ip:9997.