Docker local deployment of large model integration framework Xinference

Xinference, a Docker framework that allows you to easily deploy large models, is powerful and easy to use.
Core content:
1. Introduction to the Xinference framework and applicable scenarios
2. Preparations before deploying Xinference
3. How to obtain official images and build custom images
Preparation
Xinference uses GPU to accelerate reasoning. This image needs to be installed on a GPU graphics card.
Runs on a CUDA-enabled machine.
Ensure that CUDA is correctly installed on your machine. You can use
nvidia-smi
Check for correct operation.The CUDA version in the image is
12.4
To avoid unexpected problems, please upgrade the host machine's CUDA version and NVIDIA Driver version to12.4
and550
above.
Docker images
Currently, you can pull the official image of Xinference through two channels. 1. In Dockerhub xprobe/xinference
2. A copy of the image in Dockerhub will be uploaded to Alibaba Cloud's public image repository for users who have difficulty accessing Dockerhub to pull. Pull command:docker pull registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:<tag>
The currently available tags are:
nightly-main
: This image will be updated and produced from the GitHub main branch every day, and is not guaranteed to be stable and reliable.v<release version>
: This image is created every time Xinference is released and can generally be considered stable and reliable.latest
: This image will point to the latest release when Xinference is releasedFor the CPU version, add
-cpu
Suffixes, such asnightly-main-cpu
.
Custom images
If you need to install additional dependencies, you can refer to xinference/deploy/docker/Dockerfile (https://inference.readthedocs.io/zh-cn/latest/getting_started/using_docker_image.html) . Please make sure to use Dockerfile to create the image in the root directory of the Xinference project. For example:
git clone https://github.com/xorbitsai/inference.gitcd inferencedocker build --progress=plain -t test -f xinference/deploy/docker/Dockerfile .
Using Mirror
You can start Xinference in the container as follows, map port 9997 to port 9998 on the host, set the log level to DEBUG, and specify the required environment variables.
docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0 --log-level debug
--gpus
It must be specified. As described above, the image must be run on a machine with a GPU, otherwise an error will occur.-H 0.0.0.0
It must also be specified, otherwise you will not be able to connect to the Xinference service outside the container.You can specify multiple
-e
Option to assign values to multiple environment variables.
Of course, you can also run the container and then enter the container to manually start Xinference.
Mount model directory
By default, the image does not contain any model files, and the model will be downloaded in the container during use. If you need to use the downloaded model, you need to mount the host directory into the container. In this case, you need to specify the local volume when running the container and configure the environment variables for Xinference.
docker run -v </on/your/host>:</on/the/container> -e XINFERENCE_HOME=</on/the/container> -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0
The principle of the above command is to mount the specified directory on the host into the container and set XINFERENCE_HOME
The environment variable points to the directory inside the container. In this way, all downloaded model files will be stored in the directory you specified on the host. You don't need to worry about losing these files when the Docker container stops. The next time you run the container, you can directly use the existing model without downloading it again.
If you download the model using the default path on the host machine, since the xinference cache directory uses a soft link to store the model, you need to mount the directory where the original file is located into the container. For example, if you use huggingface and modelscope as the model repository, you need to mount the two corresponding directories into the container. Generally, the corresponding cache directories are <home_path>/.cache/huggingface and <home_path>/.cache/modelscope respectively. The commands used are as follows:
docker run\-v</your/home/path>/.xinference:/root/.xinference\-v</your/home/path>/.cache/huggingface:/root/.cache/huggingface\-v</your/hom e/path>/.cache/modelscope:/root/.cache/modelscope\-p9997:9997\--gpusall\xprobe/xinference:v<your_version>\xinference-local-H0.0.0.0
Start deployment:
mkdir /data/xinference & cd /data/xinferencedocker run -d --privileged --gpus all --restart always \ -v /data/xinference/.xinference:/root/.xinference \ -v /data/xinference/.cache/huggingface:/root/.cache/huggingface \ -v /data/xinference/.cache/modelscope:/root/.cache/modelscope \ -p 9997:9997\xprobe/xinference:v1.5.0\xinference-local -H 0.0.0.0
At this point, Xinference is deployed successfully and can be accessed using http://ip:9997.