Woter AI detection.Hurry - ends Jul 9th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

vLLM+Qwen-32B+Open Web UI builds a local private large model

Written by

Iris Vance

Updated on:June-29th-2025

Introduction to vLLM
Introduction to ModelScope
Docker environment configuration
Running the vLLM container
Open Web UI deployment
Tokens and context

1. Introduction to vLLM

vLLM (Vectorized Large Language Model Serving System) is a high-performance, scalable large language model (LLM) inference engine developed by a team from the University of California, Berkeley. It focuses on achieving high-throughput, low-latency, and low-cost model services through innovative memory management and computational optimization technologies. vLLM uses PagedAttention memory management technology to significantly improve GPU memory utilization, while supporting distributed reasoning and making efficient use of multi-machine and multi-card resources. Whether it is low-latency, high-throughput online services or resource-constrained edge deployment scenarios, vLLM can provide excellent performance.

Chinese site : https://vllm.hyper.ai/docs/

English site : https://docs.vllm.ai/en/latest/index.html

2. Introduction to ModelScope

ModelScope‌ is an open source Model as a Service (MaaS) platform launched by Alibaba Group, which aims to simplify the process of model application and provide AI developers with flexible, easy-to-use, low-cost one-stop model service products. The platform brings together a variety of state-of-the-art machine learning models, covering natural language processing, computer vision, speech recognition and other fields, and provides rich API interfaces and tools to enable developers to easily integrate and use these models.‌

Official website : https://modelscope.cn/models

Installing ModelScope

pip install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple

Create a storage directory

mkdir -p /data/Qwen/models/Qwen-32B

Download QwQ-32B model

modelscope download --local_dir /data/Qwen/models/Qwen-32B --model Qwen/QWQ-32B

3. Enable and optimize NVIDIA GPU

Update package lists

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Configuring NVIDIA Container Runtime

sudo nvidia-ctk runtime configure --runtime=docker

Restart the service

sudo systemctl daemon-reload && sudo systemctl restart docker

4. Run the vLLM container

Pull the image

docker pull docker.1panel.live/vllm/vllm-openai

Start the vLLM container

docker run -itd --restart=always --name Qwen-32B \ -v /data/Qwen:/data \ -p 18005:8000 \ --gpus '"device=1,2,3,4"' \ --ipc=host --shm-size=16g \ vllm/vllm-openai:latest \ --dtype bfloat16 \ --served-model-name Qwen-32B \ --model "/data/models/Qwen-32B" \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.95 \ --max-model-len 81920 \ --api-key token-abc123 \ --enforce-eager

Detailed explanation of Docker command parameter parsing

-i (interactive) : Allows users to interact with the container even when the container is not running in the foreground. Users candocker logsordocker attachCommand to view the output log of the container
-t (tty) : Allocate a pseudo-TTY (virtual terminal) to the container, simulating a terminal environment.
-d (detach : Run the container in the background and do not occupy the current terminal.

--restart=always : Set the container to automatically restart after the host restarts or the container exits.
--name Qwen-32B : Specify a unique name for the container.

-v /data/Qwen:/data：/data/QwenMount the directory into the container/dataDirectory. Avoid data loss caused by container restart or deletion.
-p 18005:8000 : Map the host's port 18005 to port 8000 in the container.
--gpus '"device=1,2,3,4"' : Specifies that the container uses GPU devices 1, 2, 3, and 4 on the host.

--ipc=host : Share the host's IPC (inter-process communication) namespace, allowing the container to communicate with the host's processes.

VLLM model startup parameters

--dtype bfloat16 : Specifies to use bfloat16 (Brain Floating Point 16) for model calculation.
--served-model-name Qwen-32B : Set the service name of the model to "Qwen-32B", which is used for model identification during API requests.
--model "/data/models/Qwen-32B" : specifies the path of the model file as the path in the container/data/models/Qwen-32B.
--tensor-parallel-size 4 : Sets the tensor parallel scale to 4, corresponding to using 4 GPUs for model parallel computing.
--gpu-memory-utilization 0.85 : Sets GPU memory usage to 85% and reserves 15% of memory space to prevent program crashes due to memory overflow.
--max-model-len 81920 : Specifies that the maximum context length of the model is 81920 tokens. The total number of input and output tokens that the model can process in a single inference does not exceed 81920.
--api-key token-abc123 : Set the API access key to "token-abc123". This key needs to be provided in the request header when calling the API.
--enforce-eager : Enables the eager execution mode to ensure layer-by-layer calculation during model inference and avoid memory issues that may be caused by delayed execution.

5. Open Web UI deployment

Pull the open-webui image

docker pull ghcr.nju.edu.cn/open-webui/open-webui:main

Start Open Web UI

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \ -v /data/open-webui:/app/backend/data \ --name open-webui --restart always ghcr.nju.edu.cn/open-webui/open-webui:main

Accessing the Web Interface

Browser access:http://localhost:3000

Administrator Panel--External Links--New Model Connection

ModelIDLeave blank to automatically select/v1/modelsGet it from the interface, open the new one and select DeepSeek-R1-Distill-Llama-70B model by default

Start a new conversation