vLLM+Qwen-32B+Open Web UI builds a local private large model

Written by
Iris Vance
Updated on:June-29th-2025
Recommendation

Explore the construction and deployment of local private large models to achieve high-performance AI reasoning.

Core content:
1. vLLM: A high-performance large language model reasoning engine developed by the University of California, Berkeley
2. ModelScope: Alibaba Group's open source Model as a Service (MaaS) platform
3. NVIDIA GPU configuration and vLLM container operation guide

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Table of contents

  1. Introduction to vLLM
  2. Introduction to ModelScope
  3. Docker environment configuration
  4. Running the vLLM container
  5. Open Web UI deployment
  6. Tokens and context

1. Introduction to vLLM

vLLM (Vectorized Large Language Model Serving System) is a high-performance, scalable large language model (LLM) inference engine developed by a team from the University of California, Berkeley. It focuses on achieving high-throughput, low-latency, and low-cost model services through innovative memory management and computational optimization technologies. vLLM uses PagedAttention memory management technology to significantly improve GPU memory utilization, while supporting distributed reasoning and making efficient use of multi-machine and multi-card resources. Whether it is low-latency, high-throughput online services or resource-constrained edge deployment scenarios, vLLM can provide excellent performance.

  • Chinese site : https://vllm.hyper.ai/docs/
  • English site : https://docs.vllm.ai/en/latest/index.html

    2. Introduction to ModelScope

    ModelScope‌ is an open source Model as a Service (MaaS) platform launched by Alibaba Group, which aims to simplify the process of model application and provide AI developers with flexible, easy-to-use, low-cost one-stop model service products. The platform brings together a variety of state-of-the-art machine learning models, covering natural language processing, computer vision, speech recognition and other fields, and provides rich API interfaces and tools to enable developers to easily integrate and use these models.‌

    • Official website : https://modelscope.cn/models
    Installing ModelScope
    pip install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple
    Create a storage directory
    mkdir -p /data/Qwen/models/Qwen-32B
    Download QwQ-32B model
    modelscope download --local_dir /data/Qwen/models/Qwen-32B --model Qwen/QWQ-32B

    3. Enable and optimize NVIDIA GPU
    Update package lists
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    Configuring NVIDIA Container Runtime
    sudo nvidia-ctk runtime configure --runtime=docker
    Restart the service
    sudo systemctl daemon-reload && sudo systemctl restart docker

    4. Run the vLLM container
    Pull the image
    docker pull docker.1panel.live/vllm/vllm-openai
    Start the vLLM container
    docker run -itd --restart=always --name Qwen-32B \ -v /data/Qwen:/data \ -p 18005:8000 \ --gpus '"device=1,2,3,4"' \ --ipc=host --shm-size=16g \ vllm/vllm-openai:latest \ --dtype bfloat16 \ --served-model-name Qwen-32B \ --model "/data/models/Qwen-32B" \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.95 \ --max-model-len 81920 \ --api-key token-abc123 \ --enforce-eager
    Detailed explanation of Docker command parameter parsing
    • -i (interactive) : Allows users to interact with the container even when the container is not running in the foreground. Users candocker logsordocker attachCommand to view the output log of the container

    • -t (tty) : Allocate a pseudo-TTY (virtual terminal) to the container, simulating a terminal environment.

    • -d (detach : Run the container in the background and do not occupy the current terminal.

    • --restart=always : Set the container to automatically restart after the host restarts or the container exits.

    • --name Qwen-32B : Specify a unique name for the container.

    • -v /data/Qwen:/data/data/QwenMount the directory into the container/dataDirectory. Avoid data loss caused by container restart or deletion.
    • -p 18005:8000 : Map the host's port 18005 to port 8000 in the container.
    • --gpus '"device=1,2,3,4"' : Specifies that the container uses GPU devices 1, 2, 3, and 4 on the host.
    • --ipc=host : Share the host's IPC (inter-process communication) namespace, allowing the container to communicate with the host's processes.
    VLLM model startup parameters
    • --dtype bfloat16 : Specifies to use bfloat16 (Brain Floating Point 16) for model calculation.

    • --served-model-name Qwen-32B : Set the service name of the model to "Qwen-32B", which is used for model identification during API requests.

    • --model "/data/models/Qwen-32B" : specifies the path of the model file as the path in the container/data/models/Qwen-32B.

    • --tensor-parallel-size 4 : Sets the tensor parallel scale to 4, corresponding to using 4 GPUs for model parallel computing.

    • --gpu-memory-utilization 0.85 : Sets GPU memory usage to 85% and reserves 15% of memory space to prevent program crashes due to memory overflow.

    • --max-model-len 81920 : Specifies that the maximum context length of the model is 81920 tokens. The total number of input and output tokens that the model can process in a single inference does not exceed 81920.

    • --api-key token-abc123 : Set the API access key to "token-abc123". This key needs to be provided in the request header when calling the API.

    • --enforce-eager : Enables the eager execution mode to ensure layer-by-layer calculation during model inference and avoid memory issues that may be caused by delayed execution.


    5. Open Web UI deployment
    Pull the open-webui image
    docker pull ghcr.nju.edu.cn/open-webui/open-webui:main
    Start Open Web UI
    docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \ -v /data/open-webui:/app/backend/data \ --name open-webui --restart always ghcr.nju.edu.cn/open-webui/open-webui:main
    Accessing the Web Interface
    Browser access:http://localhost:3000
    Administrator Panel--External Links--New Model Connection

    ModelIDLeave blank to automatically select/v1/modelsGet it from the interface, open the new one and select DeepSeek-R1-Distill-Llama-70B model by default

    Start a new conversation