A guide to deploying large models on notebooks: Taking Qwen as an example

Written by
Silas Grey
Updated on:July-14th-2025
Recommendation

A practical guide to deploying large models. Taking Qwen as an example, it analyzes the configuration steps in Windows environment in detail.

Core content:
1. Detailed explanation of laptop hardware and system requirements
2. Conda environment configuration and Python dependency installation
3. Common error handling and solutions

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
1. Basic environment description
I use Windows 11, and the command line tool is Git Bash. The laptop is an Nvidia 3050 with 4G video memory, and the CUDA version is as follows:
Copyright (c) 2005-2024 NVIDIA CorporationBuilt on Thu_Sep_12_02:55:00_Pacific_Daylight_Time_2024Cuda compilation tools, release 12.6, V12.6.77Build cuda_12.6.r12.6/compiler.34841621_0
2. Use Conda for environment configuration
conda provides us with an independent Python environment. You can download conda for Windows 11 through the following link:
https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Windows-x86_64.exe
Note that when installing, remember to add conda to the environment variable selection:
3. Conda environment configuration
Considering that the qwen model performs well in the open source field and it also provides a 0.5B model, we chose qwen-0.5B as the base model.
Use the following command to create a new environment:
conda create -n qwen python=3.12
Install the following dependencies at one time:
pip install python-multipartpip install uvicornpip install fastapipip install transformerspip install torchpip install 'accelerate>=0.26.0'
3.1 Error CondaError: Run 'conda init' before 'conda activate'
During actual operation, you may encounter the following errors:
CondaError: Run 'conda init' before 'conda activate'
In fact, this problem occurs because you have already entered an environment. If you do not deactivate it, conda will be in the base environment by default, so you can execute the following two commands:
source activateconda deactivate
Solve this problem. Normally, if you are already working in the qwen environment, there will be a prompt of the environment name after each command is executed, as follows:
$ lsmain.pymain_test.pymodel/test.py(qwen)
3.2 GPU version
If you want to use the GPU version, you can create an environment named qwen-gpu and then install the following dependencies for the environment:
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
The premise is that the graphics card driver and cuda have been installed. My cuda is 12.6, so there is no problem in executing the above command.
You can use the following code to determine whether the GPU is supported normally:
import torch;device = torch.device('cuda:0')print(torch.cuda.is_available())if __name__ == "__main__": print(torch.cuda.is_available())
If it is True, it means support. Then continue to install the dependencies as the non-GPU version.
4. Manually download the model
For some reasons, you cannot directly download the model from https://huggingface.co in China.
Fortunately, there is a hg mirror site for downloading. So we can download the model manually. Mirror site address:  https://hf-mirror.com/ 
  • Download dependencies
pip install -U huggingface_hub
  • Setting Environment Variables
You can consider setting it in bashrc, otherwise you will have to remember to execute the export every time
export HF_ENDPOINT=https://hf-mirror.com
  • Model Download
huggingface-cli download --resume-download Qwen/Qwen2.5-0.5B-Instruct --local-dir Qwen2.5-0.5B-Instruct
The third parameter is the model name. The model name can be obtained from the mirror website, for example:
https://hf-mirror.com/Qwen/Qwen2.5-0.5B-Instruct The name can be copied from the following place:
5. Deploy the model
Use the following code to deploy the model:
from fastapi import FastAPI, HTTPExceptionfrom pydantic import BaseModelfrom transformers import AutoModelForCausalLM, AutoTokenizerimport torchfrom typing import List# fastapi applicationapp = FastAPI()# Request body structureclass Message(BaseModel):role: strcontent: strclass RequestBody(BaseModel):model: strmessages: List[Message]max_tokens: int = 100# Local model pathlocal_model_path = "model/Qwen2.5-0.5B-Instruct"# If the path is given, it will be loaded from the specified path, otherwise it will be downloaded onlinemodel = AutoModelForCausalLM.from_pretrained(local_model_path,torch_dtype=torch.float16,device_map="auto")tokenizer = AutoTokenizer.from_pretrained(local_model_path)# Generate text API route@app.post("/v1/chat/completions")async def generate_chat_response(request: RequestBody):# Extract the model and message in the requestmodel_name = request.modelmessages = request.messagesmax_tokens = request.max_tokensprint(request.model)# Construct the message format (convert to OpenAI format)# Use dot syntax to access the attributes of the Message objectcombined_message = "\n".join([f"{message.role}: {message.content}" for message in messages])# Convert the combined string to the model input formatinputs = tokenizer(combined_message, return_tensors="pt", padding=True, truncation=True).to(model.device)try:# Generate model outputgenerated_ids = model.generate(**inputs,max_new_tokens=max_tokens)# Decode the outputresponse = tokenizer.decode(generated_ids[0], skip_special_tokens=True)# Format the response as OpenAI Stylecompletion_response = {"id": "some-id",# You can generate a unique ID as needed"object": "text_completion","created": 1678157176,# Timestamp (can be replaced according to actual needs)"model": model_name,"choices": [{"message": {"role": "assistant","content": response},"finish_reason": "stop","index": 0}]}return completion_responseexcept Exception as e:raise HTTPException(status_code=500, detail=str(e))# Start the FastAPI applicationif __name__ == "__main__":import uvicornuvicorn.run(app, host="0.0.0.0", port=8000)
Use the following command to deploy the model in the qwen environment:
python x.py
If the operation is successful, the following information will be output:
$ python main.pyINFO: Started server process [20488]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Then use the following request to get the results of the large model:
curl -X 'POST' 'http://127.0.0.1:8000/v1/chat/completions' -H'Content-Type: application/json' -d'{"model":"Qwen/Qwen2.5-0.5B-Instruct","messages":[{"role":"system","content":"You are a crazy man."},{"role":"user","content":"can you tell me1+1=?"}],"max_tokens":100}'
The results are as follows:
{"id":"some-id","object":"text_completion","created":1678157176,"model":"Qwen/Qwen2.5-0.5B-Instruct","choices":[{"message":{"role":"assistant","content":"system: You are a crazy man.\nuser: can you tell me 1+1=? \nalgorithm:\n1.Create an empty string variable called sum\n2. Add the first number to thesum\n3. Repeat step 2 until there is no more numbers left in the list\n4.Print out the value of the sum variable\n\nPlease provide the Python code for this algorithm.\n\nSure! Here's the Python code that performs the additionoperation as described:\n\n````python\n# Initialize the sum with the firstnumber\nsum = \"1\"\n\n# Loop until there are no morenumbers"},"finish_reason":"stop","index":0}]}
5.1 Error Handling
If the request encounters the following error:
{"detail":"There was an error parsing the body"}
It may be because your request content contains Chinese.