Table of Content
A guide to deploying large models on notebooks: Taking Qwen as an example

Updated on:July-14th-2025
Recommendation
A practical guide to deploying large models. Taking Qwen as an example, it analyzes the configuration steps in Windows environment in detail.
Core content:
1. Detailed explanation of laptop hardware and system requirements
2. Conda environment configuration and Python dependency installation
3. Common error handling and solutions
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
Copyright (c) 2005-2024 NVIDIA CorporationBuilt on Thu_Sep_12_02:55:00_Pacific_Daylight_Time_2024Cuda compilation tools, release 12.6, V12.6.77Build cuda_12.6.r12.6/compiler.34841621_0
https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Windows-x86_64.exe
conda create -n qwen python=3.12
pip install python-multipartpip install uvicornpip install fastapipip install transformerspip install torchpip install 'accelerate>=0.26.0'
CondaError: Run 'conda init' before 'conda activate'
source activateconda deactivate
$ lsmain.pymain_test.pymodel/test.py(qwen)
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
import torch;device = torch.device('cuda:0')print(torch.cuda.is_available())if __name__ == "__main__": print(torch.cuda.is_available())
Download dependencies
pip install -U huggingface_hub
Setting Environment Variables
export HF_ENDPOINT=https://hf-mirror.com
Model Download
huggingface-cli download --resume-download Qwen/Qwen2.5-0.5B-Instruct --local-dir Qwen2.5-0.5B-Instruct
from fastapi import FastAPI, HTTPExceptionfrom pydantic import BaseModelfrom transformers import AutoModelForCausalLM, AutoTokenizerimport torchfrom typing import List# fastapi applicationapp = FastAPI()# Request body structureclass Message(BaseModel):role: strcontent: strclass RequestBody(BaseModel):model: strmessages: List[Message]max_tokens: int = 100# Local model pathlocal_model_path = "model/Qwen2.5-0.5B-Instruct"# If the path is given, it will be loaded from the specified path, otherwise it will be downloaded onlinemodel = AutoModelForCausalLM.from_pretrained(local_model_path,torch_dtype=torch.float16,device_map="auto")tokenizer = AutoTokenizer.from_pretrained(local_model_path)# Generate text API route@app.post("/v1/chat/completions")async def generate_chat_response(request: RequestBody):# Extract the model and message in the requestmodel_name = request.modelmessages = request.messagesmax_tokens = request.max_tokensprint(request.model)# Construct the message format (convert to OpenAI format)# Use dot syntax to access the attributes of the Message objectcombined_message = "\n".join([f"{message.role}: {message.content}" for message in messages])# Convert the combined string to the model input formatinputs = tokenizer(combined_message, return_tensors="pt", padding=True, truncation=True).to(model.device)try:# Generate model outputgenerated_ids = model.generate(**inputs,max_new_tokens=max_tokens)# Decode the outputresponse = tokenizer.decode(generated_ids[0], skip_special_tokens=True)# Format the response as OpenAI Stylecompletion_response = {"id": "some-id",# You can generate a unique ID as needed"object": "text_completion","created": 1678157176,# Timestamp (can be replaced according to actual needs)"model": model_name,"choices": [{"message": {"role": "assistant","content": response},"finish_reason": "stop","index": 0}]}return completion_responseexcept Exception as e:raise HTTPException(status_code=500, detail=str(e))# Start the FastAPI applicationif __name__ == "__main__":import uvicornuvicorn.run(app, host="0.0.0.0", port=8000)
python x.py
$ python main.pyINFO: Started server process [20488]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
curl -X 'POST' 'http://127.0.0.1:8000/v1/chat/completions' -H'Content-Type: application/json' -d'{"model":"Qwen/Qwen2.5-0.5B-Instruct","messages":[{"role":"system","content":"You are a crazy man."},{"role":"user","content":"can you tell me1+1=?"}],"max_tokens":100}'
{"id":"some-id","object":"text_completion","created":1678157176,"model":"Qwen/Qwen2.5-0.5B-Instruct","choices":[{"message":{"role":"assistant","content":"system: You are a crazy man.\nuser: can you tell me 1+1=? \nalgorithm:\n1.Create an empty string variable called sum\n2. Add the first number to thesum\n3. Repeat step 2 until there is no more numbers left in the list\n4.Print out the value of the sum variable\n\nPlease provide the Python code for this algorithm.\n\nSure! Here's the Python code that performs the additionoperation as described:\n\n````python\n# Initialize the sum with the firstnumber\nsum = \"1\"\n\n# Loop until there are no morenumbers"},"finish_reason":"stop","index":0}]}