QwQ-32B supports the inference model of Function Call, and the era of deep thinking agent has arrived!

Recently, Qwen released QwQ-32B - an inference model that performs comparable to DeepSeek-R1 in many benchmarks. QwQ integrates the ability to call tools into the inference model, enabling it to think critically while using tools and adjust the inference process based on feedback. This ability makes QwQ well suited for use in Agentic Systems. This article describes how to combine QwQ-32B with vLLM and SgLang, build an OpenAI-formatted chat API, and combine it with external functions to expand the model's functionality.

tools is an optional parameter in OpenAI's Chat Completion API that can be used to provide function specifications. The purpose of this is to enable the model to generate function parameter formats that conform to the provided specifications. At the same time, the API does not actually perform any function calls. Developers need to use the model output to perform function calls.

Both vLLM and SgLang support the tool parameter of OpenAI-API. Through the tool parameter and the function calling specification, QwQ will be able to decide when to call what function and how to call the function.

Note: The test cases in this article refer to OpenAI cookbook: https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models

This article mainly contains the following two parts:

Model deployment: Use vLLM, SgLang, and QwQ to deploy a chat API interface that supports Function call by setting parameters.
Generate function parameters: Specify a set of functions and use the API to generate function parameters.

Model deployment

Model file download

modelscope download --model=Qwen/QwQ-32B --local_dir ./QwQ-32B

Environmental Installation

pip install vllmpip install "sglang[all]>=0.4.3.post2"

vLLM deployment commands

vllm serve /ModelPath/QwQ-32B \--port 8000 \--reasoning-parser deepseek_r1 \--max_model_len 4096 \--enable-auto-tool-choice \--tool-call-parser hermes

sglang deployment command

python -m sglang.launch_server --model-path /ModelPath/QwQ-32B --port 3001 --host 0.0.0.0 --tool-call-parser qwen25

Model call

Use OpenAI's API format to call the locally deployed QwQ model

Single-turn conversation

from openai import OpenAI # Set OpenAI's API key and API base URL using vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,) # Use streaming output (stream=True) chat_response = client.chat.completions.create(model="path/to/QwQ-32B",messages=[{"role": "user", "content": "你好"}],stream=True # Enable streaming response) # Process streaming output contents = [] for e in chat_response: # print(e.choices[0].delta.content,end="") contents.append(e.choices[0].delta.content) print("".join(contents))

Multi-round dialogue

from openai import OpenAIimport os
# Initialize the OpenAI clientclient = OpenAI(api_key = "empty" ,base_url= "http://localhost:8000/v1")
reasoning_content = "" # Define the complete thinking processanswer_content = ""  # Define the complete responseis_answering = False  # Determine whether to end the thinking process and start replying
messages = []conversation_idx = 1while  True :print( "=" * 20 + f" {conversation_idx} th round of conversation" + "=" * 20 )conversation_idx += 1user_msg = { "role" : "user" , "content" : input( "Please enter your message: " )}messages.append(user_msg)# Create a chat completion requestcompletion = client.chat.completions.create(model = "path/to/QwQ-32B" , # Here we take qwq-32b as an example, you can change the model name as neededmessages=messages,stream= True)print( "\n" + "=" * 20 + "Thinking process" + "=" * 20 + "\n" )for chunk in completion:# If chunk.choices is empty, print usageif  not chunk.choices:print( "\nUsage: " )print(chunk.usage)else :delta = chunk.choices[ 0 ].delta# Print the thought processif hasattr(delta, 'reasoning_content' ) and delta.reasoning_content != None :print(delta.reasoning_content, end= '' , flush= True )reasoning_content += delta.reasoning_contentelse :# Start replyingif delta.content != ""  and is_answering is  False :print( "\n" + "=" * 20 + "Complete reply" + "=" * 20 + "\n" )is_answering = True# Print reply processprint(delta.content, end= '' , flush= True )answer_content += delta.contentmessages.append({ "role" : "assistant" , "content" : answer_content})print( "\n" )# print("=" * 20 + "Complete thinking process" + "=" * 20 + "\n")# print(reasoning_content)# print("=" * 20 + "Complete reply" + "=" * 20 + "\n")# print(answer_content)

Use the tools

First, define the model calling function

from openai import OpenAI # Set OpenAI's API key and API base URL using vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://localhost:8000/v1"MODEL = "path/to/QwQ-32B" client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)
def  chat_completion_request (messages, tools=None, tool_choice=None, model=MODEL) :try :response = client.chat.completions.create(model=model,messages=messages,tools=tools,tool_choice = "auto" ,)return responseexcept Exception as e:print( "Unable to generate ChatCompletion response" )print( f"Exception: {e} " )raise

We then define some utilities for calling the chat completion API and maintaining and tracking the conversation state.

def pretty_print_conversation(messages):role_to_color = {"system": "red","user": "green","assistant": "blue","function": "magenta",}for message in messages:if message["role"] == "system":print(colored(f"system: {message['content']}\n", role_to_color[message["role"]]))elif message["role"] == "user":print(colored(f"user: {message['content']}\n", role_to_color[message["role"]]))elif message["role"] == "assistant" and message.get("function_call"):print(colored(f"assistant: {message['function_call']}\n", role_to_color[message["role"]]))elif message["role"] == "assistant" and not message.get("function_call"):print(colored(f"assistant: {message['content']}\n", role_to_color[message["role"]]))elif message["role"] == "function": print(colored(f"function ({message['name']}): {message['content']}\n", role_to_color[message["role"]]))

Tool Definition

Here we assume a weather API and set up some function specifications to interact with it. These function specifications are passed to the Chat API so that the model can generate function parameters that meet the specifications.

tools = [{"type": "function","function": {"name": "get_current_weather","description": "Get the current weather","parameters": {"type": "object","properties": {"location": {"type": "string","description": "The city and state, eg San Francisco, CA",},"format": {"type": "string","enum": ["celsius", "fahrenheit"],"description": "The temperature unit to use. Infer this from the users location.",},},"required": ["location", "format"],},}},{"type": "function","function": {"name": "get_n_day_weather_forecast","description": "Get an N-day weather forecast","parameters": {"type": "object","properties": {"location": {"type": "string","description": "The city and state, eg San Francisco, CA",},"format": {"type": "string","enum": ["celsius", "fahrenheit"],"description": "The temperature unit to use. Infer this from the users location.",},"num_days": {"type": "integer","description": "The number of days to forecast",}},"required": ["location", "format", "num_days"]},}},]

If we ask the model about the current weather conditions, it will ask back, hoping to get further information about the parameters.

messages = []messages.append({"role": "user", "content": "hi , can you tell me what's the weather like today"})chat_response = chat_completion_request(messages, tools=tools)print(chat_response)assistant_message = chat_response.choices[0].messagemessages.append(assistant_message)assistant_message

Once we provide the missing parameter information through dialogue, the model will generate the appropriate function parameters for us.

messages.append({"role": "user", "content": "I'm in Glasgow, Scotland."})chat_response = chat_completion_request(messages, tools=tools)assistant_message = chat_response.choices[0].messagemessages.append(assistant_message)assistant_message

Through different prompt words, we can let it ask different questions to obtain function parameter information.

messages = []messages.append({"role": "user", "content": "can you tell me, what is the weather going to be like in Glasgow, Scotland in next x days"})chat_response = chat_completion_request(messages, tools=tools)assistant_message = chat_response.choices[0].messagemessages.append(assistant_message)assistant_message

messages.append({"role": "user", "content": "5 days"})chat_response = chat_completion_request(messages, tools=tools)chat_response.choices[0]

Parallel function calls

Supports calling multiple functions in parallel in one question

messages = []messages.append({"role": "user", "content": "what is the weather going to be like in San Francisco and Glasgow over the next 4 days"})chat_response = chat_completion_request(messages, tools=tools, model=MODEL)
assistant_message = chat_response.choices[0].message.tool_callsassistant_message