tools is an optional parameter in OpenAI's Chat Completion API that can be used to provide function specifications. The purpose of this is to enable the model to generate function parameter formats that conform to the provided specifications. At the same time, the API does not actually perform any function calls. Developers need to use the model output to perform function calls.
Both vLLM and SgLang support the tool parameter of OpenAI-API. Through the tool parameter and the function calling specification, QwQ will be able to decide when to call what function and how to call the function.
Note: The test cases in this article refer to OpenAI cookbook: https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models
This article mainly contains the following two parts:
Model deployment: Use vLLM, SgLang, and QwQ to deploy a chat API interface that supports Function call by setting parameters.
Generate function parameters: Specify a set of functions and use the API to generate function parameters.
Model file download
modelscope download --model=Qwen/QwQ-32B --local_dir ./QwQ-32B
Environmental Installation
pip install vllmpip install "sglang[all]>=0.4.3.post2"
vLLM deployment commands
vllm serve /ModelPath/QwQ-32B \--port 8000 \--reasoning-parser deepseek_r1 \--max_model_len 4096 \--enable-auto-tool-choice \--tool-call-parser hermes
sglang deployment command
python -m sglang.launch_server --model-path /ModelPath/QwQ-32B --port 3001 --host 0.0.0.0 --tool-call-parser qwen25
Model call
Use OpenAI's API format to call the locally deployed QwQ model
Single-turn conversation
from openai import OpenAI # Set OpenAI's API key and API base URL using vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,) # Use streaming output (stream=True) chat_response = client.chat.completions.create(model="path/to/QwQ-32B",messages=[{"role": "user", "content": "你好"}],stream=True # Enable streaming response) # Process streaming output contents = [] for e in chat_response: # print(e.choices[0].delta.content,end="") contents.append(e.choices[0].delta.content) print("".join(contents))
Multi-round dialogue
from openai import OpenAI
import os
# Initialize the OpenAI client
client = OpenAI(
api_key = "empty" ,
base_url= "http://localhost:8000/v1"
)
reasoning_content = "" # Define the complete thinking process
answer_content = "" # Define the complete response
is_answering = False # Determine whether to end the thinking process and start replying
messages = []
conversation_idx = 1
while True :
print( "=" * 20 + f" {conversation_idx} th round of conversation" + "=" * 20 )
conversation_idx += 1
user_msg = { "role" : "user" , "content" : input( "Please enter your message: " )}
messages.append(user_msg)
# Create a chat completion request
completion = client.chat.completions.create(
model = "path/to/QwQ-32B" , # Here we take qwq-32b as an example, you can change the model name as needed
messages=messages,
stream= True
)
print( "\n" + "=" * 20 + "Thinking process" + "=" * 20 + "\n" )
for chunk in completion:
# If chunk.choices is empty, print usage
if not chunk.choices:
print( "\nUsage: " )
print(chunk.usage)
else :
delta = chunk.choices[ 0 ].delta
# Print the thought process
if hasattr(delta, 'reasoning_content' ) and delta.reasoning_content != None :
print(delta.reasoning_content, end= '' , flush= True )
reasoning_content += delta.reasoning_content
else :
# Start replying
if delta.content != "" and is_answering is False :
print( "\n" + "=" * 20 + "Complete reply" + "=" * 20 + "\n" )
is_answering = True
# Print reply process
print(delta.content, end= '' , flush= True )
answer_content += delta.content
messages.append({ "role" : "assistant" , "content" : answer_content})
print( "\n" )
# print("=" * 20 + "Complete thinking process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete reply" + "=" * 20 + "\n")
# print(answer_content)
First, define the model calling function
from openai import OpenAI
# Set OpenAI's API key and API base URL using vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
MODEL = "path/to/QwQ-32B"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
def chat_completion_request (messages, tools=None, tool_choice=None, model=MODEL) :
try :
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
tool_choice = "auto" ,
)
return response
except Exception as e:
print( "Unable to generate ChatCompletion response" )
print( f"Exception: {e} " )
raise
We then define some utilities for calling the chat completion API and maintaining and tracking the conversation state.
def pretty_print_conversation(messages):role_to_color = {"system": "red","user": "green","assistant": "blue","function": "magenta",}for message in messages:if message["role"] == "system":print(colored(f"system: {message['content']}\n", role_to_color[message["role"]]))elif message["role"] == "user":print(colored(f"user: {message['content']}\n", role_to_color[message["role"]]))elif message["role"] == "assistant" and message.get("function_call"):print(colored(f"assistant: {message['function_call']}\n", role_to_color[message["role"]]))elif message["role"] == "assistant" and not message.get("function_call"):print(colored(f"assistant: {message['content']}\n", role_to_color[message["role"]]))elif message["role"] == "function": print(colored(f"function ({message['name']}): {message['content']}\n", role_to_color[message["role"]]))
Here we assume a weather API and set up some function specifications to interact with it. These function specifications are passed to the Chat API so that the model can generate function parameters that meet the specifications.
tools = [{"type": "function","function": {"name": "get_current_weather","description": "Get the current weather","parameters": {"type": "object","properties": {"location": {"type": "string","description": "The city and state, eg San Francisco, CA",},"format": {"type": "string","enum": ["celsius", "fahrenheit"],"description": "The temperature unit to use. Infer this from the users location.",},},"required": ["location", "format"],},}},{"type": "function","function": {"name": "get_n_day_weather_forecast","description": "Get an N-day weather forecast","parameters": {"type": "object","properties": {"location": {"type": "string","description": "The city and state, eg San Francisco, CA",},"format": {"type": "string","enum": ["celsius", "fahrenheit"],"description": "The temperature unit to use. Infer this from the users location.",},"num_days": {"type": "integer","description": "The number of days to forecast",}},"required": ["location", "format", "num_days"]},}},]
If we ask the model about the current weather conditions, it will ask back, hoping to get further information about the parameters.
messages = []messages.append({"role": "user", "content": "hi , can you tell me what's the weather like today"})chat_response = chat_completion_request(messages, tools=tools)print(chat_response)assistant_message = chat_response.choices[0].messagemessages.append(assistant_message)assistant_message
Once we provide the missing parameter information through dialogue, the model will generate the appropriate function parameters for us.
messages.append({"role": "user", "content": "I'm in Glasgow, Scotland."})chat_response = chat_completion_request(messages, tools=tools)assistant_message = chat_response.choices[0].messagemessages.append(assistant_message)assistant_message
Through different prompt words, we can let it ask different questions to obtain function parameter information.
messages = []messages.append({"role": "user", "content": "can you tell me, what is the weather going to be like in Glasgow, Scotland in next x days"})chat_response = chat_completion_request(messages, tools=tools)assistant_message = chat_response.choices[0].messagemessages.append(assistant_message)assistant_message
messages.append({"role": "user", "content": "5 days"})chat_response = chat_completion_request(messages, tools=tools)chat_response.choices[0]
Parallel function calls
Supports calling multiple functions in parallel in one question
messages = []
messages.append({"role": "user", "content": "what is the weather going to be like in San Francisco and Glasgow over the next 4 days"})
chat_response = chat_completion_request(
messages, tools=tools, model=MODEL
)
assistant_message = chat_response.choices[0].message.tool_calls
assistant_message