Ollama releases update to support streaming responses with tool calls

Ollama v0.8 update brings a revolutionary experience of streaming response and tool calling.
Core content:
1. Combination of streaming response and instant tool calling to improve the fluency of AI applications
2. New intelligent incremental parser to enhance the accuracy and compatibility of tool calling
3. Developer-friendly integration, support for multiple languages and models, and simplified technical implementation
Real-time interaction and instant response are key to AI application experience, but blocking tool calls often interrupt the fluidity of content, causing users to experience unnecessary waiting when the model interacts with external tools. Ollama recently launched the v0.8 update, which brings the streaming responses with tool calling feature , allowing chat applications built by developers to call tools and display results in real time, just like streaming output of ordinary text.
This update enables all chat applications to call external tools in real time while the model generates content, and smoothly display the entire process (including the model's thinking, tool call instructions, and the final text reply) to users. This feature is fully supported in Ollama's Python and JavaScript libraries and cURL API.
The key highlights of this update include:
Instant tool calls and content streaming output: Applications no longer need to wait for the model to fully respond before processing tool calls. Model-generated content and tool call instructions can be streamed synchronously and in blocks. New Smart Incremental Parser: Ollama built a new parser that focuses on understanding the structure of tool calls rather than just looking for JSON. This allows Ollama to:
Real-time separation: Accurately detect, suppress, and parse tool call-related tokens while streaming user content. Compatible with a wide range of models: It works effectively regardless of whether the model has been trained with tool-specific tokens, and can even handle partial prefixes of model output or fall back to JSON parsing when necessary. Improved accuracy: The reliability of tool calls has been significantly improved through prefix matching and state management, avoiding duplication or incorrect parsing issues that may have occurred in the past. Wide range of model support: including Qwen 3, Devstral, Qwen2.5 series, Llama 3.1, Llama 4 and many other models that support tool calls. Developer-friendly integration: Clear cURL, Python, and JavaScript examples are provided to help you get started quickly. Model Context Protocol (MCP) enhancements: Developers using MCP can now also enjoy the benefits of streaming chat content and tool calls, and the official recommendation is to use larger context windows (such as 32k) to further improve the performance and result quality of tool calls. REST API (cURL) : /api/chat
Set in request"stream": true
and throughtools
The array defines the available tools.Python: Use ollama.chat()
When settingstream=True
and pass the tool definition (which can be a function object) totools
parameter.JavaScript: Use ollama.chat()
When settingstream: true
And pass the tool schema object totools
parameter.
At the technical implementation level, developers can enable this feature in the following ways:
Here is an example in Python (calling a custom math function):
# Define the python function
def add_two_numbers (a: int, b: int) -> int:
"""
Add two numbers
Args:
a (set): The first number as an int
b (set): The second number as an int
Returns:
int: The sum of the two numbers
"""
return a + b
from ollama import chat
messages = [{ 'role' : 'user' , 'content' : 'what is three minus one?' }]
response: ChatResponse = chat(
model = 'qwen3' ,
messages=messages,
tools=[add_two_numbers], # Python SDK supports passing tools as functions
stream= True
)
for chunk in response:
# Print model content
print(chunk.message.content, end= '' , flush= True )
# Print the tool call
if chunk.message.tool_calls:
print(chunk.message.tool_calls)
Expected output (examples, depending on whether the model behavior and user question match the tool):
<think>
Okay, the user is asking ...
</think>
[ToolCall( function =Function(name= 'subtract_two_numbers' , arguments={ 'a' : 3, 'b' : 1}))]
cURL example (query weather):
curl http://localhost:11434/api/chat -d '{
"model": "qwen3",
"messages": [
{
"role": "user",
"content": "What is the weather today in Toronto?"
}
],
"stream": true,
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the weather for, eg San Francisco, CA"
},
"format": {
"type": "string",
"description": "The format to return the weather in, eg ' celsius ' or ' fahrenheit '",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location", "format"]
}
}
}
]
}'
Streaming output:
...
{
"model" : "qwen3" ,
"created_at" : "2025-05-27T22:54:57.641643Z" ,
"message" : {
"role" : "assistant" ,
"content" : "celsius"
},
"done" : false
}
{
"model" : "qwen3" ,
"created_at" : "2025-05-27T22:54:57.673559Z" ,
"message" : {
"role" : "assistant" ,
"content" : "</think>"
},
"done" : false
}
{
"model" : "qwen3" ,
"created_at" : "2025-05-27T22:54:58.100509Z" ,
"message" : {
"role" : "assistant" ,
"content" : "" ,
"tool_calls" : [
{
"function" : {
"name" : "get_current_weather" ,
"arguments" : {
"format" : "celsius" ,
"location" : "Toronto"
}
}
}
]
},
"done" : false
}
...
The official also pointed out that in order to obtain the best tool calling effect, for scenes that require high-precision tool calling or complex interactions, as shown below, you can try options
In num_ctx
Increase the context window of the model (for example, set it to 32000
), but this will increase memory usage.
curl -X POST "http://localhost:11434/api/chat" -d '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"options": {
"num_ctx": 32000 # Update context window here
}
}'