Ollama releases update to support streaming responses with tool calls

Written by
Silas Grey
Updated on:June-13th-2025
Recommendation

Ollama v0.8 update brings a revolutionary experience of streaming response and tool calling.
Core content:
1. Combination of streaming response and instant tool calling to improve the fluency of AI applications
2. New intelligent incremental parser to enhance the accuracy and compatibility of tool calling
3. Developer-friendly integration, support for multiple languages ​​and models, and simplified technical implementation

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Real-time interaction and instant response are key to AI application experience, but blocking tool calls often interrupt the fluidity of content, causing users to experience unnecessary waiting when the model interacts with external tools. Ollama recently launched the v0.8 update, which brings the streaming responses with tool calling feature  , allowing chat applications built by developers to call tools and display results in real time, just like streaming output of ordinary text.

This update enables all chat applications to call external tools in real time while the model generates content, and smoothly display the entire process (including the model's thinking, tool call instructions, and the final text reply) to users. This feature is fully supported in Ollama's Python and JavaScript libraries and cURL API.

The key highlights of this update include:

  1. Instant tool calls and content streaming output:  Applications no longer need to wait for the model to fully respond before processing tool calls. Model-generated content and tool call instructions can be streamed synchronously and in blocks.
  2. New Smart Incremental Parser:  Ollama built a new parser that focuses on understanding the structure of tool calls rather than just looking for JSON. This allows Ollama to:
  • Real-time separation:  Accurately detect, suppress, and parse tool call-related tokens while streaming user content.
  • Compatible with a wide range of models:  It works effectively regardless of whether the model has been trained with tool-specific tokens, and can even handle partial prefixes of model output or fall back to JSON parsing when necessary.
  • Improved accuracy:  The reliability of tool calls has been significantly improved through prefix matching and state management, avoiding duplication or incorrect parsing issues that may have occurred in the past.
  • Wide range of model support:  including Qwen 3, Devstral, Qwen2.5 series, Llama 3.1, Llama 4 and many other models that support tool calls.
  • Developer-friendly integration:  Clear cURL, Python, and JavaScript examples are provided to help you get started quickly.
  • Model Context Protocol (MCP) enhancements:  Developers using MCP can now also enjoy the benefits of streaming chat content and tool calls, and the official recommendation is to use larger context windows (such as 32k) to further improve the performance and result quality of tool calls.
  • At the technical implementation level, developers can enable this feature in the following ways:

    • REST API (cURL)  : /api/chat Set in request "stream": true and through tools The array defines the available tools.
    • Python:  Use ollama.chat() When setting stream=True and pass the tool definition (which can be a function object) to tools parameter.
    • JavaScript:  Use ollama.chat() When setting stream: true And pass the tool schema object to tools parameter.

    Here is an example in Python (calling a custom math function):

# Define the python function
def add_two_numbers (a: int, b: int)  -> int: 
"""
  Add two numbers

  Args:
    a (set): The first number as an int
    b (set): The second number as an int

  Returns:
    int: The sum of the two numbers
  """

return  a + b

from  ollama  import  chat
messages = [{ 'role''user''content''what is three minus one?' }]

response: ChatResponse = chat(
  model = 'qwen3' ,
  messages=messages,
  tools=[add_two_numbers],  # Python SDK supports passing tools as functions
  stream= True
)

for  chunk  in  response:
# Print model content
  print(chunk.message.content, end= '' , flush= True )
# Print the tool call
if  chunk.message.tool_calls:
    print(chunk.message.tool_calls)

Expected output (examples, depending on whether the model behavior and user question match the tool):

<think>
Okay, the user is asking ...
</think>

[ToolCall( function =Function(name= 'subtract_two_numbers' , arguments={ 'a' : 3,  'b' : 1}))]

cURL example (query weather):

curl http://localhost:11434/api/chat -d  '{
  "model": "qwen3",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Toronto?"
    }
  ],
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, eg San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, eg '
celsius ' or ' fahrenheit '",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'

Streaming output:

...
{
"model""qwen3" ,
"created_at""2025-05-27T22:54:57.641643Z" ,
"message" : {
    "role""assistant" ,
    "content""celsius"
  },
"done"false
}
 {
"model""qwen3" ,
"created_at""2025-05-27T22:54:57.673559Z" ,
"message" : {
    "role""assistant" ,
    "content""</think>"
  },
"done"false
}
{
"model""qwen3" ,
"created_at""2025-05-27T22:54:58.100509Z" ,
"message" : {
    "role""assistant" ,
    "content""" ,
    "tool_calls" : [
      {
        "function" : {
          "name""get_current_weather" ,
          "arguments" : {
            "format""celsius" ,
            "location""Toronto"
          }
        }
      }
    ]
  },
"done"false
}
...

The official also pointed out that in order to obtain the best tool calling effect, for scenes that require high-precision tool calling or complex interactions, as shown below, you can try options In num_ctx Increase the context window of the model (for example, set it to 32000), but this will increase memory usage.

curl -X POST  "http://localhost:11434/api/chat"  -d  '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ],
  "options": {
    "num_ctx": 32000 # Update context window here
  }
}'