Ollama v0.8.0 is released shockingly! Real-time streaming tool calls lead a new era of intelligent dialogue

Written by
Iris Vance
Updated on:June-10th-2025
Recommendation

Ollama v0.8.0 leads a new revolution in intelligent dialogue technology, and real-time streaming tools call open a new era of interaction.

Core content:

1. Ollama v0.8.0 core update: streaming response + tool call

2. New real-time interactive experience: Chat reply and tool call are presented synchronously

3. Compatible with mainstream models and provides rich practical examples of tool calls

 
Yang Fangxian
53A founder/Tencent Cloud (TVP), most valuable expert

 

In recent years, artificial intelligence dialogue technology has made rapid progress, model capabilities have been continuously improved, and the demand for integration of intelligent tools with calls has become increasingly strong. As a pioneer of innovation, Ollama officially released its heavyweight upgrade version - v0.8.0 on May 28, 2025. This update not only brings a more powerful tool called support, but also revolutionizes the introduction of "streaming response + tool call" capabilities, greatly improving user interaction experience and developer flexibility. This article will deeply analyze the core update content, technological innovation and application prospects of Ollama v0.8.0, and will give you a comprehensive understanding of how this smart dialogue engine opens up a new era of chat and tool calls.

1. Ollama v0.8.0: Focus on bigger and better tool support

As a leading localized large language model operation engine, Ollama is committed to empowering developers to create intelligent assistants and diverse interactive applications. The release of v0.8.0 is a major leap forward in this vision. The core improvement of the new version boils down to two keywords:

  1. 1. Real-time streaming response support tool calls
  2. 2. More accurate memory estimation and log debugging

The functional updates of these two dimensions have improved the overall response efficiency of the system and the development and debugging experience, and also provided a solid foundation for the "human-machine + tool" collaboration in various complex scenarios.

2. Streaming response and tool call - unlocking a new real-time interactive experience

Previously, Ollama's tool calls had to wait for the model to generate a complete output at one time before determining whether the tool calling instructions were included through parsing. Although this method is stable, it does not support the real-time experience of "talking while chatting", and the response speed is limited. The v0.8.0 version was the first to break this bottleneck and successfully implement tool calls under streaming response, that is, while the model generates content, tool calls can be triggered and executed immediately. This design brings many important advantages:

  • User experience upgrade: Chat reply does not need to wait for complete generation, and the content and tool call results can be presented synchronously, making it more natural and smooth.
  • Real-time feedback on tool calls: Tool execution is seamlessly connected with content generation, ensuring that tool data is more accurate and timely.
  • Enhanced development flexibility: Supports multiple rounds of calls and content display in complex interactive scenarios, expanding the application innovation space.

3. List of mainstream models that support tool calls

v0.8.0 is compatible with a variety of advanced models and meets different application needs, including but not limited to:

  • • Qwen 3
  • • Devstral
  • • Qwen 2.5 and 2.5-coder
  • • Llama 3.1
  • • Llama 4

These models have been optimized and adapted to efficiently identify and parse tool call requests and achieve accurate collaboration.


IV. Practical demonstration of the tool call

With curl, Python, and JavaScript encoding examples, Ollama provides developers with a complete and clear operating manual. To give a simple example, the weather tool is implemented as follows:

Check the weather today in Toronto using curl:

curl http://localhost:11434/api/chat -d <span "qwen3",<="" model":="" span="" style="box-sizing: border-box;border-width: 0px;border-style: solid;border-color: hsl(var(--border));color: ">
  "messages": [{"role": "user","content": "What is the weather today in Toronto?"}],
  "stream": true,
  "tools": [
  "type": [
    {
    "type": "function",
      "function": { "type": "object", "properties": {
        "description": "Get the current weather for a location",
        "parameters": { "type": "object", "properties": {
"location": {"type": "string"},
          "format": {"type": "string","enum": ["celsius", "fahrenheit"]}
        }, "required": ["location", "format"] }
    }
]
}'

After running, the model returns the weather tool call request instantly and waits for the data to return, thus achieving a smarter conversation experience. Both Python and JavaScript interfaces support passing custom functions as "tools" to enhance playability.


5. The new incremental tool called parser reveals the secret

One of the most technological innovations of this V0.8.0 is the redesigned "Incremental Tool Call Parser". It accurately recognizes the tool call starting prefix based on the model template, and can intelligently distinguish chat content from tool requests. It can "generate and parse" when the model output is not complete, greatly improving streaming processing efficiency.

Compared with the old version of the method that relies on "full JSON parsing", the new parser solves multiple pain points:

  • Timely capture tool calls: There is no need to wait for all text to be generated before parsing, so as to improve the response speed.
  • Compatible with model output without prefixes and partial prefixes: The model can be correctly recognized even if it does not strictly follow the tool call format.
  • Accurately eliminates redundant calls: Avoid duplicate triggers caused by the model retelling previous call information.

This design makes the tool call process more robust, and developers do not need to worry about parsing failures caused by irregular formats.


6. Innovative Model Context Protocol

In order to maximize the use of model capabilities, Ollama has launched the "Model Context Protocol" (MCP), which supports ultra-long context windows (such as 32k tokens and above), significantly improving tool call accuracy and context understanding depth. Although the longer context will bring higher memory overhead, it does help the model make more reasonable and accurate tool call decisions. Developers can customize the num_ctx field to flexibly adjust the request parameters.


7. Practical techniques and best practice suggestions

  1. 1. Prefers models that support tool calls
    The new version is compatible with multiple mainstream models, and the model that can grant tool calls is preferred to give full play to the advantages of the new version.
  2. 2. Configure long contexts to improve the effect
    Set the maximum context length according to application requirements to ensure that the model can obtain more complete information and optimize the results.
  3. 3. Function and tool design need to be clear and standardized
    To avoid call ambiguity, the function definition and parameter description of the tool must be detailed and comply with the model identification standards.
  4. 4. Use streaming capabilities to optimize front-end interaction
    Combining WebSocket, event streaming, and other technologies to create a real-time chat interface to realize incremental content display and synchronous feedback of tool calls.
  5. 5. Follow the log debugging information
    The new version of disk log contains better memory estimation, which helps diagnose model resource usage and improve tuning efficiency.

8. Widely applicable scenarios and future prospects

With the v0.8.0 version, Ollama is not only used in basic chatbots, but also kicks off the implementation of multiple fields such as intelligent assistants, technical support, education tutoring, online consultation, and even complex automation process management. In the future, more tool integration, intelligent task coordinatio,n and stronger custom interaction functions are expected to be added, making the "human-machine + tool" collaboration smoother and more efficient.

In addition, with the continuous optimization of model training and context protocols, intelligent dialogue combined with streaming tool calls will become the core form of artificial intelligence applications in the new era, creating unprecedented value for enterprises and developers.


9. Summary

Ollama v0.8.0 achieves the perfect combination of "real-time streaming response + tool call" with the most cutting-edge technological changes, pushing intelligent dialogue into a new dimension of efficiency and experience. Both developers and end users will benefit from this and usher in an era of smarter, interactive, and richer chat applications.