Breaking Down Ollama's May Update: Multimodal, Tool-Using, and Smarter Than Ever

Ollama May update brings new features of multimodality and tool calling, and AI localization technology has taken a step forward!
Core content:
1. The new multimodal engine supports visual models to improve reliability and accuracy
2. Tool calling supports streaming responses to achieve real-time interaction
3. Lay the foundation for future voice and image generation
May is an exciting month for the Ollama community! This open source project, dedicated to making it easy for everyone to run powerful AI models locally, released three major updates in just a few weeks, greatly expanding its functionality and user experience. If you follow the development of localized AI or are already a loyal user of Ollama, these new features are definitely not to be missed. Let's take a look at what surprises Ollama has brought us this May.
Major upgrade 1: New multimodal engine, let your AI "see" the world (May 15)
First of all, Ollama has launched a new multimodal engine, officially announcing support for visual models ! This means that you can now run models that can not only understand text, but also "see" and analyze images locally through Ollama.
The first batch of supported star models include:
- Meta Llama 4
: Special mention was made of its 109 billion parameter mixture of experts (MoE) model Llama 4 Scout, which can perform detailed analysis of video frames and even answer questions based on location. For example, you can give it a picture of the San Francisco Ferry Building and ask it what it sees. It can accurately describe the clock tower, the bay in the background, and the bridge in the distance. It can also tell you how far it is from here to Stanford University and the best way to get there. - Google Gemma 3
: Demonstrated its ability to process multiple images and understand the relationship between them. For example, given four images containing the word "alpaca" at the same time, it can accurately identify that the common animal is an alpaca, and can analyze a funny picture of an alpaca and a dolphin boxing. - Qwen 2.5 VL (Tongyi Qianwen)
: Demonstrated its powerful document scanning and character recognition capabilities, such as accurately identifying text information on checks and even understanding and translating vertically written Chinese Spring Festival couplets. - Mistral Small 3.1
And more visual models.
Why do we need a new engine?
Ollama previously relied mainly on ggml/llama.cpp
The project focuses on ease of use and model portability. However, with the emergence of multimodal models, the original architecture has encountered challenges in supporting these complex models. The new engine aims to:
- Improve reliability and accuracy
: Through the modular design of the model, each model is relatively independent, reducing mutual interference and simplifying model integration. At the same time, the new engine can more accurately process the tokens generated by large images and optimize the batch processing and location information of the image. - Optimize memory management
: Introduced an image caching mechanism and worked with hardware manufacturers to optimize memory estimation and usage, and optimized KV cache for different model features (such as sliding window attention in Gemma 3 and block attention in Llama 4), thereby achieving longer context or higher concurrency on the same hardware. - Laying the foundation for the future
: Lay a solid foundation for future support of speech, image generation, video generation, longer context, and more complete tool support.
Major upgrade 2: Tool calls support streaming responses, making real-time interactions smoother (May 28)
Following the release of the multimodal engine, Ollama brought another extremely practical feature: Tool Calling supports streaming responses .
This means that when the model needs to call external tools (such as querying the weather, executing code, searching the web) to answer your questions, it does not have to wait for the tool to be fully executed and return all results before responding. Now, the model can call the tool while streaming the generated content to you in real time, while inserting tool call instructions at the appropriate time.
Models that support this feature include Qwen 3, Devstral, Llama 3.1, Llama 4, etc.
How does it work?
Ollama has developed a new incremental parser . Instead of simply waiting for the complete JSON output, this parser is able to:
- Understanding Model Templates
: A prefix that refers directly to the model template to identify tool calls. - Intelligent processing
: Even if the model output is not strictly in the preset tool call format, the parser can handle partial prefixes or fall back to parsing JSON to accurately separate content and tool calls. - Improve accuracy
: Avoids duplicate call issues that could previously occur due to models referencing previous tool calls in their replies.
Users can easily use this feature through cURL, Python or JavaScript libraries. For example, you can define an addition function in Python, and then let the model call this function to calculate "3+1". The model will stream out its "thinking" process (if enabled) and accurately call the function you defined.
Additionally, the update mentions the benefits of the Model Context Protocol (MCP) for this functionality and recommends using a context window of 32k or higher to improve performance and results of tool calls.
Major upgrade 3: Introducing the “thinking” process to make model decisions more transparent (May 30)
The last big gift in May is the launch of the model "Thinking" function . Users can now choose to enable or disable the model's "thinking" process.
When the "Think" function is enabled, the model's output will show its thinking process and final answer separately. This is very helpful for understanding how the model draws conclusions step by step, and it also allows developers to design more interesting applications based on this, such as displaying a thinking bubble before a game NPC dialogue.
When the "thinking" function is disabled, the model directly outputs the answer, which is very useful in scenarios where a quick response is required.
Models that support this function include DeepSeek R1, Qwen 3, etc.
How to use?
- CLI
: You can --think
(Enable) or--think=false
(disable) parameter control. In an interactive session, you can use/set think
or/set nothink
There is another--hidethinking
Parameter, used when thinking is enabled but only the final answer is displayed. - API
: /api/generate
and/api/chat
New interfacethink
Parameter (true/false). - Python/JavaScript Libraries
: The corresponding library has been updated to support passing when calling think
parameter.
The official demonstrated the different performances of the DeepSeek R1 model in answering questions with and without thinking mode enabled, intuitively demonstrating the value of this feature.
What does this mean for users?
Ollama's three major updates in May have undoubtedly made it a big step forward in the field of localized AI:
- More powerful capabilities
: The addition of a multimodal engine means that Ollama is no longer limited to text and has the ability to understand visual information, opening the door to the development of richer AI applications. - Smoother interaction
: Streaming tool calls solve the long waiting problem that may occur when calling external tools in the past, making the interaction with AI more real-time and natural. - More transparent decision making
The introduction of the "thinking" function not only allows us to glimpse the "inner world" of large models, but also provides a new perspective for debugging and optimizing models. - Continuous ease of use
: Despite its increasingly powerful functions, Ollama still maintains its simplicity and ease of use, allowing users to experience the latest AI technology through simple commands and APIs.