Ollama version 0.7 released, the new engine supports multimodal models

Written by

Iris Vance

Updated on:June-28th-2025

Introduction

Ollama releases major version v0.7.0, including a new model engine that supports multimodal large models:

• Meta Llama 4 ^[1]
• Google Gemma 3 ^[2]
• Qwen 2.5 VL ^[3]
• Mistral Small 3.10 ^[4]
• Other visual models ^[5]

Upgrade Ollama

v0.7.0Only versions of ollama and above support multimodal models. If the version is lower than v0.7.0, you need to upgrade ollama first.

Take Linux as an example:

1. Download

wget https://github.com/ollama/ollama/releases/download/v0.7.0/ollama-linux-amd64.tgz

2. Stop the Ollama server

sudo systemctl stop ollama

3. Delete the old version library files

sudo rm -rf /usr/lib/ollama

4. Start Ollama

sudo systemctl start ollama

Example

byqwen2.5vlThe model is an example. First download the model:

$  ollama pull qwen2.5vl
pulling manifest
pulling a99b7f834d75: 100% ▗
pulling a242d8dfdc8f: 100% ▗
pulling 75357d685f23: 100% ▗
pulling 832dd9e00a68: 100% ▗
pulling 52d2a7aa3a38: 100% ▗
pulling 83b9da835d9f: 100% 567
verifying sha256 digest
writing manifest
success

> >> ./spring-couplets.jpg what is content  in  this picture?
Added image './spring-couplets.jpg'
The image appears to be a set of Chinese New Year decorations, specifically red couplets and a central piece with the character "福" (fú), which means
"blessing" or "good fortune." The couplets and the central piece contain various phrases in Chinese, which are wishes for good fortune, happiness, and
prosperity. Here are some of the phrases:

- "家和国盛" (jiā hé guó shèng): "Family harmony and country prosperity."
- "和氣生财" (hé qì shēng cái): "Harmony brings wealth."
- "Peaceful society" (pái ān shè huì): "Peaceful society."
- "Life is comfortable for everyone"
- "新春" (chūn xīn): "New Year."
- "告别老迎新" (cí jiù yíng xīn): "Bid farewell to the old and welcome the new."

These phrases are commonly used in Chinese New Year decorations to express wishes for a prosperous and harmonious year ahead.

New multimodal engine

question

So far, Ollama relies heavily on the ggml-org/llama.cpp ^[6] project to support the model, and focuses on usingConvenienceandModel portabilitysuperior.

As large research labs continue to release more multimodal models, it becomes increasingly challenging to support them as Ollama envisions .

We decided to support a new engine that would make multimodal models first-class citizens and enable Ollama partners to contribute directly to the community - namely through the GGML Tensor Library.

Therefore, Ollama aims to improve the reliability and accuracy of local inference and lay the foundation for future features that may be supported, such as speech , image generation, video generation, long context support, etc.

Model Specification

The goal is to ensure that each model's "sphere of influence" is limited to itself, to increase reliability and make it easier for creators and developers to integrate new models.

ggml/llama.cpp provides first-class support for text modelsHowever, in a multimodal system,Text DecoderandVision Encoderare split into independent models and executed separately. Therefore, passing image embeddings from the vision model to the text model requires implementing model-specific logic in the interaction layer, which may break the implementation of a specific model.

How to improve it?

In Ollama, each model is completely self-contained and can expose its own projection layer, aligned with how the model was trained. This isolation allows model creators to train and publish models without patching or adding nested if statements to handle different cases (text and images), and model creators can focus on their own models and their training.

accuracy

Large images will generate a large number of tokens, which may exceed the batch size. Ollama adds metadata when processing images to help improve accuracy.

Example problems and solutions:

• Should attention be enabled/disabled?
• Is it possible to process image embeddings in batches? If possible, what is the separation point, considering output quality and computation time for inference? Wrong placement will lead to degraded output quality.

Most local inference tools implement this functionality differently; while they may achieve similar results, they do not follow the way the models are designed and trained.

Memory Management

Image caching: Ollama caches images so that subsequent prompts are faster. During use, the image remains in the cache and is not discarded due to memory cleanup limitations.

Memory estimation and KV cache optimization:

• Ollama works with hardware manufacturers and OS partners to ensure that the correct hardware metadata is detected to better estimate and optimize memory usage.
• Ollama has made KV cache optimizations to improve the efficient use of memory.

Future plans

• Support for longer context sizes
• Support thinking/reasoning skills
• Tool calls when implementing streaming responses
• Allow computer use

Summarize

The 0.7.0 version released by Ollama not only supports multimodal models, but also releases a new model engine and announces a blueprint for future features.

Ollama will realize the standardization of models. Different open source model providers/research institutions only need to publish models, and the model reasoning will be implemented by Ollama. This is why a separate engine is released.