Abandon llama.cpp! Ollama self-developed engine: local reasoning performance soars

Ollama engine innovates AI local reasoning, with breakthrough performance improvement.
Core content:
1. Reconstruction of architecture from relying on traditional framework to self-developed engine
2. KVCache optimization and memory management innovation, significantly improving reasoning speed
3. Deep adaptation with mainstream hardware manufacturers to create a unified and optimized hardware ecosystem
1. From "framework dependency" to "independent engine": a complete reconstruction of the underlying architecture
In the past year, the keyword in the field of large models has been "multimodal" - Meta's Llama 4 supports joint reasoning of text and images, Google's Gemma 3 strengthens code generation capabilities, and Alibaba's Qwen 2.5 VL can even parse medical images. However, the complexity of the model exposes the shortcomings of traditional reasoning frameworks: when users try to run an AI task locally that requires processing image generation, text reasoning, and mathematical calculations at the same time, frameworks such as llama.cpp often crash due to uneven memory allocation or inefficient computing scheduling .
Ollama's solution is to "reinvent the wheel from scratch". Its team made it clear on Hacker News that the new engine is completely developed based on Golang and has no direct connection with the C++ implementation of llama.cpp. Behind this choice is a trade-off between performance and flexibility: Golang's coroutine mechanism is more suitable for parallel processing of multimodal tasks, while C++'s "hardcore operations" in memory management are prone to compatibility issues. For example, when processing a high-resolution medical image, the Ollama engine will first divide the image into multiple logical blocks, and then mark the association between each block and the text token by adding metadata (such as pixel coordinates, color mode). This "classification first, then fusion" strategy enables the model to accurately locate the lesion area in the image when generating a diagnostic report, avoiding the semantic fault caused by the blind splicing of data in the traditional framework.
2. The secret of performance surge: KVCache optimization and memory management "surgery"
One of the core pain points of local reasoning is the game between video memory capacity and computing speed. Taking Meta's Llama 4 Scout model as an example, its 109 billion parameters and hybrid expert architecture (MoE) need to maintain dozens of dynamic weight matrices at the same time. Traditional solutions are often slowed down by frequent reading and writing of video memory. Ollama's breakthrough lies in the "KVCache partition compression" technology - by analyzing the access frequency of key-value pairs in the transformer model, high-frequency data is retained in the GPU video memory, and low-frequency data is dynamically migrated to memory or SSD. According to tests by the developer community, this technology has increased the reasoning speed of Llama 4 Scout by 40%, while the video memory usage has only increased by 12%.
Another breakthrough is the "image cache reuse mechanism". In AI drawing scenarios, users often need to adjust the prompt words multiple times to fine-tune the output results. Traditional frameworks re-parse the original image each time, while Ollama caches the pre-processed image tensors in memory and associates them with a specific session ID. For example, when a user modifies "change the blue sky to dusk", the engine only needs to call the segmented sky area data in the cache without repeatedly decoding the entire image. This optimization reduces the time required to batch process 100 images from 18 minutes to 7 minutes (data source: Ollama official performance white paper).
3. The "United Front" of the Hardware Ecosystem: Deep Adaptation from Chip Instruction Set to Driver Layer
Another highlight of Ollama's new engine is the joint optimization with hardware manufacturers such as NVIDIA, AMD, and Intel. Taking video memory management as an example, traditional frameworks usually rely on general CUDA or ROCm interfaces, but Ollama dynamically adjusts task scheduling strategies by parsing hardware metadata (such as the number of SM units of the GPU and the peak video memory bandwidth). For example, on an AMD Radeon RX 7900 XTX graphics card, the engine will prioritize the use of asynchronous computing queues, assign image preprocessing tasks to the GPU's AI acceleration unit, and use the graphics computing unit to process text tokens. This "divide and conquer" strategy reduces performance fluctuations of the same model on different hardware by 60%.
What is more noteworthy is the support for mobile and edge devices. Through cooperation with Qualcomm, the Ollama engine can recognize the Hexagon DSP architecture of the Snapdragon chip and offload some matrix operations to the dedicated AI core. In an internal test, when a mobile phone equipped with Snapdragon 8 Gen3 ran the Qwen 2.5 VL model, the generation speed was 3 times faster than the general framework, and the body temperature dropped by 11°C. This optimization not only relies on software-level instruction reordering, but also involves adjusting the prefetch strategy of the hardware cache line (Cache Line)-Ollama even customized different data block sizes for different brands of LPDDR5X memory.
4. Real-world scenario testing: Breaking through the critical point of large-scale model “productivity”
The value of technology upgrades is ultimately reflected in user experience. A developer used the new Ollama engine to test the code generation capabilities of the Mistral Small 3.1 model: when a photo of a hand-drawn sketch containing a class diagram and a text description "Please generate Python code to implement this class structure" was input, the model not only correctly identified the inheritance relationship in the diagram, but also automatically completed the private methods that were not drawn . In contrast, the old version of the engine often confused class names and function names due to image segmentation errors.
In the medical field, Ollama's early partner organizations tried to use it to run customized pathology analysis models. When a breast X-ray containing 5000×5000 pixels is input, the engine divides the image into 64 blocks for parallel processing through the "block attention" technology, and finally outputs diagnostic suggestions within 12 seconds (the traditional solution takes 29 seconds). More importantly, because the additional metadata records the coordinate information of each block, the model can directly mark the location of suspicious calcification points in the report without the need to call the image marking interface.
5. Controversy and Challenges: The “Border War” in the Open Source Community
Although Ollama emphasizes that its engine is independently developed, there are still doubts in the community. Georgi Gerganov, one of the core contributors of llama.cpp, once publicly stated that some of Ollama's optimization ideas (such as the implementation of 2D rotation embedding) are "highly similar" to the design of the libmtmd library. In response, the Ollama team said that both follow the original paper formula of Transformer, and the difference only comes from the characteristics of the programming language (Golang's coroutine vs. C++'s thread pool).
This debate reflects a deeper problem: In the competition of multimodal frameworks, how to balance performance and compatibility with open source protocols? For example, although Ollama's image caching mechanism improves efficiency, its private data format may lead to reduced interoperability with other frameworks. If users want to import Ollama-processed image data into PyTorch for secondary training, additional format conversion steps may be required - this conflicts with the concept of "seamless collaboration" advocated by the open source community.