New breakthrough in AI PC: Supporting 128K context window for the first time on the end side to achieve 2.2 times inference optimization

Written by

Clara Bennett

Updated on:June-13th-2025

Mianbi Intelligence officially released and open-sourced the latest masterpiece of the "Mianbi Little Steel Cannon" end-side series - MiniCPM 4.0 model, which realized the efficient innovation of system-level software and hardware sparseness that can be implemented on the end-side. Intel and Mianbi Intelligence have worked closely since the model development stage to achieve the improvement of multiple reasoning efficiency of long and short texts, full adaptation of end-side AI PC on Day 0, 128K long context window and other breakthroughs.

The two parties have carried out in-depth technical collaboration and customized speculative decoding configuration based on Intel hardware architecture. Through hardware-aware draft model optimization strategy, combined with Intel Acceleration Kit and KV Cache memory enhancement technology, end-to-end inference efficiency has been improved by 2.2 times1 ^, bringing new model innovation and end-side performance experience to the industry.

This time, the MiniCPM 4.0 series LLM model launched by Mianbi has two parameter sizes: 8B and 0.5B. In order to solve the technical problem that a single architecture is difficult to take into account different scenarios of long and short texts, MiniCPM 4.0-8B adopts an "efficient dual-frequency shifting" mechanism, which can automatically switch attention modes according to task characteristics: when processing difficult long texts and deep thinking tasks, sparse attention is enabled to reduce computational complexity, and in short text scenarios, it switches to dense attention to ensure accuracy, achieving efficient response to switching between long and short texts.

At present, Intel Core Ultra processors with three AI computing engines, CPU, GPU, and NPU, have quickly adapted to this and provide optimized and excellent performance for the MiniCPM 4.0 series models with the help of the OpenVINO ^™ toolkit. Intel once again provides Day 0 support for model release on NPU, providing more diverse and targeted platform support for different parameter models and application scenarios.

* The above tests evaluate the first-word latency and average throughput of 1K input in int4 mixed precision and fp16 precision settings. Each test is executed three times after the warm-up phase, and the average is selected as the reported data. Performance results are based on the following SKU1 or SKU2 ^{configurations2}

Intel has also made new breakthroughs in the technical innovation of long context windows. Relying on the block sparse attention mechanism, combined with deep operator fusion and hardware-driven algorithm optimization, a sharp reduction in long text cache and further improvement in reasoning efficiency have been achieved. While ensuring output quality, we have expanded the long context window to 128K for the first time on Intel Ruixuan ^™ Pro B60. Compared with the dense model, the first token latency is reduced by 38% ³ , and the token rate is increased by up to 3.8 times ^3. With this improvement, a whole Harry Potter novel of more than 300 pages can be read, analyzed and summarized in 90 seconds. This not only greatly improves the user experience of AI PC, but also lays a strong foundation for unlocking more new end-side AI applications. In the future, Intel will continue to maintain in-depth cooperation and collaborative research and development with Mianbi to further improve the performance of long context window applications.

Please refer to the video demonstration for the effect of processing 128K text input.

In today's digital age, artificial intelligence technology is developing at an unprecedented pace. As a global leading technology company and the initiator and advocate of AI PCs, Intel has always been committed to promoting the innovative development of edge AI models.

This cooperation not only demonstrates Intel's strong technical strength in the field of AI, but also reflects its firm commitment to the innovation ecosystem. By integrating the technical advantages and resources of both parties, the wide application and deployment of the joint solution of Intel platform and MiniCPM 4.0 series model has laid a solid foundation, which is expected to play a key role in multiple scenarios such as smart life and productivity improvement.

Looking ahead, Intel will continue to work closely with Mianbi Intelligence and actively expand the partnership to explore new boundaries of AI technology. Intel is committed to promoting the popularization and development of artificial intelligence technology through continuous innovation and building a smarter and more efficient future society.