Kimi launches MoBA: a breakthrough in achieving infinite context!

Written by

Clara Bennett

Updated on:July-16th-2025

Mixture of Experts and Sparse attention make nearly unlimited context possible . It enables RAG AI Agent to devour the entire code base and documents without contextual restrictions .

? The challenge of long-context attention

Transformers still face a heavy computational burden when sequences become very large. The default attention mode compares each token to every other token, resulting in a quadratic computational cost. This overhead becomes a problem when reading entire code bases, multi-chapter documents, or large amounts of legal text.

? MoBA

MoBA (Mixture of Block Attention) applies the concept of Mixture of Experts to the attention mechanism. The model divides the input sequence into multiple blocks, and then a trainable gating function calculates the relevance score between each query token and each block. Only the blocks with the highest scores are used for attention calculation, thus avoiding paying attention to every token in the complete sequence.

Chunks are defined by splitting the sequence into equal spans. Each query token looks at the aggregated representation of the keys in each chunk (e.g., using mean pooling), then ranks their importance and selects a few chunks for detailed attention computation. The chunk containing the query is always selected. Causal masks ensure that tokens do not see future information, preserving the left-to-right generation order.

? Seamlessly switch between sparse and full attention

MoBA replaces the standard attention mechanism without changing the number of parameters. It is compatible with the standard Transformer interface, so sparse and full attention can be switched between different layers or training stages. Some layers may retain full attention for specific tasks (such as supervised fine-tuning), while most layers use MoBA to reduce computational cost.

This works on larger Transformer stacks by replacing standard attention calls. A gating mechanism ensures that each query only attends to a small subset of blocks. Causality is handled by filtering out future blocks and applying a local mask within the query’s current block.

The following figure shows that queries are routed to only a few “expert” blocks’ key/values instead of the entire sequence. The gating mechanism assigns each query to the most relevant block, reducing the complexity of the attention computation from quadratic to sub-quadratic.

The gating mechanism computes a relevance score between each query and the condensed representation of each chunk. It then selects the top k chunks with the highest scores for each query, no matter how far away those chunks are in the sequence.

Since only a few blocks are processed per query, the computation is still sub-quadratic, but the model can still jump to tokens far away from the current block if the gating score shows high relevance.

PyTorch Implementation

This pseudocode partitions the keys and values into chunks, computes the mean pooled representation for each chunk, and calculates the gated score (S) by multiplying the query (Q) with the pooled representation.

It then applies causal masks to ensure that queries cannot attend to future chunks, uses a top-k operator to select the most relevant chunks for each query, and organizes the data for efficient attention computation.

? FlashAttention is applied to the self-attention block (current position) and the block selected by MoBA separately, and finally the outputs are merged using online softmax.

The end result is a sparse attention mechanism that preserves causal structure and captures long-range dependencies while avoiding the full quadratic computational cost of standard attention.

This code combines mixture-of-experts logic with sparse attention so that each query only focuses on a few blocks.

The gating mechanism scores each block against the query and selects the top k “experts”, thus reducing the number of key/value comparisons.

This keeps the computational cost of attention at a sub-quadratic level, enabling the processing of extremely long inputs without increasing computational or memory burden.

At the same time, the gating mechanism ensures that queries can still focus on distant tokens when necessary, thereby retaining the Transformer's ability to process global context.

This block- and gating-based strategy is exactly how MoBA achieves nearly infinite context in LLM.

Experimental observation

Models using MoBA are nearly on par with full attention on language modeling loss and downstream task performance. Results remain consistent even at context lengths of hundreds of thousands or millions of tokens. Experiments with “tail token” evaluation confirm that important long-range dependencies are still captured when the query identifies relevant chunks.

Scalability tests showed that the cost curve is sub-quadratic. The researchers report speed improvements of up to six times at one million tokens, with larger gains outside that range.

MoBA remains memory-friendly by avoiding the use of a full attention matrix and leveraging standard GPU cores for block-based computations.

Final Observation

MoBA reduces attention overhead through a simple idea: let the query learn which blocks are important and ignore all other blocks.

It retains the standard softmax-based attention interface and avoids forcing the use of rigid local patterns. Many large language models can integrate this mechanism in a plug-and-play manner.

This makes MoBA very attractive for workloads that need to handle extremely long contexts, such as scanning entire code bases or summarizing huge documents, without requiring major modifications to pre-trained weights or incurring large retraining overhead.