A concept of LLM model layer + application layer Full-Stack RL

Written by

Silas Grey

Updated on:July-14th-2025

This article is a very short one, but the specific plan that follows is not short.

text

Everyone already knows that RL in the post-training stage learns directly from the reward. OpenAI o1 and DeepSeek R1 have both demonstrated this for us, so I won’t go into details here.

From the API to the application layer, the workflow can also be incorporated into the RL process, and directly learn the business reward after multiple rounds. Of course, if multiple nodes can use the same LLM, the memory usage during RFT will be smaller, and there is no need to maintain multiple models.

Considering that most of the current LLM model contexts are not particularly long, the tool call and RAG process seem to need to be optimized. For example, ODR products visit a large number of web pages, but are these web pages really put into the context regardless of whether they are useful or not? The context of o3 should be more than 200k, but is it really done so? It doesn't seem like it. Is there a separate web page filtering component here that recalls relevant content from the searched web pages and puts it into the context? How should this component be implemented? It can be used as a tool, but the implementation of this tool still needs to be optimized, and it seems that it should also be included in the entire RL process for optimization.

Conversely, consider whether RL can go deep into the model instead of just optimizing the parameters of the LLM as a whole.

In terms of model architecture, various attempts are being made for various sparse attention, linear attention, hybrid attention schemes, etc. DeepSeek, MiniMax, and Moonshot have each submitted a different answer. However, the design of these attention structures still feels like a slap in the face. Can't this attention structure itself be used as part of RL optimization? It seems that it cannot be denied. However, RL is still in the post-training stage. RL is not currently being done in the pretraining stage, and the attention structure must be fixed in the pretraining stage. It seems that it is also possible to design an attention structure that can be trained first in the pretraining stage, and can still be trained in the post-training stage and significantly change its recall strategy. In this way, it can be directly optimized for the final reward through the RL process and incorporated into the Full-Stack RL process.

The new idea of this paper is to incorporate the structural optimization of attention into the RL optimization process for business rewards, so as to solve the metaphysical problem of alchemy of attention structure.

A brief introduction to the model layer solution

In order to show that the idea is indeed feasible, here is a more specific idea. Of course, the structural design of the LLM model layer is now a very professional field, which requires considering both the model effect and the hardware characteristics at the same time. It is not completely feasible just by thinking about it casually. So this can only be regarded as a starting point to give you some inspiration.

This design targets the following scenarios: Long Context (>200k); MoE architecture.

It is generally believed that the context that LLM relies on during decoding should be local, that is, adjacent tokens will probably use similar contexts. Coarse-grained context block recall can take advantage of this feature, and DeepSeek's NSA (Native Sparse Attention) solution includes coarse-grained context block recall.

Then consider the MoE architecture: Currently, the E in MoE is not for "complete semantic units", but based on tokens. That is to say, a small piece of complete semantics/capability may be split into multiple experts at a finer granularity, and different experts need to be recalled at different token positions. However, from the perspective of human cognition, it seems more appropriate to aggregate the same capabilities into a few experts, and this can enhance the locality of the recalled expert candidate set during the token generation process and reduce the switching of experts. The locality of this expert should be achieved by motivating (optimizing) it during training.

Then we have a certain degree of locality in the decoding process at both the context block and the expert level, which is expected to reduce the number of context recalls and expert recalls during the decoding process. Of course, as reflected in the NSA solution, more than one context recall method is needed to meet the needs of various scenarios. This article only discusses how to recall relevant parts from a very long context.

If we only consider RL, there are many design ideas in the post-training stage. However, it is difficult to achieve it without the pretraining stage, and the huge training cost base in the pretraining stage does not allow us to easily adopt many violent RL methods.

Based on the above considerations, a two-stage recall range can be designed in the recall of context blocks and experts. The first level of recall is the same as the current general practice and is directly used for the current decoded token. Then the second level of recall range is introduced, and the training goal is to recall the context blocks and experts required for W tokens in the future decoding process. During reasoning, both levels of recall range are involved in the calculation. If it is found that the score of the candidate element of the second layer exceeds the threshold and should enter the first level range, it means that the generation of the current sequence is deviating from the current local environment, and a larger range of recall calculations should be re-triggered when the next token is calculated. In the case of only two levels of recall, this recall calculation is a global recalculation.

The selection method of the first-level recall range can use the ideas of existing solutions such as NSA and MoE. The key lies in how to dynamically select the newly introduced second-level recall range. The selection component must be parameterized, and the learning target can be selected as the union of the elements that need to be recalled for the future W tokens. However, when calculating each token, the recall of tokens at other positions is not clear, so this requires adding a new link after each calculation of the entire sequence to summarize the recall of tokens at other positions as the fitting target of the second-level recall at the current position.

Although this adds a separate process, it should still be possible to maintain an efficient training method in the pretrain stage. I will not make a specific estimate of the increase in overhead, which should be professionally calculated by the model layer team.

In the above pretraining process, although it has a bit of RL feel, it is still a supervised learning method and does not involve optimization of delayed feedback rewards. However, in the post-training and RFT stages, RL learning can be performed on the first-layer recall selection component, the second-layer recall selection component, and the component that determines when to trigger the next round of complete recall for the target reward.

In this way, in addition to the fixed parameters, the LLM model has also added parts that can be directly optimized by RL, and the calculation cost is optimized. The effect and benefits of this method cannot be achieved by simply optimizing the LLM parameters directly.

I hope this can provide readers with some inspiration on model structure design.

Related Materials

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

https://arxiv.org/abs/2502.11089