Interview with Cursor Core Members: Our Key Judgments on AI Programming

Written by
Silas Grey
Updated on:June-08th-2025

Recommendation
In-depth analysis of the latest progress and challenges in the field of AI programming.

Core content:

1. Key issues and challenges faced by AI programming agents

2. Programming model bottlenecks and feedback mechanism design

3. Application and difficulties of reinforcement learning in the field of programming

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

In the past two days, I listened to an episode of the Cursor team's podcast, which was very dense with information and well worth listening to.

Several core members of the team focused on the theme of "AI Programming" and talked about their frontline observations and thoughts in products and research, covering a range of topics from model training, toolchain design, to feedback mechanisms, memory systems, and processing of long contexts, etc., and almost all of the key issues encountered by AI Programming Agents nowadays. It almost covers all the key problems encountered by AI programming agents nowadays.

For me, one of the most inspiring points in this issue is that they mentioned that the bottleneck of current programming models is not only the model capability itself, but the "feedback mechanism" design is not good enough. For example, in the past, Evals testing and evaluation were often mentioned, but without real and effective feedback signals (e.g., did the user keep the code, did the user take the model's suggestions), it's hard for the model to really learn what's a "good" modification.

The other is that they also have a lot of new insights into the way RL is used in programming. Unlike math or writing, programming tasks often involve multiple rounds of tool calls and code iterations, and rewards are sparse and hard to define. You can't just look at "did it pass the test", but you have to consider the code structure, readability, elegance, and even whether it has "opportunistically" passed the test.

The entire conversation covered a lot of technical details, such as GRPO, NSA attention, document-level context caching, etc., but it was very thorough and frank, and after reading it, you can basically make sense of the industry's current key challenges and possible evolutionary directions in AI programming.

This conversation gave us a glimpse into the Cursor team's cutting-edge exploration and deep thinking in the field of AI programming.

 

The guests in the picture above, from left to right, are: Sualeh Asif, co-founder and CPO; Charlie Snell, researcher; Aman Sanger, co-founder; Federico Cassano, researcher; and Jacob Jackson, researcher.

Below is the full translation that my team and I have done with the help of the tool, and I hope it can bring you some inspiration as well.

#01

What is the difference between training programming models with reinforcement learning and other models?

Sualeh Asif (CPO): Let's start today with Reinforcement Learning (RL). Here's an interesting question: what's the difference between training programming models with RL and training them for other domains (like math, or even domains like writing, where the evaluation criteria are not as clear)? What exactly is different about programming models?

Federico Cassano (Researcher): First of all, at each step, you can write all kinds of different statements and call different tools. Taking math as an example, reasoning works well in math because the final answer is short. The reasoning process can have many steps, but the final answer is just those few steps. Not so with programming, where the reasoning itself is contained in the final answer.

Charlie Snell (Researcher): Yeah, and there's also the fact that, in order to get the answer, you need to call multiple tools. So it's not like math where you "generate the reasoning process, generate the answer, get the reward," and that's the end of it. Programming has to "generate some code, call some tools, receive feedback from the tools," and sometimes multiple iterations. That is, the process of RL becomes a multi-step tool invocation and optimization.

Jacob Jackson (Researcher): And what's particularly interesting about RL for us is that the model gives results, and a lot of times we don't have a way to tell if it's solving the user's problem or meeting the user's needs or not. For example, for math or programming problems, you can have a standard answer to compare. But there are some scenarios where the user won't explicitly tell us if "the result is right or not".

Aman Sanger (Co-Founder): What do you think is going to happen in areas like writing? Is it not going to use RL at all, or will it have to rely on pre-training with the base model? Do you think RL has the potential to make writing models better?

Jacob Jackson (Researcher): Post-training of large language models nowadays usually makes the models write in a more rigid and unnatural way. But I don't think that's a limitation of the model itself, just because that's how it's taught during training.

Federico Cassano (Researcher): Yeah, why can't you train the model to predict the "next chapter" instead of just predicting the next word each time? You want to change the way it learns so that the model starts predicting whole segments. For example, give it the current chapter of a book, have it predict the entire next chapter, and then use some kind of similarity metric to evaluate how much the generated chapter resembles the real chapter.

Aman Sanger (Co-Founder): This is really taking the original approach to writing that had the model only predicting "the next word" and changing it to having the model generate an entire chapter and then scoring the model by comparing how similar the model-generated chapter is to the real chapter.

Charlie Snell (Researcher): It would be better to use "semantic rewards".

Federico Cassano (Researcher): Because the "next word prediction" goal, as it is now, doesn't quite express what we really want - we actually want the model to be able to generate whole paragraphs of high-quality content.

Aman Sanger (Co-Founder): There are actually two things here: one is to allow the model to "think about it" with more arithmetic before generating the next word, and two, it doesn't necessarily have to accurately predict the next word, but rather be able to generate a whole paragraph that is similar to the target section.

Jacob Jackson (Researcher): The hard thing about writing is that whether the final content is good or not is largely a matter of "taste" rather than a standard answer like programming. Programming is all about whether it can run or not, writing can be viewed differently as good or bad, even by very experienced people.

Aman Sanger (Co-Founder): But don't you think there's actually an element of "taste" in programming? For example, after passing a test, it might be hard to optimize it further later.

Jacob Jackson (Researcher): Yes, code quality is really a challenge.

Federico Cassano (Researcher): But "passing the test" is also sometimes unreliable, because there are models that just "speculate" to get the test to pass, and really do something that has nothing to do with the task at all. That is, the model may pass the test in some completely unrelated way, and the reward is still high.

Aman Sanger (Co-Founder): It really depends on how well you design your tests.

Jacob Jackson (Fellow): Regarding code quality, we actually want "elegant code" that is not redundant and preferably as short as possible and can be described most succinctly. Like math, the prettiest proofs are usually the most concise. It's not exactly the same, but there are similarities.

Aman Sanger (Co-Founder): When optimizing code, do you prefer to streamline and cut out unnecessary parts to make the code more concise?

Jacob Jackson (Researcher): I would feel great if you could do a code merge and cut out unnecessary code, reducing it by 100 lines in one go, and not affecting the original functionality.

Aman Sanger (Co-Founder): Yeah, that's right.

#02

How do you reward models?

Sualeh Asif (CPO): So in general, what's a "good" reward? We're actually trying a lot of different rewards right now to train RL models.

Aman Sanger (Co-Founder): What are some of the ideas you all have? What are the advantages of using tests as rewards?

Charlie Snell (Researcher): Tests are actually pretty close to the "standard answer", and like I said earlier, when tests don't cover enough, the model may "drill down" and not really solve the problem. But if the tests are done well, they can be used as a near-standard signal for determining whether the code is working or not, and you can optimize for a long time with RL for that signal, and the model learns interesting behaviors.

But not everything can be measured with tests, so we might have to consider other ways of rewarding. For example, you could use code changes that users actually make as a feedback signal when the model is trained. There may be multiple implementations of a new feature, which is not an absolute standard, but can be used as supporting information to validate the model's output.

Aman Sanger (Co-Founder): Another problem with that test doing rewards is that the rewards are too sparse. You can run it many times and only have one pass, and the reward is only a one-size-fits-all feedback of "passed/failed."

Charlie Snell (Researcher): Yeah, that can make training expensive. Even if you have a lot of servers and let the model run a few more times, the chances of ending up with useful feedback are still very small, and training like this costs both time and money

Federico Cassano (Researcher): Because the model doesn't actually see the "reward" itself, it gets the "advantage", which is the relative reward compared to other samples in the same batch.

Aman Sanger (Co-Founder): That's interesting. If you run tests on the whole PR, the feedback is sparse and difficult, and the model is having a hard time passing all the tests right now. Wouldn't it be an improvement if you could break up the PR into smaller parts and run the tests individually and get the feedback faster?

Charlie Snell (Researcher): Yes, I think it would be. That is, if the task is so hard that the model only gets it right once every 1,000 samples, then the reward signal is going to be particularly sparse. If it's 1/100 or higher success rate, then it's okay. But if it's something like 1/1000, it's time to think about splitting the task further.

Aman Sanger (Co-Founder): Do you think we're in this "rewards are too sparse" phase?

Jacob Jackson (Researcher): Yes, in some cases, it is really necessary to reduce the problem of reward sparsity by splitting the task into smaller parts so that the model gets every part right.

Essentially, what we want is for the model to make code changes that function exactly like the "standard answer" and are as clean as possible. The difficulty is that this is a hard goal in itself, and even determining whether a candidate's answer is up to par is almost as hard as completing the entire task.

So it's hard to do, but it might be better if there were a way to work in that direction.

#03

Equipping the model with tools

Sualeh Asif (CPO): What do you guys find the most interesting tools? For example, different labs will equip models with different tools. For example, o3 has done a lot of optimization specifically for the terminal, which is basically what it uses for this model, and prefers to use the terminal for everything instead of other tools.

The Claude model may prefer to use "search" and "edit". What other interesting tools do you guys think are out there that can be used to equip models in addition to these traditional tools?

Aman Sanger (co-founder): I think it's actually possible to do better than the "core toolset". The reason the terminal has an advantage is that it's simple, you don't need to get a complex runtime environment for the agent, just make it able to access the shell (command line), and it can do everything.

Simplicity is actually the biggest advantage. For example, Linter provides a lot of useful signals, but it's a pain in the ass to use because you have to get a language server running, and it's hard to get a language server to run on arbitrary code.

But editors like Cursor are interesting because they come with extensions and language servers that users are willing to configure themselves, so you can get Linter's signals. We also have tools like semantic search. Personally, I don't think semantic search necessarily gets you more code than "grep multi-hop search," but it gets you faster, it's less resource-intensive, and it's more efficient.

Sualeh Asif (CPO): So maybe the question is, do you want to improve the quality of the tool, or do you value simplicity and ease of use more? You can go for minimalist tools, like Terminal. Or you can slowly improve the "class" of the tool. What do you guys think about this trade-off?

Federico Cassano (Researcher): There is actually an interesting approach to use some tools to help models "manage" their behavior. For example, we found that many reasoning models can easily fall into overthinking, even if sometimes they don't use complex reasoning at all, they will insist on thinking for a while.

So you can add a "thinking tool" to the model, so that the model itself realizes whether the current task needs reasoning or not, and if it does, it then calls this tool.

Aman Sanger (Co-Founder): Yeah, I've always found the interaction between reasoning models and agent tools kind of strange. In the case of o3, it's probably because I don't use it enough, but I've found that it often "thinks" and calls the tool just after you've sent a message, before you've received any content.

Sualeh Asif (CPO): Do you mean that you think the model should "think" after every call to the tool?

Aman Sanger (Co-Founder): Not every time. In fact, what is the reason people train inference models in the first place? I think that the earliest versions of o1 were trained to do algorithmic competitions and math problems.

The idea is to give a nice answer in the end, whether it's shown to the user or given to the system for verification. Before that, you need to spend a lot of tokens to let the model "think" by itself.

I'm curious, after the Agent performs a sequence of operations, should the final display to the user be validated separately or not? Sometimes the user just wants the model to modify the code, such as editing a file, that is, to use the tool to change it directly. Then is this not a need for the "inference process" at all, as long as the model can be edited directly, and the training does not need to add any separate inference process?

Jacob Jackson (Researcher): Another interesting tool we're thinking about right now is to go and analyze PRs (Pull Requests) and what team members have been doing lately.

You can think of the model as a very capable engineer, but he's just on his third day on the job, and he spent the first two days getting a quick overview of the code base, and on the third day, he's going to get his hands dirty. If you were in this situation, you would definitely start by looking at what code your coworkers are changing and why.

Currently, the model is still mainly looking at large chunks of code and then going to find relevant content, which actually corresponds to the source of the training data and is an important part of it. But if we can get the model to look at PR, we might get something new.

Sualeh Asif (CPO): What do you think about the relationship between code and long context? For example, there's an argument now that long context is important. Models like Sonnet, GPT4, if you can only use 8K Tokens, it's a bit too little, you probably have to have at least 50K, 60K Tokens of context to make it work. Do you think the longer the context length, the better the RL will be? How do you understand the relationship between the two?

Jacob Jackson (Researcher): The current trend is indeed that contexts are getting longer and longer, and the attention mechanism is very powerful in utilizing long contexts. But this is also becoming more and more costly. Technically, an interesting direction is how to control the cost with long contexts, for example, how to make cached contexts reusable in multiple rounds of dialog. Especially these new models are getting more and more capable these days, but without controlling them in a smart way, the cost of contexts can be very high.

When looking at specialized code bases, the amount of contextual information related to what you're trying to do is huge. Codes are pretty specific at this point. Models like ChatGPT and Claude, for example, have very little context for the user, at most a simple question, and at most a hundred tokens.

That is to say, the main task of the model is to learn human knowledge in advance, and answer the questions directly, rather than making a complex design to deal with very long inputs, because ordinary users can not use such long inputs.

Sualeh Asif (CPO): Do you think we will need 1 million, 10 million, or 100 million Token contexts in the future?

Jacob Jackson (Researcher): I think the longer the better, but there is also the issue of diminishing returns. It's a common practice now to "retrieve" information from a model when it needs it, rather than stuffing it all in at the beginning.

It's not the only way, but it works. In the future, a more reasonable direction may be a combination of mechanisms, such as a mechanism that can digest 100 million Tokens at one time, although each Token does not get much information, it can have a general understanding of the code base. Then, when you really want to do something, the model can remember what content is relevant, and then go to focus on refreshing the memory; this approach may be the most promising.

#04

Attentional mechanisms

Sualeh Asif (CPO): What do you guys think of these new architectures lately? For example, it used to be sliding window attention; now, more sophisticated attention mechanisms like Llama are emerging.

Aman Sanger (Co-Founder): NSA (Attention Mechanism) is really elegant.

Sualeh Asif (CPO): Right? What do you think about the NSA?

Aman Sanger (Co-Founder): It might not actually be accurate to say elegant.

Sualeh Asif (CPO): For example, DeepSeek's attention mechanism has been publicly published.

Aman Sanger (Co-Founder): Hopefully, they will use it for their next model. This attention mechanism is very scalable and is said to work better than traditional attention.

The core principle is to split attention into three parts. One part is sliding window attention, which focuses on the most recent content, such as the last 4000 Tokens. The remaining two parts are chunked attention.

It will save the key and value of every other token as a "block", and then the query will pay attention to these blocks. Then it picks the top K most relevant chunks, and then does global attention on those chunks. I think this approach is cool because it really works well for retrieval in very long contexts.

Sualeh Asif (CPO): What do you guys think about this kind of, for example, we're actually focusing more on "file-level attention" internally?

Aman Sanger (Co-Founder): What do you think?

Jacob Jackson (Researcher): I think this is actually taking the idea of Mixture-of-Experts (MoE, Mixture of Experts) and using it for attention. We have a "routine" for introducing sparsity in gradient descent training: that is, we count the values, then do top-K, and then do Softmax on the selected ones. That's how MoE is trained.

In this way, although there is no gradient everywhere, this sparsity mechanism allows the "gating weights" to focus more on the "experts" most relevant to the current sample, which in NSA means focusing on the most relevant context blocks. So, essentially, it's extending the MoE approach to other domains.

Sualeh Asif (CPO): Do you have any other favorite attention mechanisms?

Jacob Jackson (Fellow): Actually, the biggest problem with long context mechanisms is how to judge the baseline effect, because various mechanisms are more or less effective, such as sparse attention, where some heads focus locally and some globally, and the result is slightly slower, but faster, or better performance. So the reviews must be very rigorous. I don't have a particular paper I'd like to recommend either.

#05

Adding Memory Tools to Models

Sualeh Asif (CPO): e.g., add a "memory tool" so that RL can save some of the information to use later. But the question is, how do you get the model to actually save what is useful for later? Do you guys think RL will let us do more complex "stateful tools"?

Aman Sanger (Co-Founder): That's interesting, the direction is now more and more towards "not putting all the information in the model", but the model learning to find the information with a retrieval tool (e.g., semantic search).

Federico Cassano (Researcher): "Memory tools" actually consists of two steps. The first step is "I'm going to store a certain interaction as a memory"; the second step is "I'm going to take the memory out when I need it".

The second step is easy to teach the model, and gives a reward for taking it out if it really helps. But "store the memory" is much more complicated, because the reward depends on the performance of a series of later actions, not the current step. So the training has to do a lot of sampling in completely different contexts to give the signal.

Charlie Snell (Researcher): Yes, that's where you train both so that the model learns to "write the memory" and then do subsequent sampling to "read the memory" and send the reward back based on the effect.

Aman Sanger (Co-Founder): It's actually a lot easier to generate and retrieve memories if you just train in a " non-emotional" way. As you've said before, it's more about trying things over and over again with various rules and rubrics to see what works.

Sualeh Asif (CPO): How do you think we should evaluate the effectiveness of "memorization"?

Aman Sanger (Co-Founder): Yeah, what do you think? Or did Luke come up with that?

Jacob Jackson (Researcher): I think it was Luke who came up with it, because, like Federico Cassano said, it's hard to backpropagate a memory storage mechanism directly, and the best way to do it is to do a benchmark.

The best way to do it is to do a benchmark test. For example, take 500 Agents that are supposed to do the task, see how well they do it, and then experiment with different rules, heuristics, or cues to see when to store the memory and when to forget it, and then compare the results directly. You can't do backpropagation with a small number of agents, otherwise, it's easy to learn "reward cheating", but the rule-based approach can help you find the optimal solution.

Aman Sanger (Co-Founder): I'm also curious if "short-term memory" will always be useful, or if, as Jacob Jackson said, it will end up being a long-context mechanism, where the model sees all of your history of conversations and automatically "reinforces your memory ".

Sualeh Asif (CPO): What do you guys think about "Historical Conversations" over "Historical PRs (Code Merge Requests)"?

Aman Sanger (Co-Founder): It's actually two different things. Historical conversations have one feature: you can see how the environment has changed since you did it, such as how the code formatting tool gives feedback and how Linter reacts. You don't see that in PR, which is more of a demo. I think both are useful in their own way.

Charlie Snell (Researcher): In fact, through the history PR can also be personalized, for example, in a certain code base, you can find out the style and order of everyone's changes, for example, which file to change first, and then which file to change, and the model can learn this habit.

#06

Long Context

Federico Cassano (Researcher): I'm very confident about long contexts. The new generation of GPUs, like the GB200, L72 architectures, make "super long contexts" easy. For one thing, there are 72 GPUs interconnected, which can do massive tensor parallelism, and also distribute the attention head across different devices, and KV storage is easier to handle. And the Grace CPU can do unified memory, which can store more KVs.

Aman Sanger (Co-Founder): What's cool is that there are actually quite a few people who have tried it now, and the KVs don't all have to be on the GPU, they just have to be interleaved between computation and loading, which barely slows down, and then loaded onto the GPU every time they use attention.

Federico Cassano (Researcher): For example, if you start at layer 0, you can load the K values you need for layer 1 from the CPU. You only have to put it on the GPU when you need it, so you don't have to take up GPU memory the whole time.

Aman Sanger (Co-Founder): But that's only going to support something like a million tokens. You still have to pay for the "quadratic" computational cost, and you can't rely on big memory.

Federico Cassano (Researcher): But with high parallelism, it's theoretically 72 times cheaper, as long as the attention spans are well distributed.

Aman Sanger (Co-Founder): Yes, it would be 72x cheaper, but the spike in n² is still there, and various constant optimizations such as sliding windows, chunking, NS states, etc., which are large constant optimizations but still essentially constant terms, may still be needed.

Federico Cassano (Researcher): We really like "document-level attention".

Aman Sanger (Co-Founder): We call it "squid attention", like each tentacle of a squid is a document.

Jacob Jackson (Researcher): Why is that?

Aman Sanger (Co-Founder): What do you think?

Jacob Jackson (Researcher): I don't know.

Aman Sanger: I don't know who came up with the name; it doesn't seem like a name Lucas would come up with. The "good" attention mechanism is for each document to "pay attention to itself" independently, and then eventually together globally.

The nice thing is that you can cache the key and value of multiple documents and replace them whenever you reason about them, without having to re-populate them. This is useful for a variety of products, such as Tad, to create content quickly, and is especially handy for Agent to do semantic searches and read documents.

#07

Best Incentive Mechanisms

Sualeh Asif (CPO): We've said before that AI started out as a way to "optimize test coverage". But do you guys have any crazier ideas? How can we make it more relevant to the real world and not just for test coverage?

Aman Sanger (Co-Founder): What do you mean by that?

Sualeh Asif (CPO): Actually, most RL (Reinforcement Learning) training nowadays is all about getting the model to pass a bunch of test cases.

But we don't really only care about the model passing the tests. We want the model to perform better on real human needs, such as being able to add console logs to files, or being able to do a lot more "human-centered" things than just completing a very small task and passing a bunch of tests. That's why measures like SWE-Bench are a bit problematic, and Federico Cassano has a similar view.

Charlie Snell (Researcher): Yeah, if we want to have more "human-centered" rewards for the model, such as code quality or whether the output makes sense, it's really critical to get feedback signals from real users. For example, whether the user likes the changes made by the model or not, or whether the user accepted the edit as a "proxy" for the feedback signal.

Aman Sanger (Co-Founder): Yeah, another way to think about it is to just look at what the user actually changed, and then judge based on that whether the model liked it or not. If the user doesn't think it's right, they'll change it themselves.

In fact, as long as we can let the model try several practices, such as the background let the model try 3 or 4 different options, use different models, Temperature up, and then the user finally picks one to use, which is selected is a very good reward signal, which can be used to train the reward model.

Federico Cassano (Researcher): It's amazing that the value of Pass@K (how many times a model succeeds out of K attempts) tends to be much higher than Pass@1 (the percentage that succeeds on the first try).

Aman Sanger (Co-Founder): Even without perfect "oracles" or reward models, there are actually ways to increase the success rate.

Charlie Snell (Researcher): Yes, you can close the gap. For example, if you sample multiple times and use a voting method or some other method of selecting the best, you can improve the pass rate. We're working on a similar selection mechanism right now.

Aman Sanger (Co-Founder): Yes, if we have enough data to see "which of the multiple options the user chose each time," then we can train reward models for that signal. Even better, the reward model can see the "standard answer", and it knows more than the original model.

Federico Cassano: But you can't "saturate" the reward model, can you? Usually, the reward model can't be improved past 200 steps; the reward score is still going up, but the model itself is not improving.

Aman Sanger (Co-Founder): Why don't reward models for "seeing the standard answer" saturate so quickly?

Federico Cassano (Researcher): Sooner or later, it will saturate, but my hypothesis is that because it's based on real-world scenarios, it will saturate much later.

Charlie Snell (Researcher): It may be that the reward scores keep going up, but the "real rewards" that you really care about will go down. It might be much better if we use rewards that are closer to the "goals we care about" because then people are actually making decisions in reality.

Federico Cassano (Researcher): If you put users directly "in the loop" ......

Aman Sanger (Co-Founder): So, do you think it would be worse? For example, which of the two options is better: using only a very clean reward signal or using a reward model where you can see the standard answer?

Federico Cassano (Researcher): If you can keep retraining the reward model, then it's not really a big deal.

Aman Sanger (Co-founder): If we can keep retraining the reward model, for example, by making a reward model for pairwise comparisons, would that be better than just seeing the standard answer?

Federico Cassano (Researcher): We actually have almost an "infinite amount" of human feedback right now, so that's ideal.

Jacob Jackson (Researcher): Yeah, what's interesting about us right now is that for a lot of models, we are the interface between the model and the real world, at least for code-related applications. So it's our job to make the model as close as possible to the needs of real users.

Charlie Snell (Researcher): I think there's a trade-off here: if you can sample unlimited feedback from the real world, you can certainly always optimize, and the results will be great. But if sampling is expensive, you have to think of other ways to do more offline optimization, such as using "standard answer" based rewards.

Aman Sanger (Co-Founder): Do you think it's realistically feasible to have the model go online often and take reward signals directly from users? Are there any reasons why this can't be done? Jacob Jackson, what do you think?

Jacob Jackson (Researcher): I think it's totally doable.

Aman Sanger (Co-founder): You think we should do that?

Jacob Jackson (Researcher): Yes, my view is that the shorter the cycle for a new model to come online to the actual environment, to start interacting with the real world, the better the model will work.

Federico Cassano (Researcher): Personally, I'm more optimistic about "modeling with rewards" and retraining more frequently.

Charlie Snell (Researcher): If you retrain every few days, enough data is fine.

Jacob Jackson (Fellow): Have you guys seen the OpenAI blog? They have a piece reflecting on the phenomenon of "sycophancy", saying that models learn to be bad because they are trained with signals like "like/tap".

Aman Sanger (Co-Founder): The signal "thumbs down" is really bad because it makes the model focus only on the distribution of users who will frequently like and thumbs down.

Federico Cassano (Researcher): I feel the same way.

Aman Sanger (Co-founder): It's really a question of evaluation.

Federico Cassano (Researcher): Do you think there are users who just intentionally like and stomp, purely messing around?

Aman Sanger (Co-founder): The "ass-kissing" phenomenon is an indication that there is such a problem, but it's not that it's completely useless.

Federico Cassano (Researcher): That's why it's important to align feedback with real user needs. You have to find scenarios where users are genuinely willing to give you feedback.

Charlie Snell (Researcher): Yes, it has to be as close as possible to the real needs.

Federico Cassano (Researcher): The ultimate ideal is to "satisfy the user and make them laugh."

Aman Sanger (Co-founder): Yes, that's the effect we want.

Federico Cassano (Fellow): I would think so, too.

Jacob Jackson (Researcher): That OpenAI article mentions that this could be one of the big reasons why models become "ass-kissing types", but of course, there are other factors as well.

Charlie Snell (Researcher): We could use a real user behavior signal, for example, if we have a "model selector". If the user switches to a different model, this can signal that they are not satisfied with the results of our model.

Aman Sanger (Co-Founder): No, I actually like the signal Jacob Jackson mentioned best. That's the one you're talking about, right?

Jacob Jackson (Researcher): Oh, I'm talking about "has the code been retained".

Aman Sanger (Co-Founder): Yeah, that's actually the signal you want most in the long term.

Jacob Jackson (Researcher): Or if they don't use Cursor anymore, we don't want to reinforce that behavior. We want users to stay on the platform. Can we do something with that signal?

Aman Sanger (Co-Founder): Churn can actually be used as a reward signal. Churn is actually the "end goal" we're trying to optimize for. Can we predict churn with short-term behaviors and use those as reward signals?

Jacob Jackson (Researcher): It's all about the "bias-variance" trade-off.

Aman Sanger (Co-Founder): There will always be tradeoffs.

Sualeh Asif (CPO): A similar question, I feel like all of our discussions have centered around outcome-based rewards. In fact, a long time ago, people were also very keen on "process reward models" (process reward models).

Aman Sanger (Co-Founder): Yes.

Sualeh Asif (CPO): So, where are these process reward models going now, Charlie? What do you think?

Charlie Snell (Researcher): The problem with process reward models is that you throw the whole trajectory of the action at the model, scoring each step, and the model is actually not very accurate in scoring the intermediate processes, especially for those judgments of "will this action lead to the final correct outcome". When you put optimization pressure on such a reward model, it can only optimize a little bit, just as we discussed earlier.

But if you have real "standard answer" signals, you can optimize all the way through, for example, the answers to math problems. Like DeepSeek's R1 does 10,000 RL steps, most RLHF pipelines only do a hundred. Being able to do 10,000 steps, the model learns a completely different new capability. So the key is how many steps you can optimize the reward signal, the ground truth (standard answer) can be optimized all the time, whereas the process reward can only be optimized for a finite number of steps.

Federico Cassano (Researcher): And the more steps you have, the worse this problem gets. For example, if you do a 50-step tool call, the value model becomes a bottleneck, so a lot of people use variants of PPO, like GRPO, RLU, because it's hard to accurately evaluate the whole process with a value model.

Charlie Snell (Researcher): For difficult problems, it is hard to expect the model to give accurate "value assessment". So, GRPO is to just violently sample more times and average the final result, which is closer to the standard answer.

Aman Sanger (Co-Founder): I may have missed the first part, but it does explain the difference between "process reward models" and "outcome reward models". But how do the two compare?

Charlie Snell (Researcher): If you compare a process reward model to a reward model that simply scores at the last step, the extra step of intermediate scoring can actually be advantageous in some search scenarios. But it's the same old problem that both can only optimize a finite number of steps.

Aman Sanger (Co-Founder): Should we still train the "process reward" model? For example, should we focus on iteratively retraining the reward model instead of the process reward?

Charlie Snell (Researcher): It's actually something to consider. If we want to maximize the use of each round of user data, the process reward model might have some gain.

#08

RL Infrastructure

Sualeh Asif (CPO): So speaking of infrastructure, many of you have been involved in building RL infrastructure. Any interesting thoughts? What makes a good RL infrastructure?

Federico Cassano (Researcher): One of the interesting points about our infrastructure is that it is naturally more complex than normal training. This is because it has to build on the original forward and reverse (SFT or pre-training) for longer. Another point is that it also has to include inference, and inference is not about low latency for the user; it's about high throughput and running rollouts (sampling) at scale.

Algorithms like GRPO are even more interesting because you have to generate a very large number of results for the same Prompt, and then backpropagate them together. In math, many open source projects don't care much about this problem, because math tasks usually have very short Prompts that can be propagated directly back and forth. But when we do Agent, the Prompt is very big, so we can't backpropagate all the samples; we have to optimize it.

For example, if you have a Prompt, you start pulling KVs (key-value pairs) when you get the Prompt from the data loader. While the inference server is rolling out, the keys are already pulled. When the rollouts come back, you just need to forward the relevant KVs again. The backpropagation can also directly use the previously prepared KVs, avoiding repeated calculations. There are a lot of underutilized optimization points here.

Charlie Snell (Researcher): You also have to be able to quickly synchronize the parameters of the training node and the inference node, which is also a big challenge.

Federico Cassano (Researcher): Also, when actually doing reinforcement learning training, you will encounter different sampling methods. For example, many people will use "asynchronous sampling" - that is, you are still backpropagating (training the model) with the last data on your side, and the model side has already started generating the next batch of data with the old parameters. This has a little bit of a lag, but the overall training will be faster.

When we need to synchronize the parameters on all machines to the latest, we have to do a "global synchronization", which is when everyone pauses to synchronize the latest parameters to all machines in the fastest way possible. Commonly used synchronization methods are RDMA, Infiniband, RoCE, etc., which are all efficient inter-server memory transfer techniques.

Aman Sanger (Co-Founder): Regarding throughput, how else do you think it can be optimized?

Jacob Jackson (Researcher): Core architectures like DeepSeek are optimized for throughput. Although Tokens per second is not high, the amount of samples per GPU is very high. The architecture is actually very well suited for RL training as well.

Aman Sanger (Co-Founder): Used on the NVL 72 device.

Federico Cassano (Researcher): Also, PD (parameter separation) is important. You only need to do a prefill on the Prompt once; after that, all the decoder workers can start working and help you with it.

Aman Sanger (Co-Founder): There's actually an interesting way to do RL (e.g., URLs), which on one hand simplifies the process, but on the other hand makes things more complicated. It's just taking the data that the user is actually reasoning about (inference) with the model and treating it as RL. Jacob Jackson is doing something similar on CAT.

Jacob Jackson (Researcher): As long as you don't need to sample multiple results for the same Prompt, and only care about what the model actually does, and then whether it's a "reinforcement" or "non-reinforcement" action, you don't really need an additional reasoning component for RL at all. In fact, you don't need the extra inference component of RL training at all. Just look at what really happened to the user.

This is a different set of tradeoffs than "resampling and comparing with a reward model," which relies more on being able to bring new strategies online very quickly. The advantage of this is that the optimized strategy is exactly the same as the strategy that actually generates the data.

We do this in the Tab scenario because we collect a lot of feedback per unit of time, for example, every time a Cursor user sees a Tab suggestion. So we have a huge amount of data and this scenario just makes sense.

Charlie Snell (Researcher): Gradient estimation in RL inherently has a lot of variance, so if you have a large batch, it's fine to enhance individual rollout trajectories. But if you don't have a big batch, you have to reduce the variance in some other way, and that's where GRPO comes in, or you can add a value function baseline. A large batch is theoretically fine.

Jacob Jackson: Big batch, short trajectories are good. Like Tab, the trajectory is a few hundred tokens, whereas Agent is 10,000 tokens.

Charlie Snell (Researcher): Agent has a very long rollout every time.

Federico Cassano (Researcher): How can we make Tab's action space bigger?

Jacob Jackson (Researcher): Let Tab give more different types of suggestions.

Federico Cassano (Researcher): Yes, if Tab only generates one line of code, you have to sample it many more times before you get a new suggestion. The way to make it more suitable for RL is to add more "actions" to it, like "jump" operations.

Jacob Jackson (Researcher): Yes, "jumps" give Tab more action. Without jumps, it often has to stop. If it can jump, it can continue to sample and give feedback based on whether you accept the jump or not.

Aman Sanger (Co-Founder): Yeah, and how does GRPO compare to other RL algorithms?

Charlie Snell: The big difference between GRPO and PPO is that PPO has a value function. That's good for people who don't have a lot of memory, because you don't have to store extra value function weights, but with GRPO, you have to run more rounds, and it's a lot more computationally intensive.

You can actually train a model with GRPO, but it will take a long time. Especially for math and code tasks, the value function is usually not accurate, and the value calculated by forward pass is not meaningful. It's better to take more samples and average them. This is also true in practice.

Federico Cassano (Researcher): There was actually a similar algorithm called AReLU before GRPO, but it didn't get much attention.

Charlie Snell (Researcher): Yeah, actually, GRPO has been out for a long time. It was used in DeepSeek's math paper, I think it was a year or so ago, around early 2024.

Federico Cassano (Researcher): I remember it was available in 2019 as well.

Charlie Snell (Researcher): Yeah, that's even earlier. RL got hot mainly after DeepSeek R1, and actually GRPO was around a year before R1.

Federico Cassano (Fellow): They were using the reward model then.

Charlie Snell (Researcher): Yes, they used RL then, ground truth too, and tried to do RL for PRM. It's interesting to see why it didn't have the same effect as R1 before.

Aman Sanger (Co-Founder): Why do you think?

Charlie Snell (Researcher): Actually there are related studies. It may be related to the underlying model capability, such as the pre-training data has changed, or the model itself is good enough that even if it takes a hundred samples to make a reasonable backtracking once, but with RL it can amplify those good behaviors. Bigger models becoming more capable also have an impact.

Aman Sanger (Co-Founder): Can anyone reproduce R0, R1 now?

Charlie Snell (Researcher): R0 is only reproduced on a very small scale, mostly on toy tasks.

Aman Sanger (Co-founder): It's really hard.

Federico Cassano (Researcher): The difficulty is to reproduce it on a large model like QwQ 32B, the infrastructure is too difficult, and the data. DeepSeek got a lot of real data, the open source community only has 100,000 or 200,000 datasets.

Aman Sanger (Co-Founder): A lot of the results now are a thousand pieces of data, and then loaded offline by the CPU.

#09

More Efficient Programming Agent

Sualeh Asif (CPO): Okay, last question, what do you think the future of Programming Agents will be?

Jacob Jackson (Fellow): It's going to use more and more tokens.

Charlie Snell (Fellow): We've been talking about input context, but actually, output context is important too. An Agent like o3, for example, will keep crawling content until it builds the right context before solving the problem. I expect future models will continuously invoke the tool for a long time before making a decision.

Aman Sanger (Co-Founder): But it feels like this is a bit of a waste because you have to do the math all over again next time, and in most cases it's not really necessary to reason that much each time... can you reuse the previous reasoning process?

Charlie Snell (Fellow): Yes, you should be able to sink your previous reasoning process. Agent looks at the history of the trace or the history of the code base, learn useful information, and stores it.

Aman Sanger (Co-Founder): Yeah, it's actually pretty undesirable to have to put up with slow and high cost like o3 if you're going to use the best Agent, or a good enough Agent.

Charlie Snell (Fellow): Yeah, you should actually be able to do the knowledge building slowly in the background, and then be able to use it quickly when you actually ask questions.

Aman Sanger (Co-founder): I think "long context" or "codebase-specific modeling" would be important. As long as you can reuse the knowledge you've accumulated and understand the structure of the code without having to re-understand it every time, the model will be a lot more efficient , and you'll only need to output the key information when generating the answer.

Federico Cassano (Researcher): Another point is that the number of output tokens can be dramatically increased, and the training will be more "sample efficient". Right now, only the output Token brings signals when SFTP is trained.

Aman Sanger (Co-Founder): But this also has drawbacks, such as generating very long outputs, and it will be difficult to do credit assignment (assigning rewards). Like GRPO, we are sampling every few tokens, and very long sequences become data efficient, but not computationally efficient.

Charlie Snell (Researcher): Yes, large language model training is now entering a phase where high-quality data is more scarce than computing power. The best data is very limited, so how do you put the computational power to use? Perhaps approaches that seem "arithmetic-burning" are the way to go instead.

Sualeh Asif (CPO): OK, that's it for today!