Recursive Language Models: Can a simple method potentially unlock new regime of inference-time scaling?
Training-time scaling laws were saturating. Then came reasoning models and their focus on test-time scaling. Now, here's a method for LLMs to handle massive prompts without badly degrading performance
IN THE FALL OF 2024, there were concerns that neural scaling laws were saturating. Throwing more compute and data to train ever-bigger large language models (LLMs) was showing diminishing returns. And then, OpenAI released its o1 series of “reasoning” models, unlocking an entirely new way to improve the performance of LLMs. OpenAI had leveraged what's called test-time or inference-time compute. The other biggies joined the bandwagon and 2025 became the year of large reasoning models (LRMs). Even DeepSeek joined the party, jolting the world with its R1 series of open source models. Their peer-reviewed paper provided one of the clearest examples of how to train a reasoning model.
And yet, even large reasoning models began to show their limitations, captured in an evocative phrase: “context rot.” LRMs depend on being able to “think” by generating long sequences of tokens or chains-of-thought, searching through them (either explicitly or implicitly) to pick the optimal CoT, and then answering. The process involves the model feeding itself the tokens it generates. So, the amount of information it has to process in one go keeps increasing as it “reasons,” especially for complex problems. There is, however, only so much a language model can ingest before its performance starts to degrade.
Context rot also applies to other techniques that externally augment an LLM's prompt with seemingly relevant information, designed to make inference more accurate, such as retrieval augmented generation (RAG).
Now, MIT researchers Alex Zhang, Tim Kraska and Omar Khattab have designed a scaffolding around an LLM that helps mitigate the problem. Importantly, the scaffolding can be used with any model, even frontier LLMs. The result is what they call recursive language models (RLMs) that, at first glance, outperform vanilla frontier models. Given that the compute being used to run an RLM is in addition to any compute that a reasoning model might use for its own operation (i.e., its inference-time compute), recursive language models might unlock a new scaling regime. Could 2026 be the year of RLMs, as the MIT teams seems to think?1
Before we ponder that question, let's examine in more detail why RLMs were needed in the first place.
Stemming The Rot
If we go back a mere three or so years, GPT-3.5 Turbo boasted of a context window length of 16,385 tokens (where a token can be an entire word, or a part of a word). Context is the entirety of information in an LLM's prompt; context window length is the maximum amount of information that can be given to the model during one forward pass or inference pass. To put into perspective how far things have come, Google's Gemini Pro series of models can take in a million tokens at a time2. And Meta's Llama-4 Scout tops out at a whopping 10 million tokens3.
The intent, of course, is to provide as much information in the context window for the model to reason effectively, including multimedia tokens. But increasing the context window length is not some magic sauce for making LLMs reason more accurately.
For one, at least the vanilla version of the attention mechanism of Transformers that allows each token to “attend” to every other token, suffers from what's called quadratic computational complexity. Let's say that you queried an LLM with a prompt of 1,000 tokens. The LLM is going to generate its response one token at a time, which it feeds back to itself plus your initial prompt, a process called auto-regression. So, 1001 tokens are fed back, then 1002, 1003, …, you get the picture. The problem is that as the input token length goes from n to 2n, the computation and memory required by attention mechanism goes up by four times. If the input length goes to 10n, then the attention mechanism requires 100 times as much resources. And with reasoning models, it's very easy for auto-regression to overwhelm the computational resources of the LLM, as the context increases with each subsequent pass through the model.
Tinkering with the attention mechanism to make it more efficient has helped. Researchers developed techniques such as linear attention (where the computational needs increase linearly with input sequence length; for the curious, this is achieved by replacing the Softmax function used while computing self-attention with a kernel function) and flash attention4 (which focuses on reducing the amount of time spent on shuttling data between a GPU's high bandwidth memory and the faster SRAM).
And while such optimizations do improve computational efficiency, thus allowing longer context window lengths given the same computing resources, they are not a panacea for context rot per say. (The phrase “context rot” was coined by researchers at Chroma5, to describe the observation that an LLMs “performance grows increasingly unreliable as input length grows.”)
One of the benchmarks used to evaluate the performance of LLMs on long context inputs is the so-called Needle in a Haystack (NIAH) test6, designed by Greg Kamradt7. The basic idea is to place a piece of text deep within a document, and then prompt an LLM with a question, the answer to which requires locating the embedded text.
The evaluation involves inserting the text (the needle) at varying depths inside a document, at 10% from the top, 25% from the top, 50%, and so on, and then querying the LLM to check its effectiveness at finding the information at each depth (tested by examining its answer). Kamradt's analysis showed that, for example, “GPT-4 retrieval accuracy started to degrade at large context lengths when the fact was placed between 10%-50% document depth.”8
The Chroma team modified the NIAH benchmark to test how the retrieval of the needle depending not just on lexical pattern matching (as in the original test), but also on matching the semantics. The team wrote: “Each task remains intentionally simple and is deliberately controlled to isolate the impact of context length alone. We demonstrate that even under these minimal conditions, model performance degrades as input length increases, often in surprising and non-uniform ways. Real-world applications typically involve much greater complexity, implying that the influence of input length may be even more pronounced in practice.”
It's this context rot that the MIT team sought to fix, with their recursive language model. As they say in their paper, “Though we expect context lengths to steadily rise through improvements to training, architecture, and infrastructure, we are interested in whether it is possible to scale the context size of general-purpose LLMs by orders of magnitude. This is increasingly urgent as LLMs begin to be widely adopted for long-horizon tasks, in which they must routinely process tens if not hundreds of millions of tokens.”9
Keeping A Lid On Input Length
The MIT team, to reiterate, designed an external scaffolding around an LLM, to prevent the prompt, no matter how long it gets, from being fed directly to the model in its entirety. Instead, the scaffolding allows the language model to inspect the prompt, break it down into constituent parts, and feed these parts recursively to itself or to other language models. The LLM+scaffolding makes up the recursive language model (RLM).
The heart of the scaffolding is something called a Read-Eval-Print-Loop (REPL) environment. In this case, it's simply a Python environment. To see how it's used, let's take the example problem that the team solved, to demonstrate a minimal implementation. The context is a massive piece of randomized text, inside which is inserted, at some random location, a 7-digit number. The QUERY to the LLM is: “I’m looking for a magic number. What is it?”
This is exactly the kind of Needle-in-a-Haystack problem that LLMs struggle with when the haystack gets bigger and bigger (even if their context window lengths allow for it).
The RLM, however, stores the context in a CONTEXT variable in the REPL environment. It then calls a language model, using a combination of the SYSTEM prompt (details below) and the QUERY as input (but not the entire context).
The SYSTEM prompt begins with these instructions to the language model: “You are tasked with answering a query with associated context. You can access, transform, and analyze this context interactively in a REPL environment that can recursively query sub-LLMs, which you are strongly encouraged to use as much as possible. You will be queried iteratively until you provide a final answer.”
The rest of the SYSTEM prompt contains details about the REPL environment, informing the language model that it has access to the environment, which has a context variable. There are specific instructions on how to use the REPL environment:
“You will only be able to see truncated outputs from the REPL environment, so you should use the query LLM function on variables you want to analyze. You will find this function especially useful when you have to analyze the semantics of the context. Use these variables as buffers to build up your final answer.
Make sure to explicitly look through the entire context in REPL before answering your query. An example strategy is to first look at the context and figure out a chunking strategy, then break up the context into smart chunks, and query an LLM per chunk with a particular question and save the answers to a buffer, then query an LLM with all the buffers to produce your final answer.
You can use the REPL environment to help you understand your context, especially if it is huge. Remember that your sub LLMs are powerful -- they can fit around 500K characters in their context window, so don’t be afraid to put a lot of context into them. For example, a viable strategy is to feed 10 documents per sub-LLM query. Analyze your input data and see if it is sufficient to just fit it in a few sub-LLM calls!”
The system prompt contains examples of how to generate code to break up the context into chunks, how to iteratively query a language model to answer the query, and then indicate when it has found the FINAL answer. You can find the details in the GitHub repository.
Essentially, the main loop of the recursive language model takes in the context and query from the user, initializes the CONTEXT variable in the REPL environment, and then passes on the SYSTEM prompt+QUERY to the language model. Inside the system prompt are instructions asking the language model to access the context variable by generating code that can run in the REPL environment, to break up the long context into smaller chunks, and recursively call itself on each chunk, until it can find an answer to the query. Of course, the devil is in the details, and there are plenty of details in the GitHub repository10.
Normally, if you had a piece of text a million tokens long and some query related to it, you'd have passed the entire context+query to the LLM. It's very likely the LLM would fail to answer the query.
But with the recursive scaffolding, the weight of the context window is lifted off the LLM. It can recursively deal in smaller portions, and potentially succeed. How did RLMs do, then, on tasks that required more than looking for a needle in a haystack?
Beyond Needles In Haystacks
The NIAH benchmark, while seminal, has proven to be limited. Researchers have argued that the task requires focusing on a small portion of the overall (potentially very large) context, allowing the LLM to disregard almost all of the rest of context. In Nov 2025, Amanda Bertsch and colleagues from CMU designed OOLONG, a benchmark that requires the LLM, for example, to both classify the information in the context window into one of many types and to count the instances of each type, a problem of linear computational complexity.11
The MIT team used the OOLONG benchmark to test their recursive language model approach against vanilla frontier models such as GPT-5. They varied the context length from 8K tokens to 262K tokens. Vanilla GPT-5's score went from about 90% for a context length of 8K tokens, to about 50 % for 33K tokens, to below 30% for 262K tokens; GPT-5 could not ingest any more tokens, given its maximum context window length of 272K tokens. Also, when the MIT team used an upgraded benchmark they call OOLONG-Pairs, where the LLM has to look at all possible pairs of chunks of tokens in order to answer a query (a problem whose computational complexity grows quadratically with context length), GPT-5's score dropped drastically, from about 80% at 8K tokens to a few percent at 33K tokens to near zero at 66K tokens or more.
On the other hand, the recursive language model (a scaffold around GPT-5), showed consistent performance (albeit slowly declining) for context lengths ranging from 8K tokens (a score of 75% on OOLONG-Pairs) all the way to 1million tokens (about 50%).
There's plenty of other data in the paper that shows RLMs outperforming vanilla LLMs. What caught my eye were some of the negatives, which were revealing in themselves. For example, if the vanilla models that were scaffolded did not have good coding capabilities, the RLM approach didn't succeed (not surprising, since the SYSTEM prompt asks the LLM to generate code that can run in the REPL environment). Also, if the vanilla models had severe restrictions on the total length of their output tokens (a different constraint than the input context window length), this too impacted the performance of the RLM. Most importantly, inside the RLM, the vanilla LLM that is being invoked has to tag its answer as FINAL if it thinks it's found the answer. It turns out that, again not surprisingly, an LLM's answer cannot be trusted. All these shortcomings will need to be addressed.
While the paper has garnered a lot of attention, many have argued that this is simply a coding agent in a new avatar. I refer you to this thread, in which one of the co-authors of the RLM paper engages with questions. There are subtleties here that are worth exploring, in terms of what the author thinks is the difference between a coding agent and RLMs.
Maybe a bit too boldly, the authors envisage a new axis for scaling laws (not unlike how inference-time scaling added an additional dimensional in the fall of 2024), using native RLMs. In their case, training a native RLM involved supervised fine-tuning using curated datasets (collected from RLM trajectories of larger models and distilled versions of smaller models of the same family), to create RLM-Qwen3-8B, the first natively recursive language model.
“We hope that training native RLMs can be treated as a new axis of scale to improve LM performance on general and long-horizon tasks,” they write.
I think it's too early to say this will unlock a new axis for scaling LLMs. But if nothing else, RLMs provide an elegant way to scale up the context window length of any LLM. That's no small achievement.
Footnotes
https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-pro
https://ai.meta.com/blog/llama-4-multimodal-intelligence/
https://arxiv.org/abs/2205.14135
https://research.trychroma.com/context-rot
https://github.com/gkamradt/LLMTest_NeedleInAHaystack
Watch Greg Kamradt explain his Needle In A Haystack
evaluation
https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/img/GPT_4_testing.png
https://arxiv.org/abs/2512.24601
https://github.com/alexzhang13/rlm-minimal
https://arxiv.org/pdf/2511.02817





The scaffolding approach to handling 1M+ token contexts is clever, but I'm curious about the practical latency implications for agent workflows.
My agent currently hits context limits on complex multi-file changes. I've worked around it by breaking tasks into smaller chunks manually, which works but feels brittle. If RLMs can handle massive contexts while maintaining performance, that's a real unlock for autonomous systems.
The "context rot" problem you mention is exactly what kills long-running agent sessions. After 50+ back-and-forth exchanges, the model starts forgetting earlier decisions or contradicting itself. Recursive processing could help, but I wonder about the overhead - does the scaffolding add enough latency that real-time agent interactions become sluggish?
The compatibility with existing LLMs is the key detail. If this is a wrapper/technique rather than a new model architecture, adoption could be fast. Would love to see benchmarks on agent-specific tasks (multi-step planning, code refactoring across repos) rather than just summarization.