Recursive Language Models: The Next Frontier in Inference-Time Scaling

By Prahlad Menon 7 min read

There’s a pattern in AI progress that’s become almost predictable. We hit a wall. Someone finds a way around it. The wall becomes a footnote.

In 2024, the wall was scaling laws. Training ever-larger models was showing diminishing returns. Then OpenAI released o1, and suddenly we had a new dimension to scale: inference-time compute. Reasoning models dominated 2025.

Now we’re hitting a new wall. It’s called context rot — and a team at MIT thinks they’ve found the next paradigm shift.

The Context Rot Problem

Here’s something practitioners know but benchmarks don’t capture well: as conversations get longer, models get dumber.

It’s not just about context window limits. Gemini supports a million tokens. Llama-4 Scout handles 10 million. The problem is that even within these limits, performance degrades steeply as prompts grow. Anthropic’s engineering team describes it precisely: “as the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases.”

The chart from MIT’s research tells the story. GPT-5, one of the most capable models available, shows severe degradation as input length grows from 8K to 256K tokens. By the time you’re working with complex reasoning tasks, the model is struggling at lengths well within its advertised context window.

Reasoning models made this worse, not better. They work by generating long chains of thought — feeding themselves tokens as they “think.” The more complex the problem, the more tokens they generate, the faster the context fills up. It’s a fundamental tension: the technique that makes models smarter also makes them more vulnerable to context rot.

The Recursive Insight

MIT researchers Alex Zhang, Tim Kraska, and Omar Khattab asked a simple question: what if we stopped treating prompts as inputs and started treating them as environments?

Traditional LLMs ingest tokens directly into the transformer. You stuff your context into the prompt, hit enter, and hope the attention mechanism figures out what matters. As context grows, this becomes increasingly hopeless.

Recursive Language Models flip this. An RLM:

  1. Treats the prompt as an external variable stored in a Python REPL environment
  2. Gives the model tools to peek into, decompose, and navigate that variable
  3. Allows the model to recursively call itself on programmatic snippets

The user sees the same interface — text in, text out. But under the hood, instead of cramming everything into one forward pass, the model writes code to examine the context strategically and spawns sub-calls to process manageable chunks.

Traditional LLM:
  [800K tokens] → Transformer → degraded output

Recursive LLM:
  [800K tokens] → stored as variable
  Root model → writes code to partition
  Sub-calls → process chunks recursively  
  Aggregation → final answer

The key insight is symbolic recursion. Prior approaches like context summarization or retrieval-augmented generation try to compress information or fetch relevant snippets. But they’re fundamentally limited — summarization loses details, retrieval can miss what matters. RLMs let the model decide how to decompose the problem programmatically, including spawning loops that process thousands of chunks if needed.

The Results Are Striking

The MIT team evaluated RLMs on four diverse benchmarks: deep research (BrowseComp+), information aggregation (OOLONG), code repository understanding (CodeQA), and a synthetic pairwise reasoning task (OOLONG-Pairs) designed to be impossible for vanilla models.

The findings:

Scale: RLMs successfully process inputs up to 10M+ tokens — two orders of magnitude beyond native context windows.

Accuracy: On OOLONG, the hardest long-context benchmark they tested, RLM using GPT-5-mini outperformed vanilla GPT-5 by more than double the correct answers. A smaller model with the right scaffolding beat a larger model with brute-force context.

No degradation: Unlike vanilla models that collapse as context grows, RLMs show minimal performance degradation even at extreme scales. The architecture sidesteps context rot by never overwhelming the underlying model’s attention.

Comparable cost: Despite the recursive overhead, average query costs remain comparable to vanilla LLM calls. The model learns to be efficient about decomposition.

Transfer learning: The team post-trained RLM-Qwen3-8B using just 1,000 samples from unrelated domains. The result: 28.3% average improvement over base Qwen3-8B, approaching GPT-5 quality on three benchmarks. The recursive capability transfers.

Why This Is Different From RAG and Agents

If you’re thinking “this sounds like retrieval” or “this is just another agent framework,” there’s a crucial distinction.

RAG retrieves snippets based on similarity, then stuffs them into context. It doesn’t solve context rot — it just selects which context to rot on. And it requires pre-indexing, which assumes you know what’s relevant before you ask.

Coding agents can write programs to interact with external data sources. But they still fill up their context window with retrieved snippets before breaking down. The bottleneck just moves.

Sub-agent delegation (like Claude’s computer use or multi-agent systems) lets models spawn helpers. But prior approaches verbalize sub-calls autoregressively — the root model has to say what it wants, limited by output length constraints.

RLMs combine the best of all these while avoiding their limitations:

  • The prompt itself becomes the environment (no pre-indexing)
  • Sub-calls are generated programmatically, not verbally (no output length bottleneck)
  • Recursion is native to the paradigm (arbitrary depth, arbitrary scale)
  • The model controls decomposition strategy (adaptive to the task)

The Paradigm Shift Case

Here’s why some researchers are calling this the most significant architectural shift of 2026:

It’s model-agnostic. RLMs are a scaffolding approach. You can wrap any LLM — GPT-5, Claude, open-source models — and get the benefits. This means the technique compounds with whatever improvements come to base models.

It’s already being adopted. DSPy added native RLM support in version 3.1.2. Google’s Agent Development Kit (ADK) includes industrial-grade RLM infrastructure with parallelism and cloud storage integration. The ecosystem is moving fast.

It unlocks a new scaling regime. Reasoning models showed that inference-time compute could substitute for training compute. RLMs show that how you use that compute matters as much as how much you use. Symbolic recursion is a qualitatively different capability.

It addresses real production pain. Anyone who’s built AI applications with long documents, code repositories, or extended conversations has hit context rot. RLMs offer a systematic solution, not just a workaround.

The Engineering Challenges

This isn’t all upside. The MIT paper and subsequent discussions have surfaced real challenges:

Cost variance: Average costs are comparable, but tail-end costs can spike if the model enters complex decomposition loops. You need monitoring and budgets.

Stopping conditions: When does more recursion stop helping? Knowing when to stop is an unsolved problem. Current implementations rely on depth limits and heuristics.

Error propagation: Recursive systems can “spread small mistakes across the system.” A bad decomposition early can cascade into garbage at aggregation time.

Latency: Recursion adds round trips. For real-time applications, this may be prohibitive. RLMs are best suited for complex, latency-tolerant tasks.

What This Means for Practitioners

If you’re building AI systems that deal with long context — code repositories, document analysis, research synthesis, extended conversations — RLMs are worth understanding now.

Short-term: Experiment with DSPy’s RLM implementation. It’s a drop-in replacement for standard LLM calls that may dramatically improve quality on long-context tasks.

import dspy

rlm = dspy.RLM('context, question -> answer: str')
result = rlm(context=massive_document, question="What are the key findings?")

Medium-term: Watch Google ADK and similar frameworks. Production-grade RLM infrastructure is coming, with parallelism, cost controls, and observability built in.

Long-term: Consider how recursive architectures change what’s possible. Tasks that were “too long” for LLMs — analyzing entire codebases, synthesizing thousands of documents, maintaining coherent context over days of conversation — may become tractable.

The Timeline

The progression is becoming clear:

  • 2024: Scaling laws saturate, inference-time compute emerges
  • 2025: Reasoning models (o1, R1, Claude thinking) dominate
  • 2026: Recursive Language Models extend the paradigm?

The MIT team is explicit about their ambition: “We think that RLMs trained explicitly to recursively reason are likely to represent the next milestone in general-purpose inference-time scaling after CoT-style reasoning models and ReAct-style agent models.”

Whether 2026 becomes “the year of RLMs” depends on how quickly the engineering challenges get solved and how broadly the frameworks get adopted. But the research is compelling, the implementations exist, and the problem it solves — context rot — is real and getting worse as we push models into longer-horizon tasks.

The wall is here. Someone found a way around it. Time to pay attention.


The Recursive Language Models paper (Zhang, Kraska, Khattab) is available at arXiv:2512.24601. Code is open-sourced at github.com/alexzhang13/rlm. A minimal implementation for experimentation is at github.com/alexzhang13/rlm-minimal.