Prompts can now be virtually infinite in length. This is possible thanks to a new approach called Recursive Language Models (RLM), which enables any language model to handle an unlimited context size. You no longer need to worry about your prompt size exceeding limits or performance deteriorating as your prompts get longer. This article explores how this open-source solution works and how you can put it into action.
The Core Challenges with Large Prompts
Before diving into the code and implementation, it’s crucial to understand the fundamental challenges that RLM addresses.
-
Context Size Limitation: Language models have a fixed limit on their context size. For instance, a model like GPT-5 might have a maximum context window of 260,000 tokens. If your use case requires a longer prompt, the model simply cannot handle it. RLM provides a way to make this context length near-infinite.
-
Performance Degradation: It’s a well-documented phenomenon that as prompts get longer, the performance of most language models declines. To maintain high performance, it’s often better to keep prompts smaller. However, this isn’t always feasible. When a long prompt is necessary, you typically have to accept a loss in performance.
A notable chart illustrates this issue clearly. When using a standard GPT-5 model, performance scores on reasoning and coding tasks decline as the prompt length increases, with a significant drop observed after the 33k token mark. In contrast, the same model augmented with the RLM approach not only handles prompts exceeding 1 million tokens but also maintains a stable, high-performance score without any degradation.
This means two primary challenges are solved:
- Prompts can be significantly longer.
- There is no deterioration in performance as prompts grow.
Introducing Recursive Language Models (RLM)
So, what exactly is RLM, and how does it enable this functionality on a model like GPT-5? The core idea of the Recursive Language Model approach is to treat the long prompt as part of an environment.
Think of it like environment variables in a Python script. The RLM approach loads the entire input prompt as a variable within a Python-like REPL environment. The model then writes code to inspect, compose, and recursively invoke itself on programmatic snippets of that variable.
In simple terms, the model interacts with the prompt through code instead of reading it all at once. The model generates code to parse, extract, or merge specific parts of the prompt. By having this code-based capability, the full text of a long prompt never needs to be loaded directly into the model’s limited context window.
How Does RLM Work? An Example
Let’s consider an initial prompt that is extremely long, such as one containing an entire book. The query might be: “You are reading an extremely long book. Can you list all items that were made before a certain grade level?”
The RLM approach doesn’t feed this entire prompt to the language model. Instead, it provides a system prompt that instructs the model on how to proceed. It informs the LLM that a long prompt exists and that it can interact with it through code.
Given the query, the language model might generate the following Python-like code:
# The LLM decides to split the prompt by the "chapter" keyword
chapters = context.split("chapter")
# It then processes each chapter individually
results = []
for chapter_text in chapters:
# A recursive call to the LLM with a smaller chunk of the prompt
sub_result = LLM_query(
prompt=chapter_text,
query="List all items made before the grade level."
)
results.append(sub_result)
# Finally, merge the results from all sub-calls
final_answer = merge_results(results)
print(final_answer)
In this scenario, the model intelligently decided to split the prompt into two parts based on the keyword “chapter.” It then calls the LLM separately for each chapter, effectively halving the prompt size for each call. The final answer is constructed by merging the outputs from these multiple sub-calls.
This agentic approach allows for parsing the prompt using code. If a language model has a 10k output token limit, it can now produce millions of output tokens by merging the results of these sub-calls. The model itself determines the depth of recursion needed to answer the query.
Performance and Cost Analysis
How does the performance of RLM compare to not using it? Benchmarks from the original paper, which tested models like GPT-5 and Quinn, provide a clear answer.
The models were tested on various tasks, including code generation, question answering, and browsing, with input prompt sizes ranging from 23,000 to over 11 million tokens. Standard models like GPT-5 or Quinn cannot handle such large prompts on their own.
The results consistently show that the RLM-augmented models achieve the highest performance scores across almost all tasks.
A Note on Cost: The cost can vary. In some cases, RLM can be cheaper or comparable to the base model. However, in other scenarios, it can be more expensive due to the recursive sub-calls. The cost truly depends on the intelligence of the underlying model. A smarter model might solve a query with just two or three sub-calls, whereas a less capable one might require many more. This is a key caveat: be cautious about the potential cost, especially with complex and deeply recursive tasks.
Despite this, the two main benefits—unlimited context size and no performance degradation—are the primary gains from this solution.
The “Magic” Behind RLM: The System Prompt
The critical question is: how does the LLM know to interact with the prompt via code without seeing the full text?
The key is the system prompt (or meta-prompt). Instead of the user’s long prompt, the RLM approach initiates a default system prompt for all its calls. This prompt instructs the LLM on how to behave.
It might contain instructions like:
“You are tasked with answering a query with an associated context. You can access, transform, and analyze this context interactively in a REPL environment.”
The prompt then specifies the environment variables available to the model:
context: A variable containing the extremely long and important information.LLM_query(prompt, query): A function that allows you to query an LLM.print(): A function to view the output of your REPL code.
By defining this environment, the language model can use an agentic approach to programmatically interact with the prompt. The system prompt also includes examples (meta-learning) to ensure the model fully understands how to operate in this environment.
A Deeper Dive: Multi-Hop Query Example
To fully grasp the approach, let’s walk through a more complex example from the paper.
The Task: The agent must find the answer to a multi-hop query given a corpus of 100 unique documents, totaling 8 million tokens. The underlying model is GPT-5, which has a mere 26k token limit.
The documents contain information about a specific dish, its culture, roots, and ingredients.
Step 1: Initial Probing with Code GPT-5 is given the meta-prompt, not the 8 million tokens. It first decides to probe the 100-document list with regex queries. It generates code to filter the prompt for specific keywords like “beauty pageant” and “festivals,” which are relevant to the user’s query.
# Example of code the LLM might generate
import re
# Search for keywords within the massive context
chunk_1 = re.search(r"beauty pageant", context)
chunk_2 = re.search(r"festivals", context)
# ... and so on, creating smaller, relevant chunks
Step 2: Recursive Calls on Interesting Chunks
After running its regex queries, the root LLM finds an interesting snippet in a chunk at index 6. It then launches recursive RLM calls over this smaller snippet to look for information relevant to the original query. The RLM is able to store the intermediate result in a variable, answer_six, and also print it for the root LLM to see. The sub-LLM call finds that the answer is likely “Maria del Makio” and stores this information back into the root language model’s environment.
Step 3: Final Answer Synthesis After reviewing the information from the sub-call, the root LLM reasons that it has enough information to answer the query. To be certain, it chooses to check the answer again with two additional recursive LLM calls for confirmation. Finally, the root LLM returns its final, correct answer.
This example demonstrates how the model generates multiple pieces of code to parse the prompt repeatedly, allowing it to arrive at the final answer regardless of the initial prompt’s size. And because each chunk was small (e.g., 10k-20k tokens), the model operated within the “golden ratio” of its context size, thus maintaining peak performance.
Getting Started with the RLM Open-Source Project
The great news is that you don’t have to build this system from scratch. The researchers have released an open-source GitHub repository for RLM.
You can quickly install the package and get started.
Installation:
pip install rlm
Quick Start: Here is a simple Python example to get you started. You’ll need an OpenAI key.
import rlm
# Initialize the backend with your OpenAI key
# As of now, only OpenAI is supported
rlm.init_backend(
"openai",
api_key="YOUR_OPENAI_API_KEY"
)
# Define your model and a very long prompt
model_name = "gpt-5-nano" # Example model
long_prompt = "..." # Your prompt that can be millions of tokens long
query = "Your question about the long prompt."
# Run the RLM query
response = rlm.query(
model_name=model_name,
prompt=long_prompt,
query=query
)
print(response)
The recursive approach is handled automatically on the backend as you use the rlm.query function.
Additionally, the project provides a nice option to visualize the traces of the recursive calls. You can enable this by initializing the visualizer, which will generate a JSON file that you can inspect to see how the RLM approach is recursively calling the language model to produce the final answer.