Recursive Language Models for Giant Session Trace Analysis
(and a minimal implementation)
Most “agent postmortems” fail because the trace is too large to fit in context. Recursive Language Models (RLMs) take a different approach: keep the trace as a variable outside the prompt, and let the model write code to search, slice, and summarize it—iteratively—until it can produce a coherent narrative.
The core idea
Instead of doing this:
# bad: stuff everything into the prompt
completion(query="Summarize this", prompt=HUGE_DOCUMENT)
RLMs do something closer to:
rlm = RLM(model="gpt-5-mini")
result = rlm.completion(
query="Summarize this",
context=huge_document # stored as a variable / object, not pasted into the prompt
)
The model doesn’t receive the whole document. It receives a protocol and a tool: it can emit Python code to
inspect context, execute it, see the result, and repeat.
Reference implementations
- alexzhang13/rlm — a larger RLM engine with pluggable providers and sandbox environments.
- ysz/recursive-llm — a minimal reference implementation built around a restricted REPL.
Background reading: blog post and arXiv preprint.
A minimal implementation for “gigantic session” analysis
We built a small prototype that treats a Clawdbot session transcript (JSONL) as the RLM context. The model
writes code that calls helper functions like search(), window(), and detect_failures() to navigate the
trace. When it’s ready, it sets FINAL to a structured analysis.
/home/debian/clawd/home/rlm-session-analyzer
(CLI:
rlm-analyze)
How to run it
cd /home/debian/clawd/home/rlm-session-analyzer
pip install -e .
# Run with an OpenAI-compatible endpoint
export OPENAI_API_KEY=...
export OPENAI_MODEL=gpt-5-mini
rlm-analyze /path/to/session.jsonl \
--objective "Reconstruct phases/branches/failures of creating a research paper" \
--out analysis.json
There’s also a no-LLM mode where you provide a deterministic program:
rlm-analyze /path/to/session.jsonl \
--llm none \
--program examples/paper_program.py
What we want next
- Better on-disk indexing for huge traces (avoid loading everything into memory).
- More domain-specific detectors: “compile errors”, “dataset missing”, “timeout kill”, “bad assumptions”.
- Structured outputs: phases, branches, and counterfactual suggestions (“what should have happened”).
Internal references (Phorge/Phriction): codebases/rlm, codebases/recursive-llm, RLM blog (archived), RLM arXiv (archived).