Context Rot Is Real. Here’s How Practitioners Are Managing It.

TMLS Insights | Week of May 4, 2026

This is the first in a series of practitioner-focused posts by Graham Toppin

Graham is a Co-founder and Analyst at Peerlabs.ai, a subscriber-funded intelligence firm focused on primary research on emerging technology.

No news roundups. No vendor hype. We want to give you concrete advice you can test this week, grounded in what we’re hearing from practitioners and seeing in our own work.


Meta’s Justin Jeffress gave the failure mode a name at All Things AI (April 4, sourced via The Register, which we’re going to treat as conference colour, not survey data): “context rot.”

As agent interactions accumulate across a session, more context competes for the model’s attention window and output quality degrades. You’ve probably seen this: the agent is sharp for the first 20 minutes, then starts contradicting itself, forgetting constraints, or repeating work. This happens in interactive sessions, and in autonomous workflows.

This is not a model problem you can solve by upgrading to a bigger context window. It’s an information management problem. Research is formalizing it (MIT’s Recursive Language Models, LinkedIn’s Cognitive Memory Agent, ICLR 2026’s MemAgents workshop), but the practical solutions available today are simpler.

What you can do now:

Externalize state between sessions. Don’t rely on the context window to be your agent’s memory. Have agents emit structured artifacts — markdown files, JSON state objects, HTML reports — at checkpoints. These become the input for the next session rather than the entire conversation history.

For operational AI: A concrete pattern from Jerry Liu (LlamaIndex): agents write .md and .html artifacts to preserve context. Obsidian serves as a local viewer and search interface across the accumulated artifacts. This separates “what the agent knows” from “what’s in the current context window.”

For product AI: Create formalized, versioned repositories your agents can persist, share and modify. Expressly define workflows for these tasks, with Pre- and Post- commit hooks to manage the flow. This is the agent taxonomy / ontology allowing you to capture the important contexts for your agents.

Compress, don’t concatenate. When you need to carry context forward, summarize rather than append. If your agent has been working for 30 minutes and you’re about to hit a new subtask, ask it to produce a structured summary of decisions made, constraints discovered, and current state – then start a fresh session with that summary as input. This is manual today; some harnesses (e.g. Claude Code’s compaction, Opencode, and others) are starting to automate it. Harnesses like pi allow you to roll your own, and have an extensive plugin community.

Context is the aspirational moat for Frontier Labs. It’s a concrete moat for practitioners. Frontier model providers, like Anthropic, have fairly opaque and changeable contexts which will not be externalized to you without explicitly planning for them. This is a form of vendor lock in you should be aware of. Your entire project context is in ~/.claude, and the structure of the Claude Code context isn’t an agreed upon format or API for you. Providers like Opencode have transparent logging and context sharing; but their codebase can be volatile. Pi allows you to create your own context – logging, transparency, etc. – and provides a basis for not just operational control, but for production workflows.

Monitor context usage. LangChain shipped a Claude Code → LangSmith tracing plugin logging subagents, tool calls, compaction events, and token usage. If you’re running agents at any scale, you need observability into how context is being consumed. Without it, you’re debugging quality degradation blind. There are a lot of (better) alternatives to LangChain’s LangSmith:

  • Langfuse – tracing, prompt management, session replays, eval templates. Free for self-hosting, and not as an after thought.
  • Arize Phoenix – We haven’t used this one ourselves, but it has been recommended to us enough times from credible sources it deserves a mention.
  • OpenLLMetry from TraceLoop – not a platform, an SDK for instrumenting your flows, and sending OpenTelemetry traces to any compatible backend (Langfuse, Datadog, SigNoz, Grafana, etc.) Great for starting out, especially when you don’t want to commit to a platform immediately.

Scope sessions tightly. The practitioners reporting the best results with parallel agents (2-4 concurrent sessions seems to be the practical ceiling for most people) are running each agent on a tightly scoped subtask, not on an open-ended mandate. “Implement the authentication module per this spec” works. “Build the backend” doesn’t.

What’s coming (but isn’t ready yet):

Google’s Gemini Enterprise Agent Platform (April 22) includes Agent Gateway, Agent Observability, and Agent Identity as governance primitives — the first major vendor attempt to address context and state management top-down. Databricks published “Memory Scaling for AI Agents” identifying retrieval quality, not storage, as the main bottleneck. Mem0’s “State of AI Agent Memory 2026” report describes memory as a “first-class architectural component” with its own benchmark suite. Whether any of these mature into production-ready tooling this quarter is an open question.

For now, the advice is simple: treat context as a resource you actively manage, not a bucket you fill until it overflows.


If you’re running open-weight models in production, or if you’ve built internal tooling around agent context management, we’d like to hear from you. We’re collecting practitioner accounts (attributed or anonymous, your choice) for a more systematic write-up.

Reach out at [email protected]

TMLS Insights is produced by the TMLS Steering Committee and Peerlabs. We produce practitioner-focused analysis for the Toronto Machine Learning Society and MLOps World communities. Aspects of our research pipeline use AI; all claims are human-reviewed and sourced.

Table of Contents
Share This Post