Context as Compute: A Memory Hierarchy for Long-Horizon Agents
The reflex of the long-horizon agent builder is to ask for a longer context window, as though attention were free storage and the only problem were capacity. We argue this is the wrong default, and we make the argument quantitative. We model agent memory as a four-level hierarchy - resident context, key-value cache, retrieval store, cold archive - with capacity and access cost both rising as we descend, and we establish five results. First, with key-value-cached decoding every resident token is re-read at every subsequent step, so holding a set resident over a horizon costs , a recurring charge rather than a one-time price (Proposition 1). Second, the offline evict-farthest-future rule (Belady-MIN) minimizes misses (Proposition 2), and its online surrogate LRU is -competitive (Proposition 3). Third, choosing what to keep resident under a token budget is a 0/1 knapsack, solved optimally by greedy utility-density for divisible items and within a factor of two otherwise (Theorem 1). Fourth, effective access time falls linearly in hit rate, so prefetch that raises the hit rate pays off directly (Proposition 4). Each result is validated by a reproducible simulation, included and linked. The practical thesis follows: the context window is not a scratchpad. It is a register file, and it must be managed like one.
Ask any team running an agent past the hour mark what is wrong and you will hear the same wish: if only the context were longer. It is the wrong wish, and the reason it is wrong is the whole subject of this essay.
The previous essay argued that an autonomous agent is a process that outgrew its runtime, and that the abstractions it lacks are the ones operating systems formalized decades ago: scheduling, protection, durability, and a managed memory hierarchy. This essay takes the last of those and develops it on its own terms. The claim is sharp. The context window should be treated as the top level of a memory hierarchy - the register file - not as an infinite scratchpad onto which we append forever. Everything that follows is the consequence of taking that analogy seriously enough to quantify.
Hardware designers never had the luxury of pretending storage was free and uniform. They built hierarchies precisely because the fast tier is small and dear and the cheap tier is vast and slow, and the art is deciding what lives where.[10] The agent builder, handed a context window that feels capacious, has been pretending otherwise. The pretense holds in a demo for the same reason an unmanaged buffer holds on a small input: nothing has yet contended for the resource. At horizon, it always does.
Contributions. We (i) make explicit a four-tier memory hierarchy for agents and a model of items with utility, size, and a reference stream; (ii) prove that residency under key-value-cached decoding is a recurring cost proportional to the remaining horizon (Proposition 1); (iii) place agentic eviction on its classical footing - Belady optimality (Proposition 2) and LRU competitiveness (Proposition 3) - and show residency under a budget is a knapsack solved well by greedy utility-density (Theorem 1); (iv) give the effective-access-time argument that prefetch pays off linearly in hit rate (Proposition 4); and (v) validate every claim with a seeded simulation whose code is included and runnable.
The hierarchy, made explicit
Lay the agent’s memory out as four tiers. At the top, L0, is the resident context: the tokens actually present in the prompt the model attends to this step. It is tiny - on the order of tokens - the fastest to reach, and by far the most expensive per unit held, for reasons we make precise below. Beneath it, L1, is the key-value cache: the materialized attention state of tokens already processed, cheap to reuse on the next step but still resident in accelerator memory.1 Below that, L2, is the retrieval store - a vector index or document database holding perhaps items, reachable in microseconds-to-milliseconds by similarity search and folded back into L0 on demand.[6][7] At the bottom, L3, is cold archive: object storage of effectively unbounded size, where access is slow and deliberate.
The shape is the familiar one. Descending the hierarchy, capacity grows by orders of magnitude, latency-per-access grows, and the per-item cost of holding something falls. The inversion that trips people up is at the very top: L0 is the cheapest tier to read from but the most expensive tier to keep things in, because keeping a token resident is not a storage decision - it is a recurring compute decision, levied every step the token survives. Figure 1 lays out the tiers; the rest of the essay is the discipline that governs movement between them.
A model of agent memory
Fix the objects we will reason about. There is a universe of items - tokens, spans, tool results, retrieved documents - each item carrying a size in tokens and an estimated utility , its expected contribution to the next decisions. The agent emits a reference stream - the sequence of items it actually touches step by step - against which any residency policy is judged. L0 has a token budget ; the manager decides, each step, which items occupy it.
The working set is Denning’s, devised for exactly this question in virtual memory.[1] Its content, transplanted to agents, is that programs - and plans - exhibit locality of reference: at any moment a task touches a small, coherent subset of all the information it has ever seen. The current sub-goal, the file under edit, the last few tool results, the relevant constraints. Residency should track this working set, not the full history. Keeping the entire transcript resident is the agentic equivalent of pinning every page a program has ever touched: it confuses has been referenced with is being referenced. The window is the one knob - too short and you thrash; too long and the working set bloats back toward the whole history.
Attention is not free storage
The entire argument rests on one cost fact, so let us state it carefully. Self-attention over a sequence of length with model width forms all pairwise interactions, so its compute scales quadratically in the sequence length and linearly in the width:[4]
A naive reader concludes that the cost of a long context is paid once, at the prefill that processes the prompt. That is false for an agent, which generates autoregressively over a long horizon, and making the falsity precise is the first formal result.
This is the crux. Residency is not a one-time storage charge; it is a recurring charge whose magnitude is proportional to the horizon still ahead. A token you admit early and never evict is the most expensive thing in your system, because you pay for it on every one of the thousands of steps it outlives. IO-aware kernels such as FlashAttention lower the constant by avoiding materialization of the attention matrix,[5] but they do not change the recurrence: more resident tokens means more memory traffic per step, every step. This is why “just use a longer context” is the wrong default. It mistakes a recurring liability for a free amenity, and it scales the liability with exactly the quantity - horizon - that long-running agents have the most of.
Eviction: approximating the offline optimum
If residency is costly, the manager must decide what leaves when the working set will not fit the budget. The theoretical benchmark is Belady’s rule.
Belady’s rule is an offline optimum: it requires knowing the future reference string, which no online agent has. It is the oracle we measure ourselves against, not a policy we can run. Online, we approximate. The recency heuristic - LRU, evict the least-recently-used - assumes the recent past predicts the near future, which is exactly the locality assumption the working set rests on, and its theoretical standing is reassuring.
The bound is worst-case and loose on the locality-rich streams agents actually produce, where LRU runs much closer to the optimum than the factor of suggests. Frequency heuristics (LFU) trade recency for popularity and fare worse when the working set drifts. The agentic upgrade is to replace the heuristic with a learned relevance predictor: a small model that scores each resident item by its probability of being referenced again soon, given the current goal. This is Belady with an estimator standing in for the oracle - the closer the estimator, the closer we approach the offline optimum.[8] One subtlety worth respecting: not every resident token is a candidate for eviction. Certain anchor tokens act as attention sinks the model relies on, and evicting them degrades the residual computation badly; a competent manager pins them.[9]
Figure 2 makes the hierarchy of policies concrete on a synthetic reference trace with temporal locality - a slowly drifting working set plus a Zipfian tail of occasional far accesses. Belady-MIN is the offline ceiling; the utility-scored policy (a recency-and-frequency score, the practical stand-in for a learned predictor) sits just beneath it; LRU is competitive and converges to the scored policy as the budget grows; LFU, blind to drift, trails badly.
Residency as constrained optimization
We can be more precise than “evict the least useful.” At each step the manager faces a budget of resident tokens, and each candidate item has a size and utility from the relevance predictor. Choosing which items to keep resident is a knapsack:
The knapsack framing pays off immediately. We do not need an exact solver; the greedy rule that sorts by utility density and admits until the budget is exhausted is both cheap and near-optimal when a budget of tokens holds many small items. So the manager’s admission and eviction collapse to one rule: keep the items with the highest utility per token, drop the rest to L2. This is the correct generalization of LRU. LRU ranks by recency alone; utility density ranks by predicted value and charges for the space consumed, which is what we actually care about when the space is the expensive resource.
The manager’s loop is compact. A sketch:
# A memory manager over the L0 budget B (tokens).
# Admission and eviction are one rule: keep highest utility-density.
class MemoryManager:
def __init__(self, budget, predictor, store):
self.B = budget # L0 token budget
self.predict = predictor # learned relevance: item -> utility
self.store = store # L2 retrieval tier
self.resident = [] # current working set in L0
def admit(self, item, goal):
item.u = self.predict(item, goal) # utility given current goal
item.density = item.u / item.size # value per token
self.resident.append(item)
self._enforce_budget()
def _enforce_budget(self):
# greedy knapsack: drop lowest density until under B
used = sum(i.size for i in self.resident)
while used > self.B:
victim = min(self.resident, key=lambda i: i.density)
if victim.pinned: # attention sinks stay resident
break
self.store.demote(self.summarize(victim)) # evict to L2, lossily
self.resident.remove(victim)
used -= victim.size
def step(self, goal):
# prefetch likely-needed items from L2 before they are demanded
for cand in self.store.nearest(goal, k=8):
if self.predict(cand, goal) > self.threshold:
self.admit(cand, goal) # raise the hit rate h
A memory manager. Admission scores utility, the budget is enforced by greedy eviction on utility density, and step() prefetches predicted references from L2 to keep the working set resident before it is demanded.
The cost of context, and the value of prefetch
Proposition 1 turned residency into a per-step cost proportional to resident tokens; the budget converts that cost into a quality tradeoff. Holding everything resident maximizes the fraction of needed items present (quality one) but pays the full ever-growing residency bill every step. Holding only a managed working set caps the per-step cost at the budget while serving most references from L0. The two strategies trace a quality-per-cost frontier, and the managed working set dominates across the useful range.
The hierarchy only earns its keep if the fast tier serves most requests. Borrow the architect’s figure of merit, effective access time. If a fraction of references are served from the fast tier at cost and the rest miss down to the slow tier at cost , the average is:[10]
Because dwarfs - a retrieval round-trip costs a step of latency and a chunk of the budget against the desired answer - effective access time is dominated by the miss rate . Raising is the highest-leverage thing the manager does, and the lever is prefetch: predict the next references and pull them from L2 into L0 before they are needed, overlapping the retrieval latency with current work. This is the same bet a CPU prefetcher makes, and it succeeds for the same reason - reference streams are predictable when locality holds. An agent about to enter a sub-goal can prefetch that sub-goal’s artifacts; an agent editing a file can prefetch its neighbors. Retrieval-augmented generation, seen through this lens, is not a bolt-on knowledge feature - it is demand paging for the context window, and prefetch is what turns demand paging from a stall into a pipeline.[6]
Reproducibility
Both simulation figures are produced by a single seeded script with no inputs beyond its parameters and no network access. The reference trace is synthetic but structured to have the locality the argument assumes - a slowly drifting working set with a Zipfian tail - so that the policy ordering it induces is the one the theory predicts. The full script is available here; the core of the eviction experiment, including the offline Belady oracle, is:
# offline optimum: evict the item whose next use is farthest in future
def hit_rate_belady(refs, B):
n = len(refs)
nxt = np.full(n, n, dtype=np.int64) # next-use index per position
seen = {}
for t in range(n - 1, -1, -1):
nxt[t] = seen.get(refs[t], n)
seen[refs[t]] = t
cache, next_use, hits = set(), {}, 0
for t, r in enumerate(refs):
if r in cache:
hits += 1
elif len(cache) >= B:
victim = max(cache, key=lambda x: next_use[x]) # farthest future
cache.discard(victim)
cache.add(r); next_use[r] = nxt[t]
return hits / len(refs) # the ceiling in Figure 2
The Belady-MIN oracle (excerpt). Running the full script regenerates both figures deterministically.
Where the analogy bends
No analogy is free, and this one bends in an instructive place. RAM supports true random write: you can overwrite address in place and the old value is gone, cleanly. The context window does not. It is append-mostly; there is no operation that surgically replaces a stale fact in the middle of the transcript without re-emitting the surrounding tokens. “Writing” to agent memory therefore means re-summarization or compaction - collapsing a span of tokens into a shorter, denser representation - and that operation is lossy.2 The manager does not get to move bytes around for free; every eviction that summarizes is a small, irreversible commitment about what mattered. This raises the stakes on the relevance predictor: a bad eviction in RAM costs a cache miss and a reload; a bad eviction here can lose information that cannot be reconstructed from L2 because it was never stored verbatim. The working-set discipline is what keeps this honest - if residency tracks reference, the items we summarize are by construction the ones least likely to be needed verbatim again.
Related work
The working-set model and the locality it formalizes are Denning’s.[1] Belady’s study of replacement algorithms gives both the offline optimum and the namesake anomaly,[2] and Sleator and Tarjan placed online paging on a competitive-analysis footing.[3] The architectural figures of merit - hierarchy, effective access time, prefetch - are textbook.[10] On the model side, the quadratic attention cost is Vaswani et al.,[4] and the IO-aware kernels that lower its constant are Dao et al.[5] Retrieval-augmented generation[6] and retrieval-during-training[7] are, in our framing, demand paging and prefetch for L2; MemGPT[8] argues the operating-system analogy for LLM memory directly; and StreamingLLM[9] identifies the attention sinks a manager must pin. Our contribution is not a new mechanism but the unification: that residency is a recurring cost (Proposition 1), that eviction inherits Belady optimality and LRU competitiveness (Propositions 2-3), that residency under a token budget is precisely 0/1 knapsack with a greedy near-optimum (Theorem 1), and that prefetch’s payoff is the linear effective-access-time law (Proposition 4) - each validated on a controlled trace.
Limitations and threats to validity
The model is deliberately minimal, and its assumptions are where it can mislead. The traces are synthetic. We construct a reference stream with the locality the theory needs; real agent memory traces may have weaker or differently-shaped locality, and the policy ordering, while robust here, is only validated against this family. Utilities are assumed known. Theorem 1 takes as given, but estimating an item’s expected contribution to future decisions is itself hard, and a poor estimator turns the greedy optimum into greedy nonsense. Relevance prediction is the open problem. The whole eviction-and-prefetch apparatus reduces to predicting the near-future reference string; that prediction is exactly what is difficult, and our learned-predictor stand-in is a recency-frequency score, not a trained model. The cost model is simplified. Proposition 1 counts memory traffic as proportional to resident tokens and ignores batching, paging, and kernel-level constants; the recurrence it captures is real, but the constant is not measured. The simulations are of the model, not of language models. They confirm the mathematics is faithful to its own assumptions; they do not substitute for measuring hit rates, utilities, and miss costs on a real long-horizon agent, which is the natural next step and the place where these predictions should be falsified.
The manager is a kernel subsystem
Step back to the substrate. The memory manager developed here is not a standalone trick; it is one subsystem of the kernel the previous essay sketched, sitting beside the scheduler, the capability broker, and the durable log. That placement matters. Residency decisions interact with scheduling - a task about to be preempted should have its working set demoted, not pinned - and with durability, since the cold archive and the checkpoint log are the same kind of object. Treating context as compute, rather than as storage, is what lets the kernel reason about all of these together: every token is a recurring cost, every eviction is a bet, every prefetch is a scheduled IO.
But a managed memory hierarchy only guarantees that the right information is present at the right price. It says nothing about whether the agent then does the right thing with it - whether a step that succeeds is a step that is correct. The relevance predictor can be wrong; the summary can be confidently false; the prefetch can surface a plausible distractor. Managing memory well buys us the conditions for reliability without delivering reliability itself. The next essay takes up that problem directly: how a long-horizon agent earns trust through verification - turning a sequence of fallible steps into a trajectory whose correctness can be checked, bounded, and, when it fails, caught before it compounds.
Footnotes
- The KV cache stores, for every token already processed, the key and value vectors it contributes to attention, so they need not be recomputed when generating later tokens. It is what makes autoregressive decoding linear rather than quadratic per step - but it is also why a resident token carries a recurring cost: its cached K/V are read at every subsequent decode step, so the bill is paid once per remaining step, not once at admission. ↩
- This is where the RAM analogy genuinely breaks. RAM offers true random write - overwrite in place, old value gone. Context is append-mostly: there is no in-place edit of a token deep in the sequence. “Writing” therefore means re-summarization or compaction, which is lossy and irreversible. So unlike a hardware cache, demotion from L0 can destroy information rather than merely relocate it. ↩
References
- P. J. Denning. The working set model for program behavior. Communications of the ACM, 11(5), 1968.
- L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2), 1966.
- D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM, 28(2), 1985.
- A. Vaswani et al. Attention is all you need. NeurIPS, 2017.
- T. Dao et al. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. NeurIPS, 2022.
- P. Lewis et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 2020.
- S. Borgeaud et al. Improving language models by retrieving from trillions of tokens. ICML, 2022.
- C. Packer et al. MemGPT: Towards LLMs as operating systems. 2023.
- G. Xiao et al. Efficient streaming language models with attention sinks. ICLR, 2024.
- J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 6th ed., 2017.