Context as Compute: A Memory Hierarchy for Long-Horizon Agents

Abstract

The reflex of the long-horizon agent builder is to ask for a longer context window, as though attention were free storage and the only problem were capacity. We argue this is the wrong default, and we make the argument quantitative. We model agent memory as a four-level hierarchy - resident context, key-value cache, retrieval store, cold archive - with capacity and access cost both rising as we descend, and we establish five results. First, with key-value-cached decoding every resident token is re-read at every subsequent step, so holding a set resident over a horizon $T$ costs $Θ (| S | T d)$ , a recurring charge rather than a one-time price (Proposition 1). Second, the offline evict-farthest-future rule (Belady-MIN) minimizes misses (Proposition 2), and its online surrogate LRU is $k$ -competitive (Proposition 3). Third, choosing what to keep resident under a token budget is a 0/1 knapsack, solved optimally by greedy utility-density for divisible items and within a factor of two otherwise (Theorem 1). Fourth, effective access time falls linearly in hit rate, so prefetch that raises the hit rate pays off directly (Proposition 4). Each result is validated by a reproducible simulation, included and linked. The practical thesis follows: the context window is not a scratchpad. It is a register file, and it must be managed like one.

Ask any team running an agent past the hour mark what is wrong and you will hear the same wish: if only the context were longer. It is the wrong wish, and the reason it is wrong is the whole subject of this essay.

The previous essay argued that an autonomous agent is a process that outgrew its runtime, and that the abstractions it lacks are the ones operating systems formalized decades ago: scheduling, protection, durability, and a managed memory hierarchy. This essay takes the last of those and develops it on its own terms. The claim is sharp. The context window should be treated as the top level of a memory hierarchy - the register file - not as an infinite scratchpad onto which we append forever. Everything that follows is the consequence of taking that analogy seriously enough to quantify.

Hardware designers never had the luxury of pretending storage was free and uniform. They built hierarchies precisely because the fast tier is small and dear and the cheap tier is vast and slow, and the art is deciding what lives where.[10] The agent builder, handed a context window that feels capacious, has been pretending otherwise. The pretense holds in a demo for the same reason an unmanaged buffer holds on a small input: nothing has yet contended for the resource. At horizon, it always does.

Contributions. We (i) make explicit a four-tier memory hierarchy for agents and a model of items with utility, size, and a reference stream; (ii) prove that residency under key-value-cached decoding is a recurring cost proportional to the remaining horizon (Proposition 1); (iii) place agentic eviction on its classical footing - Belady optimality (Proposition 2) and LRU competitiveness (Proposition 3) - and show residency under a budget is a knapsack solved well by greedy utility-density (Theorem 1); (iv) give the effective-access-time argument that prefetch pays off linearly in hit rate (Proposition 4); and (v) validate every claim with a seeded simulation whose code is included and runnable.

The hierarchy, made explicit

Lay the agent’s memory out as four tiers. At the top, L0, is the resident context: the tokens actually present in the prompt the model attends to this step. It is tiny - on the order of $10^{4}$ tokens - the fastest to reach, and by far the most expensive per unit held, for reasons we make precise below. Beneath it, L1, is the key-value cache: the materialized attention state of tokens already processed, cheap to reuse on the next step but still resident in accelerator memory.¹ Below that, L2, is the retrieval store - a vector index or document database holding perhaps $10^{9}$ items, reachable in microseconds-to-milliseconds by similarity search and folded back into L0 on demand.[6][7] At the bottom, L3, is cold archive: object storage of effectively unbounded size, where access is slow and deliberate.

The shape is the familiar one. Descending the hierarchy, capacity grows by orders of magnitude, latency-per-access grows, and the per-item cost of holding something falls. The inversion that trips people up is at the very top: L0 is the cheapest tier to read from but the most expensive tier to keep things in, because keeping a token resident is not a storage decision - it is a recurring compute decision, levied every step the token survives. Figure 1 lays out the tiers; the rest of the essay is the discipline that governs movement between them.

The agent memory hierarchy. The resident context (L0) is narrowest, fastest, and most expensive to occupy; capacity and access latency grow downward toward cold archive (L3). The prefetch path pulls likely-needed items upward; the evict path demotes or summarizes cold items downward.

A model of agent memory

Fix the objects we will reason about. There is a universe of items - tokens, spans, tool results, retrieved documents - each item $i$ carrying a size $s_{i}$ in tokens and an estimated utility $u_{i}$ , its expected contribution to the next decisions. The agent emits a reference stream - the sequence of items it actually touches step by step - against which any residency policy is judged. L0 has a token budget $B$ ; the manager decides, each step, which items occupy it.

Definition 1 (Working set). The working set

W (t, τ)

is the set of distinct items referenced in the trailing window of length

τ

ending at time

t

W (t, τ) = {r : r referenced in (t - τ, t]}

The working set is Denning’s, devised for exactly this question in virtual memory.[1] Its content, transplanted to agents, is that programs - and plans - exhibit locality of reference: at any moment a task touches a small, coherent subset of all the information it has ever seen. The current sub-goal, the file under edit, the last few tool results, the relevant constraints. Residency should track this working set, not the full history. Keeping the entire transcript resident is the agentic equivalent of pinning every page a program has ever touched: it confuses has been referenced with is being referenced. The window $τ$ is the one knob - too short and you thrash; too long and the working set bloats back toward the whole history.

Attention is not free storage

The entire argument rests on one cost fact, so let us state it carefully. Self-attention over a sequence of length $n$ with model width $d$ forms all pairwise interactions, so its compute scales quadratically in the sequence length and linearly in the width:[4]

F (n) = Θ (n^{2} d)

A naive reader concludes that the cost of a long context is paid once, at the prefill that processes the prompt. That is false for an agent, which generates autoregressively over a long horizon, and making the falsity precise is the first formal result.

Proposition 1 (Recurring residency cost). Under key-value-cached autoregressive decoding, keeping a set

S

of items resident across a horizon of

T

remaining decode steps costs

Θ (| S | T d)

, not the one-time

Θ (| S | d)

. Equivalently, the marginal cost of admitting one extra token, held until the horizon, is

c_{resident} = Θ (H d), H = n_{horizon} - t

The standard optimization caches the keys and values of every processed token so they are not recomputed; this removes the quadratic from generation. But at each decode step the model must read the cached keys and values of all resident tokens to compute the next one, an

Θ (| S | d)

read per step. A token admitted at step

t

and never evicted is touched at each of the remaining

H

steps; summing over

S

and over the horizon gives

Θ (| S | T d)

. □

This is the crux. Residency is not a one-time storage charge; it is a recurring charge whose magnitude is proportional to the horizon still ahead. A token you admit early and never evict is the most expensive thing in your system, because you pay for it on every one of the thousands of steps it outlives. IO-aware kernels such as FlashAttention lower the constant by avoiding materialization of the attention matrix,[5] but they do not change the recurrence: more resident tokens means more memory traffic per step, every step. This is why “just use a longer context” is the wrong default. It mistakes a recurring liability for a free amenity, and it scales the liability with exactly the quantity - horizon - that long-running agents have the most of.

Eviction: approximating the offline optimum

If residency is costly, the manager must decide what leaves when the working set will not fit the budget. The theoretical benchmark is Belady’s rule.

Proposition 2 (Belady optimality). For a cache of fixed capacity, the offline policy that on each miss evicts the item whose next reference is farthest in the future (MIN) incurs the minimum possible number of misses over the reference string.[2]

Exchange argument. Take any optimal policy

A

that on some miss evicts

x

while MIN would evict

y

, whose next use is later than

x

’s. Build

A^{'}

that agrees with

A

but evicts

y

here; couple their cache contents until they re-converge. Because

x

is needed first,

A^{'}

suffers no more misses than

A

. Repeating turns any optimum into MIN without increasing misses. □

Belady’s rule is an offline optimum: it requires knowing the future reference string, which no online agent has. It is the oracle we measure ourselves against, not a policy we can run. Online, we approximate. The recency heuristic - LRU, evict the least-recently-used - assumes the recent past predicts the near future, which is exactly the locality assumption the working set rests on, and its theoretical standing is reassuring.

Proposition 3 (LRU competitiveness). On any reference string, LRU with a cache of

k

pages incurs at most

k

times the faults of the offline optimum on a cache of the same size, up to an additive constant; that is, LRU is

k

-competitive, and no deterministic online policy achieves a smaller competitive ratio (Sleator and Tarjan).[3]

The bound is worst-case and loose on the locality-rich streams agents actually produce, where LRU runs much closer to the optimum than the factor of $k$ suggests. Frequency heuristics (LFU) trade recency for popularity and fare worse when the working set drifts. The agentic upgrade is to replace the heuristic with a learned relevance predictor: a small model that scores each resident item by its probability of being referenced again soon, given the current goal. This is Belady with an estimator standing in for the oracle - the closer the estimator, the closer we approach the offline optimum.[8] One subtlety worth respecting: not every resident token is a candidate for eviction. Certain anchor tokens act as attention sinks the model relies on, and evicting them degrades the residual computation badly; a competent manager pins them.[9]

Figure 2 makes the hierarchy of policies concrete on a synthetic reference trace with temporal locality - a slowly drifting working set plus a Zipfian tail of occasional far accesses. Belady-MIN is the offline ceiling; the utility-scored policy (a recency-and-frequency score, the practical stand-in for a learned predictor) sits just beneath it; LRU is competitive and converges to the scored policy as the budget grows; LFU, blind to drift, trails badly.

Eviction policy quality on a locality-rich reference trace. Belady-MIN (dashed) is the offline optimum and upper-bounds every online policy at every budget. The utility-scored policy approaches it; LRU is competitive and converges to the scored policy as

B

grows; LFU, insensitive to working-set drift, lags. The gap between LRU and Belady shrinks with budget, consistent with Proposition 3 being loose under locality.

Residency as constrained optimization

We can be more precise than “evict the least useful.” At each step the manager faces a budget of $B$ resident tokens, and each candidate item $i$ has a size $s_{i}$ and utility $u_{i}$ from the relevance predictor. Choosing which items to keep resident is a $0 / 1$ knapsack:

max_{x_{i} \in {0, 1}} \sum_{i} u_{i} x_{i} s.t. \sum_{i} s_{i} x_{i} \leq B

Theorem 1 (Residency is knapsack). Maximizing resident utility

\sum_{i} u_{i} x_{i}

subject to

\sum_{i} s_{i} x_{i} \leq B

is the 0/1 knapsack problem. Admitting items in decreasing order of utility density

u_{i} / s_{i}

is optimal for the fractional relaxation, and a 2-approximation for the integral problem; the gap vanishes as item sizes shrink relative to

B

The objective and the single linear capacity constraint with binary variables are the definition of 0/1 knapsack. For the fractional relaxation, an exchange argument shows any optimal solution fills capacity in nonincreasing density order, so greedy is exact. For the integral problem, compare greedy’s value to the better of (a) the greedy prefix that fits and (b) the single highest-utility item; one of these is at least half the fractional optimum, and the fractional optimum upper-bounds the integral one, giving the factor of two. When every

s_{i} ≪ B

the wasted tail is negligible and greedy is near-exact. □

The knapsack framing pays off immediately. We do not need an exact solver; the greedy rule that sorts by utility density and admits until the budget is exhausted is both cheap and near-optimal when a budget of $10^{4}$ tokens holds many small items. So the manager’s admission and eviction collapse to one rule: keep the items with the highest utility per token, drop the rest to L2. This is the correct generalization of LRU. LRU ranks by recency alone; utility density ranks by predicted value and charges for the space consumed, which is what we actually care about when the space is the expensive resource.

The manager’s loop is compact. A sketch:

# A memory manager over the L0 budget B (tokens).
# Admission and eviction are one rule: keep highest utility-density.

class MemoryManager:
    def __init__(self, budget, predictor, store):
        self.B = budget              # L0 token budget
        self.predict = predictor     # learned relevance: item -> utility
        self.store = store           # L2 retrieval tier
        self.resident = []           # current working set in L0

    def admit(self, item, goal):
        item.u = self.predict(item, goal)        # utility given current goal
        item.density = item.u / item.size     # value per token
        self.resident.append(item)
        self._enforce_budget()

    def _enforce_budget(self):
        # greedy knapsack: drop lowest density until under B
        used = sum(i.size for i in self.resident)
        while used > self.B:
            victim = min(self.resident, key=lambda i: i.density)
            if victim.pinned:                  # attention sinks stay resident
                break
            self.store.demote(self.summarize(victim))  # evict to L2, lossily
            self.resident.remove(victim)
            used -= victim.size

    def step(self, goal):
        # prefetch likely-needed items from L2 before they are demanded
        for cand in self.store.nearest(goal, k=8):
            if self.predict(cand, goal) > self.threshold:
                self.admit(cand, goal)         # raise the hit rate h

A memory manager. Admission scores utility, the budget is enforced by greedy eviction on utility density, and step() prefetches predicted references from L2 to keep the working set resident before it is demanded.

The cost of context, and the value of prefetch

Proposition 1 turned residency into a per-step cost proportional to resident tokens; the budget converts that cost into a quality tradeoff. Holding everything resident maximizes the fraction of needed items present (quality one) but pays the full ever-growing residency bill every step. Holding only a managed working set caps the per-step cost at the budget while serving most references from L0. The two strategies trace a quality-per-cost frontier, and the managed working set dominates across the useful range.

Left panel: task quality versus expected per-step cost in resident tokens. The managed working set traces a curve that rises steeply at low cost and reaches quality near 0.93 by B=64 resident tokens; keep-everything sits far to the right at cost 240 with quality 1.0. Right panel: effective access time versus hit rate h, a straight line falling from about 60 at h=0 to 1 at h=1, with Monte Carlo points on the line. — Cost of context. *Left:* the managed working set reaches near-full task quality at a small fraction of the per-step cost of keep-everything, which holds every distinct item ever seen (cost ~240 resident tokens here for one extra quality point over a budget of 64). *Right:* effective access time $EAT = h t_{fast} + (1 - h) t_{slow}$ falls linearly in hit rate; Monte Carlo points (40,000 trials each) lie on the line.

The hierarchy only earns its keep if the fast tier serves most requests. Borrow the architect’s figure of merit, effective access time. If a fraction $h$ of references are served from the fast tier at cost $t_{fast}$ and the rest miss down to the slow tier at cost $t_{slow}$ , the average is:[10]

EAT = h t_{fast} + (1 - h) t_{slow}

Proposition 4 (Value of prefetch). Effective access time is affine and strictly decreasing in the hit rate:

EAT (h) = t_{slow} - h (t_{slow} - t_{fast})

, so each unit increase in

h

reduces expected access cost by the fixed amount

t_{slow} - t_{fast} > 0

. Hence any mechanism that raises

h

pays off proportionally.

Collecting terms in the definition of effective access time, the coefficient of

h

- (t_{slow} - t_{fast})

, a negative constant. The slope is independent of

h

, so the payoff is linear (Figure 3, right). □

Because $t_{slow}$ dwarfs $t_{fast}$ - a retrieval round-trip costs a step of latency and a chunk of the budget against the desired answer - effective access time is dominated by the miss rate $1 - h$ . Raising $h$ is the highest-leverage thing the manager does, and the lever is prefetch: predict the next references and pull them from L2 into L0 before they are needed, overlapping the retrieval latency with current work. This is the same bet a CPU prefetcher makes, and it succeeds for the same reason - reference streams are predictable when locality holds. An agent about to enter a sub-goal can prefetch that sub-goal’s artifacts; an agent editing a file can prefetch its neighbors. Retrieval-augmented generation, seen through this lens, is not a bolt-on knowledge feature - it is demand paging for the context window, and prefetch is what turns demand paging from a stall into a pipeline.[6]

Reproducibility

Both simulation figures are produced by a single seeded script with no inputs beyond its parameters and no network access. The reference trace is synthetic but structured to have the locality the argument assumes - a slowly drifting working set with a Zipfian tail - so that the policy ordering it induces is the one the theory predicts. The full script is available here; the core of the eviction experiment, including the offline Belady oracle, is:

# offline optimum: evict the item whose next use is farthest in future
def hit_rate_belady(refs, B):
    n = len(refs)
    nxt = np.full(n, n, dtype=np.int64)          # next-use index per position
    seen = {}
    for t in range(n - 1, -1, -1):
        nxt[t] = seen.get(refs[t], n)
        seen[refs[t]] = t
    cache, next_use, hits = set(), {}, 0
    for t, r in enumerate(refs):
        if r in cache:
            hits += 1
        elif len(cache) >= B:
            victim = max(cache, key=lambda x: next_use[x])  # farthest future
            cache.discard(victim)
        cache.add(r); next_use[r] = nxt[t]
    return hits / len(refs)                       # the ceiling in Figure 2

The Belady-MIN oracle (excerpt). Running the full script regenerates both figures deterministically.

Where the analogy bends

No analogy is free, and this one bends in an instructive place. RAM supports true random write: you can overwrite address $k$ in place and the old value is gone, cleanly. The context window does not. It is append-mostly; there is no operation that surgically replaces a stale fact in the middle of the transcript without re-emitting the surrounding tokens. “Writing” to agent memory therefore means re-summarization or compaction - collapsing a span of tokens into a shorter, denser representation - and that operation is lossy.² The manager does not get to move bytes around for free; every eviction that summarizes is a small, irreversible commitment about what mattered. This raises the stakes on the relevance predictor: a bad eviction in RAM costs a cache miss and a reload; a bad eviction here can lose information that cannot be reconstructed from L2 because it was never stored verbatim. The working-set discipline is what keeps this honest - if residency tracks reference, the items we summarize are by construction the ones least likely to be needed verbatim again.

Related work

The working-set model and the locality it formalizes are Denning’s.[1] Belady’s study of replacement algorithms gives both the offline optimum and the namesake anomaly,[2] and Sleator and Tarjan placed online paging on a competitive-analysis footing.[3] The architectural figures of merit - hierarchy, effective access time, prefetch - are textbook.[10] On the model side, the quadratic attention cost is Vaswani et al.,[4] and the IO-aware kernels that lower its constant are Dao et al.[5] Retrieval-augmented generation[6] and retrieval-during-training[7] are, in our framing, demand paging and prefetch for L2; MemGPT[8] argues the operating-system analogy for LLM memory directly; and StreamingLLM[9] identifies the attention sinks a manager must pin. Our contribution is not a new mechanism but the unification: that residency is a recurring cost (Proposition 1), that eviction inherits Belady optimality and LRU competitiveness (Propositions 2-3), that residency under a token budget is precisely 0/1 knapsack with a greedy near-optimum (Theorem 1), and that prefetch’s payoff is the linear effective-access-time law (Proposition 4) - each validated on a controlled trace.

Limitations and threats to validity

The model is deliberately minimal, and its assumptions are where it can mislead. The traces are synthetic. We construct a reference stream with the locality the theory needs; real agent memory traces may have weaker or differently-shaped locality, and the policy ordering, while robust here, is only validated against this family. Utilities are assumed known. Theorem 1 takes $u_{i}$ as given, but estimating an item’s expected contribution to future decisions is itself hard, and a poor estimator turns the greedy optimum into greedy nonsense. Relevance prediction is the open problem. The whole eviction-and-prefetch apparatus reduces to predicting the near-future reference string; that prediction is exactly what is difficult, and our learned-predictor stand-in is a recency-frequency score, not a trained model. The cost model is simplified. Proposition 1 counts memory traffic as proportional to resident tokens and ignores batching, paging, and kernel-level constants; the recurrence it captures is real, but the constant is not measured. The simulations are of the model, not of language models. They confirm the mathematics is faithful to its own assumptions; they do not substitute for measuring hit rates, utilities, and miss costs on a real long-horizon agent, which is the natural next step and the place where these predictions should be falsified.

The manager is a kernel subsystem

Step back to the substrate. The memory manager developed here is not a standalone trick; it is one subsystem of the kernel the previous essay sketched, sitting beside the scheduler, the capability broker, and the durable log. That placement matters. Residency decisions interact with scheduling - a task about to be preempted should have its working set demoted, not pinned - and with durability, since the cold archive and the checkpoint log are the same kind of object. Treating context as compute, rather than as storage, is what lets the kernel reason about all of these together: every token is a recurring cost, every eviction is a bet, every prefetch is a scheduled IO.

But a managed memory hierarchy only guarantees that the right information is present at the right price. It says nothing about whether the agent then does the right thing with it - whether a step that succeeds is a step that is correct. The relevance predictor can be wrong; the summary can be confidently false; the prefetch can surface a plausible distractor. Managing memory well buys us the conditions for reliability without delivering reliability itself. The next essay takes up that problem directly: how a long-horizon agent earns trust through verification - turning a sequence of fallible steps into a trajectory whose correctness can be checked, bounded, and, when it fails, caught before it compounds.

The KV cache stores, for every token already processed, the key and value vectors it contributes to attention, so they need not be recomputed when generating later tokens. It is what makes autoregressive decoding linear rather than quadratic per step - but it is also why a resident token carries a recurring cost: its cached K/V are read at every subsequent decode step, so the bill is paid once per remaining step, not once at admission. ↩

This is where the RAM analogy genuinely breaks. RAM offers true random write - overwrite in place, old value gone. Context is append-mostly: there is no in-place edit of a token deep in the sequence. “Writing” therefore means re-summarization or compaction, which is lossy and irreversible. So unlike a hardware cache, demotion from L0 can destroy information rather than merely relocate it. ↩