Systems No. 01

The Orchestration Substrate: An Operating System for Autonomous Agents

Abstract

The dominant pattern in autonomous agents - a control loop that repeatedly prompts a language model with tool results - is, structurally, a single-threaded interpreter without a kernel. As agents move from minute-long demos to month-long deployments, the abstractions they lack are precisely the ones operating systems spent four decades formalizing: scheduling under contention, durable execution, and protection. We argue that these are not engineering niceties but the preconditions for long-horizon reliability, and we make the argument quantitative. We model agent execution as a dynamic task graph with a ready set, a per-task holding-cost rate, a service rate, and a per-step success probability, and we establish four results. First, the cμ rule - always serve the ready task maximizing the product of holding cost and service rate - minimizes expected weighted completion time, and greedy step-by-step control is provably suboptimal (Proposition 1). Second, sequential pipelines fail with probability that decays exponentially in the horizon (Proposition 2). Third, a durable substrate that checkpoints and retries each step converts that exponential failure into expected work that grows only linearly in the horizon (Theorem 1). Fourth, capabilities ordered as a lattice and composed by intersection bound a sub-agent's authority below its parent's, so blast radius is bounded by construction (Proposition 3). Each quantitative claim is validated by a reproducible Monte Carlo simulation, included and linked. The conclusion is a layered design: a thin substrate kernel mediating between userland agents and the resources they consume.

Every generation of systems builders rediscovers the operating system. We are living through the agentic rediscovery right now, and like all the others, it is happening one accidental, under-specified reimplementation at a time.

Consider what a modern agent framework actually is. A process holds a conversation buffer. It calls a model, receives a request to invoke a tool, executes the tool, appends the result, and calls the model again. This is a read-eval-print loop with a stochastic evaluator. It is also, unmistakably, a tiny operating system written by someone who did not realize they were writing one: the conversation buffer is unmanaged memory, the tool dispatch is an unprotected system-call interface, and the control loop is a scheduler with a quantum of one and a ready queue of size one. The framework works in a demo for the same reason a program with one thread and no memory management works on a small input - nothing has yet contended for a resource.

The thesis of this essay is narrow and, I think, consequential: the unit of progress in long-horizon autonomy is not a better policy but a better substrate. A model that is ten percent better at reasoning improves every step by ten percent; a substrate that makes a thousand steps composable, recoverable, and bounded changes what is reachable at all. The history of computing is not kind to the view that you can scale a single interpreter forever. We separated mechanism from policy, we virtualized resources, and we built kernels[1] - not because it was elegant, but because nothing else survived contact with concurrency and failure.1

Contributions. We (i) give a minimal but faithful model of agent execution as a dynamic task graph; (ii) recover the classical cμ index policy for the agent scheduler and show greedy control is suboptimal by an interchange argument; (iii) prove that durable checkpoint-and-retry linearizes the exponential failure of sequential pipelines; (iv) show that capability composition by intersection bounds authority and blast radius by construction; and (v) validate every quantitative claim with a seeded simulation whose code is included and runnable.

A model of the agent as a process

Let us be precise about what breaks. An agent executing a non-trivial objective does not perform a linear sequence of steps; it induces a dynamic task graph. A planning step spawns sub-goals. A sub-goal dispatches three tool calls that can run concurrently. One of those calls itself invokes a sub-agent. The graph is revealed incrementally - you do not know the third level until you have executed the second - and its nodes have wildly heterogeneous cost and latency profiles: a model call is seconds and cents, a web fetch is hundreds of milliseconds, a code-execution sandbox is variable and occasionally catastrophic.

Definition 1 (Agent execution model). An agent induces a directed acyclic task graph revealed online. At time t the ready set R(t) is the subset of tasks whose dependencies are satisfied. Each task i carries a holding-cost rate ci>0 (the rate at which delay on it hurts the objective), a random service time Si with service rate μi=1/𝔼[Si], and a per-step success probability pi(0,1]. A substrate executes ready tasks subject to a concurrency limit L.

A single-threaded control loop collapses this graph into a total order chosen greedily by the model.[8][9] Three failures follow immediately. First, no concurrency: independent tool calls that could overlap are serialized, so end-to-end latency is the sum rather than the max of branch latencies.[7] Second, no admission control: nothing bounds the number of in-flight operations, so a runaway plan can open unbounded sandboxes or spend an unbounded budget before any supervisor intervenes. Third, and most corrosively, no isolation of failure: a single tool exception unwinds the whole stack, discarding correct work performed earlier in the graph. These are not prompt-engineering problems. You cannot prompt your way to a scheduler. They are the problems a kernel exists to solve, and they reappear here with the same structure they had in 1965, merely wearing different clothes.

A layered agent operating system: userland agents, the substrate kernel, and resources. USERLAND KERNEL RESOURCES Planner Researcher Coder Critic spawn / await Substrate Kernel Scheduler cμ index policy Memory Mgr working set Cap Broker least authority Durable Log checkpoint models tools / apis sandboxes store dashed - the durable log feeds preemption & recovery decisions back to userland
The substrate kernel mediates every interaction between userland agents and the resources they consume. Agents issue spawn and await calls; the kernel owns scheduling, memory, authority, and durability. Mechanism lives below the line, policy above it.

Scheduling is the first-class problem

Once we admit that an agent induces a task graph with a ready set, the central question is no longer what should the model do next? but which ready task should the substrate run next, given finite concurrency and a budget? This is a scheduling problem, and scheduling theory has a sharp answer. Attach to each task its holding-cost rate ci (a critical-path task on which the whole plan waits has a high rate; a speculative branch has a low one) and its service rate μi. The classical result for minimizing expected weighted completion time on a single server is the cμ rule[4]: always serve the ready task that maximizes the product of holding cost and service rate.

Proposition 1 (cμ optimality). On a single server serving a fixed set of N tasks non-preemptively, the policy that always serves the available task of largest ciμi minimizes the expected weighted completion time 𝔼[iciCi], where Ci is the completion time of task i. Greedy control that ignores either factor is suboptimal.
i = arg max iR(t) ciμi , μi= 1𝔼[Si]
Consider any schedule and two tasks i, j served back to back, i before j, with the rest fixed. Swapping their order changes the objective by exactly cj𝔼[Si]-ci𝔼[Sj], since only the two tasks' completion times move and all later tasks are unaffected. Serving i first is therefore at least as good iff ci/𝔼[Si]cj/𝔼[Sj], that is ciμicjμj. Any schedule out of cμ order contains an adjacent inverted pair whose swap does not increase the objective; repeating bubbles the schedule into cμ order without ever increasing it.

The lesson generalizes past its assumptions. Greedy “do the next step the model proposes” ignores both terms: it does not weight by cost of delay, and it does not exploit fast tasks to unblock the graph. A substrate that schedules - even with a crude estimate of the two rates - reorders work to keep the critical path saturated and short-circuits cheap discriminating tests early. When durations are uncertain and tasks can be paused, the exact optimum becomes an index policy of Gittins type[5],2 but the cμ intuition is the one worth internalizing: priority is a product of how much waiting costs and how quickly the wait can end. Figure 2 quantifies the gain over the natural baselines on randomly generated instances.

Bar chart of mean weighted completion time over 4000 random 8-task instances under four policies. Random order is highest at about 243; shortest-processing-time-first and highest-cost-first are each about 185; the c-mu rule is lowest at about 160. Error bars are one standard error of the mean and are tiny.
Mean weighted completion time per scheduling policy, averaged over 4000 random instances of N=8 tasks with random holding costs and service times. The cμ rule is lowest; shortest-processing-time and highest-cost each capture only one factor and tie well above it; random order is worst. Error bars are ±1 SEM.

Concurrency, in turn, is governed by Little's law[6]: the mean number of in-flight tasks L equals the arrival rate λ times mean residence time W. The substrate, not the model, is the right place to set L: it is the only component with a global view of budget, rate limits, and the blast radius of an additional concurrent sandbox.

Why long horizons demand durability

Here is the argument that, more than any other, forces the substrate. Suppose a plan is a sequence of n steps, each succeeding independently with probability pip¯. If any step's failure aborts the run - the default for a stack-unwinding control loop - then end-to-end success is the product, and it decays exponentially.

Proposition 2 (Reliability decay). Under abort-on-failure, end-to-end success of an n-step plan is the product of the per-step success probabilities, bounded above by p¯n, which is exponentially small in n for any p¯<1.
Pe2e = i=1n pi p¯n
Steps are independent, so the joint success probability factors as the product of the marginals; bounding each factor by p¯ gives p¯n=enlnp¯, which decays exponentially because lnp¯<0.

Even at a per-step reliability of p¯=0.99, a 300-step trajectory succeeds barely five percent of the time. This is the mathematical reason long-horizon agents feel impossible: reliability decays geometrically, and no plausible improvement in per-step accuracy escapes an exponential. You cannot win this race by making the model better. You win it by changing the recurrence.

Theorem 1 (Durability linearizes cost). If the substrate checkpoints state after each step and retries a failed step in place rather than restarting the run, the number of attempts at step i is geometric with mean 1/pi, so the expected total work is the sum 𝔼[C]=i=1nwi/pinw¯/p̲, which is linear in n. By contrast, restart-on-failure repeats the whole run until all n steps succeed, so the expected number of full-run attempts is 1/p¯n and expected total work is n/p¯n, exponential in n.
𝔼[C] = i=1n wipi versus 𝔼restart[C] = np¯n
Under checkpointing, step i is retried until it first succeeds; the attempt count is geometric with success probability pi and mean 1/pi. Linearity of expectation over the n independent steps gives iwi/pi. Under restart, a full run succeeds with probability p¯n, so the number of runs to first success is geometric with mean p¯-n, each run costing n step-executions. The exponential in the failure model has become a linear cost in the durable model.

This single substitution - checkpoint and retry the step, never the run - is the difference between agents that survive a minute and agents that survive a month. It is also exactly what durable-execution engines provide for ordinary workflows;3 the agent substrate's job is to make the unit of durability a semantically meaningful step, not an arbitrary function call. Figure 3 shows both sides of the recurrence: the exponential collapse of naive success, and the exponential-versus-linear gap in expected work, with Monte Carlo estimates overlaid on the closed forms.

Two panels. Left: naive end-to-end success p-bar to the n, with p-bar = 0.99, collapsing from 1 toward 0 as the horizon n grows to 400; Monte Carlo points lie on the curve. Right, on a log vertical axis: expected total work versus horizon. The naive restart-on-failure curve rises steeply and exponentially past ten thousand step-executions, while the durable checkpoint-retry curve grows linearly to a few hundred; Monte Carlo points lie on both curves.
Left: naive end-to-end success p¯n (p¯=0.99) collapsing with horizon. Right (log scale): expected total work - naive restart-on-failure grows exponentially, durable checkpoint-and-retry grows linearly, validating Theorem 1. Lines are closed forms; points are Monte Carlo.
Per-step reliability versus end-to-end success, with and without a durable substrate. The right column is the expected-cost multiplier, which stays near 1 rather than diverging.
Horizon n Naive P(success), p̄=0.99 Durable expected-cost ×
1090.4%1.01×
5060.5%1.01×
10036.6%1.01×
3004.9%1.01×

Authority is a lattice, not a flag

The final abstraction the substrate owes us is protection. The prevailing posture - hand the agent broad credentials and hope the prompt restrains it - is the ambient-authority model that the security community spent decades arguing against.[2] The correct frame is capability-based[3]: authority is a token that confers a specific, attenuable right, and an agent can do exactly what its held capabilities permit, no more.4

Proposition 3 (Least authority bounds blast radius). Let rights form a lattice (,) ordered by “confers at most as much authority as,” with meet (intersection of conferred rights). If a parent holding capability a mints a child capability b=ar by attenuating with a request r, then ba: no sub-agent's authority exceeds its parent's, and the authority of any descendant is bounded by the meet along its spawn path. The blast radius of a compromised or confused agent is therefore bounded by construction.
By the defining property of the meet, ara for every r, so a single minting step is monotone non-increasing in authority. Composition along a spawn path of depth k gives bk=ar1rka by associativity and idempotence of the meet, so authority is monotone non-increasing along every chain. Since an agent can exercise only rights its held capability confers, the set of reachable effects is contained in those permitted by a - the blast radius is bounded above by the parent's authority independent of agent behavior.

The principle of least authority says a task should run at the greatest lower bound of the rights it actually needs. When the planner spawns a coder sub-agent to edit one repository, the substrate should mint a capability scoped to that repository and that operation - not pass down its own broader authority. Because capabilities compose by intersection, a sub-agent can never escalate beyond its parent; bounding is a structural property, not a matter of vigilance. This is the only known way to reason about safety in a system where the policy is a stochastic model you do not fully control. The substrate's userland interface can be small. A sketch:

# Userland sees a kernel, not a control loop.
# spawn() returns a handle; await() yields to the scheduler.

async def research(topic, cap):
    # least authority: a read-only web capability, nothing more
    web = cap.attenuate("net.read", domains=["*.arxiv.org"])

    # three independent fetches: the kernel runs them concurrently
    handles = [kernel.spawn(fetch, q, cap=web,
                            cost=3.0, est_ms=400)
               for q in subqueries(topic)]

    docs = await kernel.gather(handles)   # scheduled by cμ index
    return kernel.checkpoint(synthesize(docs))  # durable step boundary

A userland routine. Concurrency, priority, authority, and durability are all delegated to the substrate; the agent expresses only intent.

Reproducibility

Every quantitative figure is produced by a single seeded script with no inputs beyond its parameters; the closed forms above are overlaid on Monte Carlo estimates so the reader can see model and simulation agree. The full script is available here; the core of the scheduling experiment, which realizes Proposition 1 by comparing four orderings on each random instance, is:

# total weighted completion time for four policies on each random instance
c = rng.uniform(0.2, 5.0, N)            # holding-cost rates c_i
s = rng.uniform(0.2, 5.0, N)            # service times S_i (mu_i = 1/s)
spt   = idx[np.argsort(s)]              # shortest processing time first
hc    = idx[np.argsort(-c)]             # highest holding cost first
cmu   = idx[np.argsort(-(c / s))]       # c-mu: largest c_i * mu_i first
# weighted_completion serves in order, accumulating sum_i c_i * C_i
totals["c-mu rule"][m] = weighted_completion(cmu, c, s)

The scheduling simulation (excerpt). Running the full script regenerates both figures deterministically.

Beyond the figures, a runnable reference harness assembles these mechanisms into working code in the Hermes function-calling style: capability-scoped tools composed by intersection, the reason-act loop, verifier-gated output, a step budget, and a durable transcript that resumes after a crash rather than restarting the run. It is one standard-library file with a deterministic stub model, so it runs with no API key: research/agent-harness/harness.py.

Related work

The framing of a kernel that arbitrates contended resources is Dijkstra's, whose THE system separated mechanism from policy in layers.[1] Protection by least privilege and the case against ambient authority are Saltzer and Schroeder's,[2] made composable by Miller's object-capability model, which is the lattice of Proposition 3.[3] The cμ rule of Proposition 1 originates with Cox and Smith,[4] and its preemptive, uncertain-duration generalization is the Gittins index;[5] Little's law[6] fixes the concurrency budget. The exponential tail and the value of cutting it are the systems lesson of Dean and Barroso.[7] Durable execution descends from Lamport's logical clocks and replay over an ordered event log,[8] and the compensating-transaction view of long-running work is Garcia-Molina and Salem's sagas.[9] On the agent side, the greedy single-threaded loops we critique are ReAct[10] and Reflexion.[11] Our contribution is to assemble these classical results into a single substrate and to make three of its load-bearing claims quantitative and reproducible: that cμ beats greedy scheduling, that durability turns exponential failure into linear cost, and that capability composition bounds blast radius by construction.

Limitations and threats to validity

The model is deliberately minimal, and its assumptions are where it can mislead. The cμ result is single-server. Proposition 1 assumes one server, i.i.d. service times, and known rates; the agent setting violates all three, though the qualitative ordering is robust and the simulation confirms it on random instances. Failure independence is an idealization. Theorem 1 assumes step failures are independent and that a checkpoint perfectly restores state; correlated failures (a poisoned context, a wedged sandbox) defeat in-place retry, and the bound is then optimistic. Checkpoint overhead is unmodeled. Persisting and restoring state has a cost we set to zero; if it dominates the per-step work, the linear-versus-exponential comparison shifts in constant factors, though not in asymptotics. The capability model assumes an honest kernel. Proposition 3 bounds authority only if the kernel that mints and checks capabilities is itself uncompromised; it says nothing about a malicious substrate. The simulations are of the model, not of real agents. They confirm the mathematics is faithful to its own assumptions; they do not substitute for measuring service times, success rates, and contention on a deployed fleet, which is where these predictions should be falsified.

Conclusion

None of the four mechanisms - scheduling, durable execution, capability isolation, and a memory hierarchy - is novel in computer science. That is the point. Their novelty here is in recognizing that an autonomous agent is a process that has outgrown the toy runtime it was born in, and that the abstractions which domesticated concurrency and failure for ordinary software are exactly the ones that will domesticate it. The model is the policy; the substrate is the mechanism; progress comes from getting the separation right. The cost is that we must stop treating “the agent” as the whole system: the agent becomes userland - important, replaceable, and untrusted - running atop a small, boring, fiercely reliable kernel. The next essay takes one of these subsystems, the memory manager, and argues that the context window is best understood as the top of a managed hierarchy - context as compute - with everything that implies about eviction, prefetch, and the price of attention.

Footnotes

  1. The “OS” framing is deliberate over “runtime.” A runtime executes; a kernel arbitrates. The defining feature we need is arbitration of contended resources under an adversarial-by-default policy.
  2. When tasks are preemptable and durations are unknown but learnable, the problem becomes a multi-armed bandit over tasks, whose optimal policy is the Gittins index. The substrate is the natural home for such bookkeeping.
  3. Workflow engines such as Temporal popularized “durable execution” for microservices: deterministic replay over an append-only event log[8]. The agentic variant must additionally treat model nondeterminism as an explicit, logged input.
  4. Ambient authority - rights attached to identity rather than to unforgeable tokens - is the source of confused-deputy bugs. Object-capability systems eliminate the class.

References

  1. E. W. Dijkstra. The structure of the “THE”-multiprogramming system. Communications of the ACM, 11(5), 1968.
  2. J. H. Saltzer and M. D. Schroeder. The protection of information in computer systems. Proceedings of the IEEE, 63(9), 1975.
  3. M. S. Miller. Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control. PhD thesis, Johns Hopkins University, 2006.
  4. D. R. Cox and W. L. Smith. Queues. Methuen, 1961. (Origin of the rule.)
  5. J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society B, 41(2), 1979.
  6. J. D. C. Little. A proof for the queuing formula L = λW. Operations Research, 9(3), 1961.
  7. J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2), 2013.
  8. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7), 1978.
  9. H. Garcia-Molina and K. Salem. Sagas. Proc. ACM SIGMOD, 1987.
  10. S. Yao et al. ReAct: Synergizing reasoning and acting in language models. ICLR, 2023.
  11. N. Shinn et al. Reflexion: Language agents with verbal reinforcement learning. NeurIPS, 2023.