Reliability No. 03

The Verifier's Advantage: Asymmetric Verification as the Engine of Reliable Autonomy

Abstract

Autonomous systems built on stochastic policies fail at a rate that no feasible improvement in the policy can drive to zero. We argue that reliability at scale is bought not on the generation side but on the verification side, and we make the argument quantitative. We formalize generate-and-check as a sampling-and-acceptance process parameterized by a generator success rate, a verifier sensitivity, and a false-accept rate, and we establish three results. First, with a sound verifier, repeated sampling drives the probability of obtaining a correct candidate to one geometrically in the sample budget (Proposition 1). Second, stacking conditionally independent verifiers drives the precision of accepted outputs to one (Proposition 3), but under a shared-failure model with correlation ρ the achievable precision is capped by an explicit ceiling that no amount of stacking can exceed (Proposition 4). Third, the expected cost to obtain a verified-correct output admits a closed form from which we derive a threshold beyond which stacking cheap verifiers strictly dominates scaling the generator (Theorem 1). Each result is validated by a reproducible Monte Carlo simulation, included and linked. The practical thesis follows: decorrelated verification, not a better policy, is the cheapest path to reliable autonomy, and its limit is set by how independent your checks really are.

Checking an answer is usually easier than producing one. That asymmetry is the most underexploited fact in the design of autonomous systems, and this essay is an attempt to take it seriously enough to quantify.

The intuition is ancient in computer science. A satisfying assignment is hard to find and trivial to check; a proof is hard to discover and mechanical to verify. The entire theory of NP is built on the gap between finding and checking.[1] The same gap appears, less formally but no less usefully, throughout autonomous work: a plan is hard to construct and cheap to test against constraints, code is hard to write and cheap to run against a suite, a citation is hard to produce and cheap to look up. Wherever the gap exists, it can be turned into reliability.

The argument against relying on the policy alone is a single inequality. If a system executes a trajectory of m steps, each correct independently with probability p¯, end-to-end success decays as p¯m. At p¯=0.99 a 300-step task succeeds about five percent of the time, and the exponential is indifferent to whether the per-step number is 0.99 or 0.999. You cannot win an exponential race by nudging its base. You win by changing the process: sample, check, and keep only what passes. This essay asks how much that buys, and where it stops working.

Contributions. We (i) give a minimal but faithful model of generate-and-check; (ii) prove a coverage result for repeated sampling and a precision result for stacked verification, and identify the precise way correlation among verifiers caps achievable precision; (iii) derive the expected cost of a verified-correct output and the regime in which verification dominates generation scaling; and (iv) validate every claim with a seeded simulation whose code is included and runnable.

A model of generate-and-check

Fix a task with a (possibly unknown) correctness predicate. A generator produces a candidate; a verifier accepts or rejects it. We model only the statistics that matter.

Definition 1 (Generator). A generator emits candidates i.i.d., each correct with probability p(0,1), at marginal cost cg per candidate.
Definition 2 (Verifier). A verifier is a test with sensitivity s=1-β (probability it accepts a correct candidate) and false-accept rate α (probability it accepts an incorrect candidate), at marginal cost cvcg. It is sound if α=0 and complete if β=0.

The controller is a resample loop: draw a candidate, run the verifier stack, return it if accepted, otherwise draw again, up to a budget. The cost asymmetry cvcg is the entire reason the loop is interesting: it lets us spend many cheap checks to launder a weak generator into reliable output. Two questions decide whether that works: does sampling find a correct candidate (coverage), and is an accepted candidate actually correct (precision)?

Coverage: the value of sampling

Coverage is the easy half, and it is favorable. With a sound verifier, the loop succeeds as soon as any drawn candidate is correct.

Proposition 1 (Coverage). With a sound verifier and a budget of n samples, the probability of returning a correct answer is
Pcov(n) = 1- (1-sp)n 1-(1-p)n ass1.
Each sample is independently accepted-and-correct with probability sp; the loop fails only if all n fail, with probability (1-sp)n. Complement, then take s1.

The convergence is geometric: each additional sample multiplies the failure probability by (1-sp). A generator that is right only a quarter of the time clears 99% coverage in sixteen samples. This is the regime that makes test-filtered code generation and self-consistency work in practice.[2][13] Figure 1 plots the closed form against a Monte Carlo estimate; they agree to within sampling error.

Coverage probability 1-(1-p)^n versus number of samples for p = 0.10, 0.25, 0.50. Solid lines are the closed form; points are Monte Carlo estimates over 20,000 trials, lying on the lines. All three curves rise quickly toward one.
Coverage rises geometrically in the sample budget. Solid lines are Proposition 1; points are Monte Carlo estimates (20,000 trials per point). Even a weak generator (p=0.10) exceeds 95% coverage within thirty samples.

Precision: the hard half

Coverage says a correct candidate was drawn; it does not say the verifier returned it rather than a confident-looking wrong one. The quantity that ships is precision: the probability that an accepted candidate is correct. By Bayes' rule, for a single verifier,

π= p(1-β) p(1-β)+(1-p)α .

Precision is governed by the false-accept rate α, not by sensitivity.1 A verifier that occasionally rejects good answers costs you extra samples; a verifier that occasionally accepts bad ones costs you correctness. The natural move is to demand that several verifiers agree.

Proposition 2 (Conjunctive stacking). If k verifiers must all accept, and their errors are conditionally independent given the candidate's correctness, the stack has sensitivity sk and false-accept rate αeff=j=1kαj.
Proposition 3 (Precision under independence). If each αjα<1 and verifiers are conditionally independent, then accepted-output precision πk1 as k, provided s>α.
Substituting the stack's rates into the Bayes expression gives πk=pskpsk+(1-p)αk. Divide through by sk; the false term scales as (α/s)k0 when s>α.

Proposition 3 is the optimistic story usually told about ensembling, and it is real - if the verifiers are independent. They rarely are. A generator and a learned verifier trained on overlapping data share blind spots; a wrong answer that looks right to one model-based check often looks right to the next. The honest model has to admit a common failure mode.

Proposition 4 (Precision ceiling under correlation). Suppose a wrong candidate is deceptive with probability ρ, in which case all verifiers accept it together (each with rate α0), and otherwise their false-accepts are independent. Then, writing s=1 to isolate the false-accept channel,
αeff(k) = ρα0 + (1-ρ)α0k , limk πk = p p+(1-p)ρα0 <1.
With probability ρ the conjunction cannot help, because all verifiers move together; the surviving false-accept mass ρα0 is independent of k. The independent component (1-ρ)α0k vanishes. Substituting into the Bayes expression and taking k leaves the stated ceiling.

This is the load-bearing result of the essay. Stacking checks is not a free lunch that converges to certainty; it converges to a ceiling fixed by the correlation of your verifiers. Driving reliability up therefore means driving ρ down - decorrelating the checks by making them mechanistically different (a type checker, an executed test, an independent model, a proof) rather than adding more checks of the same kind. Figure 2 shows the gap: independent verifiers climb toward one, while even mild correlation (ρ=0.15) plateaus well short of it.

Precision versus number of stacked verifiers k from 1 to 8. The independent curve (rho = 0) climbs toward 1. The rho = 0.05 and rho = 0.15 curves plateau below 1 at their respective dotted ceilings. Monte Carlo points lie on the analytic curves.
Accepted-output precision against stacked verifiers (p=0.5, α0=0.35). Independent verifiers approach certainty; correlated verifiers stall at the ceiling of Proposition 4 (dotted). Points are Monte Carlo (60,000 trials per point).

The economics of reliability

Coverage and precision are bought with samples and checks, both of which cost money. The resample loop returns one accepted candidate after, in expectation, 1/(psk+(1-p)αeff) draws, each costing one generation and k verifications. Charging only for outputs that are both accepted and correct gives the cost of reliability.

Theorem 1 (Cost of a verified-correct output). The expected cost to produce one accepted-and-correct output is
𝔼[C] = cg+j=1kcv,j psk .

Holding the generator fixed, adding a conditionally independent verifier multiplies the false-accept rate by α<1 at additive cost cv, whereas achieving the same multiplicative reduction in error by improving the generator requires raising p toward one at a cost that diverges. Hence:

Corollary 1 (Verification dominance). For any target precision below the correlation ceiling of Proposition 4, there is a cost threshold above which stacking verifiers attains the target more cheaply than scaling the generator. Below the threshold the order can reverse; the two strategies trace distinct cost-reliability frontiers.
Model generation cost as cg(p)=c0/(1-p), capturing that halving a model's error rate costs super-linearly. Reliability under generator-scaling improves only as fast as 1-p shrinks, at diverging marginal cost; under the verifier strategy precision improves by an α-factor per added check at constant marginal cost cv. The two parametric curves cross. Figure 3 plots both.
Reliability versus expected cost per verified-correct output. The 'scale generator' curve rises slowly and runs to high cost. The 'add verifiers' curve, with k labeled from 1 to 7, reaches high reliability at substantially lower cost.
Cost-reliability frontiers. Scaling the generator (raising p) buys reliability slowly and expensively; stacking cheap verifiers on a weak generator reaches the same reliability for less, until the correlation ceiling bites. Derived from Theorem 1 with cv=0.05c0.

Reproducibility

Every figure is produced by a single seeded script with no inputs beyond its parameters; the closed forms above are overlaid on Monte Carlo estimates so the reader can see model and simulation agree. The full script is available here; the core of the precision experiment, which realizes the shared-failure model of Proposition 4, is:

# precision under k stacked verifiers, shared-failure correlation rho
for k in ks:
    correct   = rng.random(trials) < p              # candidate correct?
    shared    = rng.random(trials) < rho            # deceptive: all verifiers move together?
    indep_fa  = (rng.random((trials, k)) < a0).all(axis=1)
    shared_fa = rng.random(trials) < a0
    wrong_acc = np.where(shared, shared_fa, indep_fa)
    accept    = np.where(correct, True, wrong_acc)   # s = 1: always accept correct
    mc.append(correct[accept].mean())               # empirical precision

The precision simulation (excerpt). Running the full script regenerates all three figures deterministically.

Related work

The generation-verification asymmetry is the practical face of the finding-versus-checking gap formalized by Cook.[1] Learned verifiers were shown to lift reasoning accuracy by Cobbe et al.,[2] and self-consistency[3] can be read as a weak verifier (a majority vote among samples), connecting to the Condorcet jury theorem.[10]2 Process-level reward and step-wise checking[4][5] push verification inside the trajectory rather than only at the end. Repeated sampling with a verifier as a scaling axis was studied empirically by Brown et al.,[7] and test-filtered code generation at scale by Li et al.[13] Debate[6] and reward modeling[8][9] are verification strategies for the harder case where correctness is not directly checkable. Our contribution is orthogonal: rather than proposing a verifier, we characterize the limits of verifier composition - the correlation ceiling of Proposition 4 - and the economics of choosing verification over generation (Theorem 1). The ceiling formalizes, as a hard upper bound, the widely noted failure of correlated ensembles, and reframes the central engineering task as decorrelation rather than accumulation.

Limitations and threats to validity

The model is deliberately minimal, and its assumptions are where it can mislead. Independence is the crux. Proposition 3 assumes conditionally independent verifiers; Proposition 4 shows how badly that fails under a single shared failure mode, but real correlation structure is richer than one parameter ρ, and estimating it is itself hard. Verifiers can be gamed. When the generator optimizes against a fixed verifier, the false-accept rate is no longer a constant but a function of optimization pressure - a Goodhart effect the i.i.d. model omits, and one that argues for held-out or non-differentiable checks. The cost model is a caricature. The form cg(p)=c0/(1-p) is plausible but unmeasured; the threshold's location depends on it, though its existence does not. The simulations are of the model, not of language models. They confirm the mathematics is faithful to its own assumptions; they do not substitute for measuring p,α,ρ on a real task, which is the natural next step and the place where these predictions should be falsified.

Conclusion

Reliability is a resource you purchase, and the verification side of the ledger is where it is cheap. Sampling buys coverage geometrically; stacking buys precision, but only up to a ceiling set by how correlated your checks are; and the cost arithmetic favors cheap independent checks over expensive generators across a wide regime. The engineering imperative is therefore not "build a better verifier" but "build verifiers that fail differently." The next essay scales this from one agent to many, and asks how a fleet that must verify and coordinate allocates scarce resources without a central planner.

Footnotes

  1. Sensitivity below one (s<1) does not hurt precision in the resample loop - it only raises the expected number of draws, and hence cost, through the sk term in Theorem 1. The precision figures fix s=1 precisely to isolate the false-accept channel.
  2. Majority voting among k independent better-than-chance judges approaches certainty as k grows - the Condorcet jury theorem. Self-consistency is the special case where the judges are the generator's own samples, which is exactly why correlation among those samples (the failure mode of Proposition 4) limits it.

References

  1. S. A. Cook. The complexity of theorem-proving procedures. Proc. 3rd ACM Symposium on Theory of Computing (STOC), 1971.
  2. K. Cobbe et al. Training verifiers to solve math word problems. arXiv:2110.14168, 2021.
  3. X. Wang et al. Self-consistency improves chain-of-thought reasoning in language models. ICLR, 2023.
  4. H. Lightman et al. Let's verify step by step. ICLR, 2024.
  5. J. Uesato et al. Solving math word problems with process- and outcome-based feedback. arXiv:2211.14275, 2022.
  6. G. Irving, P. Christiano, and D. Amodei. AI safety via debate. arXiv:1805.00899, 2018.
  7. B. Brown et al. Large language monkeys: scaling inference compute with repeated sampling. arXiv:2407.21787, 2024.
  8. J. Leike et al. Scalable agent alignment via reward modeling: a research direction. arXiv:1811.07871, 2018.
  9. N. Stiennon et al. Learning to summarize from human feedback. NeurIPS, 2020.
  10. Marquis de Condorcet. Essai sur l'application de l'analyse a la probabilite des decisions rendues a la pluralite des voix. 1785.
  11. S. Yao et al. ReAct: synergizing reasoning and acting in language models. ICLR, 2023.
  12. N. Shinn et al. Reflexion: language agents with verbal reinforcement learning. NeurIPS, 2023.
  13. Y. Li et al. Competition-level code generation with AlphaCode. Science, 378(6624), 2022.
  14. P. Christiano et al. Deep reinforcement learning from human preferences. NeurIPS, 2017.