Research · 2026-06-09
The Model–Harness Configuration as the Unit of Agentic Capability
Bert Colemont
Abstract
Progress in autonomous LLM agents is conventionally attributed to the model: "model M scores X% on benchmark Y." Yet the same model embedded in different runtimes — different context management, tool surfaces, orchestration, and verification — succeeds or fails on the same task, so model-level attribution obscures a large and controllable source of performance variance. This paper argues that the appropriate unit of analysis for agentic capability is not the bare model but the model–harness configuration: the model together with the runtime that governs what it perceives, what it may do, what state survives across turns, what it may not do, how its work is evaluated, and how it recovers from failure. We define the harness and distinguish capability-of-the-model from capability-of-the-configuration; give an operating-systems-grounded taxonomy of harness components, each with its OS analog and the design question it poses; and synthesize empirical evidence that scaffolding shifts end-to-end task success substantially with the model held fixed. We draw out implications for benchmarking (configuration-level reporting), for open-weight competitiveness (a "build to delete" account in which weaker models need more scaffolding and stronger models need less), and for governance, where a runtime-enforced, logged control is auditable evidence while a prompt instruction is only behavior.
The wrong unit of analysis
Statements of the form "model M scores X% on benchmark Y" are the default currency of progress reports in autonomous agents. They are convenient, comparable across leaderboards — and almost always misleading for the systems people actually deploy. A language model does not, by itself, resolve a GitHub issue, navigate a website, or hold a multi-turn conversation under a policy. To do those things it must be embedded in a runtime that decides what it perceives, what it may invoke, what survives between turns, what it is forbidden to do, and how its output is checked.
This paper names that runtime the harness and argues that the proper unit of analysis for agentic capability is the model–harness configuration — the pairing of a specific model with a specific instantiation of six governed axes: perception, action, state, permissions, evaluation, and recovery.
An operating-systems taxonomy
A bare language model resembles a processor: capable of useful work per cycle, but unable on its own to manage memory, persist state, reach devices, enforce protection, schedule work, or recover from faults. The paper organizes harness components by the operating-system service each one mirrors:
| Harness component | OS analog | Design question |
|---|---|---|
| Perception / Context | RAM, working set | What to retrieve, compact, or evict? |
| Action / Tools | Syscalls, devices | Which tools to expose at each step? |
| State / Memory | Disk, filesystem | What must persist across turns? |
| Permissions / Sandboxing | Kernel rings | What least privilege, enforced how? |
| Orchestration loop | Scheduler | How to sequence reason and act? |
| Observability / Evaluation | Logging, tracing | How to make work inspectable? |
| Recovery | Exceptions, rollback | How to detect failure and retry? |
| Governance boundary | Trusted kernel | Which controls are runtime-enforced? |
The analogy is explicitly an organizing device, not a claimed isomorphism — and the paper says so.
The evidence, with the model held fixed
The synthesis assembles published results in which the weights do not change and the surrounding runtime does: tree search lifting GPT-4 from 4% to 74% on a planning puzzle; sampling-and-voting adding 17.9 points on GSM8K; reason–act interleaving, self-refinement, and episodic reflection each adding double-digit gains; a coding agent moving from 42.0% to 48.6% on a SWE-Bench Pro subset purely through hierarchical memory and context management; configuration-level benchmarking finding substantial variation in completion, process quality, efficiency, and failure behavior across model–harness pairings — and audits showing that flaws in the evaluation harness itself can distort reported capability by up to 100% in relative terms.
Every evidence claim is labeled by strength — strong, moderate, illustrative, or position-only — and the paper carries an explicit Honest caveats subsection: the studies are heterogeneous, magnitudes are not comparable across papers, and clean single-variable harness ablations remain rare.
Three implications
Benchmarking. A number attached to a bare model name is under-specified. Capability should be reported at the configuration level, with the harness disclosed and, where feasible, released.
Open weights and "build to delete." Each harness component encodes an assumption about a current model weakness, so the optimal quantity of scaffolding decreases as the model improves — and, run backwards, weaker and open-weight models can buy back capability with engineering. A fair open-versus-proprietary comparison is a comparison of configurations.
Governance. The harness is the boundary where controls become evidence. A constraint expressed in a prompt is behavior; a constraint enforced by the runtime is auditable independently of the model's output. The EU AI Act's human-oversight requirements — monitoring operation to detect anomalies, overriding output, interrupting the system — are architectural requirements on the deployed configuration, not properties of model weights.
Provenance
The paper was drafted and verified using the discipline it describes: a multi-agent research workflow with generation separated from evaluation, every citation checked against the live arXiv API, every numeric claim traced to its source abstract or full text, and the EU AI Act characterization verified clause-by-clause against the official EUR-Lex text. A non-technical companion essay, The model is the CPU, not the computer, develops the same argument for a general audience.
The full paper — definition, taxonomy, evidence synthesis, implications, limitations, and the complete reference list — is available as a PDF below.

