Engineering · 2026-06-09

New paper: the model–harness configuration as the unit of agentic capability

We turned this morning's essay into a research paper: a precise definition of the agent harness, an OS-grounded taxonomy, and a synthesis of the evidence that scaffolding moves capability with the model held fixed. Built — and verified — with the discipline it describes.

This morning we published an essay arguing that the model is the CPU, not the computer. This evening we are publishing the paper version: The Model–Harness Configuration as the Unit of Agentic Capability, seventeen pages, forty-seven references, available as a PDF.

The essay and the paper make the same claim at different registers. The essay says: stop grading the processor and start grading the machine. The paper makes that precise. It defines the harness — the runtime that governs what a model perceives, what it may do, what state survives across turns, what it may not do, how its work is evaluated, and how it recovers from failure — and argues that the proper unit of analysis for agentic capability is the model–harness configuration, not the model name on the leaderboard.

What the paper adds over the essay

Three things an essay cannot carry.

A definition you can use. Capability-of-the-model is per-step competence of the weights. Capability-of-the-configuration is end-to-end task success under a specific runtime. These are empirically different quantities, and most public numbers report the second while attributing it to the first.

A taxonomy with edges. Eight harness components, each mapped to the operating-system service it mirrors — context as working memory, tools as system calls, permissions as kernel rings, the orchestration loop as the scheduler — each with the design question it forces. The paper is explicit that the analogy is an organizing device, not a proven isomorphism.

Evidence, strength-labeled. The synthesis assembles the model-held-fixed results: tree search lifting the same model from 4% to 74% on a planning task; a coding agent gaining 6.6 points on SWE-Bench Pro from context management alone; configuration sweeps finding substantial variation across model–harness pairings; benchmark audits showing the evaluation harness itself can distort reported capability by up to 100% in relative terms. Every claim carries an honesty label — strong, moderate, illustrative, or position-only — and the limitations section is written to be agreed with, not skimmed.

The part we care most about

For us the load-bearing section is the governance argument. A constraint expressed in a prompt is behavior: it may be followed. A constraint enforced by the runtime is evidence: it is auditable independently of the model's output. The EU AI Act's human-oversight requirements — monitor operation, detect anomalies, override the output, stop the system — are requirements on the deployed configuration, not on weights. That is the same boundary we enforce in Hermes, where an unclassified prompt is rejected rather than guessed at. The harness is where capability and accountability turn out to be the same engineering problem.

Built the way it argues

The paper practices its thesis. It was drafted by a multi-agent research workflow with generation separated from evaluation: literature scouts fanned out, an adversarial verification pass checked every candidate source against the live arXiv API, and every prose characterization was then audited against the source abstracts and — where abstracts were silent — full text, including a clause-by-clause check of the AI Act text on EUR-Lex. One unverifiable citation was dropped rather than trusted. The audit trail ships with the paper.

A generator that is not the only judge of its own work: the paper recommends it, because the paper needed it.

Read the paper here, or start with the essay if you prefer the view from the machine room.

← All engineering posts