Engineering · 2026-06-09

The model is the CPU, not the computer

The popular account of AI agents gets the slogan right and the engineering wrong. The model is the CPU; the harness is the operating system — and in a regulated, European context, the harness is where the real decisions live.

There is a common story about AI agents that is worth correcting, because the mistake in it is expensive.

The story says: the models finally got good enough, so now the agents ship the software. It is a tidy narrative, and it quietly erases most of the engineering that actually made the difference. The teams shipping real agent work in 2026 did not win by waiting for a smarter model. They won by building the machine around the model.

The claim this piece defends is precise: for any agent doing serious work, the model is not the system, and treating it as the system is now a measurable mistake. The system is the model plus the harness — the instructions it receives, the tools it can call, the workspace it can mutate, the memory it can write, the evidence it can inspect, the permissions it cannot bypass, the evaluator that catches its failures, and the handoff that lets a human take responsibility for the result.

That is not a philosophical distinction. It is the difference between an agent as a clever autocomplete surface and an agent as a governed, executable system. And the gap between those two things is exactly where European teams — who have to answer for what their systems do, not just what they can do — should be spending their attention.

The cleanest way to see it is an old analogy, and it survives scrutiny better than most.

The model is the CPU, not the computer

A CPU executes instructions. It is necessary, it is powerful, and on its own it is nearly useless. A computer that does work also needs memory management, a file system, device drivers, a scheduler, a permission model, logs, recovery, and an interface an operator can actually trust. Strip those away and you have a fast processor that cannot be relied on for anything that matters.

Agents are the same shape. The model supplies reasoning and generation. The context window is the working set. Tools are devices and system calls. Durable files are disk. The orchestration loop is the scheduler. Guardrails and permissions are the protection ring. Tests, traces, and browser automation are the observability layer. The harness is the operating environment that converts raw model capability into governed execution.

A central chip labelled MODEL — the CPU — sits inside a boundary labelled THE HARNESS, surrounded by six components: context window as RAM, tools and APIs as devices, durable files as disk, orchestration as the scheduler, permissions as the kernel ring, and observability as tests, traces and the evaluator.

The analogy is not perfect, and the imperfections are instructive rather than fatal. A CPU is deterministic; a language model is not. A model also carries an enormous amount of learned world-knowledge that no processor has. But for anyone trying to understand why "we used the same model" can produce very different outcomes, the analogy holds. Two machines can share a processor and behave nothing alike if one runs a mature operating system and the other runs a pile of ad hoc scripts.

That is the whole argument in one image. The useful part is that we now have numbers behind it.

Takeaway: If your mental model of an agent stops at "the model," you are reasoning about a CPU and calling it a computer.

The slogans are outrunning the engineering

The field is currently compressing hard engineering into slogans, and the slogans are shaping how teams plan, budget, and govern — usually in the wrong direction.

Take the most-shared example. OpenAI published a February 2026 account of an internal beta built with zero manually written lines of code — roughly a million lines across application code, tests, CI, documentation, observability, and tooling, about 1,500 merged pull requests, driven first by three engineers and later seven. The popular reading is "the agents shipped production software while humans watched."

That is not what happened, and OpenAI's own framing is more interesting. Humans started from an empty repository, designed the environment, specified intent, built the feedback loops, exposed observability, and kept the codebase legible enough for Codex to operate inside it. They used a short AGENTS.md as a map rather than a monolithic rulebook, and pushed deeper knowledge into structured, verifiable documentation. The scarce resource they were optimising was not model capability. It was human attention. The headline is not "no humans." It is that humans moved up the stack — from writing code to engineering the environment that writes code.

The Sora for Android story is the same lesson in different clothes. OpenAI reports the app went from prototype to global launch in 28 days (8 October to 5 November 2025), with four engineers, a #1 Play Store launch day, and 99.9 percent crash-free reliability, on an early build of GPT‑5.1‑Codex. The number everyone quotes is the roughly 85 percent of the project written by Codex — and the number worth remembering is that it was 85, not 100. The engineers laid the architectural foundations themselves. The approach that failed was the one the slogan celebrates: the one-shot prompt, "build the Android app from the iOS code." It produced something. It did not produce the product.

So the lesson is the inverse of the slogan. Agent-heavy work does not remove engineering. It relocates it — out of the function body and into the harness.

Takeaway: When a result looks effortless, ask where the engineering went. It went into the environment.

What a harness actually is

A minimal agent is easy to describe: a model with instructions and tools. OpenAI's own Agents SDK defines an agent as an LLM configured with instructions, tools, and optional runtime behavior such as handoffs, guardrails, and structured outputs. That definition is correct and almost completely insufficient for production, because production harnesses grow well past it.

In infrastructure terms, the mapping looks like this.

Conventional system	Agent system	Why it matters
CPU	Model	Produces the next token, decision, or action.
RAM	Context window	Holds the active working set: task, code, observations, prior steps.
Disk	Files, commits, task state, memory stores	Lets useful state survive beyond one context window or session.
Devices	Tools, APIs, browser, shell, database access	Lets the system act on the world instead of only emitting text.
System calls	Typed tool calls	Defines how the model asks the outside world to do work.
Kernel permissions	Tool policy, approvals, sandboxing, secrets isolation	Prevents the model from crossing boundaries by persuasion or accident.
Scheduler	Orchestration loop	Decides when to call the model, which context to provide, and when to stop.
Observability	Logs, traces, screenshots, tests, evaluator output	Turns failure into evidence the agent and reviewers can inspect.
Recovery	Retries, rollbacks, resumable state, failed-action history	Lets the system continue after a bad action without pretending it did not happen.

These rows are not decorative. They are why a harness can move agent behavior as much as a model upgrade can. Concretely, a serious harness defines six things, and each one is a design surface a team owns.

First, what the model can see. Repository-local docs, architecture maps, issue context, prior plans, code search, traces, browser state, logs. The question is never "how much context can the model hold?" It is "which facts are legible to the model at the moment it acts?" Those are different questions, and confusing them is how teams ship agents that technically had the information and still got it wrong.

Second, what the model can do. Shell access, file edits, package installs, browser control, database queries, deploy actions — these are not morally equivalent. A serious harness treats tools as capability grants, not convenience functions.

Third, how state survives turns and sessions. Long-running agents lose coherence unless progress, plans, decisions, and failed attempts are externalized into durable artifacts. This is why progress files, structured task lists, git commits, and worktree-scoped state keep reappearing in the stronger harnesses. Memory is not a model feature here. It is an architectural choice.

Fourth, what cannot be delegated to the model at all. Permissions, policy, secrets handling, tool risk, tenant boundaries, deployment gates. A prompt that tells an agent to be careful is not a control. It is a suggestion with good intentions.

Fifth, how work is evaluated. Unit tests, type checks, linting, browser automation, screenshots, traces, diff review, security scans, human sign-off — all of these are part of the harness the moment they feed back into the loop rather than sitting in a wiki.

Sixth, how the system learns from failure. A harness that hides stack traces, discards failed actions, or lets the agent mark its own work complete is training the wrong behavior at inference time, every single run.

A harness, in other words, is not a prompt wrapper. It is the runtime that makes a probabilistic component usable as part of a deterministic-enough system.

Takeaway: A prompt is one input. A harness decides what the model perceives, may do, remembers, cannot touch, is judged by, and learns from. That is an operating system, not a wrapper.

The evidence is no longer anecdotal

For a while, "the harness matters" was an assertion. It is now a measurement.

Terminal-Bench 2.0 is a deliberately hard benchmark: 89 tasks in real terminal environments, each with an isolated environment, a human-written solution, and tests for verification. At publication, frontier models and agents scored below 65 percent. That matters because it grades an agent operating in an environment, not a model answering static questions — the difference between a driving test and a written exam.

LangChain then showed one of the cleanest harness effects on record. Holding the model fixed at gpt-5.2-codex, they moved the same model from 52.8 percent to 66.5 percent on Terminal-Bench 2.0 — a 13.7-point improvement — purely by changing the harness: prompts, middleware hooks, tool implementations, and model-specific profiles. No new model. (LangChain has written this up across more than one post, including a companion piece on tuning agents to different models; the headline figure comes from the harness-engineering work.)

Bar chart: holding the model fixed at gpt-5.2-codex, a bare harness scores 52.8 percent on Terminal-Bench 2.0 while a tuned harness scores 66.5 percent — a gain of 13.7 points with no new model.

The newer Harness-Bench paper makes the structural argument directly: capability should be reported at the model-harness configuration level, because the harness governs context, tools, state, constraints, permissions, tracing, and recovery. Across 5,194 execution trajectories, the authors found substantial differences in completion, process quality, efficiency, and failure behaviour across model-harness pairings.

This should change how a leaderboard is read. A bare model score is becoming less like a benchmark for an application and more like a benchmark for an engine on a test stand — informative, and insufficient on its own to predict lap times. The chassis, transmission, tyres, route, telemetry, and brakes still decide the race.

More harness is not always better, either, and the same discipline says so. Vercel reported removing roughly 80 percent of the tools from an internal data agent and improving performance. On their small benchmark of five representative queries, the stripped-down, file-system-based agent completed all five, used fewer tokens, and ran faster; the older tool-heavy architecture's worst case used 100 steps and 145,000 tokens before failing. The lesson is not that tools are bad. It is that every tool is a decision the model must understand, select, and use correctly — and when the underlying data layer is well-structured, a small, boring action space (grep, cat, ls) can beat a large menu of specialised functions.

It is worth resisting the opposite over-correction too. "The model was never the constraint" is too strong; the model is often the constraint. The accurate formulation is narrower and more useful: the model alone is not the unit of performance. The unit is the model in a harness, under a task distribution, with a budget, a tool surface, and an evaluator.

Takeaway: Same model, +13.7 points from the harness. Benchmarking a model in isolation measures the engine and leaves you guessing at the car.

What Anthropic learned about state, testing, and knowing when to delete

If OpenAI's posts are about legibility, Anthropic's are about the two problems that quietly kill long-running agents: state and self-grading.

In its earlier work on effective harnesses for long-running agents, Anthropic split the system into an initializer and a coding agent. The initializer sets up the world: an init.sh, a progress log, an initial commit, and a structured feature list. Each subsequent session reads the artifacts, picks one incomplete feature, implements it, tests it, commits, and leaves clean state for the next session.

One detail in that post is worth more than it looks. Anthropic stored the feature list as JSON rather than Markdown because, in their experiments, the model was less likely to casually rewrite or overwrite JSON. That is a harness-level finding, not a model-level one: the representation format changed the agent's behaviour. The substrate matters, down to the file type.

The second finding is the one every team should absorb: agents routinely marked features complete before any end-to-end verification. The fix was not a smarter model — it was requiring the agent to drive the actual product. When Claude was prompted to exercise a web app through Puppeteer like a user, it found bugs that source inspection alone never surfaced. Expect this to become standard practice: if the product has a user interface, the harness must make the agent use the product.

The March 2026 follow-up, harness design for long-running application development, pushes further with a planner, generator, and evaluator. The planner expands a short prompt into a spec. The generator builds. The evaluator uses Playwright to exercise the running app and grade it against explicit criteria, with generator and evaluator effectively negotiating a contract for what "done" means before each sprint.

The cost comparison is the part worth keeping in view. On a retro arcade-game build, a solo run on Opus 4.5 cost about $9 and ran for 20 minutes — and produced a broken game where entities rendered but nothing responded to input, with the wiring between definitions and runtime silently severed. The full harness cost about $200 and ran for six hours — over 20× more expensive — and produced something a person could actually play. That is not an argument to spend $200 on every task. It is an argument that "cheap output" and "working product" are different categories that can look identical in a screenshot.

And then the twist that makes the whole philosophy honest. When the model improved to Opus 4.6, Anthropic simplified the harness. With a much larger context window, the sprint construct became unnecessary for some tasks; a slimmer harness built a browser-based DAW in about four hours for roughly $124, with the evaluator's role reduced to a single pass at the end. The evaluator still caught meaningful gaps, but it earned its keep conditionally rather than always.

As the model improves from Opus 4.5 to Opus 4.6, the harness shrinks from four layers — planner, generator, per-sprint evaluator and sprints, about six hours and $200 — to just generator plus a single-pass evaluator, about four hours and $124; the planner and sprints are deleted. Run backwards, a weaker or open-weight model needs more harness, not less.

This is the strongest version of "build to delete." Every harness component encodes an assumption about what the current model cannot do reliably. When the model changes, every assumption must be retested — and the right move is sometimes to remove scaffolding, not add it. A team that only ever adds harness is as confused as a team that never adds any.

Takeaway: State must live outside the model, the agent must test through the real product, and every piece of scaffolding has an expiry date tied to model capability.

Context engineering is not prompt engineering

Manus uses different vocabulary and arrives at the same boundary. Their context-engineering write-up is about the unglamorous physics of production agent loops: act, observe, append, act again.

A clockwise four-stage loop: ACT issues a typed tool call, OBSERVE collects logs, traces and tests, APPEND writes the result to disk rather than only to RAM, and DECIDE chooses the next step before returning to ACT. The context grows append-only and failed actions are kept as evidence.

In that regime, the first-order concerns are not clever prompts. They are KV-cache hit rate (the single biggest lever on latency and cost), serialization stability, tool-definition stability, and recoverability. A changing timestamp near the front of a prompt can wreck prefix-cache reuse. Dynamically adding and removing tools can both confuse the model and invalidate cached context. A single huge observation can swamp the window even when the nominal context limit is generous.

Their answers are harness answers, every one. Keep prompt prefixes stable. Make context append-only where possible. Use the file system as durable, restorable memory rather than stuffing everything into the window. Keep a living todo list near the tail of the context so the current objective stays salient. Preserve failed actions and stack traces, because errors are evidence the model can use to avoid repeating itself — deleting them is deleting the agent's ability to learn within a run. And manage tool availability by masking, not by mutating the tool set mid-flight.

This is why "prompt engineering" is too small a phrase for the work. The prompt is one part of an interface. Context engineering asks what the model may observe, where long-term state lives, which facts are restorable, how failures are represented, how the tool surface changes, and what the runtime does to keep attention on the task. For a chat turn, this is overkill. For an agent making fifty tool calls across hours, it is the job.

Takeaway: Prompt engineering tunes one message. Context engineering designs the entire information environment the agent lives inside, run after run.

Open weights, and where the gap actually goes

This reframing has a consequence the slogan tends to skip, and it is the most important one for anyone building in Europe: if the harness is doing this much of the work, the model underneath it no longer has to be a frontier API.

The evidence supports this directly. Alibaba's Qwen3.6‑27B — a dense, open-weight, self-hostable 27B model — reportedly scores around 59.3 on Terminal-Bench 2.0, in the neighbourhood of Opus‑4.5-class agentic coding, while outperforming the far larger Qwen3.5‑397B mixture-of-experts model it shipped after. Open-weight families like Qwen3.5‑122B and its successors are now genuinely competitive on well-scoped agentic work. Pair one of them with a harness tuned specifically to it, and you can get most of the way to a closed frontier model.

Most of the way is the accurate phrase, and the earlier evidence explains why. A harness amplifies a model's capability; it does not manufacture it. It can address weaknesses in state, verification, and tool use — it cannot supply a reasoning step the model cannot take on a genuinely hard problem. So the gap narrows on the tasks a good harness is good at (bounded, verifiable, tool-heavy) and persists on the ones it is not (open-ended, long-horizon, hard reasoning). A single benchmark number understates this: averages hide tail behaviour, and tail behaviour — how a model fails when it fails — is where frontier models still tend to separate. It is the reason Harness-Bench argues for reporting at the configuration level rather than by a model score alone.

There is also a cost, and it falls in a predictable place. Recall the build-to-delete principle: a stronger model needs less harness. Run it the other way and you get the open-weight reality — a smaller or weaker model needs more harness to reach the same point. You do not escape the capability gap; you relocate it, out of the model's price tag and into your own engineering. More scaffolding, more tokens, more latency, more surface area to maintain, and tuning that is model-specific and does not transfer for free. "Perfectly tuned to it" is not a given; it is a project.

For many teams that cost is worth paying, and for European ones it is often the better choice rather than a compromise. An open-weight model you can host yourself, wrapped in a harness you control, lets you close the remaining capability gap in exactly the layer where sovereignty and auditability already have to live. You trade a few points of peak capability for control over where state is stored, which tools may cross a jurisdictional boundary, and who may change the policy. Put that way, the harness is not only how you make a smaller model competitive. It is how you make a competitive model governable — which is the subject this argument has been circling all along.

Takeaway: A well-built harness lets an open-weight model approach frontier-class agent work — and you pay the remaining gap in engineering rather than in API fees. In a sovereignty context, that is frequently the trade you want.

The part that matters most in Europe: the harness is the governance boundary

That last point deserves its own argument, because it is the one most easily skipped.

If an agent can read files, call APIs, query databases, route prompts, write code, open pull requests, or trigger deployments, then the harness is a policy boundary, whether or not anyone designed it to be. It decides which capabilities exist, which are visible for a given task, which require approval, which are logged, and which are simply impossible. That boundary is being drawn either deliberately, by engineers, or accidentally, by omission.

The principle is firm: that boundary must not depend on the model agreeing to behave. A system prompt can ask an agent not to read secrets. A harness can make the secret store unavailable. The first is behaviour; the second is a control. Regulators, auditors, and incident reviewers care about the second, and they are right to.

Two panels contrasting the same goal. On the left, a suggestion: a system prompt asks the model not to read the secret store, the request reaches the model, and it may or may not comply — depends on the model, produces a hope. On the right, a control: the secret store is present but not mounted, a runtime barrier blocks the model from reaching it regardless of the model — produces evidence.

This is the same reason mature teams prefer explicit prompt classification in their LLM gateways over trusting the model to infer sensitivity. A model can guess that something is confidential. A gateway can enforce a declared policy and produce a record of having done so. One is inference; the other is evidence. When a decision is questioned eighteen months later, "the agent inferred it was fine" is not an answer to give a supervisory authority.

Agent harnesses need the same discipline, and it should be concrete:

Tool calls should carry identity, tenant, purpose, risk, and audit metadata.
Dangerous tools should be gated outside the model, not behind a polite instruction.
Generated code should land in reviewable diffs, never straight to production.
Browser actions, shell commands, test results, and rejected attempts should be traceable after the fact.
Irreversible boundaries — production deploys, external messages, financial actions, destructive data operations, policy exceptions — should require human approval as a property of the runtime, not the goodwill of the agent. The agent assists; a person decides and is accountable.

This is where harness engineering meets the European question of sovereign and jurisdictional AI. Sovereignty is not only about which model endpoint receives the prompt. It is about where state is stored, which tools can cross a jurisdictional boundary, what evidence exists for each decision, whether the runtime is inspectable, and — the question that outlives every model release — who may change the policy. A team that has answered "which model, hosted where" and left the rest to a system prompt has secured the most visible layer and left the rest undefined.

The model may execute. The harness must govern. If you can engineer only one of them carefully, engineer the harness.

Takeaway: A control the model can talk its way around is not a control. In a regulated context, the harness — not the model card — is where your compliance story actually lives.

What we would build first

The practical shape is less exotic than the terminology suggests. If you are starting today, build in this order.

Start with a short repository entry point. AGENTS.md, CLAUDE.md, or your local equivalent should be a map, not a manual — pointing to architecture, product, security, style, verification, and deployment docs that are themselves maintained and reviewable. A monolithic rule file rots, overflows context, and cannot be verified mechanically.

Make task state structured. Markdown is fine for prose; status the agent mutates repeatedly should be JSON, YAML, SQLite, or anything with clear fields and mechanical validation. The representation should make destructive edits obvious — and, as Anthropic found, harder to make by accident.

Keep the tool surface small and typed. Prefer tools whose inputs and outputs are explicit, testable, and boring. Add tools when traces show repeated failure or wasted work; remove them when they add selection ambiguity without improving success. Vercel's result is permission to delete.

Separate generation from evaluation when the task is expensive or ambiguous. The evaluator need not be another large system — a test suite, a browser script, a static analyzer, a skeptical second model, or a human with a checklist all qualify. The point is that the generator must not be the only judge of its own work.

Expose runtime evidence. Logs, metrics, traces, screenshots, DOM snapshots, terminal output, failing tests, validation artifacts — readable by both the agent and the reviewer. A harness without evidence produces confident narratives about what probably happened, which is the worst possible artifact in an audit.

Treat permissions as infrastructure. Scope tool access by task, role, environment, and risk. Put human approval at the irreversible edges.

Version the harness. Prompts, tools, evaluators, context policies, and guardrails are code. If renaming a tool can move a benchmark by 13.7 points, a harness diff deserves review at least as much as an application diff.

Ablate deliberately. Disable the planner. Run without the evaluator. Remove a tool. Swap the model. Measure what happens to quality, cost, latency, and failure mode. If a component no longer earns its keep, delete it — and write down why, so the next model upgrade does not silently re-justify it.

How we build this at Euraika

This is the principle we build on, not a thesis we admire from a distance.

Our approach starts from the task, not the model. For each application we ask what the agent must see, what it may do, what state has to survive, what can never be delegated to it, and how every output is verified before a person sees it. Then we choose a light, open-weight model fit for that specific job and tune it inside a harness built to measure — rather than reaching for a frontier API behind a thin prompt. The model is the smallest part of that design, by intention.

This is an engineering choice on the merits, not a budget compromise. A frontier model is a general-purpose engine; most real applications are narrow, repeatable, and verifiable — the regime where a well-harnessed smaller model can match or beat a larger one while running faster, at lower cost, on infrastructure the customer controls. Bigger is not the same as better-for-the-job. Frontier models are remarkable, and they are not always the right tool; "reach for the biggest model" is a habit, not an answer. It is the Archimedean way of working we hold to: reason from first principles, build the machine, and show the working.

Our inference engine, Hermes, is where this becomes concrete — EU-hosted, serving open models behind the same retrieval, claim-verification and citation discipline that runs in Aegis, so no data has to leave European, customer-controlled infrastructure. Because the model is light and the harness is ours, the properties European organisations must answer for are present by design rather than added afterwards: state stays where it belongs, tools cannot quietly cross a boundary, every decision leaves evidence, and the policy remains the customer's to set. The agent assists; a person decides. The harness is where we make a model competitive, and the same harness is where we make it governable — for us those have never been two separate projects. Several of the products that will sit on this foundation are still on the roadmap; the principle they are being built to is the one described here.

The honest conclusion

The popular version says: agents ship because the models got good enough.

The accurate version says: agents ship when a capable model is placed inside a harness that makes the work legible, constrained, stateful, testable, recoverable, and reviewable — and when someone keeps pruning that harness as the model improves.

Models still matter. A weak model in an elegant harness is still weak. But the public evidence now points to a more precise claim than the slogan: for long-running agent work, the model alone is the wrong unit of analysis. The model-harness configuration is.

So the right questions to ask about any impressive agent result are not about line counts. They are: How was the task specified? What did the agent know at the moment it acted? Which tools could it call? What state survived? What did it test, and how? What did the evaluator catch? Which controls lived outside the model? What was logged? What failed? And which parts of the harness were later removed because the model outgrew them?

Those questions separate a demo from an engineered system. And in Europe, where you have to answer for the system and not only admire it, they are also the difference between something that impresses in a screenshot and something you can put your name on.

The next wave of useful AI systems will not be built by asking a model to try harder. It will be built by making the environment around the model exact enough that trying harder is no longer the plan.

Sources

OpenAI — Harness engineering: leveraging Codex in an agent-first world
OpenAI — How we used Codex to build Sora for Android in 28 days
OpenAI — Agents SDK: agent definition
Terminal-Bench 2.0 — arXiv:2601.11868
LangChain — Improving Deep Agents with harness engineering and Tuning Deep Agents to work well with different models
Harness-Bench — arXiv:2605.27922
Vercel — We removed 80% of our agent's tools
Qwen — Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model and MarkTechPost coverage
Anthropic — Effective harnesses for long-running agents
Anthropic — Harness design for long-running application development
Manus — Context Engineering for AI Agents: Lessons from Building Manus

← All engineering posts