Skip to main content

Your Harness Is Only As Smart As Your Spec

· 5 min read
Codalio Team
AI app builder team

Everyone upgraded the engine. Nobody checked the map.

The conversation in AI building has quietly flipped. A year ago, the question was "which model?" Now it's "which harness?" — the orchestration layer that takes one instruction and fans it out across dozens or hundreds of parallel agents, each chipping at a slice of the work.

The proof point everyone repeats is real and genuinely impressive: a 750,000-line codebase ported from one language to another at 99.8% test pass, in eleven days. The takeaway people drew from it was "the harness is the moat." The same model scores differently depending on the wrapper around it, so the wrapper is where the leverage lives.

Half right. The harness is leverage. But leverage multiplies whatever you point it at — and most founders are pointing it at a guess.


A faster way to be wrong is still wrong

Here is the part the demos skip. A harness that fans out 100 subagents against a fuzzy intent doesn't produce one good product. It produces 100 plausible-but-wrong diffs, faster than you can read them.

Orchestration doesn't add clarity. It assumes clarity already exists and spends compute acting on it. When the underlying instruction is "build me something like Stripe but for invoices," every agent fills the gaps differently — different assumptions about who the user is, what "done" means, which edge cases matter. You don't get convergence. You get 100 confident interpretations of a question nobody actually answered.

Orchestration without a spec is just a louder vibe-coding session. The agents aren't drifting because they're weak; they're drifting because nothing is pulling them toward a shared target.

That port everyone cites? It worked because the target was unambiguous. The spec was the existing codebase and its test suite — a precise, executable definition of "correct" that 100 agents could each measure themselves against. The harness fanned out, but the tests pulled it back in. Take away that ground truth and you don't have an engineering feat. You have a very expensive way to generate noise.


The spec is the thing being amplified

This is the founder truth underneath the hype: leverage in AI building comes from the spec, not the tooling. The harness amplifies. The spec is what gets amplified. Vibe coding ships prototypes; spec-driven work ships products — and the difference isn't how many agents you can summon, it's whether they can agree on what they're building.

You can see the market already pricing this in. The role early teams are now scrambling to hire ahead of their first salesperson isn't another implementer — it's the person who translates a fuzzy problem into a buildable spec and owns it through to deployment. The scarce skill stopped being "can write code." It became "can define exactly what correct looks like" so the machines can take it from there.

A real spec isn't a wish. It's the convergence target. At minimum it pins down:

  • Who the user is and the single job they're hiring the software to do

  • What "done" means — the observable behavior that proves the thing works

  • The edge cases that are in scope, and the ones you're deliberately ignoring for now

  • The constraints — data, integrations, rules the build cannot violate

  • The non-goals — what you are explicitly choosing not to build yet

Notice that none of these are technical. They're business decisions. Which is exactly why founders can't outsource them to the harness — and why skipping them shows up later as a rebuild, not a bug.


What you actually do before you build

The trap is treating the spec as paperwork you'll backfill once you "see something working." By then the agents have already chosen for you, and unwinding 100 wrong assumptions costs more than writing the right one would have.

So before you open the harness, do the unglamorous part:

Write down the one user and the one job. Not the platform vision — the single workflow that has to work first. Then define "done" as something you could watch someone do: a person uploads X, gets Y, and the result is correct when Z. If you can't describe the win condition in a sentence a stranger could verify, your agents can't either.

List the three edge cases that would actually embarrass you in front of a customer, and the ten you're consciously postponing. Ambiguity isn't removed by building — it's removed by deciding. Every decision you don't make becomes a decision an agent makes for you, badly.

Only then point the orchestration at it. Now the same fan-out that produced noise produces convergence, because every agent is measuring itself against the same target you wrote down.


Spec first, then summon the agents

This is the whole reason Codalio exists. We don't start by generating code — we start by turning your business logic into a precise, spec-driven definition of what "correct" means, then let the AI-powered build converge on it instead of guessing at it. The harness gets to be smart because the spec made it possible.

Try Codalio before you start building — walk through the spec-driven workflow at codalio.com and leave with a clear definition of what you're actually building, before a single agent runs. That's the difference between shipping a prototype and shipping a product.

If you want to do the groundwork by hand first, start with our Startup Software Requirements Gathering Worksheet — it's the same questions, on paper.

References

  • Agent HQ — GitHub: unifying agent orchestration under one governance layer

  • Dynamic workflows — Claude Code: fanning out parallel subagents in a single session

  • The 750k-line port at 99.8% test pass in 11 days — the harness-as-moat proof point

  • The Forward Deployed Engineer as a first-10-hires role for early teams