Vivek Haldar

Meta-Harness: Automating the Benchmaxing Loop

The central question in the agent space right now is: how do you get humans out of the loop? If you’re manually steering your agent, you have only partially automated the flow, and you the human are still the bottleneck. Two recent projects illustrate how to climb this ladder of abstraction, and the second one hits close to home for me.

Autoresearch

Andrej Karpathy’s AutoResearch showed a clean version of this idea. Take a program that trains a small GPT model, and let a coding agent edit it directly — making changes to architecture, hyperparameters, optimizer, anything — then running a training job and checking whether a performance metric improved. If it did, keep the change. If not, git-reset back to the last good version. Greedy hill-climbing, running overnight automatically.

A human ML researcher performs the same loop manually: set off a training run, check metrics, tweak the training logic, repeat. AutoResearch automates that entire cycle. The crucial enabler is having one very unambiguous, very clear metric to optimize. The agent does the hill-climbing all by itself.

MetaHarness

Meta-Harness takes this idea and applies it to model harnesses — what we used to call scaffolding. A harness is all the logic you wrap around a base model to make it accomplish a task: the reasoning loop, tool use, planning, prompt engineering. Claude Code is a harness, just a very general-purpose one. So is any agent framework you build for a specific purpose.

Meta-Harness is trying to do for harnesses what Autoresearch did for training: can you automatically construct the optimal harness for a given task?

The architecture is simple, and mirrors what a human AI engineer does. A coding agent — Claude Code with Opus 4.6, in their experiments — acts as the proposer: it creates a new harness, which gets evaluated on a benchmark. The scores, the harness source code, and crucially all the execution traces (prompts, tool calls, model outputs) get stored back to the file system. Then the agent loops back.

There is no complex co-ordination mechanism. It is just the file system, folders full of harness code, benchmark results, and traces. The proposer reads everything from previous iterations via standard tools (grep, cat, ls), does failure analysis on the traces, and proposes something better. There’s a minimal domain-specific prompt that tells the agent where to write harnesses and what files it can modify, but beyond that, no special-purpose programmatic logic drives the improvement. The general-purpose proposer agent (Claude Code, in this case) is smart enough to poke around the folders and files from previous runs, analyze where it failed, propose improvements, and keep the loop going.

And it works. They tested across multiple domains — text classification, math reasoning (IMO-level problems), and agentic coding (TerminalBench-2) — and the meta-harness loop produced harnesses that beat the strongest handcrafted baselines in each.

This is very bitter-lesson-pilled: throw search and compute at the problem, and you’ll outperform bespoke human engineering.

AutoResearch operates on a narrow signal. It only works with the last good train.py (reverting on failure) and one scalar metric. Meta-Harness operates on a much richer information set: the source code, evaluation scores, and full execution traces from every previous attempt are preserved and accessible. That richer context gives the proposer far more to work with when diagnosing failures and proposing improvements.

I Did This By Hand (And Regretted It)

This resonates for me personally. I recently designed and open-sourced Proceda, a harness for automating standard operating procedures into agents, and spent days benchmaxing it on SOP-Bench. My loop looked exactly like Meta-Harness — run the benchmark, look at the score, ask Claude Code to analyze the failure traces, improve the harness, repeat — except I was the one closing the loop manually.

I got to SOTA on SOP-Bench doing this, but at the end I was kicking myself for not setting up an AutoResearch-style automated loop. Which is precisely what Meta-Harness formalizes.

The Takeaway

The main thrust of the entire agent field right now is removing humans from agentic loops. AutoResearch and Meta-Harness are two concrete illustrations of how to do it: define a clear metric, give the agent access to its own history, and let it rip.