Plato's Cave of AI Agent Evals
[The following is a platonic ideal– a thought experiment. It was borne of many recent conversations and experiences around building AI agents for enterprise customers.]
Imagine two parties: the company that needs an agent and the company that builds it. In this ideal world, they never have a meeting. They never speak. Their entire interaction is mediated through a single, shared Git repository.
This repository contains one thing: a comprehensive set of evaluation data—a benchmark.
The client company defines the task by creating this benchmark. It’s filled with input examples and the corresponding ideal outputs. The agent-building company’s only job is to build an agent that achieves a pre-agreed-upon score on this benchmark.
If the client wants to change the agent’s behavior? They don’t send an email or schedule a call. They add new data points to the benchmark that illustrate the desired change. Or they modify existing ones. The agent’s behavior is shaped only by the data in the eval set.
The builder’s job is simple: make the agent pass the test. The client’s job is also simple: create a test that accurately reflects their needs.
If this sounds familiar, it should. It’s essentially Test-Driven Development (TDD) for AI. In software engineering, TDD dictates that before you write a single line of feature code, you first write a test that defines what success looks like. The test fails, naturally. Then, you write just enough code to make it pass. Plato’s Cave of Evals applies the same philosophy. The benchmark is the failing test. The agent is the code you write to make the test pass.
Of course, this is a hypothetical ideal. In the real world, the nuance of a business problem can’t always be perfectly captured in a static dataset. But Plato’s Cave of Evals captures the spirit of what we should be aiming for. It forces clarity. It makes the abstract concrete. It shifts the entire development process from subjective feedback loops based on vibes to objective, data-driven iteration.