Feb 23, 2025

Human evals are outsourced vibe checks

Engineering has always been about measurement and evaluation. But when it comes to LLMs, traditional approaches to evaluation don’t work as cleanly as they do for other software systems.

Of course, evals and benchmarks are an important signal (perhaps the only signal) of raw capability. They show us improvement over time. The fact that many of the original benchmarks from a year or so ago are at or near saturation also tells us a lot.

But if LLMs are the new CPUs, there is an important difference between the way old-school CPU benchmarks (SPECInt, GeekBench etc) were interpreted, and how we should interpret LLM benchmarks.

The translation of CPU benchmarks to performance on real-world apps was objective. A higher SPECInt score, for example, translated to an objectively measurable and valuable metric like transaction throughput or request latency.

Unlike CPU benchmarks, where better performance directly translates to measurable improvements in real-world applications, LLM benchmarks don’t always correlate neatly with real-world utility. Why? Because the success of LLM outputs is often inherently subjective.

Take, for example, a common LLM use-case: summarization. However, in your particular application, summarization might mean something very specific. Maybe you want summaries that emphasize only some aspects of the input, or are always within a certain length range. The metric for success itself is fuzzy. This subjectivity isn’t just a quirk of summarization—it’s a pattern across many LLM applications. That’s why evaluation often relies on human judgment.

There are two common ways of measuring adherence to such criteria: either use LLM-as-a-judge, or human evals. When the developer of the application in question runs a few inputs and subjectively evaluates the output, it is called a “vibe check”. Human evals are vibe-checks outsourced to multiple humans, at scale.

Note that this is not true of all ML models. A fraud-detection or spam classification model might be probabilistic (just like LLMs) but it still has an objectively measurable success rate in terms of false and true positives. It’s just that the kinds of applications LLMs are currently used for anchor heavily on natural language understanding and interpretation, which is naturally resistant to clear-cut objective measures.

If your company or organization is fortunate enough to afford regular human evals on a subset of your traffic, I recommend the following: pick a small subset of your human evals, roughly split between positive and negative judgments. Get your leads together in a room and go through each human judgment. Watch as they all disagree with each other and the human evaluator on the judgment, and as your naive notion that human evals were some sort of golden signal goes down in flames like the Hindenburg.

So what can we do? How can we judge fit-for-purpose and do internal hill climbing for LLM-driven applications?

Embrace the vibe-checks, but be structured about it. Develop a small, representative set of inputs for the particular problem your application is solving. The number doesn’t have to be large– even tens of samples get you pretty far. Then, rather than have golden answers (which are brittle and don’t accommodate change very well), document the properties you’d like to see in the outputs. This should give you a good feel for whether your prompt engineering or model changes or agentic loops are giving you better answers.

Since traditional benchmarking doesn’t work well for LLM-driven applications, we have to lean into subjective evaluation. But instead of relying on gut feelings, we can make subjective evaluation a bit more systematic.