Aug 16, 2025

Natlang Code

It is now commonplace to write code in natural language.

This might seem like a controversial or nonsensical statement, but consider the following:

In Claude Code, you can define custom commands and workflows in plain text:

Custom slash commands allow you to define frequently-used prompts as Markdown files that Claude Code can execute.

(Gemini-CLI has a similar feature, with the slight syntactic difference that it uses TOML.)

The initial usage of these commands started by automating common developer workflows (write, test, review, commit, push, bring up dev server etc). They were “scripts”, but in natural language.

In a shell or Python script one could never say “look at the diff and formulate a succinct but descriptive commit message following our company style guide at <url>”. In current coding agents, such a semantic task is trivial.

But then, soon after, people started using these agentic coding agents as everything agents. See the long list of non-coding use-cases in this X thread. Or how one can run Claude Code in the home directory (not a git repo) and use it to automate most of what one does on a computer.

This uplevel of using CLI agents as everything-agents brings the same conundrum that every developer faces when faced with a series of repetitive tasks: should I bite the bullet and write a script to automate this, or just go ahead and do it manually? Almost always, writing a script takes much more time than doing in manually once. Now instead of writing a shell or Python script, one has to do the calculus on slowing down and writing out a text file with instructions and criteria, so that the agent can do it in the future.

Stuck in Chatland

While developers are automating entire workflows with coding agents, the rest of us are stuck in what I call “chatland”. Chatland is using ChatGPT to “ask questions”, or run prompts one-shot. And then what? Closely read it? Copy/paste it somewhere? Ask even more questions?

Chat is great for exploration; it is not meant for repeatability and workflow automation. It is a nice local maxima, certainly not a global one. One could, however, use a long conversation with the model to arrive at the process and pull out tacit knowledge, much like a newbie on the job does with a more experienced colleague.

A whole host of tools have cropped up to address the shortcomings of chat for automation: n8n, Gumloop, make.com and many others. Their drag-and-drop UI based on flowchart-construction seems like a natural way to express and then automate workflows, but it can be fiddly and breaks down with data-intensive flows. As Erik Meijer put it:

Most workflow tools, like UIPath, Workato, and Azure Logic Apps, primarily use sequential composition with limited control flow and repetition validating our design choices. One complication in workflow tools is the need for a “language” of functions to manipulate data as it flows between actions. We replace this with pattern-matching to simplify and enhance the expressiveness of the base language without the need for an extra layer of ad-hoc functions

(I covered that paper in a video here.)

More importantly, when businesses do document their workflows, they rarely use flowcharts (or the more enterprise-y BPMN). They write SOPs (standard operating procedures) in natural language. SOPs are the grown-up versions of custom slash-commands written in markdown.

The pain of articulation

In a recent post titled “AI is a mirror”, I wrote:

The company that has a clear, pre-defined workflow that is even ready to be considered for agentification is extremely rare. They know they have inefficiencies. They feel the pain of manual, repetitive tasks. But they can’t articulate the precise sequence of steps, the decision points, and the logic that governs the work.

Ethan Mollick had already picked up on this theme. In a recent column, he spoke of the “garbage can” model of the organization and its processes:

One thing you learn studying (or working in) organizations is that they are all actually a bit of a mess. In fact, one classic organizational theory is actually called the Garbage Can Model. This views organizations as chaotic “garbage cans” where problems, solutions, and decision-makers are dumped in together, and decisions often happen when these elements collide randomly, rather than through a fully rational process… The Garbage Can represents a world where unwritten rules, bespoke knowledge, and complex and undocumented processes are critical. It is this situation that makes AI adoption in organizations difficult, because even though 43% of American workers have used AI at work, they are mostly doing it in informal ways, solving their own work problems. Scaling AI across the enterprise is hard because traditional automation requires clear rules and defined processes; the very things Garbage Can organizations lack.

He goes on to apply the bitter-lesson mindset to applying AI to enterprise work:

The Bitter Lesson suggests we might soon ignore how companies produce outputs and focus only on the outputs themselves. Define what a good sales report or customer interaction looks like, then train AI to produce it. The AI will find its own paths through the organizational chaos; paths that might be more efficient, if more opaque, than the semi-official routes humans evolved.

(Emphasis mine.)

I agree, and this is why I think we must encode knowledge of “what good looks like” for the output of a process into a rigorous eval/benchmark.

The client company defines the task by creating this benchmark. It’s filled with input examples and the corresponding ideal outputs. The agent-building company’s only job is to build an agent that achieves a pre-agreed-upon score on this benchmark.

Note that raw model intelligence will not help with workflow articulation. As Casper put it:

Recently, both OpenAI and Google models were on par with gold medallists in the International Mathematical Olympiad 2025 (IMO). At the same time it’s still difficult to make AI agents work for relatively simple enterprise use cases. Why is there such a disparity in model performance between problem domains? Why are models so much better at complex maths tasks that only few humans can complete, while struggling at simple every day tasks done by most humans? It’s because the bottleneck isn’t in intelligence, but in human tasks: specifying intent and context engineering.

(Emphasis mine.)

Developers writing elaborate workflows in plain-text files are speed-running the same journey that enterprises need to make to achieve the end-to-end benefits of AI. They have to look beyond helping individual contributors with granular tasks in “chatland AI”, and make the leap to “workflow-automation AI”.