LLMs are compilers
I’ve been thinking about a fruitful way to frame the act of writing code in the age of Copilot/Codex, and I don’t think “autocomplete on steriods” is it. Prompt-driven programming with an LLM is better thought of as a compiler. Just like what we understand as a compiler today translates from a high-level programming language like C++ or Java to machine code (actually, assembly language), you could view an LLM as a compiler that translates from English to a high-level language.
The hottest new programming language is English
— Andrej Karpathy (@karpathy) January 24, 2023
Programming in assembly language used to be a skill. That skill became irrelevant when good optimizing compilers could translate high-level languages to performant assembly (and when whatever tiny difference remained got swamped by gains in CPU speed). Nobody argues today that programming should be done in assembly.
Today we’re living through a similar transition to a higher level of abstraction. Code in HLLs is becoming the output of an even higher-level language– the natural language prompt to an LLM.
Here are a few examples to drive the point home:
Copilot easily passes “Intro to CS” courses: In this paper, the authors found that on a set of CS101-level programming problems, Copilot solved half of them right away (just feeding it the problem description itself), and solved 60% of the rest with only changes to the problem description.
Other observations have shown that when the output from tools like Copilot is wrong, developers avoid debugging the generated code, and either fiddle with the prompt until it works, or rewrite the code from scratch.
And here is Prof. Crista Lopes (1, 2, 3, 4) trying to grab the CS education community by the shoulders and shake them, after getting surprisingly good results with ChatGPT when implementing a lexer and parser for a toy programming language– a problem common in graduate-level compilers courses.
The neural network was able to understand the concrete symbolic reasoning of a Lox tokenizer purely by examples; it was able to generalize well beyond the examples; it was able to correct my specification mistakes; … bottom line: it is able to tokenize Lox programs without a single line of code being written.
Obviously, I am “programming” it by teaching it the rules using a combination of natural language and examples. But this is not programming as we know it. This is a blend of old ideals such as “do what I mean” and programming by example, but on steroids. While I am very familiar with these old ideals, I never thought I would live to see the day where they were a reality!
Taking the compiler analogy even further are systems like Parsel, which takes in a spec that mimics the way human programmers think, in that the spec decomposes the problem to be solved into smaller sub-problems, along with function signatures and a few input-output examples (i.e. unit tests), and produces code for that higher-level problem. The authors have recently made advances that let you omit the input-output examples, and even omit the function signatures, generating code from a natural language breakdown of the problem!
Parsel🐍 update: you now don't need to write function names/args at all - it's literally just indented natural language!
— Eric Zelikman (@ericzelikman) February 14, 2023
Images show 4 lines of Parsel and the generated 114-line Python code
Try or contribute here! https://t.co/vZ33rNFzuX
Last thread: https://t.co/SAJzbD9lKL pic.twitter.com/LsASh6OBhI
Of course, Copilot is not perfect. It still generates incorrect code, even though the hit-rate is pretty good. The question is: do you think it’ll get better or worse in the future? Programmers today never have to peel back the curtain to debug the assembly language generated by modern compilers (unless you’re a compiler writer). Sooner or later, code LLMs will achieve the same level of reliability, truly making them compilers from English to working code.