read it again. he criticizes the hype built around 2025 as the Year X for agents. many were thinking that "we'll carry PCs in our pockets" when Windows Mobile-powered devices came out. many predicted 2003 as the Year X for what we now call smartphones.
a stellar piece, Cal, as always. short and straight to the point.
I believe that Codex and the likes took off (in comparison to e.g. "AI" browsers) because the bottleneck there was not reasoning about code, it was about typing and processing walls of text. for a human, the interface of e.g. Google Calendar is ± intuitive. for a LLM, any graphical experience is an absolute hellscape from performance standpoint.
CLI tools, which LLMs love to use, output text and only text, not images, not audio, not videos. LLMs excel at text, hence they are confined to what text can do. yes, multimodal is a thing, but you lose a lot of information and/or context window space + speed.
LLMs are a flawed technology for general, true agents. 99% of the time, outside code, you need eyes and ears. we have only created a self-writing paper yet.
Codex and the like took off because there existed a "validator" of its work - a collection of pre-existing non-LLM software - compilers, linters, code analyzers etc. And the second factor is very limited and defined grammar of programming languages. Under such constraints it was much easier to build a text generator which will validate itself using external tools in a loop, until generated stream makes sense.
And the other "successful" industry being disrupted is the one where there is no need validate output, because errors are ok or irrelevant. A text not containing much factual data, like fiction or business-lingo or spam. Or pictures, where it doesn't matter which color is a specific pixel, a rough match will do just fine.
But outside of those two options, not many other industries can use at scale an imprecise word or media generator. Circular writing and parsing of business emails with no substance? Sure. Not much else.
This is the reasoning deficit. Models are very good at generating large quantities of truthy outputs, but are still too stupid to know when they've made a serious mistake. Or, when they are informed about a mistake they sometimes don't "get it" and keep saying "you're absolutely right!" while doing nothing to fix the problem.
It's a matter of degree, not a qualitative difference. Humans have the exact same flaws, but amateur humans grow into expert humans with low error rates (or lose their job and go to work in KFC), whereas LLMs are yet to produce a true expert in anything because their error rates are unacceptably high.
Besides the ability to deal with text, I think there are several reasons why coding is an exceptionally good fit for LLMs.
Once LLMs gained access to tools like compilers, they started being able to iterate on code based on fast, precise and repeatable feedback on what works and what doesn't, be it failed tests or compiler errors. Compare this with tasks like composing a powerpoint deck, where feedback to the LLM (when there is one) is slower and much less precise, and what's "good" is subjective at best.
Another example is how LLMs got very adept at reading and explaining existing code. That is an impressive and very useful ability, but code is one of the most precise ways we, as humans, can express our intent in instructions that can be followed millions of times in a nearly deterministic way (bugs aside). Our code is written in thoroughly documented languages with a very small vocabulary and much easier grammar than human languages. Compare this to taking notes in a zoom call in German and trying to make sense of inside jokes, interruptions and missing context.
But maybe most importantly, a developer must be the friendliest kind of human for an LLM. Breaking down tasks in smaller chunks, carefully managing and curating context to fit in "memory", orchestrating smaller agents with more specialized tasks, creating new protocols for them to talk to each others and to our tools.... if it sounds like programming, it's because it is.
I agree with that. For code, most of it was in a "public space" similar to driving down a street and training the model on trees and signs etc. The property is not yours but looking at it doesn't require ownership.
It was not a well thought out piece and it is discounting the agentic progress that has happened.
>The industry had reason to be optimistic that 2025 would prove pivotal. In previous years, AI agents like Claude Code and OpenAI’s Codex had become impressively adept at tackling multi-step computer programming problems.
It is easy to forget that Claude Code CAME OUT in 2025. The models and agents released in 2025 really DID prove how powerful and capable they are. The predictions were not really wrong. I AM using code agents in a literal fire and forget way.
Claude Code is a hugely capable agentic interface for sovling almost any kind of problem or project you want to solve for personal use. I literally use it as the UX for many problems. It is essentially a software that can modify itself on the fly.
Most people haven't really grasped the dramatic paradigm shift this creates. I haven't come up with a great analogy for it yet, but the term that I think best captures how it feels to work with claude code as a primary interface is "intelligence engine".
I'll use an example, I've created several systems harnessed around Claude Code, but the latest one I built is for stock porfolio management (This was primarily because it is a fun problem space and something I know a bit about). Essentially you just used Claude Code to build tools for itself in a domain. Let me show how this played out in this example.
Claude and I brainstorma general flow for the process and roles. Then figure out what data each role would need, research what providers have the data at a reasonable price.
I purchase the API keys and claude wires up tools (in this case python scripts and documentation for the agents for about 140 api endpoints), then builds the agents and also creates an initial vesrion of the "skill" that will invoke the process that looks something like this:
Obviously it isn't 100% great on the first pass and I have to lean on expertise I have in building LLM applications, but now I have a Claude Code instance that can orchestrate this whole research process and also handle ad-hoc changes on the fly.
Now I have evolved this system through about 5 significant iterations, but I can do it "in the app". If I don't like how part of it is working, I just have the main agent rewire stuff on the fly. This is a completely new way of working on problems.
there is a very similar app with much bigger history and (obviously) greater reputation: BuzzKill. [0] it's paid, available on Google Play, has tons of features and then some.
also, I bet that Android platform forbids you from requesting the internet permission if you use some "dangerous" permissions, e.g. reading notifications.
> The bottleneck is still knowing what to build, not building.
shit, I'm stealing that quote! it's easier to seize an opportunity, (i.e. build a tool that fixes the problem X without causing annoying Y and Z side effects) but finding one is almost as hard as it was since the beginning of the world wide web.
I trust Claude in Chrome a lot more, and I trust my own hands and eyes most.
reply