More

mccoyb · 2026-02-07T18:55:28 1770490528

Effectively everyone is building the same tools with zero quantitative benchmarks or evidence behind the why / ideas … this entire space is a nightmare to navigate because of this. Who cares without proper science, seriously? I look through this website and it looks like a preview for a course I’m supposed to buy … when someone builds something with these sorts of claims attached, I assume that there is going to be some “real graphs” (“these are the number of times this model deviated from the spec before we added error correction …”)

What we have instead are many people creating hierarchies of concepts, a vast “naming” of their own experiences, without rigorous quantitative evaluation.

I may be alone in this, but it drives me nuts.

Okay, so with that in mind, it amounts to heresay “these guys are doing something cool” — why not shut up or put up with either (a) an evaluation of the ideas in a rigorous, quantitative way or (b) apply the ideas to produce an “hard” artifact (analogous, e.g., to the Anthropic C compiler, the Cursor browser) with a reproducible pathway to generation.

The answer seems to be that (b) is impossible (as long as we’re on the teet of the frontier labs, which disallow the kind of access that would make (b) possible) and the answer for (a) is “we can’t wait we have to get our names out there first”

I’m disappointed to see these types of posts on HN. Where is the science?

simonw · 2026-02-07T19:06:38 1770491198

Honestly I've not found a huge amount of value from the "science".

There are plenty of papers out there that look at LLM productivity and every one of them seems to have glaring methodology limitations and/or reports on models that are 12+ months out of date.

Have you seen any papers that really elevated your understanding of LLM productivity with real-world engineering teams?

svara · 2026-02-07T20:30:17 1770496217

The writing on this website is giving strong web3 vibes to me / doesn't smell right.

The only reason I'm not dismissing it out of hand is basically because you said this team was worth taking a look at.

I'm not looking for a huge amount of statistical ceremony, but some detail would go a long way here.

What exactly was achieved for what effort and how?

FridgeSeal · 2026-02-08T01:51:37 1770515497

Nothing in this space “smells right” at the moment.

Half the “ai” vendors outside of frontier labs are trying to sell shovels to each other, every other bubbly new post is about this-weeks-new-ai-workflow, but very few instances of “shutting up and delivering”. Even the Anthropic C compiler was torn to pieces in the comments the other day.

At the moment everything feels a lot like the people meticulously organising desks and calendars and writing pretty titles on blank pages and booking lots of important sounding meetings, but not actually…doing any work?

cejast · 2026-02-07T23:20:10 1770506410

This was my reaction as well, a lot of hand-waving and invented jargon reminiscent of the web3 era - which is a shame, because I'd really like to understand what they've actually done in more detail.

simonw · 2026-02-07T22:00:01 1770501601

Yeah, they've not produced as much detail as I'd hoped - but there's still enough good stuff in there that it's a valuable set of information.

mccoyb · 2026-02-07T19:52:44 1770493964

No, I agree! But I don’t think that observation gives us license to avoid the problem.

Further, I’m not sure this elevates my understanding: I’ve read many posts on this space which could be viewed as analogous to this one (this one is more tempered, of course). Each one has this same flaw: someone is telling me I need to make a “organization” out of agents and positive things will follow.

Without a serious evaluation, how am I supposed to validate the author’s ontology?

Do you disagree with my assessment? Do you view the claims in this content as solid and reproducible?

My own view is that these are “soft ideas” (GasTown, Ralph fall into a similar category) without the rigorous justification.

What this amounts to is “synthetic biology” with billion dollar probability distributions — where the incentives are setup so that companies are incentivized to convey that they have the “secret sauce” … for massive amounts of money.

To that end, it’s difficult to trust a word out of anyone’s mouth — even if my empirical experiences match (along some projection).

simonw · 2026-02-07T20:25:51 1770495951

The multi-agent "swarm" thing (that seems to be the term that's bubbling to the top at the moment) is so new and frothy that is difficult to determine how useful it actually is.

StrongDM's implementation is the most impressive I've seen myself, but it's also incredibly expensive. Is it worth the cost?

Cursor's FastRender experiment was also interesting but also expensive for what was achieved.

I think my favorite current example at the moment was Anthropic's $20,000 C compiler from the other day. But they're an AI vendor, demos from non-vendors carry more weight.

I've seen enough to be convinced that there's something there, but I'm also confident we aren't close to figuring out the optimal way of putting this stuff to work yet.

voidhorse · 2026-02-07T22:41:08 1770504068

But the absence of papers is precisely the problem and why all this LLM stuff has become a new religion in the tech sphere.

Either you have faith and every post like this fills you with fervor and pious excitement for the latest miracles performed by machine gods.

Or you are a nonbeliever and each of these posts is yet another false miracle you can chalk up to baseless enthusiasm.

Without proper empirical method, we simply do not know.

What's even funnier about it is that large-scale empirical testing is actually necessary in the first place to verify that a stochastic processes is even doing what you want (at least on average). But the tech community has become such a brainless atmosphere totally absorbed by anecdata and marketing hype that no one simply seems to care anymore. It's quite literally devolved into the religious ceremony of performing the rain dance (use AI) because we said so.

One thing the papers help provide is basic understanding and consistent terminology, even when the models change. You may not find value in them but I assure you that the actual building of models and product improvements around them is highly dependent on the continual production of scientific research in machine learning, including experiments around applications of llms. The literature covers many prompting techniques well, and in a scientific fashion, and many of these have been adopted directly in products (chain of thought, to name one big example—part of the reason people integrate it is not because of some "fingers crossed guys, worked on my query" but because researchers have produced actual statistically significant results on benchmarks using the technique) To be a bit harsh, I find your very dismissal of the literature here in favor of hype-drenched blog posts soaked in ridiculous language and fantastical incantations to be precisely symptomatic of the brain rot the LLM craze has produced in the technical community.

simonw · 2026-02-07T22:59:13 1770505153

I do find value in papers. I have a series of posts where I dig into papers that I find noteworthy and try to translate them into more easily understood terms. I wish more people would do that - it frustrates me that paper authors themselves only occasionally post accompanying commentary that helps explain the paper outside of the confines of academic writing. https://simonwillison.net/tags/paper-review/

One challenge we have here is that there are a lot of people who are desperate for evidence that LLMs are a waste of time, and they will leap on any paper that supports that narrative. This leads to a slightly perverse incentive where publishing papers that are critical of AI is a great way to get a whole lot of attention on that paper.

In that way academic papers and blogging aren't as distinct as you might hope!

moregrist · 2026-02-08T00:42:52 1770511372

> There are plenty of papers out there that look at LLM productivity and every one of them seems to have glaring methodology limitations and/or reports on models that are 12+ months out of date.

This is a general problem with papers measuring productivity in any sense. It's often a hard thing to define what "productivity" means and to figure out how to measure it. But also in that any study with worthwhile results will:

1. Probably take some time (perhaps months or longer) to design, get funded, and get through an IRB.

2. Take months to conduct. You generally need to get enough people to say anything, and you may want to survey them over a few weeks or months.

3. Take months to analyze, write up, and get through peer review. That's kind of a best case; peer review can take years.

So I would view the studies as necessarily time-boxed snapshots due to the practical constraints of doing the work. And if LLM tools change every year, like they have, good studies will always lag and may always feel out of date.

It's totally valid to not find a lot of value in them. On the other hand, people all-in on AI have been touting dramatic productivity gains since ChatGPT first arrived. So it's reasonable to have some historical measurements to go with the historical hype.

At the very least, it gives our future agentic overlords something to talk about on their future AI-only social media.

mccoyb · 2026-02-04T16:06:16 1770221176

How does this model compare to the syndicated actor model of Tony Garnock-Jones?

(which, as far as I can tell, also supports capabilities and caveats for security)

Neat work!

davexunit · 2026-02-04T16:36:11 1770222971

The animation on the Syndicated Actors home page [0] does a pretty good job of showing the difference, I think. Goblins is much more similar to the classic actor model shown at the beginning of the animation. The "syndicated" part, as far as I understand, relates to things like eventually consistent state sync being built-in as primitives. In Goblins, we provide the actor model (actually the vat model [1] like the E language) which can be used to build eventually consistent constructs on top. Recently we prototyped this using multi-user chat as a familiar example. [2]

[0] https://syndicate-lang.org/

[1] https://files.spritely.institute/docs/guile-goblins/0.17.0/T...

[2] https://spritely.institute/news/composing-capability-securit...

mccoyb · 2026-02-04T16:37:48 1770223068

Thank you, very helpful!

mccoyb · 2026-02-04T16:37:17 1770223037

My 5 minute read is that the divergences are primarily in the communication model and in transactions:

- the SAM coordinates through the dataspace, whereas Goblins is focused on ("point-to-point") message passing

- SAM (as presented) doesn't contain a transactional semantics -- e.g. turns are atomic, and there's no rollback mechanism (I haven't been up to speed on recent work, I do wonder if this could be designed into SAM)

mccoyb · 2026-01-31T11:14:51 1769858091

a better term might be “feedback engineering” or “verification engineering” (what feedback loop do I need to construct to ensure that the output artifact from the agent matches my specification)

This includes standard testing strategies, but also much more general processes

I think of it as steering a probability distribution

At least to me, this makes it clear where “vibe coding” sits … someone who doesn’t know how to express precise verification or feedback loops is going to get “the mean of all software”

mccoyb · 2026-01-27T17:47:40 1769536060

It's not in the harness today, it's a special RL technique they discuss in https://www.kimi.com/blog/kimi-k2-5.html (see "2. Agent Swarm")

I looked through the harness and all I could find is a `Task` tool.

mccoyb · 2026-01-23T20:00:52 1769198452

Claude Code gets functionally worse every update. They need to get their shit together, hilarious to see Amodei at Davos talking big game about AGI and the latest update for a TUI application fucking changes observable behavior (like history scrolling with arrow keys), rendering random characters on the "newest" native version rendered in iTerm2, broken status line ... the list goes on and on.

This is the new status quo for software ... changing and breaking beneath your feet like sand.

direwolf20 · 2026-01-23T21:13:20 1769202800

Software changing and breaking beneath your feet is not new

mccoyb · 2026-01-19T00:59:16 1768784356

I think Yegge and Huntley are smart guys.

I don't think they're doing a good job incubating their ideas into being precise and clearly useful -- there is something to be said about being careful and methodical before showing your cards.

The message they are spreading feels inevitable, but the things they are showing now are ... for lack of better words, not clear or sharp. In a recent video at AI Engineer, Yegge comments on "the Luddites" - but even for advocates of the technology, it is nigh impossible to buy the story he's telling from his blog posts.

Show, don't tell -- my major complaint about this group is that they are proselytizing about vibe coding tools ... without serious software to show for it.

Let's see some serious fucking software. I'm looking for new compilers, browsers, OSes -- and they better work. Otherwise, what are we talking about? We're counting foxes before the hunt.

In any case, wouldn't trying to develop a serious piece of software like that _at the same time you're developing Gas Town or Loom_ make (what critics might call) the ~Emacs config tweaking for orchestration~ result driven?

mccoyb · 2026-01-19T01:11:48 1768785108

Here's a separate, optimistic comment about Yegge and Huntley: they are obviously on the right track.

In a recent video about Loom (Huntley's orchestration tool), Huntley comments:

"I've got a single goal and that is autonomous evolutionary software and figuring out what's needed to be there."

which is extremely interesting and sounds like great fun.

When you take these ideas seriously, if the agents get better (by hook and crook or RLVR) -- you can see the implications: "grad student descent" on whatever piece of software you want. RAG over ideas, A/B testing of anything, endless looping, moving software.

It's a nightmare for the model of software development and human organization which is "productive" today, but an extremely compelling vision for those dabbling in the alternative.

PKop · 2026-01-19T16:47:19 1768841239

> they are obviously on the right track

How can you just assert that? It's fine to say it looks like the right track to you. But in what way is it obvious?

3dsnano · 2026-01-19T14:59:24 1768834764

yes, and Yegge + Huntley are doing it in an fun and creative way, breaking rules that make folks really mad and huffy puffy. this is a renaissance to those who can see it, those who drink the koolaid willingly, because it makes you trip balls and come up with crazy ideas... just like Hypercard...

why do we drink it? because its awesome and makes software 100X more FUN than it used to be. what yegge + huntley are doing is intensely creative. they are having FUN. and i am have FUN!!!!!

skybrian · 2026-01-19T01:09:01 1768784941

It's a science project. I think the "I am so crazy" messaging is deliberate to scare most people away while attracting a few like-minded beta testers. He's telling you not to use it, which some people will take as a dare...

vessenes · 2026-01-19T13:30:07 1768829407

Counterpoint - you can go much faster if you get lots of people engaging with something and testing it. This is exploratory work, not some sort of ivory tower rationalism exercise, (if those even ever truly exist), there’s no compulsion involved, so everyone engaged does so for self-motivated reasons..

Don’t be mad!

Also, beads is genuinely useful. In my estimation, gas town, or a successor built on a similar architecture, will not only be useful, but likely be considered ‘state of the art’ for at least a month sometime in the future. We should be glad this stuff is developed in the open, in my opinion.

mccoyb · 2026-01-14T22:57:06 1768431426

Supposing agents and their organization improve, it seems like we’re approaching a point where the cost of a piece of software will be driven down to the cost of running the hardware, and the cost of the tokens required to replicate it.

The tokens were “expensive” from the minds of humans …

Daishiman · 2026-01-14T23:14:21 1768432461

It will be driven down to the cost of having a good project and product manager effectively understanding what the customer wants, which has been the main barrier to excellent software for a good long time.

galaxyLogic · 2026-01-14T23:34:35 1768433675

And not only understanding what the customer wants, but communicating that unambiguously to the AI. And note who is the "customer" here? Is it the end-users, or is it a client-company which contracts the project-manager for this task? But then the issue is still there, who in the client-company decides exactly what is needed and what the (potential) users want?

I think this situation emphasizes the importance of (something like) Agile. To produce something useful can only happen via experimentation and getting feedback from actual users, and re-iterating relentlessly.

mccoyb · 2026-01-09T20:22:48 1767990168

Who knew that these massive high-dimensional probability distributions would drive us insane

mccoyb · 2026-01-07T07:37:46 1767771466

It codes faster and with more abandon. For good results, mix Claude Code with Codex (preferably high or xhigh reasoning) for reviews.

copperx · 2026-01-07T18:35:07 1767810907

Thanks. The reason for my hesitancy is that I've heard that the $20 sub isn't enough for anything meaningful.

mccoyb · 2026-01-03T17:13:33 1767460413

My wishlist for 2026: Anthropic / OpenAI expose “how compaction is executed” to plugin authors for their CLI tools.

This technique should be something you could swap in for whatever Claude Code bakes in — but I don’t think the correct hooks or functionality is exposed.

rockwotj · 2026-01-03T20:48:49 1767473329

Isn’t codex open source and you can just go read what they do?

I have read the gemini source and it’s a pretty simple prompt to summarize everything when the context window is full

MillionOClock · 2026-01-03T21:30:36 1767475836

It should be noted that OpenAI now has a specific compaction API which returns opaque encrypted items. This is AFAICT different from deciding when to compact, and many open source tools should indeed be inspectable to that regard.

omneity · 2026-01-03T23:24:04 1767482644

It's likely to either be an approach like this [0] or something even less involved.

0: https://github.com/apple/ml-clara