Scaling the model size, the compute and the dataset hit a wall. Too large a model, or if it needs too much compute - it becomes too expensive to use. And the dataset .. we benefitted in one go from about multiple decades of content accumulation online, but since late 2022 it's only been 3 years, so organic text does not increase exponentially past this size, it only worked for 50T tokens or so.
> Things like number of stars on a repository, number of forks, number of issues answered, number of followers for an account. All these things are powerful indicators of quality
They're NOT! Lots of trashy AI projects have +50k stars.
It's much more interesting than that. They're using this document as part of the training process, presumably backed up by a huge set of benchmarks and evals and manual testing that helps them tweak the document to get the results they want.
"Use AI to fix AI" is not my interpretation of the technique. I may be overlooking it, but I don't see any hint that this soul doc is AI generated, AI tuned, or AI influenced.
Separately, I'm not sure Sam's word should be held as prophetic and unbreakable. It didn't work for his company, at some previous time, with their approaches. Sam's also been known to tell quite a few tall tales, usually about GPT's capabilities, but tall tales regardless.
If Sam said that, he is wrong. (Remember, he is not an AI researcher.) Anthropic have been using this kind of approach from the start, and it's fundamental to how they train their models. They have published a paper on it here: https://arxiv.org/abs/2212.08073
I'm impressed by this. You know in the beginning I was like hey why doesn't this look like counterstrike ? yeah I had the exepectation this things can one shot an industry leading computer game. Of course that's not yet possible.
But still, this is pretty damn impressive for me.
In a way, they really condensed perfectly a lot of what's silly currently around AI.
> Codex, Opus, Gemini try to build Counter Strike
Even though the prompt mentions Counter Strike, it actually asks to build the basics of a generic FPS, and with a few iterations ends up with some sort of minecraft-looking generic FPS with code that would never make it to prod anywhere sane.
It's technically impressive. But functionally very dubious (and not at all anything remotely close to Counter-Strike besides "being an FPS").
i mean it's the most bare-bones implementation without any engineering considerations
it's not something that would ever work industrially
people with code-generators they've made could do this just as fast as the AI except their generators could have engineering considerations built-in to them as well so it'd be even better
> people with code-generators they've made could do this just as fast as the AI except their generators could have engineering considerations built-in to them as well so it'd be even better
I think they're referring to the project scaffolding features that's built-in to framework tooling thesedays (e.g. `ng generate ng <schema>` or `dotnet scaffold`).
There's also the practice of using good ol' fashioned code-generation tools like T4 or Moustache/Liquid templates to generate program entity classes and data-access methods from a DB schema, for example. Furthermore, now there's pretty nifty compile-time code-generation in C# - while languages like F# support built-time type-generation.
...and these are all good tools IMO; but really aren't comparable to an LLM, imo.
Lots of research shows post-training dumbs down the models but no one listens because people are too lazy to learn proper prompt programming and would rather have a model already understand the concept of a conversation.
Some distributional collapse is good in terms of making these things reliable tools. The creativity and divergent thinking does take a hit, but humans are better at this anyhow so I view it as a net W.
This. A default LLM is "do whatever seems to fit the circumstances". An LLM that was RLVR'd heavily? "Do whatever seems to work in those circumstances".
Very much a must for many long term tasks and complex tasks.
You lob it the beginning of a document and let it toss back the rest.
That's all that the LLM itself does at the end of the day.
All the post-training to bias results, routing to different models, tool calling for command execution and text insertion, injected "system prompts" to shape user experience, etc are all just layers built on top of the "magic" of text completion.
And if your question was more practical: where made available, you get access to that underlying layer via an API or through a self-hosted model, making use of it with your own code or with a third-party site/software product.
Better? I am not sure. A parent comment [1] was suggesting better LLM performance using completion than using chat. UX wise it is probably worse except for power users.
Exactly. Even this paper shows how model creativity significantly drops and the models experience mode collapse like we saw in GANs, but the companies keep using RLHF...
Great management will do that automatically. Meetings will be sent so the team has 3-4 uninterrupted, solid working days with one day strictly dedicated to meetings and/or other interruptions.
> It's silly of them to say you need a "modern terminal emulator", it's wrong and drives people away. I'm using xfce4-terminal.
Good. I'd rather use a tool designed with focus on modern standards than something that has to keep supporting ancient ones every time they roll an update.
This is what I've been talking about for a few months now. the AI field seems to reinvent the wheel every few months. And because most people really don't know what they're talking about, they just jump on the hype and adopt the new so-called standards without really thinking if it's the right approach. It really annoys me because I have been following some open source projects that have had some genuinely novel ideas about AI agent design. And they are mostly ignored by the community. But as soon as a large company like Anthropic or OpenAI starts a trend, suddenly everyone adopts it.
Well, what are those projects? I don't speak for anyone else, but I'm generally fatigued by the endless parade of science fair projects at this point, and operate under the assumption that if an approach is good enough, openai/anthropic/google will fold useful ideas under their tools/products.
I cannot believe all these months and years people have been loading all of the tool JSON schemas upfront. This is such a waste of context window and something that was already solved three years ago.
What is the right pattern? Do you just send a list of tool names & descriptions, and just give the agent an "install" tool that adds a given tool to the schema on the next turn?
- claudes tool search tool
- list of skills (markdown files) the agent can grep
- claude skills
- context compaction
- sub-agents
- plans
There is no one “right” pattern. But yes it all generalizes to context engineering.
With plans for example, you write out potential distractions for later, to keep (the AI and the Human context) focused on a task at hand.
That pattern solves a distinctly different use case than the skills folder, but plans can also refer to skills in specific ways.
Context engineering is evolving with overlapping complementary patterns, and while certain vendors are branding patterns, i think we will hopefully we see tools converge.
reply