A key difference is that the cost to execute a cab ride largely stayed the same. Gas to get you from point A to point B is ~$5, and there's a floor on what you can pay the driver. If your ride costs $8 today, you know that's unsustainable; it'll eventually climb to $10 or $12.
But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.
>But inference costs are dropping dramatically over time,
Please prove this statement, so far there is no indication that this is actually true - the opposite seems to be the case. Here are some actual numbers [0] (and whether you like Ed or not, his sources have so far always been extremely reliable.)
There is a reason the AI companies don't ever talk about their inference costs. They boast with everything they can find, but inference... not.
I believe OP's point is that for a given model quality, inference cost decreases dramatically over time. The article you linked talks about effective total inference costs which seem to be increasing.
Those are not contradictory: a company's inference costs can increase due to deploying more models (Sora), deploying larger models, doing more reasoning, and an increase in demand.
However, if we look purely at how much it costs to run inference on a fixed amount of requests for a fixed model quality, I am quite convinced that the inference costs are decreasing dramatically. Here's a model from late 2025 (see Model performance section) [1] with benchmarks comparing a 72B parameter model (Qwen2.5) from early 2025 to the late 2025 8B Qwen3 model.
The 9x smaller model outperforms the larger one from earlier the same year on 27 of the 40 benchmarks they were evaluated on, which is just astounding.
Anecdotally, I find you can tell if someone worked at a big AI provider or a small AI startup by proposing an AI project like this:
" First we'll train a custom trillion parameter LLM for HTML generation. Then we'll use it to render our homepage to our 10 million daily visitors. "
The startup people will be like "this is a bad idea because you don't have enough GPUs for training that LLM" and the AI lab folks will be like "How do you intend to scale inference if you're not Google?"
> But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
I'd like to see this statement plotted against current trends in hardware prices ISO performance. Ram, for example, is not meaningfully better than it was 2 years ago, and yet is 3x the price.
I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.
"The energy consumed per text prompt for Gemini Apps has been reduced by 33x over the past 12 months."
My thinking is that if Google can give away LLM usage (which is obviously subsidized) it can't be astronomically expensive, in the realm of what we are paying for ChatGPT. Google has their own TPUs and company culture oriented towards optimizing the energy usage/hardware costs.
I tend to agree with the grandparent on this, LLMs will get cheaper for what we have now level intelligence, and will get more expensive for SOTA models.
Google is a special case - ever since LLMs came out I've been pointing out that Google owns the entire vertical.
OpenAI, Anthropic, etc are in a race to the bottom, but because they don't own the vertical they are beholden to Nvidia (for chips), they obviously have less training data, they need constant influsx of cash just to stay in that race to the bottom, etc.
Google owns the entire stack - they don't need nvidia, they already have the data, they own the very important user-info via tracking, they have millions, if not billions, of emails on which to train, etc.
Google needs no one, not even VCs. Their costs must be a fraction of the costs of pure-LLM companies.
> OpenAI, Anthropic, etc are in a race to the bottom
There's a bit of nuance hiding in the "etc". Openai and anthropic are still in a race for the top results. Minimax and GLM are in the race to the bottom while chasing good results - M2.1 is 10x cheaper than Sonnet for example, but practically fairly close in capabilities.
> There's a bit of nuance hiding in the "etc". Openai and anthropic are still in a race for the top results.
That's not what is usually meant by "race to the bottom", is it?
To clarify, in this context I mean that they are all in a race to be the lowest margin provider.
They re at the bottom of the value chain - they sell tokens.
It's like being an electricity provider: if you buy $100 or electricity and produce 100 widgets, which you sell for $1k each, that margin isn't captured by the provider.
That's what being at the bottom of the value chain means.
I get what it means, but it doesn't look to me like they're trying that yet. They don't even care that people buy multiple highest level plans to rotate them every week, because they don't provide a high enough tier for the existing customers. I don't see any price war happening. We don't know what their real margins are, but I don't see the race there. What signs do you see that Anthropic and Openai are in the race to the bottom?
> I don't see any price war happening. What signs do you see that Anthropic and Openai are in the race to the bottom?
There doesn't need to be signs of a race (or a price-war),only signs of commodification; all you need is a lack of differentiation between providers for something to turn into a commodity.
When you're buying a commodity, there's no big difference between getting your commodity delivered by $PROVIDER_1 and getting your commodity delivered by $PROVIDER_2.
The models are all converging quality-wise. Right now the number of people who swear by OpenAI models are about the same as the number of people who swear by Anthropic models, which are about the same as the number of people who swear by Google's models, etc.
When you're selling a commodity, the only differentiation is in the customer experience.
Right now, sure, there's no price war, but right now almost everyone who is interested are playing with multiple models anyway. IOW, the target consumers are already treating LLMs as a commodity.
Email seems like not only a pretty terrible training data set, since most of it is marketing spam with dubious value, but also an invasion of privacy, since information could possibly leak about individuals via the model.
> Email seems like not only a pretty terrible training data set, since most of it is marketing spam with dubious value
Google probably even has an advantage there: filter out everything except messages sent from valid gmail account to valid gmail account. If you do that you drop most of the spam and marketing, and have mostly human-to-human interactions. Then they have their spam filters.
I'd upgrade that "probably" leak to "will absolutely" leak, albeit with some loss of fidelity.
Imagine industrial espionage where someone is asking the model to roleplay a fictional email exchange between named corporate figures in a particular company.
> if Google can give away LLM usage (which is obviously subsidized) it can't be astronomically expensive
There is a recent article by Linus Sebastian (LTT) talking about Youtube: it is almost impossible to support the cost to build a competitor because it is astronomically expensive (vs potential revenue)
I do not disagree they will get cheaper, but I pointing out that none of this is being reflected in hardware pricing. You state LLMs are becoming more optimized (less expensive). I agree. This should have a knockon effect on hardware prices, but it is not. Where is the disconnect? Are hardware prices a lagging indicator? Is Nvidia still a 5 trillion dollar company if we see another 33x improvement in "energy consumed per text prompt"?
Jevon's paradox. As AI gets more efficient its potential scope expands further and the hardware it runs on becomes even more valuable.
BTW, the absolute lowest "energy consumed per logical operation" is achieved with so-called 'neuromorphic' hardware that's dog slow in latency terms but more than compensates with extreme throughput. (A bit like an even more extreme version of current NPU/TPUs.) That's the kind of hardware we should be using for AI training once power use for that workload is measured in gigawatts. Gaming-focused GPUs are better than your average CPU, but they're absolutely not the optimum.
> I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.
This isn't hard to see. A company's overall profits are influenced – but not determined – by the per-unit economics. For example, increasing volume (quantity sold) at the same per-unit profit leads to more profits.
> So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
The same task on the same LLM will cost $8 or less. But that's not what vendors will be selling, nor what users will be buying. They'll be buying the same task on a newer LLM. The results will be better, but the price will be higher than the same task on the original LLM.
It's not the hardware getting cheaper, it's that LLMs were developed when we really didn't understand how they worked, and there is still some room to improve the implementations, particularly do more with less RAM... And that's everything from doing more with fewer weights to things like FP16, not to mention if you can 2x the speed you can get twice as much done with the same RAM and all the other parts.
> I fail to see how costs can drop while valuations for all major hardware vendors continue to go up.
yeah. valuations for hardware vendors have nothing to do with costs. valuations are a meaningless thing to integrate into your thinking about something objective like, will the retail costs of inference trend down (obviously yes)
What if we run out of GPU? Out of RAM? Out of electricity?
AWS is already raising GPU prices, that never happened before. What if there is war in Taiwan? What if we want to get serious about climate change and start saving energy for vital things ?
My guess is that, while they can do some cool stuff, we cannot afford LLMs in the long run.
We can't copy/paste a new ASML no matter how hard you try (aside from open sourcing all of their IPs). Even if you do, by the time you copy one generation of machine, they're on a new generation and you now still have the bottleneck on the same place.
Not to mention that with these monopolies they can just keep increasing prices ad infinitum.
The outcome may seem like magic, but the input is "simply" hard work and a big budget: billions of dollars and years of investment into tuning the parameters like droplet size, frequency, etc...
The interviews make it clear that the real reason ASML's machines are (currently) unique is that few people had the vision, patience, and money to fund what seemed at the time impossible. The real magic was that ASML managed to hang on by a fingernail and get a successful result before the money ran out.
Now that tin droplet EUV lasers have not only been demonstrated to be possible, but have become the essential component of a hugely profitable AI chip manufacturing industry, obtaining funding to develop a clone will be much easier.
> ASML's secret sauce is not that secret or uncopyable.
You must've watched a different video. They took a decade to get there and they're happy to show all the how-to's because they know the devil is in the details.
If the US is ready to start a war against Europe to invade Groenland, it's certainly because they need more sand and plastic? Of course in weight it's probably mostly sand and plastic but the interesting bit probably needs palladium, copper, boron, cobalt, tungsten, etc
Greenland is Trump’s Ukraine. He’s jealous of Putin, that is all.
There is nothing in Greenland worth breaking up the alliances with Europe over.
Trump is too stupid to realise this, he just wants land like it’s a Civ game.
PS: An entire rack of the most expensive NVIDA equipment millions of dollars can buy has maybe a few grams of precious or rare metals in it. The cost of those is a maybe a dollar or two. They don’t even use gold any more!
The expensive part is making it, not the raw ingredients.
That alliance costs money. It doesn't bring anything good in return: the USSR (that this alliance was created against) is long gone. Trump is a genius if he somehow manages to kill 2 birds with 1 stone: make OTHER parties of the alliance want to disband the alliance AND get some piece of land with a unique strategic position all to himself/U.S.
I think it's Putin who is going to be jealous of Trump, not the other way around.
The parent said: "Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing."
SOTA improvements have been coming from additional inference due to reasoning tokens and not just increasing model size. Their comment makes plenty of sense.
Is it? Recent new models tend to need fewer tokens to achieve the same outcome. The days of ultrathink are coming to an end, Opus is well usable without it.
The only way to get "perfect rating" is to go to your junior dev and bring another interruption (maybe the dev was 90% done !). So now he has been interrupted twice by two different manager and you have contradicted your own boss in front of an employee. You just broke a cardinal rule of middle management: it's ok to tell your boss he is wrong, but not in front of someone else.
Additionally, you also need to tell him to f** off with is request to get the numbers (without even trying to understand if the request was legitimate or not !), so that your precious sprint is saved. I don't see how he gets what he wants in your ideal handling. AT best you seem to tell him you will "look into it" in two weeks.
Much better solution is to help you junior dev solve the problem so the interruption goes away as fast as possible and he can go back to contributing to the sprint. If the VP requires these numbers and went as far as back channeling you there is probably a quite good reason for that. Maybe the last time he needed something you told him it was not possible because the sprint thingy is unmovable ?
Once you have the result, you can go give those to the VP yourself, highlight the work of the junior dev, and use this "I am giving you the very important data you asked for" as a foot in the door to show that you had to pull the dev from the other feature, that these interruption also have cost and that you are more than happy to take care of them. He gets what he wants, the difficult conversation of "you did not do what you are supposed to do" happens behind closed doors, and you have a much better of getting results if he sees you as an ally to get his important stuff rather than a hinderance.
I agree that this is true 90% of times but if you included office politics in the equation sometimes it is not.
If it is in a deep political institution these are the initial set of questions I would start with:
Who is the Jr to the the VP, what are their relation ? How is your Jr to the manager ? How is the manager relation to the VP? How respectful to boundaries the VP is to the boundaries? How likely is for him to repeat or to get you shoved out the way next time ? How much do you care about being put astray in comparison to the quality of overall work ? How many times this has occurred before ? How likely is for the Jr to bypass you anyway ?
And as one can see, this is just too much to bother with. Sometimes it is easier to cry out that you need more money and or time.
I would do the same by the way. Make the distraction go away and try to put things back into the process route. If the process does not work and this is constant there is no reason to tell the person that pays you that they are always wrong.
Eeeehhh, might be overestimating executives a bit =P
But yeah, my first instinct is also to tell Gary to fuck off. That said, I would default to a process reason, so, the advice at the end wasn't totally useless for me.
And I'm with you—my default instinct is also to tell Gary to back off using a Process Reason (e.g. 'It's not in the sprint'). It feels safe because it's logical.
The 'Advice at the end' was just trying to highlight why that specific shield often cracks against a VP (because they think they own the process). Glad that specific breakdown was useful to read, even if the scenario felt a bit generous to the exec!
Again, it also depends on who "Gary" is in the real world!
This is a brilliant deconstruction. You’ve highlighted a flaw in my 'Correct' path: I optimized for Process Protection (Save the Sprint), but you are optimizing for Relationship Preservation (Save the VP's face).
You are absolutely right that contradicting the VP in front of the Junior Dev breaks the 'United Front' rule.
This highlights a key point: I built this to teach transferable heuristics (e.g., 'Protect the team'), not to be a rigid playbook. In real life, specific contexts (like 'Is the VP usually reasonable?') often override the default rule.
Your approach—facilitate the request to clear the distraction, then negotiate boundaries in private—is a more sophisticated heuristic than the one I initially coded. It trades short-term sprint purity for long-term political capital.
I love this. I’m going to add your 'Shield & Deliver' path as an alternative (and perhaps superior) winning state. This is exactly the nuance I wanted to surface.
>I love this. I’m going to add your 'Shield & Deliver' path as an alternative (and perhaps superior) winning state. This is exactly the nuance I wanted to surface.
I would be wary of this being the superior winning state, but definitely an alternative. I've done exactly this in my career as a tech lead only for it to burn me, and probably 2/3rds of the time the best thing for everyone is to simply "Save the Sprint" and not become mired in discussions that often are for personal empire building that strategic leadership would hate.
Maybe people have different experiences than me on this, feel free to speak up!
This is exactly why management is hard to unit test!
You are absolutely right. If you 'Shield & Deliver' every time, you risk becoming the 'Yes Man' who absorbs infinite scope creep for someone's vanity project (Empire Building).
The 'Correct' answer actually depends entirely on the Nature of the Request:
Legitimate Business Crisis? -> Shield & Deliver
Noise/Politics? -> Save the Sprint
Distinguishing between the two before you act is the master skill.
I think keeping both paths as valid strategies with different 'Trade-off' warnings or having 2 different contexts is the right move to reflect that ambiguity
When the traffic was peak, I did use LLM to polish my responses.
Later, I started replying without an LLM.
No, AI agents at work. This is an individual at work!
> Much better solution is to help you junior dev solve the problem
Meanwhile there are five other subordinates and all the overhead that you're neglecting while you fiddle with your dev environment trying to get started on the task, as you've been away from direct engineering for a while.
>If the VP requires these numbers and went as far as back channeling you there is probably a quite good reason for that.
This is good intuition but generally people won't tell you whether they have a good or silly/self-serving reason in my experience, and you can only really get them to surface that by comparing it to the priority of other commitments and forcing them to depriotize something.
I think the ratings might be a bit borked. There's a dialogue path choice that results in the A+ where you end up asking directly whether the back-channel was worth another delay, and you. The VP says no. No junior gets interrupted.
Sometimes you don’t even need to surface it. You just force responsibility: “This will prevent us from being ready for Friday’s demo. If you’re cool with that I’ll run it by {project sponsor}.”
Now it’s between the VP and the project sponsor - as it rightfully should be.
You are comparing apple to oranges. Social media posts are static, don’t watch you etc. But the distribution platform does all these things.
In books it’s exactly the same thing: do not believe for one second that the publishing industry does not watch engagement metrics (aka: sales) and does not adapt to the taste of the market. It’s also tuned to maximize outrage; see how popular unauthorized biographies of polarizing figures have become - who is next on Walter Isaacson list ? I am betting Trump must be somewhere there and it’s gonna be a banger.
If these customers have set an automated system to pay you $200 every single minute, that’s correct. If they haven’t and it was just a one off sale, you are missing the “recurring” part in ARR.
> It seems like EU in general should be heavily invested in Mistral's development, but it doesn't seem like they are
The EU is extremely invested in Mistral's development: half of the effort is finding ways to tax them (hello Zucman tax), the other half is wondering how to regulate them (hello AI act)
Zucman taxes rich individuals (100m€+), not Mistral. AI Act rules are not that difficult to comply with by GPAI model providers as long as the model doesn't become systemic risk... They have to spend a lot more time on PR and handshaking with French politicians than on AI compliance. They probably don't even have a single FTE for that... So that's just prejudice I believe.
> Our first and default tool should be some form of lightweight automated testing
Manual verification isn't about skipping tests, it's about validating what to test in the first place.
You need to see the code work before you know what "working" even means. Does the screen render correctly? Does the API return sensible data? Does the flow make sense to users? Automated tests can only check what you tell them to check. If you haven't verified the behavior yourself first, you're just encoding your assumptions into test cases.
I'd take "no tests, but I verified it works end-to-end" over "full test coverage, but never checked if it solves the actual problem" every time. The first developer is focused on outcomes. The second is checking boxes.
Tests are crucial: they preserve known-good behavior but you have to establish what "good" looks like first, and that requires human judgment. Automate the verification, not the discovery. So our first and default tool remains manual verification
I suppose we could be talking circles around eachother, but I'd say many of what you've suggested as manual tests could be codified into an automated test just as easily.
Manual: `curl localhost:8080 | jq .` or whatever, brings value once.
Automated: `assert.ValidJSON(req.Body)` is basically identical, but can be repeated over and over again
I know few countries that reject poisonous US social media in favor of better platform that is safe for children, safe for news and information, and safe for society and for Democracy itself: the peoples democracy of North Korea, the democratic republic of Iran, the not authoritarian society of Russia, etc
I see tremendous correlation between restriction of access to some websites and straight up dictatorship that pretend to protect it's population from the evils of foreign influences.
Or maybe it’s just strange classification. I see a lot of prompts on the internet looking like “act as a senior xxx expert with over 15 years of industry experience and answer the following: [insert simple question]”
I hope those are not classified as “roleplaying” the “roleplay” here is just a trick to get better answer from the model, often in a professional setting that has nothing to do with creative writing of NSFW stuff
I can't be sure, but this sounds entirely possible to me.
There are many, many people, and websites, dedicated to roleplaying, and those people will often have conversations lasting thousands of messages with different characters. I know a people whose personal 'roleplay AI' budget is a $1,000/month, as they want the best quality AIs.
Would be good to look into those particular statistics, then. Seems like the category could include all sorts of stuff:
> This indicates that users turn to open models primarily for creative interactive dialogues (such as storytelling, character roleplay, and gaming scenarios) and for coding-related tasks. The dominance of roleplay (hovering at more than 50% of all OSS tokens) underscores a use case where open models have an edge: they can be utilized for creativity and are often less constrained by content filters, making them attractive for fantasy or entertainment applications. Roleplay tasks require flexible responses, context retention, and emotional nuance - attributes that open models can deliver effectively without being heavily restricted by commercial safety or moderation layers. This makes them particularly appealing for communities experimenting with character-driven experiences, fan fiction, interactive games, and simulation environments.
I could imagine something like D&D or other types of narrative adventures on demand with a machine that never tires of exploring subplots or rewriting sections to be a bit different is a pretty cool thing to have. Either that, or writing fiction, albeit hopefully not entire slop books that are sold, but something to draw inspiration from and do a back and forth.
In regards to NSFW stuff, a while back people were clowning on OpenAI for suggesting that they'd provide adult writing content to adults, but it might as well be a bunch of money that's otherwise left on the table. Note: I'm all for personal freedom, though one also has to wonder about the longer term impact of those "AI girlfriend/boyfriend" trends, you sometimes see people making videos about those subreddits. Oh well, not my place to judge.
Edit: oh hey, there is more data there after all
> Among the highest-volume categories, roleplay stands out for its consistency and specialization. Nearly 60% of roleplay tokens fall under Games/Roleplaying Games, suggesting that users treat LLMs less as casual chatbots and more as structured roleplaying or character engines. This is further reinforced by the presence of Writers Resources (15.6%) and Adult content (15.4%), pointing to a blend of interactive fiction, scenario generation, and personal fantasy. Contrary to assumptions that roleplay is mostly informal dialogue, the data show a well-defined and replicable genre-based use case.
The way I think about it, the training data (i.e. the internet) has X% of people asking something like "explain it to me like I'm five years old" and Y% of people framing it like "I'm technical, explain this to me in detail". You use the "act as a senior XXX" when you want to bias the output towards something more detailed.
But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.
reply