That SWE-bench chart with the mismatched bars (52.8% somehow appearing larger than 69.1%) was emblematic of the entire presentation - rushed and underwhelming. It's the kind of error that would get flagged in any internal review, yet here it is in a billion-dollar product launch. Combined with the Bernoulli effect demo confidently explaining how airplane wings work incorrectly (the equal transit time fallacy that NASA explicitly debunks), it doesn't inspire confidence in either the model's capabilities or OpenAI's quality control.
The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.
The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.
You're sort of glossing over the part where this can now be leveraged as a cost-efficient agentic model that performs better than o3. Nobody used o3 for sw agent tasks due to costs and speed, and this now substantially seems to both improve on o3 AND be significantly cheaper than Claude.
o3's cost was sliced by 80% a month or so ago and is also cheaper than Claude (the output is even cheaper than GPT-5). It seems more cost efficient but not by much.
O3 is fantastic at coding tasks, until today it was smartest model in existence. But it works only in few shot conversational scenarios, it's not good at agentic harnesses.
it has to be released because it's not much better and OpenAI needs the team to stop working on it. They have serious competition now and can't afford to burn time / money on something that isn't shifting the dial.
The whole presentation was full of completely broken bar charts. Not even just the typical "let's show 10% of the y axis so that a 5% increase looks like 5x" but stuff like the deception eval showing gpt5 vs o3 as 50 vs 47, but the 47 is 3x as big, and then right next to it we have 9 vs 87, more reasonably sized.
It's like no one looked at the charts, ever, and they just came straight from.. gpt2? I don't think even gpt3 would have fucked that up.
I don't know any of those people, but everyone that has been with OAI for longer than 2 years 1.5m bonuses, and somehow they can't deliver a bar chart with sensible at axes?
TBH Claude Code max pro's performance on coding has been abhorrent(bad at best). The core of the issue is that the plan produced will more often than not use humans as verifiers(correctness, optimality and quality control). This is a fundamentally bad way to build systems that need to figure out if their plan will work correctly, because an AI system needs to test many plans quickly in a principled manner(it should be optimal and cost efficient).
So you might get that initial MvP out the door quickly, but when the complexity grows even just a little bit, you will be forced to stop and look at the plan and try to get it to develop it saying things like: "use Design agent to ultrathink about the dependencies of the current code change on other APIs and use TDD agent to make sure tests are correct in accordance with the requirements I stated" and then one finds that even the all the thinking there are bugs that you will have to fix.
Source: I just tried max pro on two client python projects and it was horrible after week 2.
What do you mean? A single data point cannot be exponential. What the blog post say is that the ability to solve tasks of all LLMs is exponential over time, and GPT-5 fits in that curve.
Yes, but the jump in performance from o3 is well beyond marginal while also fitting an exponential trend, which undermines the parent's claim on two counts.
I suspect the vast majority of OpenAI's users are only using ChatGPT, and the vast majority of those ChatGPT users are only using the free tier.
For all of them, getting access to full-blown GPT-5 will probably be mind-blowing, even if it's severely rate-limited. OpenAI's previous/current generation of models haven't really been ergonomic enough (with the clunky model pickers) to be fully appreciated by less tech-savvy users, and its full capabilities have been behind a paywall.
I think that's why they're making this launch is a big deal. It's just an incremental upgrade for the power users and the people that are paying money, but it'll be a step-change in capability to everyone else.
The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.
The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.