Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end), but those odds have dropped to 18% today.
(I'm mostly making this comment to document what happened for the history books.)
After a few hours with gpt-5, I'd trade that spread. Not that I think oAI will win end of year. But I think gpt5 is better than it looks on the benchmark side. It is very very good at something we don't have a lot of benchmarks for -- keeping track of where it's at. codex is vassstly better in practice than claude code or gemini cli right now.
On the chat side, it's also quite different, and I wouldn't be surprised if people need some time to get a taste and a preference for it. I ask most models to help me build a macbook pro charger in 15th century florence with the instructions that I start with only my laptop and I can only talk for four hours of chat before the battery dies -- 5 was notable in that it thought through a bunch of second order implications of plans and offered some unusual things, including a list of instructions for a foot-treadle-based split ring commutator + generator in 15th century florentine italian(!). I have no way of verifying if the italian was correct.
Upshot - I think they did something very special with long context and iterative task management, and I would be surprised if they don't keep improving 5, based on their new branding and marketing plan.
That said, to me this is one of the first 'product release' moments in the frontier model space. 5 is not so much a model release as a polished-up, holes-fixed, annoyances-reduced/removed, 10x faster type of product launch. Google (current polymarket favorite) is remarkably bad at those product releases.
Back to betting - I bet there's a moment this year where those numbers change 10% in oAIs favor.
I would agree. I am a big fan of Claude and I've Claude code a bunch although after testing Codex & GPT-5 extensively, it just gets stuck in a rut way less often and much more often is able to pinpoint issues & fixes in the codebase.
How on Earth does that market have Anthropic at 2%, in a dead heat with the likes of Meta? If the market was about yesterday rather than 5 months from now I think Claude would be pretty clearly the front runner. Why does the market so confidently think they’ll drop to dead last in the next little while?
It's because those markets are based on the LLM Arena leaderboard (https://lmarena.ai/), where Claude has historically done poorly.
That eval has also become a lot less relevant (it's considered not very indicative of real-world performance), so it's unlikely Anthropic will prioritize optimizing for it in future models.
Anthropic has always been one of the best at not optimizing for stupid metrics. Rather, they spend significant energy researching weaknesses and building metrics around that. Google is also pretty on point IMO, but they can also afford to dedicate to these nonsense metrics as they are still good marketing.
Meanwhile Meta and Xai are behind the ball and largely marketing focused.
How is Claude doing on the benchmark that market is based on? Maybe not so good? Idk. Just because Claude is good for real world use doesn't mean it's winning the benchmark, but the benchmark is all that matters for the Polymarket.
I'm a fan of Anthropic for this reason. I use Claude and it's very good most of the time for my coding requirements.
Generally when you have a lot of companies competing to show whos product X does the best at Y, there's a lot of monetary incentives to manipulate the products to perform well specifically on those types of tests.
Well I for example don't give a shit what prediction markets do and never participated, but if someone thinks they're wrong, they should just participate and get free money. Otherwise why complain.
I wasn't complaining per-se, I was asking for (and expecting) a legitimate reason. Which I got: that the market is resolved purely based on LLM Arena which Anthropic has never done well on (which says more about the benchmark than about Anthropic).
You got a random person saying a random thing. There's no explanation for a market. The same way the stock market doesn't move for the reason the articles say it does. Everyone on each side has their own multitude of reasons.
I think they also based their expectation on the release cycles and speeds of update. Anthropic is known for more conservative release cycle and incremental updates. Google on the other hand is accelerated recently. It also seems that other actors are better at benchmark cheating ;)
I mean, if you feel strongly enough that it will be #1 at the end of year then $100 now would net you $3000 end of year... Do bear in mind what my sibling said about the specific benchmark that is being used, though.
Looking at LMarena which polymarket uses, I'm not surprised. Based on the little data there is (3k duels, it's possibly worse than Gemini, it lost more to Gemini 2.5 Pro than it won in direct duels). Not sure why the ELO is still higher, possibly GPT5 did more clearly better against bad models, which I don't care about.
Elon's Y Combinator interview was pretty good. He seemed more in his element back amongst the hacker crowd (rather than dirty politics), and seemed to be doing hackery things at X, like renting generators and mobile cooling vans and just putting them the car park outside a warehouse to train Grok, since there were no data centres available and he was told it would take 2 years to set it all up properly.
I think he's just good at attracting good talent, and letting them focus on the right things to move fast initially, while cutting the supporting infra down to zero until it's needed.
It's hackery but also kind of sociopathic to dump a bunch of loud, dirty generators in the middle of a low-income community. Go set your data center up on Martha's Vineyard and see how long the residents put up with it.
Thinking more cynically: political corruption and connections I'm guessing? Just a couple months ago Musk was treating the US government like his personal playground.
(I'm mostly making this comment to document what happened for the history books.)
https://polymarket.com/event/which-company-has-best-ai-model...