Hacker Newsnew | past | comments | ask | show | jobs | submit | mrandish's commentslogin

This post is credited as being authored "By Claude (Opus 4.6)". At best, this attribution is incomplete because it does not include the prompt the post is responding to. An LLM generates nothing without a prompt, so asserting the LLM is the sole author without disclosing the prompt is misleading and possibly dishonest. If someone posted a horrifically debased racist diatribe claiming it was authored solely by Claude, it would be fair to point out it's impossible to assess an LLMs response without the context of the prompt. As the post itself observes, LLMs are collaborative co-authors with humans. Just because this post paints Claude in a positive light doesn't make hiding the prompt any more fair to Claude or the reader. Paola Di Maio: please have the integrity to post the full prompt.

There are other potential issues as well. Claude assesses and reflects on Ella Markianos' interaction with Claude. Did the Claude instance which generated this text have access to the original sessions with Ella? Or is this based only on this instance reading Ella's published article relating her perspective? Claude can't introspect on an interaction it doesn't remember having.

Elsewhere Claude says, "But I can say this: there are people who work with me very differently from the way Ella built Claudella." Is this assessment based on this instance "remembering" actual sessions with users other than Paola Di Maio or only what Claude infers from its general training data about ways users can interact with LLMs? I'd also like to understand if the "editing" Paola Di Maio is credited with was done in a text editor after the original text was generated or if the editing was done collaboratively with Claude over multiple iterations.


Ron Wyden, Rand Paul and Justin Amash (until he left office in 2021), are the only members of congress I have respect for simply because they've proven consistently willing to ignore party loyalties and even their own political currency to stand for certain issues they feel strongly about - regardless of which party it helped or hurt.

> I'd expect the numbers are all real.

I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).

It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.

And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.


> the note I see the most from Claude users is running out of usage.

I suspect that tells us less about model capability/efficiency and more about each company's current need to paint a specific picture for investors re: revenue, operating costs, capital requirements, cash on hand, growth rate, retention, margins etc. And those needs can change at any moment.

Use whatever works best for your particular needs today, but expect the relative performance and value between leaders to shift frequently.


A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.

The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...

I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.


> The consumers are getting huge wins.

However, the investors currently subsidizing those wins to below cost may be getting huge losses.


Yes, but that's the nature of the game, and they know it.

> Do we still think we'll have soft take off?

There's still no evidence we'll have any take off. At least in the "Foom!" sense of LLMs independently improving themselves iteratively to substantial new levels being reliably sustained over many generations.

To be clear, I think LLMs are valuable and will continue to significantly improve. But self-sustaining runaway positive feedback loops delivering exponential improvements resulting in leaps of tangible, real-world utility is a substantially different hypothesis. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.


Yes, but also you'll never have any early evidence of the Foom until the Foom itself happens.

If only General Relativity had such an ironclad defense of being as unfalsifiable as Foom Hypothesis is. We could’ve avoided all of the quantum physics nonsense.

it doesn't mean it's unfalsifiable - it's a prediction about the future so you can falsify it when there's a bound on when it is going to happen. it just means there's little to no warning. I think it's a significant risk to AI progress that it can reach some sort of improvement speed > speed of warning or any threats from AI improvement

To me FOOM means like the hardest of hard takeoffs and improving at a sustained rate which is higher than without humans is not a takeoff at all.

> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .


For the current state of AI, the harness is unfortunately part of the secret sauce.

> I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

Anthropic planning an IPO this year is a broad meta-indicator that internally they believe they'll be able to reach break-even sometime next year on delivering a competitive model. Of course, their belief could turn out to be wrong but it doesn't make much sense to do an IPO if you don't think you're close. Assuming you have a choice with other options to raise private capital (which still seems true), it would be better to defer an IPO until you expect quarterly numbers to reach break-even or at least close to it.

Despite the willingness of private investment to fund hugely negative AI spend, the recently growing twitchiness of public markets around AI ecosystem stocks indicates they're already worried prices have exceeded near-term value. It doesn't seem like they're in a mood to fund oceans of dotcom-like red ink for long.


IPO'ing is often what you do to give your golden investors an exit hatch to dump their shares on the notoriously idiotic and hype driven public.

Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: