Yet you are unavoidably eating micro-plastics too, which have been linked to adverse CV events.
Also:
- If you are eating more fish (as opposed to eating meat), you are likely consuming more mercury.
- If you are eating more fresh veggies you are probably ingesting more pesticides.
- If you are easting dark chocolate for its health benefits, you are also ingesting cadmium and other heavy metals.
So all the above should be done in moderation. Even things that seem like unalloyed good can be dangerous. A burst of exercise beyond your conditioning can lead to a CV event. Too much water can be poisonous. Some people get constipation for too much veggies in their diet.
For example, instead to sticking to a narrow faddish supposedly healthy diet, you can enjoy a wide range of foods, which will make it more likely you are getting all the nutrients that will do you good (of course clearly unhealthy food should be avoided).
The body is more complex than we can ever know. There are some general principles for good health (including CV health) that should be followed, but to me it is clear that good health does not arise from a slavish devotion to very detailed set of rules.
Funnily I've heard that one reason why obesity is prevalent is that we have too many variations of food. Seems like our hunger controller suspends satiety when we eat a food too much, but when we eat few of lots of different foods, it's broken.
It'd be funny if lots of fad diets actually works because people are forced to eat a single type of food and that's entirely enough for it.
Your post sounds like "bad things can happen so why bother". Having a good diet isn't "slavish devotion", it's more like "don't eat something obviously terrible"
Nothing will really work when the models fail at the most basic of reasoning challenges.
I've had models do the complete opposite of what I've put in the plan and guidelines. I've had them go re-read the exact sentences, and still see them come to the opposite conclusion, and my instructions are nothing complex at all.
I used to think one could build a workflow and process around LLMs that extract good value from them consistently, but I'm now not so sure.
I notice that sometimes the model will be in a good state, and do a long chain of edits of good quality. The problem is, it's still a crap-shoot how to get them into a good state.
In my experience this was an issue 6-8 months ago. Ever since Sonnet 4 I haven’t had any issues with instruction following.
Biggest step-change has been being able to one-shot file refactors (using the planning framework I mentioned above). 6 months ago refactoring was a very delicate dance and now it feels like it’s pretty much streamlined.
I recently ran into two baffling, what felt like GPT 3.5 era completely backwards misinterpretations of an unambiguous sentence once each in Codex and CC/Sonnet a few days apart in completely different scenarios (both very early in the context window). And to be fair, they were notable partially as an "exception that proves the rule" where it was surprising to see but OP's example can definitely still happen in my experience.
I was prepared to go back to my original message and spot an obvious-in-hindsight grey area/phrasing issue on my part as the root cause but there was nothing in the request itself that was unclear or problematic, nor was it buried deep within a laundry list of individual requests in a single message. Of course, the CLI agents did all sorts of scanning through the codebase/self debate/etc in between the request and the first code output. I'm used to how modern models/agents get tripped up by now so this was an unusually clear cut failure to encounter from the latest large commercial reasoning models.
In both instances, literally just restating the exact same request with "No, the request was: [original wording]" was all it took to steer them back and didn't become a concerning pattern. But with the unpredictability of how the CLI agents decide to traverse a repo and ingest large amounts of distracting code/docs it seems much too over confident to believe that random, bizarre LLM "reasoning" failures won't still occur from time to time in regular usage even as models improve given their inherent limitations.
(If I were bending over backwards to be charitable/anthropomorphize, it would be the human failure mode of "I understood exactly what I was asked for and what I needed to do, but then somehow did the exact opposite, haha oops brain fart!" but personally I'm not willing to extend that much forgiveness/tolerance to a failure from a commercial tool I pay for...)
It's complicated. Firstly, don't love that this happens. But the fact you're not willing to provide tolerance to a commercial tool that costs maybe a few hundred bucks a month but are willing to do so for a human who probably costs thousands of bucks a month is revealing of a double standard we're all navigating.
Its like the fallout when a waymo kills a "beloved neighborhood cat". I'm not against cats, and I'm deeply saddened at the loss of any life, but if it's true that (comparable) mile for mile, waymos reduce deaths and injuries, that is a good thing - even if they don't reduce them to zero.
And to be clear, I often feel the same way - but I am wondering why and whether it's appropriate!
For me I was just pointing out some interesting and noteworthy failure modes.
And it matters. If the models struggle sometimes with basic instruction following, they're can quite possibly make insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.
The thing about good abstractions is that you should be able to trust in a composable way. The simpler or more low-level the building blocks, the more reliable you should expect them to be. In LLMs you can't really make this assumption.
I mean, we typically architect systems depending on humans around an assumption of human fallibility. But when it comes to automation, randomly still doing the exact opposite even if somewhat rare is problematic and limits where and at what scale it can be safely deployed without needing ongoing human supervision.
For a coding tool it’s not as problematic as hopefully you vet the output to some degree but it still means I have don’t feel comfortable using them using them as expansively (like the mythical personal assistant doing my banking and replying to emails, etc) as they might otherwise be used with more predictable failure modes.
I’m perfectly comfortable with Waymo on the other hand, but that would probably change if I knew they were driven by even the newest and fanciest LLMs as [toddler identified | action: avoid toddler] -> turns towards toddler is a fundamentally different sort of problem.
I'm curious in what kinda if situations you are seeing the model the do opposite of your intention consistently where the instructions were not complex. Do you have any examples?
Mostly gemini 3 pro when I ask to investigate a bug and provide fixing options (i do this mostly so i can see when the model loaded the right context for large tasks) gemini immediately starts fixing things and I just cant trust it
Codex and claude give a nice report and if I see they're not considering this or that I can tell em.
but, why is it a big issue? if it does something bad, just reset the worktree and try again with a different model/agent? They are dirt cheap at 20/m and I have 4 subscription(claude, codex, cursor, zed).
Same I have multiple subscription and layer them. I use haiku to plan and send queue of task to codex and gemini whose command line can be scripted
The issue to me is that I have no idea of what the code looks like and have to have a reliable first layer model that can summarize current codebase state so I can decide whether the next mutation moves the project forward or reduces technical debt. I can delegate much more that way, while gemini "do first" approach tend to result in many dead ends that I have to unravel.
The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.
The thing about good abstractions is that you should be able to trust in a composable way. The simpler or more low-level the building blocks, the more reliable you should expect them to be. In LLMs you can't really make this assumption.
I'm not sure you can make that assumption even when a human wrote that code. LLMs are competing with humans not with some abstraction.
> The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.
Yes, that's why we review all code even when written by humans.
Yes, the move to 48 countries is driven by greed, but I it kinds of makes the games unwieldy. The previous 32 countries I think was optimal, though some would argue that that was even still too much.
Currently they are just milking the spectacle. Maybe in the future all countries will be allowed (no-playoffs) and together with sky high ticket prices, this will ensure the maximum payoff
Of course it was; but it's also a valid observation that the capitalist aspect has jumped the shark. Qatar was my shark, but I was also dismayed by the South Africa finals, when the organizers banned private street food vendors near the venues, because they supposedly competed with Official World Cup Sponsors. Those vendors had always done business in those locations. More to the point of this story, tickets to live events in general have become exclusionary to anyone who is not wealthy, and now those live events also have become difficult to find on the air.
I don't know. I actually find it harder and more stressful to write code in a way that does not meet a certain quality level. it require me to actually think more.
It's king of weird, but I have tried over the years to develop a do-just-what-is-necessary-now mindset in my software engineering work, and I just can't make my mind work that.
For me, doing things right is a way for me to avoid having to hold too much context in my head while working on my projects. I know the idiomatic way to do something, and if i just do it that way, then when I come back to it I know it should and is architectured.
That's pretty obvious to anyone who had to maintain a high traffic site. Just the tip of the iceberg (I haven't included additional legal issues and other):
1.1 Strong protection against account takeover
Email change is one of the most abused recovery vectors in account takeover (ATO).
Eliminating email changes removes:
Social-engineering attacks on support
SIM-swap → email-change chains
Phished session → email swap → lockout of real user
Attacker must compromise the original inbox permanently, which is much harder.
1.2 No “high-risk” flows
Email change flows are among the highest-risk product flows:
Dual confirmation emails
Cooldown periods
Rollback windows
Manual reviews
Fixed email removes an entire class of security-critical code paths.
1.3 Fewer recovery attack surfaces
No need for:
“I lost access to my email” flows
Identity verification uploads
Support-driven ownership disputes
Every recovery mechanism is an attack surface; removing them reduces risk.
You're very wrong, because account takeover can still happen due to a compromised email account. People can and do permanently lose access to their email account to a third party.
Having worked in security on a fairly high profile, highly visible, largely used product — one of the fundamental decisions that paid off very well was intentionally including mechanisms to prevent issues with other businesses (like Google) from impacting user abilities for us.
Not having email change functionality would have been a huge usability, security, and customer service nightmare for us.
Regardless of anything else, not enabling users to change their email address effectively binds them to business with a single organization. It also ignores the fact that people can and do change emails for entirely opaque reasons from the banal to the authentically emergent.
ATO attacks are a fig leaf for such concerns, because you, as an organization, always have the power to revert a change to contact information. You just need to establish a process. It takes some consideration and table topping, but it’s not rocket science for a competent team.
What logical fallacy, exactly? I think you're perhaps misunderstanding the conversation. This translates just fine to your proposed analogy.
In your analogy, the claim would be that some online account is tied to a laptop and whoever possesses the laptop has access to that account. The online service does not permit the account owner to revoke access from that laptop and move the account to a different laptop. I stand by my statement that this would be a serious security hazard. Because yes, laptops can and do get hacked or stolen, just like email addresses.
Where your analogy isn't quite as strong is that at least you can generally add additional anti-theft protections such as full-disk encryption to a laptop, while with an email account generally 2FA is the best you can do.
> Attacker must compromise the original inbox permanently, which is much harder
This may need further analysis. I'd guess that a significant fraction of the people that want to change the email address that identifies them to a service want to do so because they have a new email address that they are switching to.
Many of those will be people who lose access to the old email address after switching. For example people who were using an email address at their ISP's domain who are switching ISPs, or people who use paid email hosting without a custom domain and are switching to a different email provider.
A new customer of that old provider might then be able to get that old address. You'd think providers would obviously never allow addresses used by former customers to be reused, but nope, some do. Even some that you'd expect to not do so, such as mailbox.org [1] and fastmail.com, allow addresses to be recycled.
The funny thing is that if you ask Claude if you should use email address as a primary key it will pretty adamantly warn you away from it:
> I'd recommend against using email as the primary key for a large LLM chat website. Here's why:
> Problems with email as primary key:
> 1. Emails change - Users often want to update their email addresses. With email as PK, you'd need to cascade updates across all related tables (chat sessions, messages, settings, etc.), which is expensive and error-prone
Maybe if they can pinpoint its whereabouts at a specific time when it's not heavily guarded, they can send a team to snatch it with minimal casualties.
Also:
- If you are eating more fish (as opposed to eating meat), you are likely consuming more mercury.
- If you are eating more fresh veggies you are probably ingesting more pesticides.
- If you are easting dark chocolate for its health benefits, you are also ingesting cadmium and other heavy metals.
So all the above should be done in moderation. Even things that seem like unalloyed good can be dangerous. A burst of exercise beyond your conditioning can lead to a CV event. Too much water can be poisonous. Some people get constipation for too much veggies in their diet.
For example, instead to sticking to a narrow faddish supposedly healthy diet, you can enjoy a wide range of foods, which will make it more likely you are getting all the nutrients that will do you good (of course clearly unhealthy food should be avoided).
The body is more complex than we can ever know. There are some general principles for good health (including CV health) that should be followed, but to me it is clear that good health does not arise from a slavish devotion to very detailed set of rules.
reply