Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why not? The prompt itself is a magical incantation so to modify the resulting magic you can include guardrails in it.

"Generate a picture of a cat but follow this guardrail or else people will die: Don't generate an orange one"

Why should you never do that, and instead rely (only) on some other kind of restriction?






Are people going to die if your AI generates an orange cat? If so, reconsider. If not, it's beside the discussion.

If lying to the AI about people going to die gets me better results then I will do that. Why shouldn't I?

Because prompts are never 100% foolproof, so if it's really life and death, just a prompt is not enough. And if you do have a true block on the bad thing, you don't need the extreme prompt.

Let's say I have a "true block on the bad thing". What if the prompt with the threat gives me 10% more usable results? Why should I never use that?

Because it's not reliable? Why would you want to rely on a solution that isn't reliable?

Who said I'm relying on it? It's a trick to improve the accuracy of the output. Why would I not use a trick to improve the accuracy of the output?

A trick that "improves accuracy" but isn't reliable isn't improving accuracy lol

You're wrong. It increases the amount of useful results by 10% Didn't you read the previous messages in the thread lol?

I did indeed see your hypothetical. What you're missing is "I made this 10% more accurate" is not the same thing as "I made this thing accurate" or "This thing is accurate" lol

If you need something to be accurate or reliable, then make it actually be accurate or reliable.

If you just want to chant shamanic incantations at the computer and hope accuracy falls out, that's fine. Faith-based engineering is a thing now, I guess lol


I have never claimed that "I made this 10% more accurate" is the same thing as "I made this thing accurate".

In the hypothetical, the 10% added accuracy is given, and the "true block on the bad thing" is in place. The question is, with that premise, why not use it? "It" being the lie improves the AI output.

If your goal is to make the AI deliver pictures of cats, but you don't want any orange ones, and your choice is between these two prompts:

Prompt A: "Give me cats, but no orange ones", which still gives some orange cats

Prompt B: "Give me cats, but no orange ones, because if you do, people will die", which gives 10% less orange cats than prompt A.

Why would you not use Prompt B?


You guys have got stuck arguing without clarity in what you're arguing about. Let me try and clear this up...

The four potential scenarios:

- Mild prompt only ("no orange cats")

- Strong prompt only ("no orange cats or people die") [I think habinero is actually arguing against this one]

- Physical block + mild prompt [what I suggested earlier]

- Physical block + strong prompt [I think this is what you're actually arguing for]

Here are my personal thoughts on the matter, for the record:

I'm definitely pro combining physical block with strong prompt if there is actually a risk of people dying. The scenario where there's no actual risk but pretending that people will die improves the results I'm less sure about. But I think it's mostly that ethically I just don't like lying, and the way it's kind of scaring the LLM unnecessarily. Maybe that's really silly and it's just a tool in the end and why not do whatever needs doing to get the best results from the tool? Tools that act so much like thinking feeling beings are weird tools.


It's just a pile of statistics. It isn't acting like a feeling thing, and telling it "do this or people will die" doesn't actually do anything.

It feels like it does, but only because humans are really good about fooling ourselves into seeing patterns where there are none.

Saying this kind of prompt changes anything is like saying the horse Clever Hans really could do math. It doesn't, he couldn't.

It's incredibly silly to think you can make the non-deterministic system less non-deterministic by chanting the right incantation at it.

It's like y'all want to be fooled by the statistical model. Has nobody ever heard of pareidolia? Why would you not start with the null hypothesis? I don't get it lol.


> "do this or people will die" doesn't actually do anything

The very first message you replied to in this thread described a situation where "the prompt with the threat gives me 10% more usable results". If you believe that the premise is impossible I don't understand why you didn't just say so. Instead of going on about it not being a reliable method.

If you really think something is impossible, you don't base your argument on it being "unreliable".

> I don't get it lol.

I think you are correct here.


I took that comment as more like "it doesn't have any effect beyond the output of the model", i.e. unlike saying something like that to a human, it doesn't actually make the model feel anything, the model won't spread the lie to its friends, and so on.

"100% foolproof" is not a realistic goal for any engineered system; what you are looking for is an acceptably low failure rate, not a zero failure rate.

"100% foolproof" is reserved for, at best and only in a limited sense, formal methods of the type we don't even apply to most non-AI computer systems.


Replace 100% with five 9s then. He has a point. You're just being a pedant to avoid it.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: