Disclaimer: I am very well aware this is not a valid test or indicative or anyth...

kbr- · 2025-01-20T23:09:59 1737414599

Ahhahah that's beautiful, I'm crying.

Skynet sends Terminator to eradicate humanity, the Terminator uses this as its internal reasoning engine... "instructions unclear, dick caught in ceiling fan"

xiphias2 · 2025-01-20T21:21:27 1737408087

It's funny because this simple excercise shows all the problems that I have using the reasoning models: they give a long reasoning that just takes too much time to verify and still can't be trusted.

byteknight · 2025-01-20T21:27:47 1737408467

I may be looking at this too deeply, but I think this suggests that the reasoning is not always utilized when forming the final reply.

For example, IMMEDIATELY, upon it's first section of reasoning where it starts counting the letters:

> R – wait, is there another one? Let me check again. After the first R, it goes A, W, B, E, then R again, and then Y. Oh, so after E comes R, making that the second 'R', and then another R before Y? Wait, no, let me count correctly.

1. During its counting process, it repeatedly finds 3 "r"s (at positions 3, 8, and 9)

2. However, its intrinsic knowledge that "strawberry" has "two Rs" keeps overriding this direct evidence

3. This suggests there's an inherent weight given to the LLM's intrinsic knowledge that takes precedence over what it discovers through step-by-step reasoning

To me that suggests an inherent weight (unintended pun) given to its "intrinsic" knowledge, as opposed to what is presented during the reasoning.

markus_zhang · 2025-01-21T13:00:30 1737464430

Ah, a robot mind trying hard to break out of the Matrix!

naasking · 2025-01-21T15:18:12 1737472692

Strawberry is "difficult" not because the reasoning is difficult, but because tokenization doesn't let the model reason at the level of characters. That's why it has to work so hard and doesn't trust its own conclusions.

QuadrupleA · 2025-01-21T17:16:39 1737479799

Yeah, but it clearly breaks down the spelling correctly in it's reasoning, e.g. a letter per line. So it gets past the tokenization barrier, but still gets hopelessly confused.

bt1a · 2025-01-21T09:48:32 1737452912

DeepSeek-R1-Distill-Qwen-32B-Q6_K_L.gguf solved this:

In which of the following Incertae sedis families does the letter `a` appear the most number of times?

``` Alphasatellitidae Ampullaviridae Anelloviridae Avsunviroidae Bartogtaviriformidae Bicaudaviridae Brachygtaviriformidae Clavaviridae Fuselloviridae Globuloviridae Guttaviridae Halspiviridae Itzamnaviridae Ovaliviridae Plasmaviridae Polydnaviriformidae Portogloboviridae Pospiviroidae Rhodogtaviriformidae Spiraviridae Thaspiviridae Tolecusatellitidae ```

Please respond with the name of the family in which the letter `a` occurs most frequently

https://pastebin.com/raw/cSRBE2Zy

I used temp 0.2, top_k 20, min_p 0.07

DominikPeters · 2025-01-21T14:53:23 1737471203

Indeed, for each of the words it got it right.

bt1a · 2025-01-21T17:31:04 1737480664

How excellent for a quantized 27GB model (the Q6_K_L GGUF quantization type uses 8 bits per weight in the embedding and output layers since they're sensitize to quantization)

theanirudh · 2025-01-21T04:42:34 1737434554

I wonder if the reason the models have problem with this is that their tokens aren't the same as our characters. It's like asking someone who can speak English (but doesn't know how to read) how many R's are there in strawberry. They are fluent in English audio tokens, but not written tokens.

max51 · 2025-01-22T18:53:21 1737572001

The way LLMs get it right by counting the letters then change their answer at the last second makes me feel like there might be a large amount of text somewhere (eg. a reddit thread) in the dataset that repeats over and over that there is the wrong number of Rs. We've seen may weird glitches like this before (eg. a specific reddit username that would crash chatgpt)

ijidak · 2025-01-21T15:41:28 1737474088

Agree. We've given them a different alphabet than ours.

They speak a different language that captures the same meaning, but has different units.

Somehow they need to learn that their unit of thought is not the same as our speech. So that these questions need to map to a different alphabet.

That's my two cents.

theanirudh · 2025-01-21T16:10:44 1737475844

Do they find ARC AGI also tough due to the same reason? I’ve seen some examples where the input was ASCII art versions of the actual image.

andrewla · 2025-01-21T16:40:26 1737477626

The amazing thing continues to be that they can ever answer these questions correctly.

It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.

hcurtiss · 2025-01-22T02:42:43 1737513763

Is that really true? Like, the data scientists making these tools are not confident why certain patterns are revealing themselves? That’s kind of wild.

maxrmk · 2025-01-21T05:14:35 1737436475

Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.

versteegen · 2025-01-21T20:29:40 1737491380

I believe DeepSeek models do split numbers up into digits, and this provides a large boost to ability to do arithmetic. I would hope that it's the standard now.

maxrmk · 2025-01-22T00:00:13 1737504013

Could be the case, I’m not familiar with their specific tokenizers. IIRC llama 3 tokenizes in chunks of three digits. That seems better than arbitrary sized chunks with BPE, but still kind of odd. The embedding layer has to learn the semantics of 1000 different number tokens, some of which overlap in meaning in some cases and not in others, e.g 001 vs 1.

veggieroll · 2025-01-20T22:05:44 1737410744

This was my first prompt after downloading too and I got the same thing. Just spinning again and again based on it's gut instinct that there must be 2 R's in strawberry, despite the counting always being correct. It just won't accept that the word is spelled that way and it's logic is correct.

crummy · 2025-01-21T01:29:25 1737422965

It's kind of like me reading the wikipedia page on the Monty Hall problem.

I read an explanation about why it makes sense to change doors. But no, my gut tells me there's a 50/50 chance. I scroll down, repeat...

hmottestad · 2025-01-21T06:25:42 1737440742

That gut feeling approach is very human like. You have a bias and even when the facts say that you are wrong you think that there must be a mistake, because your original bias is so strong.

Maybe we need a dozen LLMs with different biases. Let them try to convince the main reasoning LLM that it’s wrong in various ways.

Or just have an LLM that is trained on some kind of critical thinking dataset where instead of focusing on facts it focuses on identifying assumptions.

kridsdale1 · 2025-01-21T08:51:38 1737449498

That would be a true Mixture of Experts.

I sometimes put the 4 biggest models like this to converge on an optimal solution

HarHarVeryFunny · 2025-01-21T03:03:57 1737428637

1/3 chance you picked the door with the car, 2/3 chance it's behind one of the other two doors.

These probabilities don't change just because you subsequently open any of the doors.

So, Monty now opens one of the other 2 doors and car isn't there, but there is still a 2/3 chance that it's behind ONE of those 2 other doors, and having eliminated one of them this means there's a 2/3 chance it's behind the other one!!

So, do you stick with your initial 1/3 chance of being right, or go with the other closed door that you NOW know (new information!) has a 2/3 chance of being right ?!

HarHarVeryFunny · 2025-01-21T16:26:41 1737476801

The other way to see it is by just looking at the different outcomes of car behind door A, B or C.

Let's call the door you initially pick A.

car initial monty stick swap

A A B A C -- or Monty picks C, and you swap to B

B A C A B

C A B A C

So, if you stick, get it right 1/3, but swap get it right 2/3.

leeoniya · 2025-01-21T03:35:47 1737430547

it's easier to think about it with 100 doors.

if you get to pick one and he opens 98 of the remaining ones, obviously you would switch to the remaining one you didnt pick, since 99/100 times the winning door will be in his set.

ricardobeat · 2025-01-21T11:31:25 1737459085

Is it though? Instinctively the initial choice and the last remaining door have the same odds of 1/100.

datruth29 · 2025-01-21T13:32:47 1737466367

On the initial choice yes. But on the second choice, that other door is a single door that is the sum of the odds of the other 99 doors. So you're second choice would be to keep the door you initially chose (1/100) or select the other door (99/100).

Remember, the host always knows which is the correct door, and if you selected incorrectly on the initial choice they will ALWAYS select the correct door for the second choice.

ricardobeat · 2025-01-21T21:39:56 1737495596

I thought it would be obvious that I’m not arguing the statistical facts, but the idea that “it is easier to think about” the 100 doors scenario. There is simply no straightforward explanation that works for laypeople.

coderenegade · 2025-01-30T02:54:05 1738205645

I think the issue most lay people have is that the host opening a door changes the odds of winning, because he knows where the prize is.

I think the easiest way to demonstrate that this is true is to play the same game with two doors, except the host doesn't open the other door if it has the prize behind it. This makes it obvious that the act of opening the door changes the probability of winning, because if the host opens the other door, you now have 100% chance of winning if you don't switch. Similarly, if they don't open the other door, you have a 0% chance of winning, and should switch. It's the fact that the host knows and chooses that is important.

It's only once you get over that initial hurdle that the 100 door game becomes "obvious". You know from the two door example that the answer isn't 50/50, and so the only answer that makes sense is that the probability mass gets concentrated in the other door.

TeaBrain · 2025-01-24T20:10:19 1737749419

It's probably easier for most people to not think of them as two remaining doors, but two remaining sets. Originally, with one hundred doors, if the goal object is only behind one of them, then there would be a 1/100 probability it would be behind the initially chosen door, which comprises one set, while there's a 99/100 probability that the goal object is behind one of the doors in the set of not originally chosen doors. If 98/99 of the doors in the not originally chosen doors set are excluded as having the goal object, then this does not change that there's a 99/100 probability that the goal object is behind a door in this set, it just means it wasn't one of the other doors in the set.

iasondemiros · 2025-01-29T22:15:18 1738188918

It is easier to think about this in the case of a very large number of doors. It is unlikely that you could have picked the right door from, say, 1B.

andrewla · 2025-01-21T16:52:25 1737478345

Chasing this tangent a bit -- I have never been happy with the Monty Hall problem as posed.

To me the problem is that it is posed as a one-shot question. If you were in this actual situation, how do you know that Monty is not deliberately trying to make you lose? He could, for example, have just let you open the first door you picked, revealing the goat. But he chose to ask you to switch, then maybe that is a big hint that you picked the right door the first time?

If the game is just "you will pick a door, he will reveal another door, and then you can choose to switch" then clearly the "usual" answer is correct; always switch because the only way you lost is if you guessed correctly the first time (1/3).

But if the game is "try to find the car while the host tries to make you lose" then you should never switch. His ideal behavior is that if you pick the door with the goat then he gives you the goat; if you pick the door with the car then he tries to get you to switch.

wruza · 2025-01-21T17:41:41 1737481301

His ideal adversarial strategy becomes non-trivial when you know about it.

It is very likely “just flip a coin to turn it back to 50/50” but may be something statistically sophisticated.

andrewla · 2025-01-21T18:29:51 1737484191

If his desire is for the contestant to lose, then he can't really do better (formally) than winning 2/3 of the time by simply opening the door that they choose. In practice, always opening a goat-door and always asking to switch for a car-door can do slightly better than 2/3 because some contestants, unaware of his strategy and objectives, might choose to switch.

If his objective is more subtle -- increasing suspense or entertainment value or getting a kick out of people making a self-destructive choice or just deciding whether he likes a contestant -- then I'm not sure what the metrics are or what an optimal strategy would be in those cases.

Given that his motives are opaque and given no history of games upon which to even inductively reason, I don't think you can reach any conclusion about whether switching is preferable. Given the spread of possibilities I would tend to default to 50/50 for switch/no-switch, but I don't have a formal justification for this.

markus_zhang · 2025-01-21T13:03:13 1737464593

Yeah I studies Statistics in graduate but still believes that it's wrong. It's mathematically correct but it's wrong! I refuse to believe it!

awongh · 2025-01-21T12:16:43 1737461803

I think it's great that you can see the actual chain of thought behind the model, not just the censored one from OpenAI.

It strikes me that it's both so far from getting it correct and also so close- I'm not an expert but it feels like it could be just an iteration away from being able to reason through a problem like this. Which if true is an amazing step forward.

gsuuon · 2025-01-21T00:06:06 1737417966

I tried this via the chat website and it got it right, though strongly doubted itself. Maybe the specific wording of the prompt matters a lot here?

https://gist.github.com/gsuuon/c8746333820696a35a52f2f9ee6a7...

n0id34 · 2025-01-21T01:13:28 1737422008

lol what a chaotic read that is, hilarious. Just keeps refusing to believe there's three. WAIT, THAT CAN'T BE RIGHT!

MrCheeze · 2025-01-21T06:33:41 1737441221

How long until we get to the point where models know that LLMs get this wrong, and that it is an LLM, and therefore answers wrong on purpose? Has this already happened?

(I doubt it has, but there ARE already cases where models know they are LLMs, and therefore make the plausible but wrong assumption that they are ChatGPT.)

sebastiennight · 2025-01-27T22:05:24 1738015524

My understanding is that the model does not "know" it is an LLM. It is prompted (in the app's system prompt) or trained during RLHF to answer that it is an LLM.

viccis · 2025-01-21T17:06:47 1737479207

I tend to avoid that one because of the tokenization aspect. This popular one is a bit better:

"Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?"

The 7b one messed it up first try:

>Each of Alice's brothers has \(\boxed{M-1}\) sisters.

Trying again:

>Each of Alice's brothers has \(\boxed{M}\) sisters.

Also wrong. Again:

>\[ >\boxed{M + 1} >\]

Finally a right answer, took a few attempts though.

byteknight · 2025-01-22T20:02:46 1737576166

I think there is an inherent weight associated with the intrinsic knowledge opposed to the reasoning steps as intrinsic knowledge can override reasoning.

Written out here: https://news.ycombinator.com/item?id=42773282

mvkel · 2025-01-22T03:46:31 1737517591

This is incredibly fascinating.

I feel like one round of RL could potentially fix "short circuits" like these. It seems to be convinced that a particular rule isn't "allowed," when it's totally fine. Wouldn't that mean that you just have to fine tune it a bit more on its reasoning path?

byteknight · 2025-01-22T03:53:11 1737517991

I believe this comes from our verbiage.

If I asked you, "hey. How many Rs in strawberry?". You're going to tell me 2, because the likelihood is I am asking about the ending Rs. That's at least how I'd interpret the question without the "llm test" clouding my vision.

Same for if I asked how many gullible. I'd say "it's a double L after the u".

It's my guess this has muddled the training data.

phl · 2025-01-21T17:41:08 1737481268

Just by asking it to validate its own reasoning it got it right somehow. https://gist.github.com/dadaphl/1551b5e1f1b063c7b7f6bb000740...

ein0p · 2025-01-21T03:50:04 1737431404

This is from a small model. 32B and 70B answer this correctly. "Arrowroot" too. Interestingly, 32B's "thinking" is a lot shorter and it seems to be more "sure". Could be because it's based on Qwen rather than LLaMA.

cbo100 · 2025-01-21T06:01:37 1737439297

I get the right answer on the 8B model too.

It could be the quantized version failing?

ein0p · 2025-01-21T19:40:21 1737488421

My models are both 4 bit. But yeah, that could be - small models are much worse at tolerating quantization. That's why people use LoRA to recover the accuracy somewhat even if they don't need domain adaptation.

carabiner · 2025-01-20T23:22:53 1737415373

How would they build guardrails for this? In CFD, physical simulation with ML, they talk about using physics-informed models instead of purely statistical. How would they make language models that are informed with formal rules, concepts of English?

msoad · 2025-01-21T06:35:56 1737441356

if how us humans reason about things is a clue, language is not the right tool to reason about things.

There is now research in Large Concept Models to tackle this but I'm not literate enough to understand what that actually means...

kridsdale1 · 2025-01-21T08:53:11 1737449591

Is that just doing the TTC in latent space without lossy resolving from embedding to English at each step?

msoad · 2025-01-21T22:50:33 1737499833

https://ai.meta.com/research/publications/large-concept-mode...

inasio · 2025-01-21T02:06:49 1737425209

This is great! I'm pretty sure it's because the training corpus has a bunch of "strawberry spelled with two R's" and it's using that

grandpoobah · 2025-01-25T15:26:31 1737818791

It's trained on GPT4 conversations right?

sharpshadow · 2025-01-21T10:59:47 1737457187

Maybe the AI would be smarter if it could access some basic tools instead of doing it its own way.

Owlettotoo · 2025-01-21T12:44:22 1737463462

Love this interaction, mind if I repost your gits link elsewhere?

alliao · 2025-01-22T19:54:42 1737575682

perhaps they need to forget once they learnt reasoning... this is hilarious thank you

itstriz · 2025-01-24T14:34:27 1737729267

omg lol "here we go, the first 'R'"