I wonder if the reason the models have problem with this is that their tokens ar...

max51 · 2025-01-22T18:53:21 1737572001

The way LLMs get it right by counting the letters then change their answer at the last second makes me feel like there might be a large amount of text somewhere (eg. a reddit thread) in the dataset that repeats over and over that there is the wrong number of Rs. We've seen may weird glitches like this before (eg. a specific reddit username that would crash chatgpt)

ijidak · 2025-01-21T15:41:28 1737474088

Agree. We've given them a different alphabet than ours.

They speak a different language that captures the same meaning, but has different units.

Somehow they need to learn that their unit of thought is not the same as our speech. So that these questions need to map to a different alphabet.

That's my two cents.

theanirudh · 2025-01-21T16:10:44 1737475844

Do they find ARC AGI also tough due to the same reason? I’ve seen some examples where the input was ASCII art versions of the actual image.

andrewla · 2025-01-21T16:40:26 1737477626

The amazing thing continues to be that they can ever answer these questions correctly.

It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.

hcurtiss · 2025-01-22T02:42:43 1737513763

Is that really true? Like, the data scientists making these tools are not confident why certain patterns are revealing themselves? That’s kind of wild.

maxrmk · 2025-01-21T05:14:35 1737436475

Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.

versteegen · 2025-01-21T20:29:40 1737491380

I believe DeepSeek models do split numbers up into digits, and this provides a large boost to ability to do arithmetic. I would hope that it's the standard now.

maxrmk · 2025-01-22T00:00:13 1737504013

Could be the case, I’m not familiar with their specific tokenizers. IIRC llama 3 tokenizes in chunks of three digits. That seems better than arbitrary sized chunks with BPE, but still kind of odd. The embedding layer has to learn the semantics of 1000 different number tokens, some of which overlap in meaning in some cases and not in others, e.g 001 vs 1.