Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder if the reason the models have problem with this is that their tokens aren't the same as our characters. It's like asking someone who can speak English (but doesn't know how to read) how many R's are there in strawberry. They are fluent in English audio tokens, but not written tokens.



The way LLMs get it right by counting the letters then change their answer at the last second makes me feel like there might be a large amount of text somewhere (eg. a reddit thread) in the dataset that repeats over and over that there is the wrong number of Rs. We've seen may weird glitches like this before (eg. a specific reddit username that would crash chatgpt)


Agree. We've given them a different alphabet than ours.

They speak a different language that captures the same meaning, but has different units.

Somehow they need to learn that their unit of thought is not the same as our speech. So that these questions need to map to a different alphabet.

That's my two cents.


Do they find ARC AGI also tough due to the same reason? I’ve seen some examples where the input was ASCII art versions of the actual image.


The amazing thing continues to be that they can ever answer these questions correctly.

It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.


Is that really true? Like, the data scientists making these tools are not confident why certain patterns are revealing themselves? That’s kind of wild.


Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.


I believe DeepSeek models do split numbers up into digits, and this provides a large boost to ability to do arithmetic. I would hope that it's the standard now.


Could be the case, I’m not familiar with their specific tokenizers. IIRC llama 3 tokenizes in chunks of three digits. That seems better than arbitrary sized chunks with BPE, but still kind of odd. The embedding layer has to learn the semantics of 1000 different number tokens, some of which overlap in meaning in some cases and not in others, e.g 001 vs 1.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: