I don't doubt that training an LLM, and curating a training set, is a black art....

danielmarkbruce · 2025-01-21T22:14:51 1737497691

Nope, the right analogy is: "it's like saying a model will find it difficult to tell you what's inside a box because it can't see inside it". Shaking it, weighing it, measuring if it produces some magnetic field or whatever is what LLMs are currently doing, and often well.

The discussion was around the difficulty of doing it with current tokenization schemes v character level. No one said it was impossible. It's possible to train an LLM to do arithmetic with decent sized numbers - it's difficult to do it well.

You don't need to spend more than a few hundred dollars to train a model to figure something like this out. In fact, you don't need to spend any money at all. If you are willing to step through small model layer by layer, it obvious.

HarHarVeryFunny · 2025-01-22T01:08:22 1737508102

At the end of the day you're just wrong. You said models fail to count r's in strawberry because they can't "break" the tokens into letters (i.e. predict letters from tokens, given some examples to learn from), and seem entirely unfazed by the fact that they in fact can do this.

Maybe you should tell Altman to put his $500B datacenter plans on hold, because you've been looking at your toy model and figured AGI can't spell.

danielmarkbruce · 2025-01-22T21:55:35 1737582935

Maybe go back and read what I said rather than make up nonsense. 'often fail' isn't 'always fail'. And many models fail the strawberry example, that's why it's famous. I even lay out some training samples that are of the type that enable current models to succeed at spelling 'games' in a fragile way.

Problematic and fragile at spelling games compared to using character or byte level 'tokenization' isn't a giant deal. These are largely "gotchas" that don't reduce the value of the product materially. Everyone in the field is aware. Hyperbole isn't required.

Someone linked you to one of the relevant papers above... and you still contort yourself into a pretzel. If you can't intuitively get the difficulty posed by current tokenization, and how character/byte level 'tokenization' would make those things trivial (albeit with a tradeoff that doesn't make it worth it) maybe you don't have the horsepower required for the field.

HarHarVeryFunny · 2025-01-23T02:03:04 1737597784

Did you actually read the CUTE paper ?!

What character level task does it say is no problem for multi-char token models ?

What kind of tasks does it say they do poorly at ?

Seems they agree with me, not you.

But hey, if you tried spelling vs counting for yourself you already know that.

You should swap your brain out for GPT-1. It'd be an upgrade.

danielmarkbruce · 2025-01-23T17:31:59 1737653519

Conclusion:

""" While current LLMs with BPE vocabularies lack direct access to a token’s characters, they perform well on some tasks requiring this information, but perform poorly on others. The models seem to understand the composition of their tokens in direct probing, but mostly fail to understand the concept of orthographic similarity. Their performance on text manipulation tasks at the character level lags far behind their performance at the word level. LLM developers currently apply no methods which specifically address these issues (to our knowledge), and so we recommend more research to better master orthography. Character-level models are a promising direction. With instruction tuning, they might provide a solution to many of the shortcomings exposed by our CUTE benchmark """

That is "having problems with spelling 'games'" and "probably better to use character level models for such tasks". Maybe you don't understand what "spelling games" are, here: https://chatgpt.com/share/67928128-9064-8002-ba4d-7ebc5edf07...