And have you measured the trade-off that could come with embracing such a large number of languages and alphabets? It would be interesting to note whether you are sacrificing some response quality, or if such supposed sacrifice is interestingly negligible, or if - even more interestingly - the quality increases with the added proficiency.
Yes we have measured the tradeoff. We don't see a drop of perplexity in English when introducing multilingual, and there is a slight drop in some English language-specific evals (~1%).
There are enough small model teams competing that I fell confident one of them will try this, and if it just sticking to english gives a large boost, the others will be forced to follow suite.
It would also kind of suck for non-english speakers, because it will just be another feather in the hat of "English eats the world".
Some numbers to try and make an idea: if I understand correctly, Gemma3 uses a fixed (in its versions by size) vocabulary 256k entries big; the smallest 1B version has ~300M embedding parameters and ~700M non-embedding parameters; the largest 27B version has ~5x embedding parameters and ~35x non-embedding parameters.
Multilingualism covering 140 languages is quite a big feat. Gemma3 apparently aims to be compact and efficient. The two goals and features put together raise questions. You wonder for example how much does such extensive multilingualism impact the above numbers, on a benchmark of similar results. It may e.g. be a general question to wonder how much multilingualism complicates an embedding space (owing e.g. to omographic collisions), and the question becomes more prominent when you crammed 140 languages in one model.
> non-english speakers
You would produce more specialized models (where it makes sense): Eng; Eng-Fra-Esp-Deu; Man-Can... For a billion weights per model it could probably be financially acceptable.