We never train at 128k, only 32k, changing the scaling factor at the end. We wan... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

		alekandreev 1 day ago \| parent \| context \| favorite \| on: Gemma 3 Technical Report [pdf] We never train at 128k, only 32k, changing the scaling factor at the end. We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k. Individual attention layers are always dense.

sidkshatriya 1 day ago [–]

Thanks for your answer ! So in the 32k global layer, every token attends to each of the other 32k tokens ?

[Edit: You answered the question when you said that individual attention layers are always dense.]

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact