Hacker News new | past | comments | ask | show | jobs | submit login

We never train at 128k, only 32k, changing the scaling factor at the end.

We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k.

Individual attention layers are always dense.






Thanks for your answer ! So in the 32k global layer, every token attends to each of the other 32k tokens ?

[Edit: You answered the question when you said that individual attention layers are always dense.]




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: