Hacker News new | past | comments | ask | show | jobs | submit login

As per the technical report, every 5 layers you have a global attention layer. The global attention layer during training can have as many as a 128k context length during training (though I understand it is usually 32k).

Q. When you are training with a context length of 128k, is the attention in the global layers dense or sparse ?

If dense, would the attention memory requirement here would be O(n^2) where n is 128k for each global layer ?






We never train at 128k, only 32k, changing the scaling factor at the end.

We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k.

Individual attention layers are always dense.


Thanks for your answer ! So in the 32k global layer, every token attends to each of the other 32k tokens ?

[Edit: You answered the question when you said that individual attention layers are always dense.]




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: