TLDR:
1. 1B text only, 4, 12, 27B Vision + text. 14T tokens
2. 128K context length further trained from 32K. 1B is 32K.
3. Removed attn softcapping. Replaced with QK norm
4. 5 sliding + 1 global attn
5. 1024 sliding window attention
6. RL - BOND, WARM, WARP
TLDR:
1. 1B text only, 4, 12, 27B Vision + text. 14T tokens
2. 128K context length further trained from 32K. 1B is 32K.
3. Removed attn softcapping. Replaced with QK norm
4. 5 sliding + 1 global attn
5. 1024 sliding window attention
6. RL - BOND, WARM, WARP