Hacker News new | past | comments | ask | show | jobs | submit login

If it helps anyone, I wrote a detailed analysis here: https://x.com/danielhanchen/status/1899735308180267176

TLDR:

1. 1B text only, 4, 12, 27B Vision + text. 14T tokens

2. 128K context length further trained from 32K. 1B is 32K.

3. Removed attn softcapping. Replaced with QK norm

4. 5 sliding + 1 global attn

5. 1024 sliding window attention

6. RL - BOND, WARM, WARP






Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: