Hacker News new | past | comments | ask | show | jobs | submit login

It seems to have been very benchmark-tuned for LMArena. In my own experiments, it was roughly in line with other comparably sized models for factual knowledge (like Mistral Small 3), and worse than Mistral Small 3 and Phi-4 at STEM problems and logic. It's much worse than Llama 3.3 70b or Mistral Large 2411 in knowledge or intelligence in reality, even though LMArena ranks it as better than those.





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: