Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree. Public benchmarks aren't very useful for a bunch of reasons. Any company relying on LLMs for a critical function should have its own internal benchmark system. I maintain such a system for my job. If you are able, use the same prompt every time. It's fun to be able to include models like the original Bard on our leader board.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: