I agree. Public benchmarks aren't very useful for a bunch of reasons. Any company relying on LLMs for a critical function should have its own internal benchmark system. I maintain such a system for my job. If you are able, use the same prompt every time. It's fun to be able to include models like the original Bard on our leader board.