It seems more like they valued quantitative data in the form of A/B testing high...

It seems more like they valued quantitative data in the form of A/B testing higher than their "vibe checks". The point I took away from the paper is in the context of LLMs, quantitative A/B testing isn't necessarily better than a handful of experts giving anecdotes on if they like it.

In my experience, smart leaders tend to rely on data and hard numbers over qualitative and anecdotal evidence, and this paper explores this exception.

I'm disappointed they didn't address the paper about GPT integrating with ChatbotArena that was shared here on HN a couple days ago.