We just evaluated it for Vectara's grounded hallucination leaderboard: it scores at 10.9% hallucination rate, better than Gemini-3, GPT-5.1-high or Grok-4.
If you have built AI agents in the last 6-12 months you know they fail a lot.
I built this repository to be a community-curated list of failure modes, techniques to mitigate, and other resources, so that we can all learn from each other and build better agents.
Enterprise Deep Research is like "consumer" deep research just pointed at your private data, and I think may become the "killer app" of Agentic AI for business.
Lots of valuable use-cases: compliance monitoring, sales enablement, onboarding, legal, and many others.
One of the biggest challenges in RAG Evaluation is the assumption that you somehow can get the "source of truth" generated, specifically the set of "golden answers" (or golden chunks/documents).
In practice that is extremely difficult and non scalable.
Open-RAG-Eval is a new open source project that aims to address that via reference-free evaluation such as UMBRELA and AutoNuggetizer scores.
Well, we expect AI to become AGI sometime in the future. Some say it's here, others say it's in 5 years or 50 years or whatever.
So imagine AGI is here already (for sake of argument), and really has superintelligence, and will be able to have agency. How do we need to treat "it"?
Over history, humans and society created mechanism to overcome distrust, and our ability to collaborate is what helped us thrive. Should we think about our upcoming "relationship" with AI from that perspective as well?
RAG Evaluation is difficult, primarily because it's hard to come up with "golden answers" (or golden chunks).
We made Open-RAG-Eval to solve this - RAG Eval that only requires the question, yet provides great metrics for retrieval, generation, hallucination and citations for any RAG setup.
This was in collaboration with Jimmy Lin and his students at UWaterloo.
It has connectors to LangChain, LlamaIndex and Vectara, and hoping others can contribute more connectors to other RAG systems.
A great day for open source, and so glad to see llama4 out.
However, I'm a bit disappointed that the hallucination rates of Llama4 are not as low as I would have liked (TL;DR slightly higher than Llama3).
https://github.com/vectara/hallucination-leaderboard