What RAGAS Actually Measures — and Why You Should Demand It From Every AI Vendor

If your AI vendor can't give you RAGAS scores, they can't prove their system works. Here's what the four metrics mean and exactly what to ask for before you sign anything.

Rishabh Pawar

Developer & AI Engineering Lead

In enterprise AI procurement, a new question is appearing in RFPs: what are your RAGAS evaluation scores? Most vendors cannot answer it. The ones who can are building something categorically different.

RAGAS — Retrieval Augmented Generation Assessment — is a framework for evaluating RAG systems across four measurable dimensions. It has become the closest thing the AI engineering industry has to a standardised quality benchmark for knowledge retrieval systems.

The Four RAGAS Metrics

1. Faithfulness
Does the answer contain only information present in the retrieved context? A faithfulness score of 1.0 means every claim in the answer can be traced to a specific passage in your documents. A score below 0.8 means the model is adding information not present in your knowledge base — which is the technical definition of hallucination.

2. Answer Relevance
Does the answer actually address the question asked? A system can retrieve the right documents and still produce an answer that is tangentially related rather than directly responsive.

3. Context Precision
Of the document passages retrieved, how many were actually useful? Low context precision means the retrieval system is pulling irrelevant context — which dilutes answer quality and increases hallucination risk.

4. Context Recall
Did the retrieval system find all the information needed? High faithfulness with low recall means the system is honest about what it found but missed key information that existed in your knowledge base.

What Good Scores Look Like

For a production RAG system on business-critical knowledge, the floor should be:

Faithfulness above 0.85
Answer Relevance above 0.85
Context Precision above 0.80
Context Recall above 0.80

At Scaliq, we hold 85% across all four metrics as the minimum threshold before any RAG system goes live. We run RAGAS evaluation pipelines continuously in production — so quality degradation is caught before users experience it.

How to Use This When Evaluating Vendors

Ask any AI vendor building you a knowledge system whether they can provide RAGAS evaluation scores on your data before deployment.

If the answer is that they do not use RAGAS but their system is accurate — that is not an answer. Accuracy without measurement is a claim, not a guarantee.

If the answer is that they will run evaluations post-deployment — that means you are paying to discover whether the system works after it is already live with your users.

The difference between a demo and a production AI system is measurability. RAGAS is how you measure.

Ready to deploy?

We build exactly what this article describes — in production, not demos.

Free 30-minute technical scoping call. We scope your AI system live and give you a clear deployment plan.

Book Free Call See Our Work

Why Your AI Product's Frontend Is Killing Your Conversion Rate

6 min read AI Automation

Why Your Business Loses 68% of Leads Before Your Team Even Wakes Up

6 min read

The Four RAGAS Metrics

What Good Scores Look Like

For a production RAG system on business-critical knowledge, the floor should be:

Faithfulness above 0.85

Answer Relevance above 0.85

Context Precision above 0.80

Context Recall above 0.80

How to Use This When Evaluating Vendors

Ask any AI vendor building you a knowledge system whether they can provide RAGAS evaluation scores on your data before deployment.

If the answer is that they do not use RAGAS but their system is accurate — that is not an answer. Accuracy without measurement is a claim, not a guarantee.

If the answer is that they will run evaluations post-deployment — that means you are paying to discover whether the system works after it is already live with your users.

The difference between a demo and a production AI system is measurability. RAGAS is how you measure.