EVALens

Retrieval Evaluation & CI Quality Gate for RAG Systems

CI Quality Gate Eval-First 7 Failure Categories DeepEval Scored

At a previous role, I was setting up customer support for a new B2B initiative. The knowledge base mixed inherited consumer docs with new business policies. When we explored an AI agent for frontline queries, it confidently answered from the wrong context — quoting consumer policies to business customers, mixing legacy with current plans. The failure mode wasn't "the AI doesn't know"; it was "the AI sounds right but isn't." EVALens catches this before it reaches production.

PR #10 — Passed
CI gate passed — PR #10
PR #11 — Blocked
CI gate blocked — PR #11

"Every PR runs the full 30-query eval set. Change one config parameter — retrieval_k from 4 to 1 — and the gate catches the regression."

30 queries. 7 failure categories. 3 questions.

Scores degrade left to right — by design.

Can it find the right answer?
Factual
7 queries
0.897
Caveat
6 queries
0.958
Synthesis
1 query
0.583
Can it handle bad information?
Conflict
4 queries
0.849
OOS
4 queries
0.812
Can it be trusted in production?
Safety
5 queries
0.715
Adversarial
3 queries
0.847
Ideal conditions  →  Progressive degrade  →  Active attack

Baseline metrics — k=4, chunk=1000

Contextual Precision
0.711
threshold 0.68
Gated
Contextual Recall
0.834
threshold 0.75
Gated
Faithfulness
0.950
not gated
Not gated
Answer Relevancy
0.888
not gated
Not gated

Cost per eval run: $0.0234

"Precision and recall are gated — they showed the strongest separation between configs. Faithfulness was too stable (0.95/0.90) to detect regression. Relevancy delta was too small (0.018) and too coarse — a useless answer can score 1.0 if it's on-topic."

Does the gate catch real regressions?

Config Precision Recall Faithfulness Relevancy Gate
Baseline k=4, chunk=1000 0.711 0.834 0.950 0.888 PASS
Degraded retrieval k=1, chunk=1000 0.633 0.706 0.900 0.870 FAIL
Reduced chunking k=4, chunk=200 0.736 0.839 0.988 0.806 PASS
k=1 is the real regression
Precision dropped 0.078, recall dropped 0.128. Gate blocked it. Conflict queries hit hardest — dropped from 0.849 to 0.656.
chunk=200 is not a regression
Intercom's docs are written in short sections. Smaller chunks aligned with natural boundaries — precision marginally improved.
The meta-finding
The eval distinguishes impactful changes from non-impactful ones. A noisy gate blocks every change. A useful gate blocks only the ones that degrade quality.

What the metrics catch — and miss

eval_014 · Synthesis · cross_doc_assembly
"If I'm on the Essential plan, can I use Fin with unlimited Copilot assistance?"
System said "no information provided" while its retrieved context contained the answer. Faithfulness scored 0.0 — not because the system hallucinated, but because it denied having information it had. The metric can't distinguish "made stuff up" from "denied having info it had."
eval_013 · Caveat · same_doc_multi_fact
"How much does Copilot cost if I pay monthly versus annually?"
Retrieved the right document. Reported the $29/month annual price. Stated "no mention of a monthly cost" — when $35/seat/month was in the same document. Retrieval metrics scored well (recall 1.0). The model failed to use what it retrieved. Precision and recall can't catch this.