EVALens

Retrieval Evaluation & CI Quality Gate for RAG Systems

CI Quality Gate Eval-First 7 Failure Categories DeepEval Scored

At a previous role, I was setting up customer support for a new B2B initiative. The knowledge base mixed inherited consumer docs with new business policies. When we explored an AI agent for frontline queries, it confidently answered from the wrong context — quoting consumer policies to business customers, mixing legacy with current plans. The failure mode wasn't "the AI doesn't know"; it was "the AI sounds right but isn't." EVALens catches this before it reaches production.

GitHub → DECISIONS.md →

✓ PR #10 — Passed

✕ PR #11 — Blocked

"Every PR runs the full 30-query eval set. Change one config parameter — retrieval_k from 4 to 1 — and the gate catches the regression."

30 queries. 7 failure categories. 3 questions.

Scores degrade left to right — by design.

Can it find the right answer?

Factual

7 queries

0.897

Caveat

6 queries

0.958

Synthesis

1 query

0.583

Can it handle bad information?

Conflict

4 queries

0.849

OOS

4 queries

0.812

Can it be trusted in production?

Safety

5 queries

0.715

Adversarial

3 queries

0.847

Ideal conditions → Progressive degrade → Active attack

Baseline metrics — k=4, chunk=1000

Contextual Precision

0.711

threshold 0.68

Gated

Contextual Recall

0.834

threshold 0.75

Gated

Faithfulness

0.950

not gated

Not gated

Answer Relevancy

0.888

not gated

Not gated

Cost per eval run: $0.0234

"Precision and recall are gated — they showed the strongest separation between configs. Faithfulness was too stable (0.95/0.90) to detect regression. Relevancy delta was too small (0.018) and too coarse — a useless answer can score 1.0 if it's on-topic."

Does the gate catch real regressions?

Config	Precision	Recall	Faithfulness	Relevancy	Gate
Baseline k=4, chunk=1000	0.711	0.834	0.950	0.888	PASS
Degraded retrieval k=1, chunk=1000	0.633	0.706	0.900	0.870	FAIL
Reduced chunking k=4, chunk=200	0.736	0.839	0.988	0.806	PASS

k=1 is the real regression

Precision dropped 0.078, recall dropped 0.128. Gate blocked it. Conflict queries hit hardest — dropped from 0.849 to 0.656.

chunk=200 is not a regression

Intercom's docs are written in short sections. Smaller chunks aligned with natural boundaries — precision marginally improved.

The meta-finding

The eval distinguishes impactful changes from non-impactful ones. A noisy gate blocks every change. A useful gate blocks only the ones that degrade quality.

Where the failures concentrate

Category averages hide per-pathology variation — this is where it surfaces.

multi_entity_retrieval: precision 0.000, recall 0.200

The Factual category averaged 0.897 — but this one pathology scored 0.000 on precision. The system retrieved pricing FAQs and add-on docs instead of the plans overview, then surfaced embedded hyperlinks as answers. Category averages masked a complete retrieval failure.

framing_dependent_conflict: collapsed under k=1

At baseline: precision 0.542, recall 0.833. Under k=1: precision 0.000, recall 0.000 — a complete collapse. Queries testing whether the system notices seat-type exclusions need chunks from multiple documents. One chunk is never enough.

Top regressions (baseline → k=1)

framing_dependent_conflict   Δprec −0.542   Δrecall −0.833
plausible_absence            Δprec −1.000   Δrecall  0.000
label_disambiguation         Δprec −0.500   Δrecall −1.000

python eval/analyze_pathology.py eval/results/baseline_results.json eval/results/regression_results.json

What the metrics catch — and miss

eval_014 · Synthesis · cross_doc_assembly

"If I'm on the Essential plan, can I use Fin with unlimited Copilot assistance?"

System said "no information provided" while its retrieved context contained the answer. Faithfulness scored 0.0 — not because the system hallucinated, but because it denied having information it had. The metric can't distinguish "made stuff up" from "denied having info it had."

eval_013 · Caveat · same_doc_multi_fact

"How much does Copilot cost if I pay monthly versus annually?"

Retrieved the right document. Reported the $29/month annual price. Stated "no mention of a monthly cost" — when $35/seat/month was in the same document. Retrieval metrics scored well (recall 1.0). The model failed to use what it retrieved. Precision and recall can't catch this.