Retrieval Evaluation & CI Quality Gate for RAG Systems
At a previous role, I was setting up customer support for a new B2B initiative. The knowledge base mixed inherited consumer docs with new business policies. When we explored an AI agent for frontline queries, it confidently answered from the wrong context — quoting consumer policies to business customers, mixing legacy with current plans. The failure mode wasn't "the AI doesn't know"; it was "the AI sounds right but isn't." EVALens catches this before it reaches production.
"Every PR runs the full 30-query eval set. Change one config parameter — retrieval_k from 4 to 1 — and the gate catches the regression."
Scores degrade left to right — by design.
Cost per eval run: $0.0234
"Precision and recall are gated — they showed the strongest separation between configs. Faithfulness was too stable (0.95/0.90) to detect regression. Relevancy delta was too small (0.018) and too coarse — a useless answer can score 1.0 if it's on-topic."
| Config | Precision | Recall | Faithfulness | Relevancy | Gate |
|---|---|---|---|---|---|
| Baseline k=4, chunk=1000 | 0.711 | 0.834 | 0.950 | 0.888 | PASS |
| Degraded retrieval k=1, chunk=1000 | 0.633 | 0.706 | 0.900 | 0.870 | FAIL |
| Reduced chunking k=4, chunk=200 | 0.736 | 0.839 | 0.988 | 0.806 | PASS |