LLM Evaluation as Production Decision Layer¶
The strategic shift is from "does this LLM output score well?" to "what should the system do with this output before a user sees it?"
Core Idea¶
Most LLM evaluation is framed as scoring. Production AI needs routing:
ACCEPTβ serve the response.REVIEWβ route to human or retry with a safer prompt.REJECTβ block/regenerate because the output is unsafe or unsupported.
A useful eval system returns a structured payload, not just a number:
{
"decision": "ACCEPT|REVIEW|REJECT",
"failure_type": "hallucination|weak_grounding|poor_retrieval|off_topic|uncertain|none",
"next_action": "serve_response|retry|retrieve_more|regenerate|human_review",
"reason": "plain-English explanation",
"scores": {
"attribution": 0.0,
"specificity": 0.0,
"relevance": 0.0,
"context_quality": 0.0,
"disagreement": 0.0
}
}
The valuable abstraction is decision + reason + action. Scores are diagnostics, not the product.
Attribution Γ Specificity Matrix¶
The most useful mental model from the TDS eval-layer article is splitting faithfulness into two independent signals:
- Attribution β is the answer supported by the supplied context?
- Specificity β is the answer concrete and detailed?
| Attribution | Specificity | Meaning | Action |
|---|---|---|---|
| high | high | grounded and useful | accept |
| high | low | true but thin | retry for detail |
| low | low | vague / weakly supported | review or improve retrieval |
| low | high | confident hallucination | reject/regenerate |
The dangerous production failure is low attribution + high specificity: a polished, specific, unsupported answer. This is what human skim-review misses and what single-score evals can accidentally pass.
Why This Matters for Analytics Engineering¶
This is the LLM equivalent of data quality testing:
| Analytics Engineering | LLM System |
|---|---|
| dbt tests | eval cases |
| source freshness | context freshness |
| model contracts | answer contracts |
| CI failure on bad data | deploy block on quality regression |
| observability dashboards | accept/review/reject + failure taxonomy |
The career/consultancy framing: AI outputs are data products. They need tests, contracts, regression gates, and observability before they are trusted in business workflows.
This connects directly to [[Data-Quality-and-Governance]] and [[AE-Consultancy-Delivery]].
Consultancy Offer: LLM Quality Gate Sprint¶
Positioning:
We do not just build a chatbot. We install the quality gates that stop unsupported AI outputs reaching users.
Discovery Questions¶
- What outputs would cause business harm if confidently wrong?
- What context should each answer be grounded in?
- What is the expected action for failures: retry, retrieve more, block, or human review?
- Do you have labelled examples of good, bad, and borderline answers?
- Which changes should trigger regression tests: prompt edits, model swaps, retriever changes, knowledge-base updates?
Delivery Pattern¶
- Define answer contract and failure taxonomy.
- Build 30β100 labelled eval cases.
- Implement cheap heuristic gates for deterministic first-pass checks.
- Add LLM-as-judge only for the uncertain middle band.
- Wire regression suite into CI/CD.
- Publish an eval dashboard: accept/review/reject distribution, failure types, regression deltas.
This is especially strong for clients moving from GenAI prototype to production deployment.
Hermes Application¶
Candidate Hermes surfaces:
- Tech/news briefings β verify summaries against source extracts before delivery.
- Morning briefings β stop unsupported claims from calendar/email/news context.
- Wiki promotion β ensure durable knowledge pages are grounded in source research.
- Cron summarisation β attach eval decision payloads to generated output logs.
Implementation principle: follow the deterministic-first pattern. Do not call an LLM judge for every output. Use source extraction + local checks first, escalate only for borderline cases.
This links to [[Agentic-Engineering-Patterns]] and [[AI-Agents-in-Data-Engineering]].
Important Caveats¶
The referenced repo is best treated as a pattern, not a drop-in standard.
- Reported benchmark set is tiny; five cases are not enough to trust thresholds.
- Thresholds are domain-specific and must be calibrated per client/context.
- Local heuristics can miss semantic drift where the answer uses the same words but changes the meaning.
- The inspected TF-IDF fallback implementation appears weak because it compares per-text vectors without a shared vocabulary; use a proper embedding model or shared vectorizer.
Immediate Actions¶
- Prototype a lightweight eval decision wrapper on tech-news summaries.
- Define a reusable failure taxonomy for Hermes-generated outputs.
- Convert this into a consultancy package: LLM Quality Gate Sprint plus [[Agentic-Reliability-Consultancy-Offer]].
- Build a small labelled dataset from past good/bad Hermes briefings.
Related¶
- [[AE-Consultancy-Delivery]]
- [[Data-Quality-and-Governance]]
- [[Agentic-Engineering-Patterns]]
- [[AI-Agents-in-Data-Engineering]]
- [[Agentic-Analytics-Engineering]]