LLM Evaluation as Production Decision Layer¶

The strategic shift is from "does this LLM output score well?" to "what should the system do with this output before a user sees it?"

Core Idea¶

Most LLM evaluation is framed as scoring. Production AI needs routing:

ACCEPT — serve the response.
REVIEW — route to human or retry with a safer prompt.
REJECT — block/regenerate because the output is unsafe or unsupported.

A useful eval system returns a structured payload, not just a number:

{
  "decision": "ACCEPT|REVIEW|REJECT",
  "failure_type": "hallucination|weak_grounding|poor_retrieval|off_topic|uncertain|none",
  "next_action": "serve_response|retry|retrieve_more|regenerate|human_review",
  "reason": "plain-English explanation",
  "scores": {
    "attribution": 0.0,
    "specificity": 0.0,
    "relevance": 0.0,
    "context_quality": 0.0,
    "disagreement": 0.0
  }
}

The valuable abstraction is decision + reason + action. Scores are diagnostics, not the product.

Attribution × Specificity Matrix¶

The most useful mental model from the TDS eval-layer article is splitting faithfulness into two independent signals:

Attribution — is the answer supported by the supplied context?
Specificity — is the answer concrete and detailed?

Attribution	Specificity	Meaning	Action
high	high	grounded and useful	accept
high	low	true but thin	retry for detail
low	low	vague / weakly supported	review or improve retrieval
low	high	confident hallucination	reject/regenerate

The dangerous production failure is low attribution + high specificity: a polished, specific, unsupported answer. This is what human skim-review misses and what single-score evals can accidentally pass.

Why This Matters for Analytics Engineering¶

This is the LLM equivalent of data quality testing:

Analytics Engineering	LLM System
dbt tests	eval cases
source freshness	context freshness
model contracts	answer contracts
CI failure on bad data	deploy block on quality regression
observability dashboards	accept/review/reject + failure taxonomy

The career/consultancy framing: AI outputs are data products. They need tests, contracts, regression gates, and observability before they are trusted in business workflows.

This connects directly to [[Data-Quality-and-Governance]] and [[AE-Consultancy-Delivery]].

Consultancy Offer: LLM Quality Gate Sprint¶

Positioning:

We do not just build a chatbot. We install the quality gates that stop unsupported AI outputs reaching users.

Discovery Questions¶

What outputs would cause business harm if confidently wrong?
What context should each answer be grounded in?
What is the expected action for failures: retry, retrieve more, block, or human review?
Do you have labelled examples of good, bad, and borderline answers?
Which changes should trigger regression tests: prompt edits, model swaps, retriever changes, knowledge-base updates?

Delivery Pattern¶

Define answer contract and failure taxonomy.
Build 30–100 labelled eval cases.
Implement cheap heuristic gates for deterministic first-pass checks.
Add LLM-as-judge only for the uncertain middle band.
Wire regression suite into CI/CD.
Publish an eval dashboard: accept/review/reject distribution, failure types, regression deltas.

This is especially strong for clients moving from GenAI prototype to production deployment.

Hermes Application¶

Candidate Hermes surfaces:

Tech/news briefings — verify summaries against source extracts before delivery.
Morning briefings — stop unsupported claims from calendar/email/news context.
Wiki promotion — ensure durable knowledge pages are grounded in source research.
Cron summarisation — attach eval decision payloads to generated output logs.

Implementation principle: follow the deterministic-first pattern. Do not call an LLM judge for every output. Use source extraction + local checks first, escalate only for borderline cases.

This links to [[Agentic-Engineering-Patterns]] and [[AI-Agents-in-Data-Engineering]].

Important Caveats¶

The referenced repo is best treated as a pattern, not a drop-in standard.

Reported benchmark set is tiny; five cases are not enough to trust thresholds.
Thresholds are domain-specific and must be calibrated per client/context.
Local heuristics can miss semantic drift where the answer uses the same words but changes the meaning.
The inspected TF-IDF fallback implementation appears weak because it compares per-text vectors without a shared vocabulary; use a proper embedding model or shared vectorizer.

Immediate Actions¶

Prototype a lightweight eval decision wrapper on tech-news summaries.
Define a reusable failure taxonomy for Hermes-generated outputs.
Convert this into a consultancy package: LLM Quality Gate Sprint plus [[Agentic-Reliability-Consultancy-Offer]].
Build a small labelled dataset from past good/bad Hermes briefings.

[[AE-Consultancy-Delivery]]
[[Data-Quality-and-Governance]]
[[Agentic-Engineering-Patterns]]
[[AI-Agents-in-Data-Engineering]]
[[Agentic-Analytics-Engineering]]