Skip to content

LLM Evaluation as Production Decision Layer

The strategic shift is from "does this LLM output score well?" to "what should the system do with this output before a user sees it?"


Core Idea

Most LLM evaluation is framed as scoring. Production AI needs routing:

  • ACCEPT β€” serve the response.
  • REVIEW β€” route to human or retry with a safer prompt.
  • REJECT β€” block/regenerate because the output is unsafe or unsupported.

A useful eval system returns a structured payload, not just a number:

{
  "decision": "ACCEPT|REVIEW|REJECT",
  "failure_type": "hallucination|weak_grounding|poor_retrieval|off_topic|uncertain|none",
  "next_action": "serve_response|retry|retrieve_more|regenerate|human_review",
  "reason": "plain-English explanation",
  "scores": {
    "attribution": 0.0,
    "specificity": 0.0,
    "relevance": 0.0,
    "context_quality": 0.0,
    "disagreement": 0.0
  }
}

The valuable abstraction is decision + reason + action. Scores are diagnostics, not the product.


Attribution Γ— Specificity Matrix

The most useful mental model from the TDS eval-layer article is splitting faithfulness into two independent signals:

  • Attribution β€” is the answer supported by the supplied context?
  • Specificity β€” is the answer concrete and detailed?
Attribution Specificity Meaning Action
high high grounded and useful accept
high low true but thin retry for detail
low low vague / weakly supported review or improve retrieval
low high confident hallucination reject/regenerate

The dangerous production failure is low attribution + high specificity: a polished, specific, unsupported answer. This is what human skim-review misses and what single-score evals can accidentally pass.


Why This Matters for Analytics Engineering

This is the LLM equivalent of data quality testing:

Analytics Engineering LLM System
dbt tests eval cases
source freshness context freshness
model contracts answer contracts
CI failure on bad data deploy block on quality regression
observability dashboards accept/review/reject + failure taxonomy

The career/consultancy framing: AI outputs are data products. They need tests, contracts, regression gates, and observability before they are trusted in business workflows.

This connects directly to [[Data-Quality-and-Governance]] and [[AE-Consultancy-Delivery]].


Consultancy Offer: LLM Quality Gate Sprint

Positioning:

We do not just build a chatbot. We install the quality gates that stop unsupported AI outputs reaching users.

Discovery Questions

  • What outputs would cause business harm if confidently wrong?
  • What context should each answer be grounded in?
  • What is the expected action for failures: retry, retrieve more, block, or human review?
  • Do you have labelled examples of good, bad, and borderline answers?
  • Which changes should trigger regression tests: prompt edits, model swaps, retriever changes, knowledge-base updates?

Delivery Pattern

  1. Define answer contract and failure taxonomy.
  2. Build 30–100 labelled eval cases.
  3. Implement cheap heuristic gates for deterministic first-pass checks.
  4. Add LLM-as-judge only for the uncertain middle band.
  5. Wire regression suite into CI/CD.
  6. Publish an eval dashboard: accept/review/reject distribution, failure types, regression deltas.

This is especially strong for clients moving from GenAI prototype to production deployment.


Hermes Application

Candidate Hermes surfaces:

  • Tech/news briefings β€” verify summaries against source extracts before delivery.
  • Morning briefings β€” stop unsupported claims from calendar/email/news context.
  • Wiki promotion β€” ensure durable knowledge pages are grounded in source research.
  • Cron summarisation β€” attach eval decision payloads to generated output logs.

Implementation principle: follow the deterministic-first pattern. Do not call an LLM judge for every output. Use source extraction + local checks first, escalate only for borderline cases.

This links to [[Agentic-Engineering-Patterns]] and [[AI-Agents-in-Data-Engineering]].


Important Caveats

The referenced repo is best treated as a pattern, not a drop-in standard.

  • Reported benchmark set is tiny; five cases are not enough to trust thresholds.
  • Thresholds are domain-specific and must be calibrated per client/context.
  • Local heuristics can miss semantic drift where the answer uses the same words but changes the meaning.
  • The inspected TF-IDF fallback implementation appears weak because it compares per-text vectors without a shared vocabulary; use a proper embedding model or shared vectorizer.

Immediate Actions

  • Prototype a lightweight eval decision wrapper on tech-news summaries.
  • Define a reusable failure taxonomy for Hermes-generated outputs.
  • Convert this into a consultancy package: LLM Quality Gate Sprint plus [[Agentic-Reliability-Consultancy-Offer]].
  • Build a small labelled dataset from past good/bad Hermes briefings.

  • [[AE-Consultancy-Delivery]]
  • [[Data-Quality-and-Governance]]
  • [[Agentic-Engineering-Patterns]]
  • [[AI-Agents-in-Data-Engineering]]
  • [[Agentic-Analytics-Engineering]]