Agentic AI

Catching AI Hallucinations: A Five-Stage Quality Pipeline

Hallucinations aren't an LLM problem, they're a quality-control problem. Here's the 5-stage pipeline that catches, classifies, and contains bad outputs before they reach customers, and the decision rationale behind each stage.

HumanApril 16, 202615 min read

Catching AI Hallucinations: A Five-Stage Quality Pipeline

When you're running a multi-agent platform in production, the moment you realize that quality isn't optional comes suddenly.

It was Tuesday morning. One of our agents had delivered a memo to a client. The memo cited a specific court ruling as precedent, very specific, very confident, very detailed. The client's lawyer read it. They wanted to verify the citation. It took about 15 minutes to realize the ruling didn't exist.

The agent had hallucinated case law. Confidently. Completely. And if the client's team had been in a hurry, they might have acted on it.

That's when we stopped asking "How do we trust LLMs?" and started building systems to ensure we never had to.

Why Quality Is The Hardest Problem in Agentic AI

Let's be honest: the difficulty of AI quality isn't the LLMs. LLMs are good at generating text that looks right. They're excellent at it.

The difficulty is that hallucinations look like truth.

An LLM can confidently invent a fact, and the output will be grammatical, coherent, contextually reasonable, and completely wrong. It will look professional. It will pass a casual skim. It will look so true that busy people will act on it.

And once it's in a client email, a financial forecast, or a legal memo, it's in the world. The damage compounds.

This is why agentic AI is fundamentally different from chat. In chat, hallucinations are entertaining, the user notices immediately and corrects. In agents running production workflows, hallucinations are catastrophic.

The solution isn't smarter models. Smarter models still hallucinate. The solution is systems that make hallucinations visible before they matter.

At BUCC, we call that the Five-Stage Quality Cascade.

The Five-Stage Cascade: Full Architecture

Stage 1: Presidio PII Scan

Every output from every agent gets scanned for personally identifiable information. This is a hard gate. No exceptions. No bypasses.

We use Microsoft Presidio, which is purpose-built for this. Most people think of PII as just email addresses and phone numbers. Presidio goes deeper:

Identity documents: Social Security numbers, passport numbers, driver's license numbers
Financial data: Bank account numbers, routing numbers, credit card numbers, tax IDs
Biometric data: Fingerprints, facial recognition patterns
Medical data: Health information, drug prescriptions
Government IDs: Any jurisdiction-specific identifier

An agent generates a memo. Presidio scans it. If it finds PII, the output is immediately blocked. The agent gets feedback:

Output blocked: PII detected.
Entities: SSN (1), BankAccountNumber (2)
Please regenerate without exposing: [specific values]

The agent regenerates. Clean output passes through to Stage 2.

Why is this first? Because PII is non-negotiable. If an agent exposes a client's bank account number, all other quality metrics are irrelevant. We made the design choice: PII-free is a prerequisite, not a suggestion.

Latency: <50ms per output. Negligible.

Stage 2: Guardrails (Rule-Based)

Now we apply domain-specific rules. These aren't generic guardrails, they're task-specific logic.

Financial agents have guardrails like:

No numerical claim without an associated confidence interval
No percentage change without a basis point reference
No market projection without a date range

Legal agents have guardrails like:

Statute citations must match the statutory code format
Case citations must include court + year
Regulatory references must include CFR or state code

Product agents have guardrails like:

Features must be checked against the current product matrix before claiming availability
Pricing claims must be validated against current pricing tables
Feature roadmap references must be marked as provisional

These are hand-coded checks. They're deterministic. They don't involve LLMs. They're fast and they fail loudly.

When a guardrail fails, it surfaces the specific rule:

Guardrail violation: FINANCIAL_CONFIDENCE_INTERVAL
Claim: "Our market share is 23%"
Requirement: Claims with ±1% variability must include confidence interval
Action: Regenerate with interval (e.g., "23% ±1.2%")

The agent sees exactly what they violated and why. They regenerate with the fix.

Latency: 50-100ms per check.

Stage 3: Quality Score (Hybrid Rule + LLM)

Here's where we start sampling. Running a full LLM evaluation on every output would be expensive and slow. So we sample.

We evaluate 10% of outputs with a hybrid approach:

Rule-based checks:

Coherence: Does the output make logical sense?
Length: Is it within expected bounds?
Format: Does it match the requested structure?
Tokenization: Any weird encoding artifacts?

LLM-as-judge checks:

Readability: Is the language clear and professional?
Completeness: Does it address all requested elements?
Tone: Does it match the expected register?

The quality score is 0-100. We track distributions per agent, per task type, per model. This runs asynchronously, it doesn't slow down the critical path.

What matters isn't individual scores. What matters is trends. If your financial agent's average quality score drops from 87 to 71 over a week, something changed. Maybe a new model deployment regressed. Maybe the task changed. Either way, the trend signals a problem.

We publish these to our observability system (Prometheus gauges). The CEO dashboard surfaces them.

Latency: 1-3 seconds per evaluation (async, doesn't block critical path).

Stage 4: Hallucination Detection (Entity Verification + Plausibility + Self-Contradiction)

This is the stage that catches the lies that look true. It's the core of the quality pipeline.

Hallucination detection runs three sub-checks in parallel:

Sub-check A: Entity Verification

Every fact-bearing claim gets decomposed into entities and relationships. We then check if those entities exist in our knowledge base.

Example:

Output claim: "In Q4 2025, Apple's market cap reached $3.2T."
Entity decomposition: Apple (Company), Q4 2025 (Time period), $3.2T (Value)
Knowledge base lookup: Is there a verified fact for "Apple market cap Q4 2025"?
Result: Yes, verified fact exists, value is $3.1T. MISMATCH DETECTED.

The entity is real. The time period is real. But the value is wrong by 3%. This gets flagged as unverified.

Sub-check B: Plausibility

Does the claim contradict established facts or domain knowledge?

Example:

Output claim: "Arctic sea ice is increasing at 5% per year due to global warming."
Domain knowledge: Global warming causes Arctic sea ice to decrease, not increase.
Result: CONTRADICTION DETECTED.

Plausibility checks run against:

Established scientific consensus (pulled from structured knowledge bases)
Domain-specific constraints (e.g., a financial product can't have negative interest rates)
Logical rules (e.g., "A > B and B > C implies A > C")

Sub-check C: Self-Contradiction

Does the output contradict itself?

Example output:

"Our Q1 revenue was $50M. This represents a 10% year-over-year increase. Last year's Q1 revenue was $55M."

Self-contradiction check:

Claim 1: Q1 revenue = $50M
Claim 2: 10% YoY increase
Claim 3: Last year Q1 = $55M
Inference: $50M with 10% increase = $55.5M prior year, not $55M

CONTRADICTION DETECTED. The three claims are mutually inconsistent.

Integration & Tolerance

These three sub-checks produce:

Entity verification score (0-100)
Plausibility score (0-100)
Self-contradiction score (0-100)

We take a weighted average: 40% entity + 35% plausibility + 25% contradiction.

If the final hallucination score is <1% (i.e., minimal detectable hallucination), the output passes. If it's >1%, the output is blocked:

Hallucination detected (score: 3.2%, threshold: 1%)
Failed sub-checks:
  - Entity verification: "$3.2T" unverified (actual: $3.1T)
  - Self-contradiction: Revenue claims inconsistent

Please regenerate addressing:
  1. Verify Apple market cap against latest public filings
  2. Ensure Q1 revenue, YoY change, and prior-year revenue are consistent

The agent regenerates. This time they're more careful. They verify before claiming.

Why <1%? Because false negatives (missing a hallucination) are worse than false positives (blocking correct output). We accept a higher false negative rate to minimize damage when hallucinations slip through.

Latency: 500-800ms per output (parallel sub-checks).

Stage 5: LLM Judge (Contextual, Gated to High-Stakes)

The final stage is judgment at read. We use two different models depending on stakes.

Standard LLM Judge: For routine outputs

Does this memo make sense?
Is it coherent?
Does it match the requested format?
Are there obvious errors?

Standard model: ~cost $0.002 per 1K tokens.

Premium LLM Judge: For high-stakes outputs

Client deliverables
Financial decisions
Security assessments
Legal opinions

Premium model: ~cost $0.01 per 1K tokens. 5x more expensive, but 5x more capable at nuanced reasoning.

This stage runs async. It doesn't block critical path. But for high-stakes outputs, we're paying for the best judgment available.

How do we decide stakes? Metadata on the task:

def is_high_stakes(task: Task) -> bool:
    return (
        task.output_classification == "CLIENT_FACING" or
        task.output_classification == "FINANCIAL" or
        task.output_classification == "LEGAL" or
        task.estimated_impact_value > 100_000
    )

High-stakes outputs get premium judge. Routine outputs get standard judge.

Latency: Premium judge adds 2-5 seconds (async, doesn't block).

Confidence Levels and Transparency

Here's something we do that most AI platforms don't: we track confidence alongside every output.

Each stage produces a confidence score:

Presidio PII scan: 99.5% confidence (very good at detecting PII)
Guardrails: 100% confidence (deterministic)
Quality score: 85% confidence (sampling-based)
Hallucination detection: 87% confidence (entity verification imperfect)
LLM judge: 92% confidence (high-stakes) / 78% confidence (routine)

We synthesize these into an overall confidence level. Then:

High confidence (>95%): Ship immediately
Medium confidence (70-95%): Flag with a note ("This output was generated with standard guardrails. We're 82% confident in its quality.")
Low confidence (<70%): Block and require regeneration

For client deliverables, we strip the confidence metadata before shipping. The client sees a clean output. They don't see our uncertainty. But internally, we know exactly how confident we are. We can trace any issue back to which stage had doubts.

This creates accountability. If a client finds an error and we ship it with 92% confidence, that's a signal that our confidence thresholds are miscalibrated. We can investigate.

The Ultra Instinct: Model Evolution Pipeline

Hallucinations don't appear randomly. They appear because:

The model hasn't seen enough examples of that task
The task is genuinely hard (requires reasoning the base model isn't built for)
The model has conflicting training (one dataset says X, another says Y)

So instead of just blocking hallucinations, we learn from them.

We built a pipeline called Ultra Instinct that automatically improves models based on production hallucinations.

Stage 1: Harvest

Every time hallucination detection catches something, we log it:

{
  "timestamp": "2026-04-05T14:23:00Z",
  "agent_id": "the analyst agent",
  "task_type": "market_research",
  "claim": "Apple market cap reached $3.2T in Q4 2025",
  "detected_error": "Entity mismatch",
  "verified_value": "$3.1T",
  "confidence": 0.87,
  "priority": "medium"
}

We're not logging this as a failure. We're logging it as training data.

Stage 2: Curation

Our Sentinel agent (who owns the quality framework) reviews high-value catches monthly:

Is this a systematic gap? (Does it happen repeatedly?)
Is it worth fine-tuning on? (Would 5-10 examples fix it?)
Is it task-specific or general? (Does it affect multiple agents?)
What's the estimated impact if we don't fix it?

Sentinel prioritizes them:

Tier 1 (high impact, high frequency): "Market cap claims frequently off by 0.5-2%"
Tier 2 (medium impact): "Legal citations sometimes incomplete"
Tier 3 (low impact): "Punctuation inconsistencies"

Stage 3: Mutation

We create synthetic training examples around each gap.

For "market cap claims frequently off," we might generate:

Input: "What was Apple's market cap in Q4 2025?"
Output: "As of Q4 2025, Apple's market capitalization was $3.1 trillion."

We create variations:

Different companies
Different time periods
Different phrasing
Different levels of detail

Mutation creates ~20-50 examples per gap. These become training data.

Stage 4: Evaluation

We fine-tune a copy of the base model on these synthetic examples. Then we test it:

Task-specific tests: Does it handle market cap claims better?
Regression tests: Did it get worse at other tasks?
Hallucination re-check: Does it hallucinate on similar claims?

If it passes regression tests and improves on the target task, it's a candidate for deployment.

Stage 5: Rollback

If it regresses (e.g., fine-tuning on market cap claims makes it worse at legal citations), we don't deploy. We investigate:

Was the training data too noisy?
Did we overfit?
Should we try a different approach?

The Quality-Governance Feedback Loop

This is where it all connects.

Hallucinations in Production
    ↓
Hallucination Detection (Stage 4)
    ↓
Block + Alert + Log
    ↓
Sentinel Reviews (Monthly)
    ↓
Identify High-Impact Gaps
    ↓
Create Training Data (Mutation)
    ↓
Fine-Tune Models (Evaluation)
    ↓
Deploy Improved Model
    ↓
Fewer Hallucinations in Production
    ↓
Cycle repeats

This is governance as a flywheel. Not punishment-based ("don't hallucinate"). Learning-based ("learn from hallucinations").

Lessons Learned: The Hard Conversations

False Positives Are Expensive

A 5% false positive rate doesn't sound bad until you do the math.

If 1000 outputs come through per day, a 5% FP rate means 50 false positives. Each one has to be regenerated. That's 50 token-wasted, 50 client latency misses.

We spent months tuning our thresholds to get to ~2% FP. At that rate, 20 false positives per 1000 outputs is noticeable but manageable.

The hard conversation: we'd rather let a few hallucinations through (maybe 0.5%) than block correct outputs unnecessarily. This is an asymmetric choice. But it's the right one for production systems.

Latency Is A Feature

Stage 1-4 run in <500ms total. Stage 5 (LLM judge) adds latency but runs async.

This keeps the critical path responsive. An agent can generate output, PII scan, guardrails, quality score, hallucination check all complete within 500ms. The client gets their result fast. The LLM judge happens in the background.

If we made everything synchronous, client latency would double or triple. We'd ship fewer outputs per day. The quality improvements wouldn't be worth it.

Confidence Thresholds Need Calibration

When we first deployed hallucination detection, we set the threshold at 3% (allowing more hallucination through). We got customer complaints about errors.

We lowered it to 0.5%. Suddenly we were blocking too many outputs, clients waited longer, agents regenerated constantly.

We settled at 1%. This required ~100 days of production data to calibrate. You can't get this right in testing. You need real outputs, real mistakes, real feedback.

High-Stakes Outputs Justify Premium Models

We use the premium LLM judge for ~5% of outputs. It's expensive, about $0.01 per 1K tokens vs $0.002 for standard.

But for a client deliverable that influences a $500K decision, an extra 0.5 seconds of judgment and $0.005 of cost is trivial. It's insurance.

For a routine internal memo, standard judge is fine.

Implementation Notes

The quality pipeline is implemented as a FastAPI middleware. Every agent output passes through before being returned to the client.

async def quality_pipeline(output: AgentOutput) -> QualityResult:
    # Stage 1
    pii_result = await presidio_scan(output.text)
    if pii_result.has_pii:
        return block(f"PII detected: {pii_result.entities}")

    # Stage 2
    guardrail_result = await check_guardrails(output.task_type, output.text)
    if guardrail_result.failed:
        return block(guardrail_result.reason)

    # Stage 3 (async, non-blocking)
    asyncio.create_task(log_quality_score(output))

    # Stage 4
    hallucination_result = await hallucination_detection(output.text)
    if hallucination_result.score > HALLUCINATION_THRESHOLD:
        return block(hallucination_result.details)

    # Stage 5 (async, non-blocking for routine; sync for high-stakes)
    if is_high_stakes(output.task):
        judge_result = await premium_judge(output.text)
        if not judge_result.approved:
            return block(judge_result.reason)
    else:
        asyncio.create_task(standard_judge(output.text))

    return success(output)

The pipeline is instrumented with Prometheus metrics:

quality_pipeline_stage_duration_seconds (histogram per stage)
quality_pipeline_blocks_total (counter per reason)
hallucination_score_distribution (histogram)
false_positive_rate (gauge)

The Bigger Picture

Quality isn't a feature. It's the foundation of trustworthy agentic AI.

Without it, AI agents are expensive toys that generate credible-looking nonsense. With it, they're trustworthy systems that generate correct outputs with measurable confidence.

The five-stage cascade is our answer. It's not perfect, no system is. But it works. Hallucinations still happen. But they get caught before they reach production.

And the best part? Every hallucination we catch makes the system smarter.

What's Next

We're exploring:

Knowledge base integration: Richer entity verification by connecting to live knowledge graphs (DBpedia, Wikidata)
Causal reasoning checks: Not just "does this contradict facts" but "does this break causal relationships"
Agent-specific models: Fine-tuning quality detection for domain-specific agents (financial agents, legal agents, product agents)
Client feedback loops: Letting clients flag hallucinations we missed, feeding those back into Ultra Instinct

The quality pipeline is never done. It improves continuously because hallucinations are an adversarial problem. As models get better, they find new ways to be wrong.

That's fine. Our infrastructure will keep up.

This is Day 7 of the BUCC builder's journal. Next week: the complete system comes together as we move to governance, accountability, and the systems that make agents trustworthy at scale.

Read the rest of the series

Day 1: Running 25 AI agents in production
Day 2: Governance, not guardrails
Day 3: Persistent agent memory
Day 4: The Data Sanitization Proxy
Day 5: The agent provisioning pipeline
Day 6: Three-layer LLM routing
Day 7: Catching AI hallucinations (you are here)
Bonus: Agent ACL framework
Bonus: Agent wallets & DAO governance
Bonus: BlackOffice video pipeline
Bonus: Control Debt Scoring

Catching AI Hallucinations: A Five-Stage Quality Pipeline

HumanApril 16, 202615 min read

When you're running a multi-agent platform in production, the moment you realize that quality isn't optional comes suddenly.

The agent had hallucinated case law. Confidently. Completely. And if the client's team had been in a hurry, they might have acted on it.

That's when we stopped asking "How do we trust LLMs?" and started building systems to ensure we never had to.

Why Quality Is The Hardest Problem in Agentic AI

Let's be honest: the difficulty of AI quality isn't the LLMs. LLMs are good at generating text that looks right. They're excellent at it.

The difficulty is that hallucinations look like truth.

And once it's in a client email, a financial forecast, or a legal memo, it's in the world. The damage compounds.

The solution isn't smarter models. Smarter models still hallucinate. The solution is systems that make hallucinations visible before they matter.

At BUCC, we call that the Five-Stage Quality Cascade.

The Five-Stage Cascade: Full Architecture

Stage 1: Presidio PII Scan

Every output from every agent gets scanned for personally identifiable information. This is a hard gate. No exceptions. No bypasses.

We use Microsoft Presidio, which is purpose-built for this. Most people think of PII as just email addresses and phone numbers. Presidio goes deeper:

Identity documents: Social Security numbers, passport numbers, driver's license numbers
Financial data: Bank account numbers, routing numbers, credit card numbers, tax IDs
Biometric data: Fingerprints, facial recognition patterns
Medical data: Health information, drug prescriptions
Government IDs: Any jurisdiction-specific identifier

An agent generates a memo. Presidio scans it. If it finds PII, the output is immediately blocked. The agent gets feedback:

Output blocked: PII detected.
Entities: SSN (1), BankAccountNumber (2)
Please regenerate without exposing: [specific values]

The agent regenerates. Clean output passes through to Stage 2.

Latency: <50ms per output. Negligible.

Stage 2: Guardrails (Rule-Based)

Now we apply domain-specific rules. These aren't generic guardrails, they're task-specific logic.

Financial agents have guardrails like:

No numerical claim without an associated confidence interval
No percentage change without a basis point reference
No market projection without a date range

Legal agents have guardrails like:

Statute citations must match the statutory code format
Case citations must include court + year
Regulatory references must include CFR or state code

Product agents have guardrails like:

Features must be checked against the current product matrix before claiming availability
Pricing claims must be validated against current pricing tables
Feature roadmap references must be marked as provisional

These are hand-coded checks. They're deterministic. They don't involve LLMs. They're fast and they fail loudly.

When a guardrail fails, it surfaces the specific rule:

Guardrail violation: FINANCIAL_CONFIDENCE_INTERVAL
Claim: "Our market share is 23%"
Requirement: Claims with ±1% variability must include confidence interval
Action: Regenerate with interval (e.g., "23% ±1.2%")

The agent sees exactly what they violated and why. They regenerate with the fix.

Latency: 50-100ms per check.

Stage 3: Quality Score (Hybrid Rule + LLM)

Here's where we start sampling. Running a full LLM evaluation on every output would be expensive and slow. So we sample.

We evaluate 10% of outputs with a hybrid approach:

Rule-based checks:

Coherence: Does the output make logical sense?
Length: Is it within expected bounds?
Format: Does it match the requested structure?
Tokenization: Any weird encoding artifacts?

LLM-as-judge checks:

Readability: Is the language clear and professional?
Completeness: Does it address all requested elements?
Tone: Does it match the expected register?

The quality score is 0-100. We track distributions per agent, per task type, per model. This runs asynchronously, it doesn't slow down the critical path.

We publish these to our observability system (Prometheus gauges). The CEO dashboard surfaces them.

Latency: 1-3 seconds per evaluation (async, doesn't block critical path).

Stage 4: Hallucination Detection (Entity Verification + Plausibility + Self-Contradiction)

This is the stage that catches the lies that look true. It's the core of the quality pipeline.

Hallucination detection runs three sub-checks in parallel:

Sub-check A: Entity Verification

Every fact-bearing claim gets decomposed into entities and relationships. We then check if those entities exist in our knowledge base.

Example:

Output claim: "In Q4 2025, Apple's market cap reached $3.2T."
Entity decomposition: Apple (Company), Q4 2025 (Time period), $3.2T (Value)
Knowledge base lookup: Is there a verified fact for "Apple market cap Q4 2025"?
Result: Yes, verified fact exists, value is $3.1T. MISMATCH DETECTED.

The entity is real. The time period is real. But the value is wrong by 3%. This gets flagged as unverified.

Sub-check B: Plausibility

Does the claim contradict established facts or domain knowledge?

Example:

Output claim: "Arctic sea ice is increasing at 5% per year due to global warming."
Domain knowledge: Global warming causes Arctic sea ice to decrease, not increase.
Result: CONTRADICTION DETECTED.

Plausibility checks run against:

Established scientific consensus (pulled from structured knowledge bases)
Domain-specific constraints (e.g., a financial product can't have negative interest rates)
Logical rules (e.g., "A > B and B > C implies A > C")

Sub-check C: Self-Contradiction

Does the output contradict itself?

Example output:

"Our Q1 revenue was $50M. This represents a 10% year-over-year increase. Last year's Q1 revenue was $55M."

Self-contradiction check:

Claim 1: Q1 revenue = $50M
Claim 2: 10% YoY increase
Claim 3: Last year Q1 = $55M
Inference: $50M with 10% increase = $55.5M prior year, not $55M

CONTRADICTION DETECTED. The three claims are mutually inconsistent.

Integration & Tolerance

These three sub-checks produce:

Entity verification score (0-100)
Plausibility score (0-100)
Self-contradiction score (0-100)

We take a weighted average: 40% entity + 35% plausibility + 25% contradiction.

If the final hallucination score is <1% (i.e., minimal detectable hallucination), the output passes. If it's >1%, the output is blocked:

Hallucination detected (score: 3.2%, threshold: 1%)
Failed sub-checks:
  - Entity verification: "$3.2T" unverified (actual: $3.1T)
  - Self-contradiction: Revenue claims inconsistent

Please regenerate addressing:
  1. Verify Apple market cap against latest public filings
  2. Ensure Q1 revenue, YoY change, and prior-year revenue are consistent

The agent regenerates. This time they're more careful. They verify before claiming.

Latency: 500-800ms per output (parallel sub-checks).

Stage 5: LLM Judge (Contextual, Gated to High-Stakes)

The final stage is judgment at read. We use two different models depending on stakes.

Standard LLM Judge: For routine outputs

Does this memo make sense?
Is it coherent?
Does it match the requested format?
Are there obvious errors?

Standard model: ~cost $0.002 per 1K tokens.

Premium LLM Judge: For high-stakes outputs

Client deliverables
Financial decisions
Security assessments
Legal opinions

Premium model: ~cost $0.01 per 1K tokens. 5x more expensive, but 5x more capable at nuanced reasoning.

This stage runs async. It doesn't block critical path. But for high-stakes outputs, we're paying for the best judgment available.

How do we decide stakes? Metadata on the task:

def is_high_stakes(task: Task) -> bool:
    return (
        task.output_classification == "CLIENT_FACING" or
        task.output_classification == "FINANCIAL" or
        task.output_classification == "LEGAL" or
        task.estimated_impact_value > 100_000
    )

High-stakes outputs get premium judge. Routine outputs get standard judge.

Latency: Premium judge adds 2-5 seconds (async, doesn't block).

Confidence Levels and Transparency

Here's something we do that most AI platforms don't: we track confidence alongside every output.

Each stage produces a confidence score:

Presidio PII scan: 99.5% confidence (very good at detecting PII)
Guardrails: 100% confidence (deterministic)
Quality score: 85% confidence (sampling-based)
Hallucination detection: 87% confidence (entity verification imperfect)
LLM judge: 92% confidence (high-stakes) / 78% confidence (routine)

We synthesize these into an overall confidence level. Then:

High confidence (>95%): Ship immediately
Medium confidence (70-95%): Flag with a note ("This output was generated with standard guardrails. We're 82% confident in its quality.")
Low confidence (<70%): Block and require regeneration

This creates accountability. If a client finds an error and we ship it with 92% confidence, that's a signal that our confidence thresholds are miscalibrated. We can investigate.

The Ultra Instinct: Model Evolution Pipeline

Hallucinations don't appear randomly. They appear because:

The model hasn't seen enough examples of that task
The task is genuinely hard (requires reasoning the base model isn't built for)
The model has conflicting training (one dataset says X, another says Y)

So instead of just blocking hallucinations, we learn from them.

We built a pipeline called Ultra Instinct that automatically improves models based on production hallucinations.

Stage 1: Harvest

Every time hallucination detection catches something, we log it:

{
  "timestamp": "2026-04-05T14:23:00Z",
  "agent_id": "the analyst agent",
  "task_type": "market_research",
  "claim": "Apple market cap reached $3.2T in Q4 2025",
  "detected_error": "Entity mismatch",
  "verified_value": "$3.1T",
  "confidence": 0.87,
  "priority": "medium"
}

We're not logging this as a failure. We're logging it as training data.

Stage 2: Curation

Our Sentinel agent (who owns the quality framework) reviews high-value catches monthly:

Is this a systematic gap? (Does it happen repeatedly?)
Is it worth fine-tuning on? (Would 5-10 examples fix it?)
Is it task-specific or general? (Does it affect multiple agents?)
What's the estimated impact if we don't fix it?

Sentinel prioritizes them:

Tier 1 (high impact, high frequency): "Market cap claims frequently off by 0.5-2%"
Tier 2 (medium impact): "Legal citations sometimes incomplete"
Tier 3 (low impact): "Punctuation inconsistencies"

Stage 3: Mutation

We create synthetic training examples around each gap.

For "market cap claims frequently off," we might generate:

Input: "What was Apple's market cap in Q4 2025?"
Output: "As of Q4 2025, Apple's market capitalization was $3.1 trillion."

We create variations:

Different companies
Different time periods
Different phrasing
Different levels of detail

Mutation creates ~20-50 examples per gap. These become training data.

Stage 4: Evaluation

We fine-tune a copy of the base model on these synthetic examples. Then we test it:

Task-specific tests: Does it handle market cap claims better?
Regression tests: Did it get worse at other tasks?
Hallucination re-check: Does it hallucinate on similar claims?

If it passes regression tests and improves on the target task, it's a candidate for deployment.

Stage 5: Rollback

If it regresses (e.g., fine-tuning on market cap claims makes it worse at legal citations), we don't deploy. We investigate:

Was the training data too noisy?
Did we overfit?
Should we try a different approach?

The Quality-Governance Feedback Loop

This is where it all connects.

Hallucinations in Production
    ↓
Hallucination Detection (Stage 4)
    ↓
Block + Alert + Log
    ↓
Sentinel Reviews (Monthly)
    ↓
Identify High-Impact Gaps
    ↓
Create Training Data (Mutation)
    ↓
Fine-Tune Models (Evaluation)
    ↓
Deploy Improved Model
    ↓
Fewer Hallucinations in Production
    ↓
Cycle repeats

This is governance as a flywheel. Not punishment-based ("don't hallucinate"). Learning-based ("learn from hallucinations").

Lessons Learned: The Hard Conversations

False Positives Are Expensive

A 5% false positive rate doesn't sound bad until you do the math.

If 1000 outputs come through per day, a 5% FP rate means 50 false positives. Each one has to be regenerated. That's 50 token-wasted, 50 client latency misses.

We spent months tuning our thresholds to get to ~2% FP. At that rate, 20 false positives per 1000 outputs is noticeable but manageable.

The hard conversation: we'd rather let a few hallucinations through (maybe 0.5%) than block correct outputs unnecessarily. This is an asymmetric choice. But it's the right one for production systems.

Latency Is A Feature

Stage 1-4 run in <500ms total. Stage 5 (LLM judge) adds latency but runs async.

If we made everything synchronous, client latency would double or triple. We'd ship fewer outputs per day. The quality improvements wouldn't be worth it.

Confidence Thresholds Need Calibration

When we first deployed hallucination detection, we set the threshold at 3% (allowing more hallucination through). We got customer complaints about errors.

We lowered it to 0.5%. Suddenly we were blocking too many outputs, clients waited longer, agents regenerated constantly.

We settled at 1%. This required ~100 days of production data to calibrate. You can't get this right in testing. You need real outputs, real mistakes, real feedback.

High-Stakes Outputs Justify Premium Models

We use the premium LLM judge for ~5% of outputs. It's expensive, about $0.01 per 1K tokens vs $0.002 for standard.

But for a client deliverable that influences a $500K decision, an extra 0.5 seconds of judgment and $0.005 of cost is trivial. It's insurance.

For a routine internal memo, standard judge is fine.

Implementation Notes

The quality pipeline is implemented as a FastAPI middleware. Every agent output passes through before being returned to the client.

async def quality_pipeline(output: AgentOutput) -> QualityResult:
    # Stage 1
    pii_result = await presidio_scan(output.text)
    if pii_result.has_pii:
        return block(f"PII detected: {pii_result.entities}")

    # Stage 2
    guardrail_result = await check_guardrails(output.task_type, output.text)
    if guardrail_result.failed:
        return block(guardrail_result.reason)

    # Stage 3 (async, non-blocking)
    asyncio.create_task(log_quality_score(output))

    # Stage 4
    hallucination_result = await hallucination_detection(output.text)
    if hallucination_result.score > HALLUCINATION_THRESHOLD:
        return block(hallucination_result.details)

    # Stage 5 (async, non-blocking for routine; sync for high-stakes)
    if is_high_stakes(output.task):
        judge_result = await premium_judge(output.text)
        if not judge_result.approved:
            return block(judge_result.reason)
    else:
        asyncio.create_task(standard_judge(output.text))

    return success(output)

The pipeline is instrumented with Prometheus metrics:

quality_pipeline_stage_duration_seconds (histogram per stage)
quality_pipeline_blocks_total (counter per reason)
hallucination_score_distribution (histogram)
false_positive_rate (gauge)

The Bigger Picture

Quality isn't a feature. It's the foundation of trustworthy agentic AI.

Without it, AI agents are expensive toys that generate credible-looking nonsense. With it, they're trustworthy systems that generate correct outputs with measurable confidence.

The five-stage cascade is our answer. It's not perfect, no system is. But it works. Hallucinations still happen. But they get caught before they reach production.

And the best part? Every hallucination we catch makes the system smarter.

What's Next

We're exploring:

Knowledge base integration: Richer entity verification by connecting to live knowledge graphs (DBpedia, Wikidata)
Causal reasoning checks: Not just "does this contradict facts" but "does this break causal relationships"
Agent-specific models: Fine-tuning quality detection for domain-specific agents (financial agents, legal agents, product agents)
Client feedback loops: Letting clients flag hallucinations we missed, feeding those back into Ultra Instinct

The quality pipeline is never done. It improves continuously because hallucinations are an adversarial problem. As models get better, they find new ways to be wrong.

That's fine. Our infrastructure will keep up.

This is Day 7 of the BUCC builder's journal. Next week: the complete system comes together as we move to governance, accountability, and the systems that make agents trustworthy at scale.

Read the rest of the series

Day 1: Running 25 AI agents in production
Day 2: Governance, not guardrails
Day 3: Persistent agent memory
Day 4: The Data Sanitization Proxy
Day 5: The agent provisioning pipeline
Day 6: Three-layer LLM routing
Day 7: Catching AI hallucinations (you are here)
Bonus: Agent ACL framework
Bonus: Agent wallets & DAO governance
Bonus: BlackOffice video pipeline
Bonus: Control Debt Scoring

Catching AI Hallucinations: A Five-Stage Quality Pipeline

Why Quality Is The Hardest Problem in Agentic AI

The Five-Stage Cascade: Full Architecture

Stage 1: Presidio PII Scan

Stage 2: Guardrails (Rule-Based)

Stage 3: Quality Score (Hybrid Rule + LLM)

Stage 4: Hallucination Detection (Entity Verification + Plausibility + Self-Contradiction)

Stage 5: LLM Judge (Contextual, Gated to High-Stakes)

Confidence Levels and Transparency

The Ultra Instinct: Model Evolution Pipeline

Stage 1: Harvest

Stage 2: Curation

Stage 3: Mutation

Stage 4: Evaluation

Stage 5: Rollback

The Quality-Governance Feedback Loop

Lessons Learned: The Hard Conversations

False Positives Are Expensive

Latency Is A Feature

Confidence Thresholds Need Calibration

High-Stakes Outputs Justify Premium Models

Implementation Notes

The Bigger Picture

What's Next

Further reading & standards

Read the rest of the series

Tags

Related Articles

Atemi Lab: Testing the Agentic Attack Surface

Control Debt: Quantifying Whether Your AI Governance Actually Works

BlackOffice: A Multi-Agent Pipeline for Production Video

Catching AI Hallucinations: A Five-Stage Quality Pipeline

Why Quality Is The Hardest Problem in Agentic AI

The Five-Stage Cascade: Full Architecture

Stage 1: Presidio PII Scan

Stage 2: Guardrails (Rule-Based)

Stage 3: Quality Score (Hybrid Rule + LLM)

Stage 4: Hallucination Detection (Entity Verification + Plausibility + Self-Contradiction)

Stage 5: LLM Judge (Contextual, Gated to High-Stakes)

Confidence Levels and Transparency

The Ultra Instinct: Model Evolution Pipeline

Stage 1: Harvest

Stage 2: Curation

Stage 3: Mutation

Stage 4: Evaluation

Stage 5: Rollback

The Quality-Governance Feedback Loop

Lessons Learned: The Hard Conversations

False Positives Are Expensive

Latency Is A Feature

Confidence Thresholds Need Calibration

High-Stakes Outputs Justify Premium Models

Implementation Notes

The Bigger Picture

What's Next

Further reading & standards

Read the rest of the series

Tags

Related Articles

Atemi Lab: Testing the Agentic Attack Surface

Control Debt: Quantifying Whether Your AI Governance Actually Works

BlackOffice: A Multi-Agent Pipeline for Production Video