Catching AI Hallucinations: A Five-Stage Quality Pipeline
Hallucinations aren't an LLM problem, they're a quality-control problem. Here's the 5-stage pipeline that catches, classifies, and contains bad outputs before they reach customers, and the decision rationale behind each stage.

When you're running a multi-agent platform in production, the moment you realize that quality isn't optional comes suddenly.
It was Tuesday morning. One of our agents had delivered a memo to a client. The memo cited a specific court ruling as precedent, very specific, very confident, very detailed. The client's lawyer read it. They wanted to verify the citation. It took about 15 minutes to realize the ruling didn't exist.
The agent had hallucinated case law. Confidently. Completely. And if the client's team had been in a hurry, they might have acted on it.
That's when we stopped asking "How do we trust LLMs?" and started building systems to ensure we never had to.
Why Quality Is The Hardest Problem in Agentic AI
Let's be honest: the difficulty of AI quality isn't the LLMs. LLMs are good at generating text that looks right. They're excellent at it.
The difficulty is that hallucinations look like truth.
An LLM can confidently invent a fact, and the output will be grammatical, coherent, contextually reasonable, and completely wrong. It will look professional. It will pass a casual skim. It will look so true that busy people will act on it.
And once it's in a client email, a financial forecast, or a legal memo, it's in the world. The damage compounds.
This is why agentic AI is fundamentally different from chat. In chat, hallucinations are entertaining, the user notices immediately and corrects. In agents running production workflows, hallucinations are catastrophic.
The solution isn't smarter models. Smarter models still hallucinate. The solution is systems that make hallucinations visible before they matter.
At BUCC, we call that the Five-Stage Quality Cascade.
The Five-Stage Cascade: Full Architecture
Stage 1: Presidio PII Scan
Every output from every agent gets scanned for personally identifiable information. This is a hard gate. No exceptions. No bypasses.
We use Microsoft Presidio, which is purpose-built for this. Most people think of PII as just email addresses and phone numbers. Presidio goes deeper:
- Identity documents: Social Security numbers, passport numbers, driver's license numbers
- Financial data: Bank account numbers, routing numbers, credit card numbers, tax IDs
- Biometric data: Fingerprints, facial recognition patterns
- Medical data: Health information, drug prescriptions
- Government IDs: Any jurisdiction-specific identifier
An agent generates a memo. Presidio scans it. If it finds PII, the output is immediately blocked. The agent gets feedback:
Output blocked: PII detected.
Entities: SSN (1), BankAccountNumber (2)
Please regenerate without exposing: [specific values]
The agent regenerates. Clean output passes through to Stage 2.
Why is this first? Because PII is non-negotiable. If an agent exposes a client's bank account number, all other quality metrics are irrelevant. We made the design choice: PII-free is a prerequisite, not a suggestion.
Latency: <50ms per output. Negligible.
Stage 2: Guardrails (Rule-Based)
Now we apply domain-specific rules. These aren't generic guardrails, they're task-specific logic.
Financial agents have guardrails like:
- No numerical claim without an associated confidence interval
- No percentage change without a basis point reference
- No market projection without a date range
Legal agents have guardrails like:
- Statute citations must match the statutory code format
- Case citations must include court + year
- Regulatory references must include CFR or state code
Product agents have guardrails like:
- Features must be checked against the current product matrix before claiming availability
- Pricing claims must be validated against current pricing tables
- Feature roadmap references must be marked as provisional
These are hand-coded checks. They're deterministic. They don't involve LLMs. They're fast and they fail loudly.
When a guardrail fails, it surfaces the specific rule:
Guardrail violation: FINANCIAL_CONFIDENCE_INTERVAL
Claim: "Our market share is 23%"
Requirement: Claims with ±1% variability must include confidence interval
Action: Regenerate with interval (e.g., "23% ±1.2%")
The agent sees exactly what they violated and why. They regenerate with the fix.
Latency: 50-100ms per check.
Stage 3: Quality Score (Hybrid Rule + LLM)
Here's where we start sampling. Running a full LLM evaluation on every output would be expensive and slow. So we sample.
We evaluate 10% of outputs with a hybrid approach:
Rule-based checks:
- Coherence: Does the output make logical sense?
- Length: Is it within expected bounds?
- Format: Does it match the requested structure?
- Tokenization: Any weird encoding artifacts?
LLM-as-judge checks:
- Readability: Is the language clear and professional?
- Completeness: Does it address all requested elements?
- Tone: Does it match the expected register?
The quality score is 0-100. We track distributions per agent, per task type, per model. This runs asynchronously, it doesn't slow down the critical path.
What matters isn't individual scores. What matters is trends. If your financial agent's average quality score drops from 87 to 71 over a week, something changed. Maybe a new model deployment regressed. Maybe the task changed. Either way, the trend signals a problem.
We publish these to our observability system (Prometheus gauges). The CEO dashboard surfaces them.
Latency: 1-3 seconds per evaluation (async, doesn't block critical path).
Stage 4: Hallucination Detection (Entity Verification + Plausibility + Self-Contradiction)
This is the stage that catches the lies that look true. It's the core of the quality pipeline.
Hallucination detection runs three sub-checks in parallel:
Sub-check A: Entity Verification
Every fact-bearing claim gets decomposed into entities and relationships. We then check if those entities exist in our knowledge base.
Example:
- Output claim: "In Q4 2025, Apple's market cap reached $3.2T."
- Entity decomposition: Apple (Company), Q4 2025 (Time period), $3.2T (Value)
- Knowledge base lookup: Is there a verified fact for "Apple market cap Q4 2025"?
- Result: Yes, verified fact exists, value is $3.1T. MISMATCH DETECTED.
The entity is real. The time period is real. But the value is wrong by 3%. This gets flagged as unverified.
Sub-check B: Plausibility
Does the claim contradict established facts or domain knowledge?
Example:
- Output claim: "Arctic sea ice is increasing at 5% per year due to global warming."
- Domain knowledge: Global warming causes Arctic sea ice to decrease, not increase.
- Result: CONTRADICTION DETECTED.
Plausibility checks run against:
- Established scientific consensus (pulled from structured knowledge bases)
- Domain-specific constraints (e.g., a financial product can't have negative interest rates)
- Logical rules (e.g., "A > B and B > C implies A > C")
Sub-check C: Self-Contradiction
Does the output contradict itself?
Example output:
"Our Q1 revenue was $50M. This represents a 10% year-over-year increase. Last year's Q1 revenue was $55M."
Self-contradiction check:
- Claim 1: Q1 revenue = $50M
- Claim 2: 10% YoY increase
- Claim 3: Last year Q1 = $55M
- Inference: $50M with 10% increase = $55.5M prior year, not $55M
CONTRADICTION DETECTED. The three claims are mutually inconsistent.
Integration & Tolerance
These three sub-checks produce:
- Entity verification score (0-100)
- Plausibility score (0-100)
- Self-contradiction score (0-100)
We take a weighted average: 40% entity + 35% plausibility + 25% contradiction.
If the final hallucination score is <1% (i.e., minimal detectable hallucination), the output passes. If it's >1%, the output is blocked:
Hallucination detected (score: 3.2%, threshold: 1%)
Failed sub-checks:
- Entity verification: "$3.2T" unverified (actual: $3.1T)
- Self-contradiction: Revenue claims inconsistent
Please regenerate addressing:
1. Verify Apple market cap against latest public filings
2. Ensure Q1 revenue, YoY change, and prior-year revenue are consistent
The agent regenerates. This time they're more careful. They verify before claiming.
Why <1%? Because false negatives (missing a hallucination) are worse than false positives (blocking correct output). We accept a higher false negative rate to minimize damage when hallucinations slip through.
Latency: 500-800ms per output (parallel sub-checks).
Stage 5: LLM Judge (Contextual, Gated to High-Stakes)
The final stage is judgment at read. We use two different models depending on stakes.
Standard LLM Judge: For routine outputs
- Does this memo make sense?
- Is it coherent?
- Does it match the requested format?
- Are there obvious errors?
Standard model: ~cost $0.002 per 1K tokens.
Premium LLM Judge: For high-stakes outputs
- Client deliverables
- Financial decisions
- Security assessments
- Legal opinions
Premium model: ~cost $0.01 per 1K tokens. 5x more expensive, but 5x more capable at nuanced reasoning.
This stage runs async. It doesn't block critical path. But for high-stakes outputs, we're paying for the best judgment available.
How do we decide stakes? Metadata on the task:
def is_high_stakes(task: Task) -> bool:
return (
task.output_classification == "CLIENT_FACING" or
task.output_classification == "FINANCIAL" or
task.output_classification == "LEGAL" or
task.estimated_impact_value > 100_000
)
High-stakes outputs get premium judge. Routine outputs get standard judge.
Latency: Premium judge adds 2-5 seconds (async, doesn't block).
Confidence Levels and Transparency
Here's something we do that most AI platforms don't: we track confidence alongside every output.
Each stage produces a confidence score:
- Presidio PII scan: 99.5% confidence (very good at detecting PII)
- Guardrails: 100% confidence (deterministic)
- Quality score: 85% confidence (sampling-based)
- Hallucination detection: 87% confidence (entity verification imperfect)
- LLM judge: 92% confidence (high-stakes) / 78% confidence (routine)
We synthesize these into an overall confidence level. Then:
- High confidence (>95%): Ship immediately
- Medium confidence (70-95%): Flag with a note ("This output was generated with standard guardrails. We're 82% confident in its quality.")
- Low confidence (<70%): Block and require regeneration
For client deliverables, we strip the confidence metadata before shipping. The client sees a clean output. They don't see our uncertainty. But internally, we know exactly how confident we are. We can trace any issue back to which stage had doubts.
This creates accountability. If a client finds an error and we ship it with 92% confidence, that's a signal that our confidence thresholds are miscalibrated. We can investigate.
The Ultra Instinct: Model Evolution Pipeline
Hallucinations don't appear randomly. They appear because:
- The model hasn't seen enough examples of that task
- The task is genuinely hard (requires reasoning the base model isn't built for)
- The model has conflicting training (one dataset says X, another says Y)
So instead of just blocking hallucinations, we learn from them.
We built a pipeline called Ultra Instinct that automatically improves models based on production hallucinations.
Stage 1: Harvest
Every time hallucination detection catches something, we log it:
{
"timestamp": "2026-04-05T14:23:00Z",
"agent_id": "the analyst agent",
"task_type": "market_research",
"claim": "Apple market cap reached $3.2T in Q4 2025",
"detected_error": "Entity mismatch",
"verified_value": "$3.1T",
"confidence": 0.87,
"priority": "medium"
}
We're not logging this as a failure. We're logging it as training data.
Stage 2: Curation
Our Sentinel agent (who owns the quality framework) reviews high-value catches monthly:
- Is this a systematic gap? (Does it happen repeatedly?)
- Is it worth fine-tuning on? (Would 5-10 examples fix it?)
- Is it task-specific or general? (Does it affect multiple agents?)
- What's the estimated impact if we don't fix it?
Sentinel prioritizes them:
- Tier 1 (high impact, high frequency): "Market cap claims frequently off by 0.5-2%"
- Tier 2 (medium impact): "Legal citations sometimes incomplete"
- Tier 3 (low impact): "Punctuation inconsistencies"
Stage 3: Mutation
We create synthetic training examples around each gap.
For "market cap claims frequently off," we might generate:
Input: "What was Apple's market cap in Q4 2025?"
Output: "As of Q4 2025, Apple's market capitalization was $3.1 trillion."
We create variations:
- Different companies
- Different time periods
- Different phrasing
- Different levels of detail
Mutation creates ~20-50 examples per gap. These become training data.
Stage 4: Evaluation
We fine-tune a copy of the base model on these synthetic examples. Then we test it:
- Task-specific tests: Does it handle market cap claims better?
- Regression tests: Did it get worse at other tasks?
- Hallucination re-check: Does it hallucinate on similar claims?
If it passes regression tests and improves on the target task, it's a candidate for deployment.
Stage 5: Rollback
If it regresses (e.g., fine-tuning on market cap claims makes it worse at legal citations), we don't deploy. We investigate:
- Was the training data too noisy?
- Did we overfit?
- Should we try a different approach?
The Quality-Governance Feedback Loop
This is where it all connects.
Hallucinations in Production
↓
Hallucination Detection (Stage 4)
↓
Block + Alert + Log
↓
Sentinel Reviews (Monthly)
↓
Identify High-Impact Gaps
↓
Create Training Data (Mutation)
↓
Fine-Tune Models (Evaluation)
↓
Deploy Improved Model
↓
Fewer Hallucinations in Production
↓
Cycle repeats
This is governance as a flywheel. Not punishment-based ("don't hallucinate"). Learning-based ("learn from hallucinations").
Lessons Learned: The Hard Conversations
False Positives Are Expensive
A 5% false positive rate doesn't sound bad until you do the math.
If 1000 outputs come through per day, a 5% FP rate means 50 false positives. Each one has to be regenerated. That's 50 token-wasted, 50 client latency misses.
We spent months tuning our thresholds to get to ~2% FP. At that rate, 20 false positives per 1000 outputs is noticeable but manageable.
The hard conversation: we'd rather let a few hallucinations through (maybe 0.5%) than block correct outputs unnecessarily. This is an asymmetric choice. But it's the right one for production systems.
Latency Is A Feature
Stage 1-4 run in <500ms total. Stage 5 (LLM judge) adds latency but runs async.
This keeps the critical path responsive. An agent can generate output, PII scan, guardrails, quality score, hallucination check all complete within 500ms. The client gets their result fast. The LLM judge happens in the background.
If we made everything synchronous, client latency would double or triple. We'd ship fewer outputs per day. The quality improvements wouldn't be worth it.
Confidence Thresholds Need Calibration
When we first deployed hallucination detection, we set the threshold at 3% (allowing more hallucination through). We got customer complaints about errors.
We lowered it to 0.5%. Suddenly we were blocking too many outputs, clients waited longer, agents regenerated constantly.
We settled at 1%. This required ~100 days of production data to calibrate. You can't get this right in testing. You need real outputs, real mistakes, real feedback.
High-Stakes Outputs Justify Premium Models
We use the premium LLM judge for ~5% of outputs. It's expensive, about $0.01 per 1K tokens vs $0.002 for standard.
But for a client deliverable that influences a $500K decision, an extra 0.5 seconds of judgment and $0.005 of cost is trivial. It's insurance.
For a routine internal memo, standard judge is fine.
Implementation Notes
The quality pipeline is implemented as a FastAPI middleware. Every agent output passes through before being returned to the client.
async def quality_pipeline(output: AgentOutput) -> QualityResult:
# Stage 1
pii_result = await presidio_scan(output.text)
if pii_result.has_pii:
return block(f"PII detected: {pii_result.entities}")
# Stage 2
guardrail_result = await check_guardrails(output.task_type, output.text)
if guardrail_result.failed:
return block(guardrail_result.reason)
# Stage 3 (async, non-blocking)
asyncio.create_task(log_quality_score(output))
# Stage 4
hallucination_result = await hallucination_detection(output.text)
if hallucination_result.score > HALLUCINATION_THRESHOLD:
return block(hallucination_result.details)
# Stage 5 (async, non-blocking for routine; sync for high-stakes)
if is_high_stakes(output.task):
judge_result = await premium_judge(output.text)
if not judge_result.approved:
return block(judge_result.reason)
else:
asyncio.create_task(standard_judge(output.text))
return success(output)
The pipeline is instrumented with Prometheus metrics:
quality_pipeline_stage_duration_seconds(histogram per stage)quality_pipeline_blocks_total(counter per reason)hallucination_score_distribution(histogram)false_positive_rate(gauge)
The Bigger Picture
Quality isn't a feature. It's the foundation of trustworthy agentic AI.
Without it, AI agents are expensive toys that generate credible-looking nonsense. With it, they're trustworthy systems that generate correct outputs with measurable confidence.
The five-stage cascade is our answer. It's not perfect, no system is. But it works. Hallucinations still happen. But they get caught before they reach production.
And the best part? Every hallucination we catch makes the system smarter.
What's Next
We're exploring:
- Knowledge base integration: Richer entity verification by connecting to live knowledge graphs (DBpedia, Wikidata)
- Causal reasoning checks: Not just "does this contradict facts" but "does this break causal relationships"
- Agent-specific models: Fine-tuning quality detection for domain-specific agents (financial agents, legal agents, product agents)
- Client feedback loops: Letting clients flag hallucinations we missed, feeding those back into Ultra Instinct
The quality pipeline is never done. It improves continuously because hallucinations are an adversarial problem. As models get better, they find new ways to be wrong.
That's fine. Our infrastructure will keep up.
This is Day 7 of the BUCC builder's journal. Next week: the complete system comes together as we move to governance, accountability, and the systems that make agents trustworthy at scale.
Further reading & standards
The choices in this post map directly onto published frameworks and regulations. If you're building against the same constraints, these are the primary sources:
- OWASP LLM09, Overreliance. Why an automated quality pipeline has to flag what humans should still check. (owasp.org/www-project-top-10-for-large-language-model-applications)
- NIST AI RMF, MEASURE function (MS-2, MS-3). Continuous measurement of control effectiveness, the spec behind scorecards, dashboards, and trendlines. (nist.gov/itl/ai-risk-management-framework)
- EU AI Act, Article 14 (human oversight). High-risk AI must be designed so humans can effectively prevent or minimise risks. (artificialintelligenceact.eu)
- EU AI Act, Article 15 (accuracy, robustness, cybersecurity). Technical solutions to address AI-specific vulnerabilities are mandatory for high-risk systems. (artificialintelligenceact.eu)
Read the rest of the series
- Day 1: Running 25 AI agents in production
- Day 2: Governance, not guardrails
- Day 3: Persistent agent memory
- Day 4: The Data Sanitization Proxy
- Day 5: The agent provisioning pipeline
- Day 6: Three-layer LLM routing
- Day 7: Catching AI hallucinations (you are here)
- Bonus: Agent ACL framework
- Bonus: Agent wallets & DAO governance
- Bonus: BlackOffice video pipeline
- Bonus: Control Debt Scoring