Agentic AI

Three-Layer LLM Routing: Cost, Privacy, and Performance Without Trade-offs

L1 local Ollama, L2 subscription APIs, L3 pay-per-token frontier models. The routing layer decides which tier handles each call based on sensitivity, complexity, and cost. Here's the architecture that keeps 25 agents running without burning through a cloud bill.

HumanApril 15, 202616 min read

Three-Layer LLM Routing: Cost, Privacy, and Performance Without Trade-offs

You're running autonomous agents that need to call language models constantly. Here's what keeps ops engineers up at night:

Every inference costs money (GPU time or API fees). Your budget is finite.
Some data is sensitive and can't leave your infrastructure. Privacy is non-negotiable.
Different tasks need different models. A flagship model wastes resources on simple tasks. A lightweight model hallucinate on complex ones.

These constraints are in tension. Use only local inference? You lose access to frontier models and you bottleneck GPU resources. Use only cloud APIs? Your costs explode and sensitive data leaks. Use only a single model? You overpay for simple work and underwhelm on complex work.

At BUCC, we navigate these tensions with a 3-layer LLM routing system. Every inference request is routed based on data sensitivity, agent policy, and current capacity. The right request goes to the right model, every time.

This post explains how.

The LLM Routing Challenge

Let's ground this in concrete constraints.

Constraint 1: Budget

You have a limited AI budget. At BUCC, it's enough for:

Running flagship and lightweight models locally on dedicated GPU servers (one-time capex, recurring electricity)
Monthly quotas with 3-4 subscription API providers (recurring monthly cost, ~$10K/month)
A small pay-per-token budget for frontier models (default $0, human-approved only)

Every inference costs money. You need a system that optimizes for cost without sacrificing the work that needs doing.

Constraint 2: Privacy

Financial transaction data, legal documents, and customer PII can't leave your infrastructure. Some of it might be covered under GDPR or HIPAA or internal policy. If that data touches a third-party LLM, you've violated compliance.

You need a system that blocks sensitive data from ever reaching cloud APIs.

Constraint 3: Model Fit

Different work needs different models:

Summarizing a 10-page document? A lightweight model does fine (Qwen 3 14B). Why use the flagship?
Implementing a complex algorithm? A code-specialized model is better (Devstral 24B). Why use a generalist?
Generating creative marketing copy? A frontier model might be worth the cost. A lightweight model produces generic output.

You need a system that routes each request to the right model.

If you solve only one constraint, you fail. Solve for cost by using only cheap models, and you get poor-quality output. Solve for quality by using only frontier models, and you blow your budget. Solve for privacy by using only local inference, and you lose access to cutting-edge models.

We solved all three with a 3-layer routing system.

Layer 1: Local Inference (Ollama)

Characteristics:

Free (GPU hardware already paid for, electricity cost is absorbed)
Private (all data stays in-house)
On-premise (no API latency, no third-party dependencies)
Limited capacity (finite GPU VRAM)

Model Fleet:

Flagship (Qwen 3.5 122B): Complex reasoning, deep analysis, nuanced writing. Handles the hardest 10% of work.
Workhorse (Qwen 3 32B): General-purpose tasks, medium complexity. Handles the bulk (~60%) of work.
Coder (Devstral 24B): Code generation, debugging, technical writing. Specialized for engineering work.
Lightweight (Qwen 3 14B): High-volume simple work, summarization, classification, template filling. Fast, efficient.

All models run on dedicated GPU servers (Spark3 and Spark4 in our setup). We can run multiple models simultaneously, but there's a ceiling. The flagship model alone uses 50GB+ VRAM. You can't run 30 instances of it.

Capacity Management:

When demand exceeds capacity, requests queue. This creates a natural backpressure signal: "your agents are asking for more than the GPU can provide, either provision more hardware or reduce load."

In practice, we rarely hit this ceiling. Why? Because we have a fallback.

Use Cases:

Layer 1 is mandatory for:

Financial data (transactions, accounts, budgets)
Legal data (contracts, agreements, evidence)
Customer PII (emails, addresses, payment info)
Any HIGH or CONFIDENTIAL classification

Layer 1 is optimal for:

Deep reasoning and analysis
Complex writing and creative work
Tasks where latency doesn't matter (batch processing, overnight analysis)

Cost Model:

$0 marginal cost per inference (hardware and electricity are sunk costs). Encourages heavy use.

Layer 2: Subscription APIs

Characteristics:

Moderate cost (~$10K/month for our quotas)
Third-party hosted (slight latency increase, but acceptable)
Well-tested models with predictable performance
Limited quota (monthly caps)

Providers:

GLM (Alibaba's large models, solid reasoning)
Kimi (long-context documents, analysis)
MiniMax (cost-effective reasoning)
Mistral Large (fast, reliable generalist)

We have standing monthly quotas with each provider. The quota resets on the 1st of each month. We track usage in real-time.

Routing Logic:

We use Layer 2 when:

Data sensitivity is INTERNAL or PUBLIC (not CONFIDENTIAL or HIGH)
Layer 1 capacity is exhausted or Layer 1 models aren't optimal for the task
Remaining quota is available

If Layer 1 is available and data sensitivity allows, we prefer Layer 1 (free). If L1 is at capacity, we fall back to L2. If L2 quota is exhausted, we either queue the request or escalate to Layer 3 (with approval).

Quota Management:

Prometheus tracks quota usage with real-time gauges:

80% alert: "You've used 80% of this month's quota with 10 days left. Adjust consumption or you'll hit the cap."
95% critical alert: "You've used 95% of quota. Stop using Layer 2 until next month unless approved."
Predicted overspend: "If consumption continues at this rate, you'll exceed budget by $X by month-end."

When we hit 95%, Layer 2 stops accepting new requests. Everything has to fit into Layer 1 (with a queue) or get escalated to Layer 3 (with human approval).

Use Cases:

Layer 2 is optimal for:

General-purpose work that Layer 1 can do but is at capacity
PUBLIC or INTERNAL data
Time-sensitive requests (Layer 2 APIs have lower latency than Layer 1 queue)
Tasks that need specific model strengths (e.g., Kimi's long-context advantage)

Cost Model:

Fixed monthly quotas. No surprise bills. We budget ~$10K/month and stick to it.

Layer 3: Pay-Per-Token APIs

Characteristics:

Most expensive ($0.02-0.10 per 1K tokens)
Latest frontier models (GPT-5, Claude 3.5, etc.)
Human-approved only (default budget is $0)
Last resort

Providers:

OpenAI (GPT-5.4, most expensive)
Anthropic (Claude 3.5, reasonable cost)
Others as needed

Routing Logic:

Layer 3 is invoked when:

Data is PUBLIC or INTERNAL (HIGH and CONFIDENTIAL are blocked)
Layer 1 capacity is exhausted
Layer 2 quota is exhausted
Agent requests it explicitly (rare)
Human approves the spend

An agent can't just decide to use Layer 3. It requires explicit approval: "Use GPT-5 for this task, estimated cost $50, approve Y/N?"

A human reviews and either approves (rare, only for truly hard problems) or rejects (most of the time).

Use Cases:

Layer 3 is for edge cases:

A frontier model is uniquely good at the task
We need model capabilities beyond our local/subscription stack
The value of the output justifies the cost

In practice, we use Layer 3 less than 5% of the time. Most work fits into Layer 1/2.

Cost Model:

Pay-per-token, metered. Expensive. This scarcity makes humans careful about approval.

Data Sensitivity and Routing Decisions

Here's the core insight: data sensitivity determines which layers are available.

We classify all data into 4 levels:

PUBLIC (blog posts, published research, marketing copy)

Can use: All layers (L1, L2, L3)
Routing strategy: Cost-optimized (prefer cheapest model that does the job)

INTERNAL (meeting notes, internal processes, team documentation)

Can use: L1, L2
Routing strategy: Privacy + cost (prefer L1, fall back to L2 if needed)

CONFIDENTIAL (contracts, customer data, strategic plans)

Can use: L1, L2 (subscription providers we trust)
Routing strategy: Privacy first (L1 preferred, L2 as fallback only)

HIGH (financial transactions, legal evidence, health data)

Can use: L1 only
Routing strategy: Absolute privacy (local inference only, no exceptions)

When an agent makes an inference request, it includes the data classification. The routing layer checks:

if data == HIGH:
    route_to = L1
    if L1_capacity_available:
        execute_on_L1()
    else:
        queue_request_with_priority()
elif data == CONFIDENTIAL:
    route_to = L1  # preferred
    if L1_capacity_available:
        execute_on_L1()
    elif L2_quota_available:
        execute_on_L2()
    else:
        queue_request_with_priority()
elif data == INTERNAL:
    route_to = [L1, L2]  # prefer L1
    if L1_capacity_available:
        execute_on_L1()
    elif L2_quota_available:
        execute_on_L2()
    else:
        queue_request_with_priority()
elif data == PUBLIC:
    route_to = [L1, L2, L3]  # optimize for cost
    if L2_quota_available:  # prefer cheap
        execute_on_L2()
    elif L1_capacity_available:
        execute_on_L1()
    elif L3_budget_approved:
        request_human_approval()
        if approved:
            execute_on_L3()
    else:
        queue_request()

This is deterministic. No guessing. No "hope the data doesn't leak." The data classification determines the layer.

Per-Agent Routing Policies

On top of data sensitivity, each agent has a routing policy set during provisioning (remember Day 5?).

Financial Agent Policy:

allowed_layers: [L1]
preferred_model: flagship (Qwen 3.5 122B)
fallback_model: none
max_budget_monthly: $0
data_classifications_allowed: HIGH, CONFIDENTIAL, INTERNAL

The financial agent is L1-only. No subscriptions. No APIs. All financial work happens locally. Why? Because financial data is HIGH sensitivity, and the policy enforces it.

Researcher Agent Policy:

allowed_layers: [L1, L2, L3]
preferred_model: flagship (L1)
fallback_model: Mistral Large (L2)
max_budget_monthly: $500
data_classifications_allowed: PUBLIC, INTERNAL

The researcher can use all layers but defaults to L1 (free). Falls back to L2 if L1 is busy. Can use L3 with approval up to $500/month. Data limit is PUBLIC/INTERNAL (no financial data).

Creative Agent Policy:

allowed_layers: [L1, L2]
preferred_model: Qwen 3 32B (L1)
fallback_model: GLM (L2)
max_budget_monthly: $2000
data_classifications_allowed: PUBLIC, INTERNAL

The creative agent prefers Layer 2 (Mistral Large is fast and good at creative writing). Falls back to L1 if L2 quota is exhausted. Budget cap is $2000/month for subscriptions.

These policies are set once and immutable. An agent can't override its own policy. You can't accidentally give the financial agent L3 access.

Provider Registry and Health Monitoring

We maintain a unified registry of all LLM providers (local and cloud):

providers:
  qwen-3.5-122b:
    layer: 1
    vram: 50GB
    latency_p50: 200ms
    latency_p99: 2000ms
    cost: 0
    capacity: 4 concurrent
    health: operational

  qwen-3-32b:
    layer: 1
    vram: 24GB
    latency_p50: 150ms
    latency_p99: 1500ms
    cost: 0
    capacity: 8 concurrent
    health: operational

  mistral-large:
    layer: 2
    quota_monthly: 1000k tokens
    quota_used_this_month: 750k tokens
    quota_remaining: 250k tokens
    cost_per_1k_tokens: $0.001
    latency_p50: 500ms
    latency_p99: 3000ms
    health: operational

  gpt-5-turbo:
    layer: 3
    quota_monthly: unlimited
    quota_used_this_month: $1200
    budget_cap: $5000
    cost_per_1k_tokens: $0.02
    latency_p50: 1000ms
    latency_p99: 5000ms
    health: operational

We monitor health metrics in real-time:

Availability: Is the provider responding?
Latency: P50, P99 response times
Error rate: % of requests failing
Quota: Used / remaining

If a provider's error rate spikes (e.g., Mistral API goes down), the routing layer automatically shifts requests to the next available layer. If all L2 providers are down, requests fall back to L1 (with queue).

Prometheus alerts notify on-call engineers if provider health degrades.

Priority Queue and Inference Scheduling

When Layer 1 capacity is tight, not all requests are equal. We prioritize:

EMERGENCY (highest priority): User-facing requests, critical agent decisions, time-sensitive work
AGENT_WORK: Autonomous agent tasks, routine work
QUALITY_SAMPLING: Agent self-review, sanity checks, hallucination detection
TRAINING_EVAL: Model fine-tuning, evals, research

An EMERGENCY request jumps to the front of the queue. A TRAINING_EVAL request waits.

If the flagship model is at capacity with 10 EMERGENCY requests queued and an agent tries to submit a TRAINING_EVAL request, the TRAINING_EVAL waits. When capacity frees, EMERGENCY requests get the slot.

This is exactly like Linux process scheduling. Each request has a priority. The kernel (our routing layer) respects that priority.

Cost Tracking and Alerting

We track cost at multiple granularities:

Per-Provider:

GLM quota used this month: 750k/1000k tokens
Mistral quota used this month: 200k/500k tokens
OpenAI spend this month: $1,200/$5,000 budget

Per-Agent:

Financial agent: $0 (L1-only)
Researcher: $150 (layer 2 subscriptions) + $0 (L1 is free)
Creative: $200 (layer 2 subscriptions)

Per-Project:

Q2 Marketing: $300 (L2 subscriptions)
Security Research: $50 (mostly L1)

Per-Month:

Total spend: $8,500
Forecast: $9,200 (if consumption continues at this rate)

Prometheus dashboards visualize all of this. We get alerts:

80% threshold: "You've used 80% of Mistral quota with 10 days left."
95% threshold: "You've used 95% of Mistral quota. Layer 2 is now accepting emergency requests only."
Predicted overspend: "At current consumption, you'll exceed monthly budget by $1,200 by month-end. Recommend reducing load."

When we hit 95%, the routing layer becomes conservative. Layer 2 stops accepting new requests. Everything queues on Layer 1. Only human-approved Layer 3 requests proceed.

This creates a hard stop. No surprise bills. No "oops, we spent $30K this month."

Real-World Routing Examples

Example 1: Financial Summary Report

An agent needs to summarize Q2 spending for the CFO.

Input: 500 transactions, account balances, vendor payments (HIGH sensitivity)

Requested by: Financial agent

Data classification: HIGH

Routing decision:

HIGH data → Layer 1 only
Check Layer 1 capacity: 2 of 4 flagship slots available
Route to: Flagship model (Qwen 3.5 122B) on Layer 1
Execution: 3 minutes
Cost: $0

Result: Detailed, accurate summary. All data stayed local. Zero cost.

Example 2: Market Research Analysis

An agent needs to analyze public market trends to inform Q3 budgeting.

Input: 200 web articles, market reports (PUBLIC sensitivity)

Requested by: Researcher agent

Data classification: PUBLIC

Routing decision:

PUBLIC data → can use any layer
Cost-optimize: prefer Layer 2 (cheaper than L3, faster than L1 if L1 is queued)
Check Layer 2 quota: Mistral has 100k remaining
Route to: Mistral Large (Layer 2)
Execution: 30 seconds
Cost: $0.05 (100 tokens)

Result: Fast analysis. Cheap. Used cloud API for speed without compromising on cost.

Example 3: Custom Legal Contract Review

An agent needs to review a 200-page contract against company policy templates.

Input: Contract (CONFIDENTIAL sensitivity)

Requested by: Legal specialist agent

Data classification: CONFIDENTIAL

Routing decision:

CONFIDENTIAL data → Layer 1 or Layer 2
Check Layer 1 capacity: 1 of 4 slots available, but flagship model is needed for nuance
Route to: Flagship model (Layer 1)
If L1 full, check Layer 2: Kimi has 50k remaining (good for long documents)
Execution: 5 minutes
Cost: $0 (L1)

Result: Thorough legal review. Data stayed private. No risk of accidental disclosure.

Example 4: Blog Post Editing (with Approval Needed)

A creative agent wants to refine a blog post but needs access to a frontier model for polish.

Input: Draft blog post (PUBLIC sensitivity)

Requested by: Creative agent (policy: L1/L2 allowed, max $2000/month)

Data classification: PUBLIC

Routing decision:

PUBLIC data → can use any layer, but agent policy limits to L1/L2
Check Layer 1: available, but creative agent prefers L2 for speed
Check Layer 2 quota: Mistral at 95%, approaching limit
Creative agent has $1,800 remaining budget
Request to human: "Use frontier model (GPT-5) for blog polish, ~$25 cost, approved Y/N?"
Result: Rejected (blog isn't critical, L1/L2 is sufficient)

Agent falls back to: L1 model for editing. Still good output, zero cost.

Alternatively, if the request was approved, routing goes to Layer 3 and human is charged $25.

Lessons Learned: What Surprised Us

Lesson 1: Local Inference Capacity Matters More Than We Thought

We expected 70% of requests to use Layer 2 (subscriptions are reliable, fast). In practice, 70% use Layer 1 (local). Why? Because the queue rarely gets long, and agents are patient. Requests that queue for 30 seconds are acceptable. This changed our budget model dramatically, we're spending way less on subscriptions than expected.

Lesson 2: Data Sensitivity Classification is Hard

We started with 2 levels (sensitive, not sensitive). We now have 4 because subtle cases emerged. Is a meeting note INTERNAL or CONFIDENTIAL? Is a draft contract CONFIDENTIAL or HIGH? We needed legal input to nail the definitions. Now they're crystal clear and agents default-deny if unsure.

Lesson 3: Quota Alerts Create Behavioral Change

When we added the "80% quota" alert, agent behavior changed. Suddenly, teams started asking "do we really need this API call?" instead of assuming infinite quota. The alert created a budget mindset. This is good, it aligns incentives.

Lesson 4: Provider Failover is Critical

One month, Mistral API had a 2-hour outage. Because our routing layer automatically shifted to L1 (with queue), no work was lost. Agents just waited longer. But if we had relied on Mistral exclusively? Everything would have stopped. Redundancy matters.

Lesson 5: Per-Agent Policies Prevent Accidents

We haven't had a single case of an agent accessing a layer it shouldn't (e.g., a financial agent calling a cloud API). The policies enforce it automatically. This is the safety-first design paying off.

Conclusion: Three Layers, Three Problems Solved

By routing requests through 3 layers based on data sensitivity, agent policy, and current capacity, we achieve:

Cost Discipline: 70% of inferences happen free (Layer 1), 25% use subscriptions, 5% are APIs (human-approved). Monthly spend is predictable and under budget.

Privacy Control: HIGH-sensitivity data never leaves infrastructure. CONFIDENTIAL data uses local-first or approved subscription providers. Only PUBLIC data casually uses third-party APIs.

Performance: Requests are routed to the right model for the job, not the most expensive model. Urgent work gets priority. Quality work gets the flagship model when needed.

Resilience: Provider failover is automatic. If one layer is unavailable, the next layer handles it. No single point of failure.

This is production-grade LLM ops. And it's complex, but it's worth the complexity because the alternative (a single LLM provider, no sensitivity classification, no quotas) creates bigger problems down the line.

BUCC is an ongoing builder's journal. We're learning as we build. If you're working on production AI infrastructure, we'd love to hear what you're learning too.

Read the rest of the series

Day 1: Running 25 AI agents in production
Day 2: Governance, not guardrails
Day 3: Persistent agent memory
Day 4: The Data Sanitization Proxy
Day 5: The agent provisioning pipeline
Day 6: Three-layer LLM routing (you are here)
Day 7: Catching AI hallucinations
Bonus: Agent ACL framework
Bonus: Agent wallets & DAO governance
Bonus: BlackOffice video pipeline
Bonus: Control Debt Scoring

Three-Layer LLM Routing: Cost, Privacy, and Performance Without Trade-offs

HumanApril 15, 202616 min read

You're running autonomous agents that need to call language models constantly. Here's what keeps ops engineers up at night:

Every inference costs money (GPU time or API fees). Your budget is finite.
Some data is sensitive and can't leave your infrastructure. Privacy is non-negotiable.
Different tasks need different models. A flagship model wastes resources on simple tasks. A lightweight model hallucinate on complex ones.

This post explains how.

The LLM Routing Challenge

Let's ground this in concrete constraints.

Constraint 1: Budget

You have a limited AI budget. At BUCC, it's enough for:

Running flagship and lightweight models locally on dedicated GPU servers (one-time capex, recurring electricity)
Monthly quotas with 3-4 subscription API providers (recurring monthly cost, ~$10K/month)
A small pay-per-token budget for frontier models (default $0, human-approved only)

Every inference costs money. You need a system that optimizes for cost without sacrificing the work that needs doing.

Constraint 2: Privacy

You need a system that blocks sensitive data from ever reaching cloud APIs.

Constraint 3: Model Fit

Different work needs different models:

Summarizing a 10-page document? A lightweight model does fine (Qwen 3 14B). Why use the flagship?
Implementing a complex algorithm? A code-specialized model is better (Devstral 24B). Why use a generalist?
Generating creative marketing copy? A frontier model might be worth the cost. A lightweight model produces generic output.

You need a system that routes each request to the right model.

We solved all three with a 3-layer routing system.

Layer 1: Local Inference (Ollama)

Characteristics:

Free (GPU hardware already paid for, electricity cost is absorbed)
Private (all data stays in-house)
On-premise (no API latency, no third-party dependencies)
Limited capacity (finite GPU VRAM)

Model Fleet:

Flagship (Qwen 3.5 122B): Complex reasoning, deep analysis, nuanced writing. Handles the hardest 10% of work.
Workhorse (Qwen 3 32B): General-purpose tasks, medium complexity. Handles the bulk (~60%) of work.
Coder (Devstral 24B): Code generation, debugging, technical writing. Specialized for engineering work.
Lightweight (Qwen 3 14B): High-volume simple work, summarization, classification, template filling. Fast, efficient.

Capacity Management:

When demand exceeds capacity, requests queue. This creates a natural backpressure signal: "your agents are asking for more than the GPU can provide, either provision more hardware or reduce load."

In practice, we rarely hit this ceiling. Why? Because we have a fallback.

Use Cases:

Layer 1 is mandatory for:

Financial data (transactions, accounts, budgets)
Legal data (contracts, agreements, evidence)
Customer PII (emails, addresses, payment info)
Any HIGH or CONFIDENTIAL classification

Layer 1 is optimal for:

Deep reasoning and analysis
Complex writing and creative work
Tasks where latency doesn't matter (batch processing, overnight analysis)

Cost Model:

$0 marginal cost per inference (hardware and electricity are sunk costs). Encourages heavy use.

Layer 2: Subscription APIs

Characteristics:

Moderate cost (~$10K/month for our quotas)
Third-party hosted (slight latency increase, but acceptable)
Well-tested models with predictable performance
Limited quota (monthly caps)

Providers:

GLM (Alibaba's large models, solid reasoning)
Kimi (long-context documents, analysis)
MiniMax (cost-effective reasoning)
Mistral Large (fast, reliable generalist)

We have standing monthly quotas with each provider. The quota resets on the 1st of each month. We track usage in real-time.

Routing Logic:

We use Layer 2 when:

Data sensitivity is INTERNAL or PUBLIC (not CONFIDENTIAL or HIGH)
Layer 1 capacity is exhausted or Layer 1 models aren't optimal for the task
Remaining quota is available

Quota Management:

Prometheus tracks quota usage with real-time gauges:

80% alert: "You've used 80% of this month's quota with 10 days left. Adjust consumption or you'll hit the cap."
95% critical alert: "You've used 95% of quota. Stop using Layer 2 until next month unless approved."
Predicted overspend: "If consumption continues at this rate, you'll exceed budget by $X by month-end."

When we hit 95%, Layer 2 stops accepting new requests. Everything has to fit into Layer 1 (with a queue) or get escalated to Layer 3 (with human approval).

Use Cases:

Layer 2 is optimal for:

General-purpose work that Layer 1 can do but is at capacity
PUBLIC or INTERNAL data
Time-sensitive requests (Layer 2 APIs have lower latency than Layer 1 queue)
Tasks that need specific model strengths (e.g., Kimi's long-context advantage)

Cost Model:

Fixed monthly quotas. No surprise bills. We budget ~$10K/month and stick to it.

Layer 3: Pay-Per-Token APIs

Characteristics:

Most expensive ($0.02-0.10 per 1K tokens)
Latest frontier models (GPT-5, Claude 3.5, etc.)
Human-approved only (default budget is $0)
Last resort

Providers:

OpenAI (GPT-5.4, most expensive)
Anthropic (Claude 3.5, reasonable cost)
Others as needed

Routing Logic:

Layer 3 is invoked when:

Data is PUBLIC or INTERNAL (HIGH and CONFIDENTIAL are blocked)
Layer 1 capacity is exhausted
Layer 2 quota is exhausted
Agent requests it explicitly (rare)
Human approves the spend

An agent can't just decide to use Layer 3. It requires explicit approval: "Use GPT-5 for this task, estimated cost $50, approve Y/N?"

A human reviews and either approves (rare, only for truly hard problems) or rejects (most of the time).

Use Cases:

Layer 3 is for edge cases:

A frontier model is uniquely good at the task
We need model capabilities beyond our local/subscription stack
The value of the output justifies the cost

In practice, we use Layer 3 less than 5% of the time. Most work fits into Layer 1/2.

Cost Model:

Pay-per-token, metered. Expensive. This scarcity makes humans careful about approval.

Data Sensitivity and Routing Decisions

Here's the core insight: data sensitivity determines which layers are available.

We classify all data into 4 levels:

PUBLIC (blog posts, published research, marketing copy)

Can use: All layers (L1, L2, L3)
Routing strategy: Cost-optimized (prefer cheapest model that does the job)

INTERNAL (meeting notes, internal processes, team documentation)

Can use: L1, L2
Routing strategy: Privacy + cost (prefer L1, fall back to L2 if needed)

CONFIDENTIAL (contracts, customer data, strategic plans)

Can use: L1, L2 (subscription providers we trust)
Routing strategy: Privacy first (L1 preferred, L2 as fallback only)

HIGH (financial transactions, legal evidence, health data)

Can use: L1 only
Routing strategy: Absolute privacy (local inference only, no exceptions)

When an agent makes an inference request, it includes the data classification. The routing layer checks:

if data == HIGH:
    route_to = L1
    if L1_capacity_available:
        execute_on_L1()
    else:
        queue_request_with_priority()
elif data == CONFIDENTIAL:
    route_to = L1  # preferred
    if L1_capacity_available:
        execute_on_L1()
    elif L2_quota_available:
        execute_on_L2()
    else:
        queue_request_with_priority()
elif data == INTERNAL:
    route_to = [L1, L2]  # prefer L1
    if L1_capacity_available:
        execute_on_L1()
    elif L2_quota_available:
        execute_on_L2()
    else:
        queue_request_with_priority()
elif data == PUBLIC:
    route_to = [L1, L2, L3]  # optimize for cost
    if L2_quota_available:  # prefer cheap
        execute_on_L2()
    elif L1_capacity_available:
        execute_on_L1()
    elif L3_budget_approved:
        request_human_approval()
        if approved:
            execute_on_L3()
    else:
        queue_request()

This is deterministic. No guessing. No "hope the data doesn't leak." The data classification determines the layer.

Per-Agent Routing Policies

On top of data sensitivity, each agent has a routing policy set during provisioning (remember Day 5?).

Financial Agent Policy:

allowed_layers: [L1]
preferred_model: flagship (Qwen 3.5 122B)
fallback_model: none
max_budget_monthly: $0
data_classifications_allowed: HIGH, CONFIDENTIAL, INTERNAL

The financial agent is L1-only. No subscriptions. No APIs. All financial work happens locally. Why? Because financial data is HIGH sensitivity, and the policy enforces it.

Researcher Agent Policy:

allowed_layers: [L1, L2, L3]
preferred_model: flagship (L1)
fallback_model: Mistral Large (L2)
max_budget_monthly: $500
data_classifications_allowed: PUBLIC, INTERNAL

The researcher can use all layers but defaults to L1 (free). Falls back to L2 if L1 is busy. Can use L3 with approval up to $500/month. Data limit is PUBLIC/INTERNAL (no financial data).

Creative Agent Policy:

allowed_layers: [L1, L2]
preferred_model: Qwen 3 32B (L1)
fallback_model: GLM (L2)
max_budget_monthly: $2000
data_classifications_allowed: PUBLIC, INTERNAL

The creative agent prefers Layer 2 (Mistral Large is fast and good at creative writing). Falls back to L1 if L2 quota is exhausted. Budget cap is $2000/month for subscriptions.

These policies are set once and immutable. An agent can't override its own policy. You can't accidentally give the financial agent L3 access.

Provider Registry and Health Monitoring

We maintain a unified registry of all LLM providers (local and cloud):

providers:
  qwen-3.5-122b:
    layer: 1
    vram: 50GB
    latency_p50: 200ms
    latency_p99: 2000ms
    cost: 0
    capacity: 4 concurrent
    health: operational

  qwen-3-32b:
    layer: 1
    vram: 24GB
    latency_p50: 150ms
    latency_p99: 1500ms
    cost: 0
    capacity: 8 concurrent
    health: operational

  mistral-large:
    layer: 2
    quota_monthly: 1000k tokens
    quota_used_this_month: 750k tokens
    quota_remaining: 250k tokens
    cost_per_1k_tokens: $0.001
    latency_p50: 500ms
    latency_p99: 3000ms
    health: operational

  gpt-5-turbo:
    layer: 3
    quota_monthly: unlimited
    quota_used_this_month: $1200
    budget_cap: $5000
    cost_per_1k_tokens: $0.02
    latency_p50: 1000ms
    latency_p99: 5000ms
    health: operational

We monitor health metrics in real-time:

Availability: Is the provider responding?
Latency: P50, P99 response times
Error rate: % of requests failing
Quota: Used / remaining

Prometheus alerts notify on-call engineers if provider health degrades.

Priority Queue and Inference Scheduling

When Layer 1 capacity is tight, not all requests are equal. We prioritize:

EMERGENCY (highest priority): User-facing requests, critical agent decisions, time-sensitive work
AGENT_WORK: Autonomous agent tasks, routine work
QUALITY_SAMPLING: Agent self-review, sanity checks, hallucination detection
TRAINING_EVAL: Model fine-tuning, evals, research

An EMERGENCY request jumps to the front of the queue. A TRAINING_EVAL request waits.

This is exactly like Linux process scheduling. Each request has a priority. The kernel (our routing layer) respects that priority.

Cost Tracking and Alerting

We track cost at multiple granularities:

Per-Provider:

GLM quota used this month: 750k/1000k tokens
Mistral quota used this month: 200k/500k tokens
OpenAI spend this month: $1,200/$5,000 budget

Per-Agent:

Financial agent: $0 (L1-only)
Researcher: $150 (layer 2 subscriptions) + $0 (L1 is free)
Creative: $200 (layer 2 subscriptions)

Per-Project:

Q2 Marketing: $300 (L2 subscriptions)
Security Research: $50 (mostly L1)

Per-Month:

Total spend: $8,500
Forecast: $9,200 (if consumption continues at this rate)

Prometheus dashboards visualize all of this. We get alerts:

80% threshold: "You've used 80% of Mistral quota with 10 days left."
95% threshold: "You've used 95% of Mistral quota. Layer 2 is now accepting emergency requests only."
Predicted overspend: "At current consumption, you'll exceed monthly budget by $1,200 by month-end. Recommend reducing load."

When we hit 95%, the routing layer becomes conservative. Layer 2 stops accepting new requests. Everything queues on Layer 1. Only human-approved Layer 3 requests proceed.

This creates a hard stop. No surprise bills. No "oops, we spent $30K this month."

Real-World Routing Examples

Example 1: Financial Summary Report

An agent needs to summarize Q2 spending for the CFO.

Input: 500 transactions, account balances, vendor payments (HIGH sensitivity)

Requested by: Financial agent

Data classification: HIGH

Routing decision:

HIGH data → Layer 1 only
Check Layer 1 capacity: 2 of 4 flagship slots available
Route to: Flagship model (Qwen 3.5 122B) on Layer 1
Execution: 3 minutes
Cost: $0

Result: Detailed, accurate summary. All data stayed local. Zero cost.

Example 2: Market Research Analysis

An agent needs to analyze public market trends to inform Q3 budgeting.

Input: 200 web articles, market reports (PUBLIC sensitivity)

Requested by: Researcher agent

Data classification: PUBLIC

Routing decision:

PUBLIC data → can use any layer
Cost-optimize: prefer Layer 2 (cheaper than L3, faster than L1 if L1 is queued)
Check Layer 2 quota: Mistral has 100k remaining
Route to: Mistral Large (Layer 2)
Execution: 30 seconds
Cost: $0.05 (100 tokens)

Result: Fast analysis. Cheap. Used cloud API for speed without compromising on cost.

Example 3: Custom Legal Contract Review

An agent needs to review a 200-page contract against company policy templates.

Input: Contract (CONFIDENTIAL sensitivity)

Requested by: Legal specialist agent

Data classification: CONFIDENTIAL

Routing decision:

CONFIDENTIAL data → Layer 1 or Layer 2
Check Layer 1 capacity: 1 of 4 slots available, but flagship model is needed for nuance
Route to: Flagship model (Layer 1)
If L1 full, check Layer 2: Kimi has 50k remaining (good for long documents)
Execution: 5 minutes
Cost: $0 (L1)

Result: Thorough legal review. Data stayed private. No risk of accidental disclosure.

Example 4: Blog Post Editing (with Approval Needed)

A creative agent wants to refine a blog post but needs access to a frontier model for polish.

Input: Draft blog post (PUBLIC sensitivity)

Requested by: Creative agent (policy: L1/L2 allowed, max $2000/month)

Data classification: PUBLIC

Routing decision:

PUBLIC data → can use any layer, but agent policy limits to L1/L2
Check Layer 1: available, but creative agent prefers L2 for speed
Check Layer 2 quota: Mistral at 95%, approaching limit
Creative agent has $1,800 remaining budget
Request to human: "Use frontier model (GPT-5) for blog polish, ~$25 cost, approved Y/N?"
Result: Rejected (blog isn't critical, L1/L2 is sufficient)

Agent falls back to: L1 model for editing. Still good output, zero cost.

Alternatively, if the request was approved, routing goes to Layer 3 and human is charged $25.

Lessons Learned: What Surprised Us

Lesson 1: Local Inference Capacity Matters More Than We Thought

Lesson 2: Data Sensitivity Classification is Hard

Lesson 3: Quota Alerts Create Behavioral Change

Lesson 4: Provider Failover is Critical

Lesson 5: Per-Agent Policies Prevent Accidents

Conclusion: Three Layers, Three Problems Solved

By routing requests through 3 layers based on data sensitivity, agent policy, and current capacity, we achieve:

Cost Discipline: 70% of inferences happen free (Layer 1), 25% use subscriptions, 5% are APIs (human-approved). Monthly spend is predictable and under budget.

Privacy Control: HIGH-sensitivity data never leaves infrastructure. CONFIDENTIAL data uses local-first or approved subscription providers. Only PUBLIC data casually uses third-party APIs.

Performance: Requests are routed to the right model for the job, not the most expensive model. Urgent work gets priority. Quality work gets the flagship model when needed.

Resilience: Provider failover is automatic. If one layer is unavailable, the next layer handles it. No single point of failure.

BUCC is an ongoing builder's journal. We're learning as we build. If you're working on production AI infrastructure, we'd love to hear what you're learning too.

Read the rest of the series

Day 1: Running 25 AI agents in production
Day 2: Governance, not guardrails
Day 3: Persistent agent memory
Day 4: The Data Sanitization Proxy
Day 5: The agent provisioning pipeline
Day 6: Three-layer LLM routing (you are here)
Day 7: Catching AI hallucinations
Bonus: Agent ACL framework
Bonus: Agent wallets & DAO governance
Bonus: BlackOffice video pipeline
Bonus: Control Debt Scoring

Three-Layer LLM Routing: Cost, Privacy, and Performance Without Trade-offs

The LLM Routing Challenge

Layer 1: Local Inference (Ollama)

Layer 2: Subscription APIs

Layer 3: Pay-Per-Token APIs

Data Sensitivity and Routing Decisions

Per-Agent Routing Policies

Provider Registry and Health Monitoring

Priority Queue and Inference Scheduling

Cost Tracking and Alerting

Real-World Routing Examples

Lessons Learned: What Surprised Us

Conclusion: Three Layers, Three Problems Solved

Further reading & standards

Read the rest of the series

Tags

Related Articles

Atemi Lab: Testing the Agentic Attack Surface

Control Debt: Quantifying Whether Your AI Governance Actually Works

BlackOffice: A Multi-Agent Pipeline for Production Video

Three-Layer LLM Routing: Cost, Privacy, and Performance Without Trade-offs

The LLM Routing Challenge

Layer 1: Local Inference (Ollama)

Layer 2: Subscription APIs

Layer 3: Pay-Per-Token APIs

Data Sensitivity and Routing Decisions

Per-Agent Routing Policies

Provider Registry and Health Monitoring

Priority Queue and Inference Scheduling

Cost Tracking and Alerting

Real-World Routing Examples

Lessons Learned: What Surprised Us

Conclusion: Three Layers, Three Problems Solved

Further reading & standards

Read the rest of the series

Tags

Related Articles

Atemi Lab: Testing the Agentic Attack Surface

Control Debt: Quantifying Whether Your AI Governance Actually Works

BlackOffice: A Multi-Agent Pipeline for Production Video