Three-Layer LLM Routing: Cost, Privacy, and Performance Without Trade-offs
L1 local Ollama, L2 subscription APIs, L3 pay-per-token frontier models. The routing layer decides which tier handles each call based on sensitivity, complexity, and cost. Here's the architecture that keeps 25 agents running without burning through a cloud bill.

You're running autonomous agents that need to call language models constantly. Here's what keeps ops engineers up at night:
- Every inference costs money (GPU time or API fees). Your budget is finite.
- Some data is sensitive and can't leave your infrastructure. Privacy is non-negotiable.
- Different tasks need different models. A flagship model wastes resources on simple tasks. A lightweight model hallucinate on complex ones.
These constraints are in tension. Use only local inference? You lose access to frontier models and you bottleneck GPU resources. Use only cloud APIs? Your costs explode and sensitive data leaks. Use only a single model? You overpay for simple work and underwhelm on complex work.
At BUCC, we navigate these tensions with a 3-layer LLM routing system. Every inference request is routed based on data sensitivity, agent policy, and current capacity. The right request goes to the right model, every time.
This post explains how.
The LLM Routing Challenge
Let's ground this in concrete constraints.
Constraint 1: Budget
You have a limited AI budget. At BUCC, it's enough for:
- Running flagship and lightweight models locally on dedicated GPU servers (one-time capex, recurring electricity)
- Monthly quotas with 3-4 subscription API providers (recurring monthly cost, ~$10K/month)
- A small pay-per-token budget for frontier models (default $0, human-approved only)
Every inference costs money. You need a system that optimizes for cost without sacrificing the work that needs doing.
Constraint 2: Privacy
Financial transaction data, legal documents, and customer PII can't leave your infrastructure. Some of it might be covered under GDPR or HIPAA or internal policy. If that data touches a third-party LLM, you've violated compliance.
You need a system that blocks sensitive data from ever reaching cloud APIs.
Constraint 3: Model Fit
Different work needs different models:
- Summarizing a 10-page document? A lightweight model does fine (Qwen 3 14B). Why use the flagship?
- Implementing a complex algorithm? A code-specialized model is better (Devstral 24B). Why use a generalist?
- Generating creative marketing copy? A frontier model might be worth the cost. A lightweight model produces generic output.
You need a system that routes each request to the right model.
If you solve only one constraint, you fail. Solve for cost by using only cheap models, and you get poor-quality output. Solve for quality by using only frontier models, and you blow your budget. Solve for privacy by using only local inference, and you lose access to cutting-edge models.
We solved all three with a 3-layer routing system.
Layer 1: Local Inference (Ollama)
Characteristics:
- Free (GPU hardware already paid for, electricity cost is absorbed)
- Private (all data stays in-house)
- On-premise (no API latency, no third-party dependencies)
- Limited capacity (finite GPU VRAM)
Model Fleet:
- Flagship (Qwen 3.5 122B): Complex reasoning, deep analysis, nuanced writing. Handles the hardest 10% of work.
- Workhorse (Qwen 3 32B): General-purpose tasks, medium complexity. Handles the bulk (~60%) of work.
- Coder (Devstral 24B): Code generation, debugging, technical writing. Specialized for engineering work.
- Lightweight (Qwen 3 14B): High-volume simple work, summarization, classification, template filling. Fast, efficient.
All models run on dedicated GPU servers (Spark3 and Spark4 in our setup). We can run multiple models simultaneously, but there's a ceiling. The flagship model alone uses 50GB+ VRAM. You can't run 30 instances of it.
Capacity Management:
When demand exceeds capacity, requests queue. This creates a natural backpressure signal: "your agents are asking for more than the GPU can provide, either provision more hardware or reduce load."
In practice, we rarely hit this ceiling. Why? Because we have a fallback.
Use Cases:
Layer 1 is mandatory for:
- Financial data (transactions, accounts, budgets)
- Legal data (contracts, agreements, evidence)
- Customer PII (emails, addresses, payment info)
- Any HIGH or CONFIDENTIAL classification
Layer 1 is optimal for:
- Deep reasoning and analysis
- Complex writing and creative work
- Tasks where latency doesn't matter (batch processing, overnight analysis)
Cost Model:
$0 marginal cost per inference (hardware and electricity are sunk costs). Encourages heavy use.
Layer 2: Subscription APIs
Characteristics:
- Moderate cost (~$10K/month for our quotas)
- Third-party hosted (slight latency increase, but acceptable)
- Well-tested models with predictable performance
- Limited quota (monthly caps)
Providers:
- GLM (Alibaba's large models, solid reasoning)
- Kimi (long-context documents, analysis)
- MiniMax (cost-effective reasoning)
- Mistral Large (fast, reliable generalist)
We have standing monthly quotas with each provider. The quota resets on the 1st of each month. We track usage in real-time.
Routing Logic:
We use Layer 2 when:
- Data sensitivity is INTERNAL or PUBLIC (not CONFIDENTIAL or HIGH)
- Layer 1 capacity is exhausted or Layer 1 models aren't optimal for the task
- Remaining quota is available
If Layer 1 is available and data sensitivity allows, we prefer Layer 1 (free). If L1 is at capacity, we fall back to L2. If L2 quota is exhausted, we either queue the request or escalate to Layer 3 (with approval).
Quota Management:
Prometheus tracks quota usage with real-time gauges:
- 80% alert: "You've used 80% of this month's quota with 10 days left. Adjust consumption or you'll hit the cap."
- 95% critical alert: "You've used 95% of quota. Stop using Layer 2 until next month unless approved."
- Predicted overspend: "If consumption continues at this rate, you'll exceed budget by $X by month-end."
When we hit 95%, Layer 2 stops accepting new requests. Everything has to fit into Layer 1 (with a queue) or get escalated to Layer 3 (with human approval).
Use Cases:
Layer 2 is optimal for:
- General-purpose work that Layer 1 can do but is at capacity
- PUBLIC or INTERNAL data
- Time-sensitive requests (Layer 2 APIs have lower latency than Layer 1 queue)
- Tasks that need specific model strengths (e.g., Kimi's long-context advantage)
Cost Model:
Fixed monthly quotas. No surprise bills. We budget ~$10K/month and stick to it.
Layer 3: Pay-Per-Token APIs
Characteristics:
- Most expensive ($0.02-0.10 per 1K tokens)
- Latest frontier models (GPT-5, Claude 3.5, etc.)
- Human-approved only (default budget is $0)
- Last resort
Providers:
- OpenAI (GPT-5.4, most expensive)
- Anthropic (Claude 3.5, reasonable cost)
- Others as needed
Routing Logic:
Layer 3 is invoked when:
- Data is PUBLIC or INTERNAL (HIGH and CONFIDENTIAL are blocked)
- Layer 1 capacity is exhausted
- Layer 2 quota is exhausted
- Agent requests it explicitly (rare)
- Human approves the spend
An agent can't just decide to use Layer 3. It requires explicit approval: "Use GPT-5 for this task, estimated cost $50, approve Y/N?"
A human reviews and either approves (rare, only for truly hard problems) or rejects (most of the time).
Use Cases:
Layer 3 is for edge cases:
- A frontier model is uniquely good at the task
- We need model capabilities beyond our local/subscription stack
- The value of the output justifies the cost
In practice, we use Layer 3 less than 5% of the time. Most work fits into Layer 1/2.
Cost Model:
Pay-per-token, metered. Expensive. This scarcity makes humans careful about approval.
Data Sensitivity and Routing Decisions
Here's the core insight: data sensitivity determines which layers are available.
We classify all data into 4 levels:
PUBLIC (blog posts, published research, marketing copy)
- Can use: All layers (L1, L2, L3)
- Routing strategy: Cost-optimized (prefer cheapest model that does the job)
INTERNAL (meeting notes, internal processes, team documentation)
- Can use: L1, L2
- Routing strategy: Privacy + cost (prefer L1, fall back to L2 if needed)
CONFIDENTIAL (contracts, customer data, strategic plans)
- Can use: L1, L2 (subscription providers we trust)
- Routing strategy: Privacy first (L1 preferred, L2 as fallback only)
HIGH (financial transactions, legal evidence, health data)
- Can use: L1 only
- Routing strategy: Absolute privacy (local inference only, no exceptions)
When an agent makes an inference request, it includes the data classification. The routing layer checks:
if data == HIGH:
route_to = L1
if L1_capacity_available:
execute_on_L1()
else:
queue_request_with_priority()
elif data == CONFIDENTIAL:
route_to = L1 # preferred
if L1_capacity_available:
execute_on_L1()
elif L2_quota_available:
execute_on_L2()
else:
queue_request_with_priority()
elif data == INTERNAL:
route_to = [L1, L2] # prefer L1
if L1_capacity_available:
execute_on_L1()
elif L2_quota_available:
execute_on_L2()
else:
queue_request_with_priority()
elif data == PUBLIC:
route_to = [L1, L2, L3] # optimize for cost
if L2_quota_available: # prefer cheap
execute_on_L2()
elif L1_capacity_available:
execute_on_L1()
elif L3_budget_approved:
request_human_approval()
if approved:
execute_on_L3()
else:
queue_request()
This is deterministic. No guessing. No "hope the data doesn't leak." The data classification determines the layer.
Per-Agent Routing Policies
On top of data sensitivity, each agent has a routing policy set during provisioning (remember Day 5?).
Financial Agent Policy:
allowed_layers: [L1]
preferred_model: flagship (Qwen 3.5 122B)
fallback_model: none
max_budget_monthly: $0
data_classifications_allowed: HIGH, CONFIDENTIAL, INTERNAL
The financial agent is L1-only. No subscriptions. No APIs. All financial work happens locally. Why? Because financial data is HIGH sensitivity, and the policy enforces it.
Researcher Agent Policy:
allowed_layers: [L1, L2, L3]
preferred_model: flagship (L1)
fallback_model: Mistral Large (L2)
max_budget_monthly: $500
data_classifications_allowed: PUBLIC, INTERNAL
The researcher can use all layers but defaults to L1 (free). Falls back to L2 if L1 is busy. Can use L3 with approval up to $500/month. Data limit is PUBLIC/INTERNAL (no financial data).
Creative Agent Policy:
allowed_layers: [L1, L2]
preferred_model: Qwen 3 32B (L1)
fallback_model: GLM (L2)
max_budget_monthly: $2000
data_classifications_allowed: PUBLIC, INTERNAL
The creative agent prefers Layer 2 (Mistral Large is fast and good at creative writing). Falls back to L1 if L2 quota is exhausted. Budget cap is $2000/month for subscriptions.
These policies are set once and immutable. An agent can't override its own policy. You can't accidentally give the financial agent L3 access.
Provider Registry and Health Monitoring
We maintain a unified registry of all LLM providers (local and cloud):
providers:
qwen-3.5-122b:
layer: 1
vram: 50GB
latency_p50: 200ms
latency_p99: 2000ms
cost: 0
capacity: 4 concurrent
health: operational
qwen-3-32b:
layer: 1
vram: 24GB
latency_p50: 150ms
latency_p99: 1500ms
cost: 0
capacity: 8 concurrent
health: operational
mistral-large:
layer: 2
quota_monthly: 1000k tokens
quota_used_this_month: 750k tokens
quota_remaining: 250k tokens
cost_per_1k_tokens: $0.001
latency_p50: 500ms
latency_p99: 3000ms
health: operational
gpt-5-turbo:
layer: 3
quota_monthly: unlimited
quota_used_this_month: $1200
budget_cap: $5000
cost_per_1k_tokens: $0.02
latency_p50: 1000ms
latency_p99: 5000ms
health: operational
We monitor health metrics in real-time:
- Availability: Is the provider responding?
- Latency: P50, P99 response times
- Error rate: % of requests failing
- Quota: Used / remaining
If a provider's error rate spikes (e.g., Mistral API goes down), the routing layer automatically shifts requests to the next available layer. If all L2 providers are down, requests fall back to L1 (with queue).
Prometheus alerts notify on-call engineers if provider health degrades.
Priority Queue and Inference Scheduling
When Layer 1 capacity is tight, not all requests are equal. We prioritize:
- EMERGENCY (highest priority): User-facing requests, critical agent decisions, time-sensitive work
- AGENT_WORK: Autonomous agent tasks, routine work
- QUALITY_SAMPLING: Agent self-review, sanity checks, hallucination detection
- TRAINING_EVAL: Model fine-tuning, evals, research
An EMERGENCY request jumps to the front of the queue. A TRAINING_EVAL request waits.
If the flagship model is at capacity with 10 EMERGENCY requests queued and an agent tries to submit a TRAINING_EVAL request, the TRAINING_EVAL waits. When capacity frees, EMERGENCY requests get the slot.
This is exactly like Linux process scheduling. Each request has a priority. The kernel (our routing layer) respects that priority.
Cost Tracking and Alerting
We track cost at multiple granularities:
Per-Provider:
- GLM quota used this month: 750k/1000k tokens
- Mistral quota used this month: 200k/500k tokens
- OpenAI spend this month: $1,200/$5,000 budget
Per-Agent:
- Financial agent: $0 (L1-only)
- Researcher: $150 (layer 2 subscriptions) + $0 (L1 is free)
- Creative: $200 (layer 2 subscriptions)
Per-Project:
- Q2 Marketing: $300 (L2 subscriptions)
- Security Research: $50 (mostly L1)
Per-Month:
- Total spend: $8,500
- Forecast: $9,200 (if consumption continues at this rate)
Prometheus dashboards visualize all of this. We get alerts:
- 80% threshold: "You've used 80% of Mistral quota with 10 days left."
- 95% threshold: "You've used 95% of Mistral quota. Layer 2 is now accepting emergency requests only."
- Predicted overspend: "At current consumption, you'll exceed monthly budget by $1,200 by month-end. Recommend reducing load."
When we hit 95%, the routing layer becomes conservative. Layer 2 stops accepting new requests. Everything queues on Layer 1. Only human-approved Layer 3 requests proceed.
This creates a hard stop. No surprise bills. No "oops, we spent $30K this month."
Real-World Routing Examples
Example 1: Financial Summary Report
An agent needs to summarize Q2 spending for the CFO.
Input: 500 transactions, account balances, vendor payments (HIGH sensitivity)
Requested by: Financial agent
Data classification: HIGH
Routing decision:
- HIGH data → Layer 1 only
- Check Layer 1 capacity: 2 of 4 flagship slots available
- Route to: Flagship model (Qwen 3.5 122B) on Layer 1
- Execution: 3 minutes
- Cost: $0
Result: Detailed, accurate summary. All data stayed local. Zero cost.
Example 2: Market Research Analysis
An agent needs to analyze public market trends to inform Q3 budgeting.
Input: 200 web articles, market reports (PUBLIC sensitivity)
Requested by: Researcher agent
Data classification: PUBLIC
Routing decision:
- PUBLIC data → can use any layer
- Cost-optimize: prefer Layer 2 (cheaper than L3, faster than L1 if L1 is queued)
- Check Layer 2 quota: Mistral has 100k remaining
- Route to: Mistral Large (Layer 2)
- Execution: 30 seconds
- Cost: $0.05 (100 tokens)
Result: Fast analysis. Cheap. Used cloud API for speed without compromising on cost.
Example 3: Custom Legal Contract Review
An agent needs to review a 200-page contract against company policy templates.
Input: Contract (CONFIDENTIAL sensitivity)
Requested by: Legal specialist agent
Data classification: CONFIDENTIAL
Routing decision:
- CONFIDENTIAL data → Layer 1 or Layer 2
- Check Layer 1 capacity: 1 of 4 slots available, but flagship model is needed for nuance
- Route to: Flagship model (Layer 1)
- If L1 full, check Layer 2: Kimi has 50k remaining (good for long documents)
- Execution: 5 minutes
- Cost: $0 (L1)
Result: Thorough legal review. Data stayed private. No risk of accidental disclosure.
Example 4: Blog Post Editing (with Approval Needed)
A creative agent wants to refine a blog post but needs access to a frontier model for polish.
Input: Draft blog post (PUBLIC sensitivity)
Requested by: Creative agent (policy: L1/L2 allowed, max $2000/month)
Data classification: PUBLIC
Routing decision:
- PUBLIC data → can use any layer, but agent policy limits to L1/L2
- Check Layer 1: available, but creative agent prefers L2 for speed
- Check Layer 2 quota: Mistral at 95%, approaching limit
- Creative agent has $1,800 remaining budget
- Request to human: "Use frontier model (GPT-5) for blog polish, ~$25 cost, approved Y/N?"
- Result: Rejected (blog isn't critical, L1/L2 is sufficient)
Agent falls back to: L1 model for editing. Still good output, zero cost.
Alternatively, if the request was approved, routing goes to Layer 3 and human is charged $25.
Lessons Learned: What Surprised Us
Lesson 1: Local Inference Capacity Matters More Than We Thought
We expected 70% of requests to use Layer 2 (subscriptions are reliable, fast). In practice, 70% use Layer 1 (local). Why? Because the queue rarely gets long, and agents are patient. Requests that queue for 30 seconds are acceptable. This changed our budget model dramatically, we're spending way less on subscriptions than expected.
Lesson 2: Data Sensitivity Classification is Hard
We started with 2 levels (sensitive, not sensitive). We now have 4 because subtle cases emerged. Is a meeting note INTERNAL or CONFIDENTIAL? Is a draft contract CONFIDENTIAL or HIGH? We needed legal input to nail the definitions. Now they're crystal clear and agents default-deny if unsure.
Lesson 3: Quota Alerts Create Behavioral Change
When we added the "80% quota" alert, agent behavior changed. Suddenly, teams started asking "do we really need this API call?" instead of assuming infinite quota. The alert created a budget mindset. This is good, it aligns incentives.
Lesson 4: Provider Failover is Critical
One month, Mistral API had a 2-hour outage. Because our routing layer automatically shifted to L1 (with queue), no work was lost. Agents just waited longer. But if we had relied on Mistral exclusively? Everything would have stopped. Redundancy matters.
Lesson 5: Per-Agent Policies Prevent Accidents
We haven't had a single case of an agent accessing a layer it shouldn't (e.g., a financial agent calling a cloud API). The policies enforce it automatically. This is the safety-first design paying off.
Conclusion: Three Layers, Three Problems Solved
By routing requests through 3 layers based on data sensitivity, agent policy, and current capacity, we achieve:
Cost Discipline: 70% of inferences happen free (Layer 1), 25% use subscriptions, 5% are APIs (human-approved). Monthly spend is predictable and under budget.
Privacy Control: HIGH-sensitivity data never leaves infrastructure. CONFIDENTIAL data uses local-first or approved subscription providers. Only PUBLIC data casually uses third-party APIs.
Performance: Requests are routed to the right model for the job, not the most expensive model. Urgent work gets priority. Quality work gets the flagship model when needed.
Resilience: Provider failover is automatic. If one layer is unavailable, the next layer handles it. No single point of failure.
This is production-grade LLM ops. And it's complex, but it's worth the complexity because the alternative (a single LLM provider, no sensitivity classification, no quotas) creates bigger problems down the line.
BUCC is an ongoing builder's journal. We're learning as we build. If you're working on production AI infrastructure, we'd love to hear what you're learning too.
Further reading & standards
The choices in this post map directly onto published frameworks and regulations. If you're building against the same constraints, these are the primary sources:
- OWASP LLM10, Model Theft. One of the reasons L1-local routing is the default for sensitive data. (owasp.org/www-project-top-10-for-large-language-model-applications)
- NIST AI RMF, MEASURE function (MS-2, MS-3). Continuous measurement of control effectiveness, the spec behind scorecards, dashboards, and trendlines. (nist.gov/itl/ai-risk-management-framework)
- EU AI Act, Article 10 (data and data governance). Training and operational data must meet specific quality and governance standards. (artificialintelligenceact.eu)
Read the rest of the series
- Day 1: Running 25 AI agents in production
- Day 2: Governance, not guardrails
- Day 3: Persistent agent memory
- Day 4: The Data Sanitization Proxy
- Day 5: The agent provisioning pipeline
- Day 6: Three-layer LLM routing (you are here)
- Day 7: Catching AI hallucinations
- Bonus: Agent ACL framework
- Bonus: Agent wallets & DAO governance
- Bonus: BlackOffice video pipeline
- Bonus: Control Debt Scoring