In my previous post, I made the case that every enterprise scaling AI needs a centralized LLM Gateway. The architectural requirements are clear: unified API, automated fallback routing, protocol translation, streaming preservation, data sovereignty and cost governance.
That was the “why”. This is the “which”.
I evaluated six LLM Gateways against those requirements, specifically filtering for self-hosted deployment capability. My platform (VIA) handles product development workflows where some data is sensitive and some isn’t. I need to route to both cloud providers and local inference endpoints, with the gateway running entirely within my infrastructure. Managed SaaS platforms that require sending payloads through a third-party cloud are disqualified by default.
The six I evaluated: Bifrost (Maxim AI), LiteLLM, Portkey, Helicone, BricksLLM and Martian. Three made my shortlist and three didn’t.
What Didn’t Make the Cut#
LiteLLM has the broadest provider support (100+ providers, 2,500+ models) and the largest community (27K+ GitHub stars). It’s also the most widely adopted. But it’s built in Python, and independent benchmarks show severe performance degradation at scale: P99 latency spikes to 30+ seconds at just 500 requests per second, with out-of-memory crashes at 1,000 RPS. For a platform that needs to handle many concurrent agentic workflows, that’s a structural risk I can’t accept. The recent supply chain incident (a compromised PyPI dependency in March 2026) reinforced my concern about Python dependency chains in critical infrastructure.
Portkey offers the deepest compliance feature set I’ve seen: 50+ automated guardrails, PII redaction and SOC2/ISO/HIPAA certifications. For heavily regulated enterprises, it’s compelling. But its true capabilities live behind a commercial SaaS control plane. Self-hosted deployment is technically possible but gated behind top-tier enterprise contracts. The 20-40ms latency overhead from its guardrail processing is not insignificant for interactive applications. For my use case, the compliance features are overkill and the deployment model is too constrained.
Martian takes a fundamentally different approach: it’s an intelligent model router that dynamically selects the best model per request based on prompt analysis. Genuinely innovative, but it’s a closed commercial SaaS. You’re outsourcing routing decisions to their infrastructure. That contradicts the self-hosted requirement entirely.
1. Bifrost — For High-Throughput Agentic Workloads#
Bifrost is the performance benchmark in this space. Built in Go, it adds 11 microseconds of overhead per request at 5,000 RPS. That’s not a typo — microseconds, not milliseconds. Under extreme load, it maintains a stable P99 latency of 1.68 seconds with a peak memory footprint of just 120MB. Where Python-based alternatives crash, Bifrost barely notices.
What puts it at the top of my list is the combination of performance with feature completeness. It’s the only gateway with a native MCP gateway built in. For agentic architectures where AI models call external tools, Bifrost centralizes tool execution, applies RBAC at the virtual key level and handles OAuth token refresh. That’s a significant operational simplification.
It also does semantic caching: recognizing semantically identical queries phrased differently and returning cached responses. Not just exact string matching, but vector similarity. That directly reduces token spend without application-level changes.
Deployment is fully self-hosted (Docker, Kubernetes Helm charts or bare metal), air-gapped if needed. MIT licensed, free to self-host.
Why it’s #1: Unmatched performance headroom for concurrent agentic workflows, plus the only native MCP integration in the gateway space.
2. Helicone — For Observability-First Operations#
Helicone started as an LLM Observability platform and evolved into a full LLM Gateway. The recent rewrite to Rust pays off. 1-5ms overhead per request, making it the second-fastest option I encountered.
Where Helicone differentiates is its dual integration model. It can run as a traditional proxy (gateway in the request path) or run asynchronously to log telemetry in the background while your application talks directly to the provider. This means you can add full observability (token analytics, cost forecasting, trace debugging) without the gateway ever touching the critical path. For workloads where even 1ms of added latency matters, that’s a unique architectural option.
The observability suite itself is deep: granular token analytics, custom property tagging, a proprietary query language for trace debugging and both semantic and exact-match caching.
Self-hosted via Docker Compose or Kubernetes Helm charts. Apache 2.0 licensed.
Why it’s #2: The async integration path is architecturally unique. Full observability without being in the critical path is a compelling tradeoff for latency-sensitive workloads.
3. BricksLLM — For Strict Cost Control and Data Privacy#
BricksLLM focuses on what matters most for internal platform teams: locking down who can spend what, and ensuring sensitive data never reaches external providers.
Also built in Go, it adds roughly 30ms of overhead per request, mostly a synchronous Redis round-trip for rate-limit enforcement, by their own account. That’s three orders of magnitude more than Bifrost, but it’s the cost of strict, real-time budget and quota enforcement. It sustains 1,000 RPS without crashes or cascading errors. The provider list is focused rather than exhaustive (OpenAI, Anthropic, Azure, vLLM), but these are the providers that matter for most enterprise deployments.
The governance model is meticulous. Virtual API keys can be constrained with exact spend limits (down to $0.25) and precise rate limits (2 requests per minute if you want). Built-in PII detection and masking scrubs sensitive data from prompts before they reach external providers. For an internal platform where different teams have different budgets and different data sensitivity levels, this granularity matters.
Fully self-hosted, open-source, free.
Why it’s #3: When the primary concern is “make sure nobody accidentally sends patient data to OpenAI and nobody overspends their budget,” BricksLLM is purpose-built for that.
Quick Comparison#
| Bifrost | Helicone | BricksLLM | |
|---|---|---|---|
| Language | Go | Rust | Go |
| Latency overhead | ~11µs | 1-5ms | ~30ms |
| Providers | 15+ (1000+ models) | All major + local | Core enterprise set |
| MCP support | Native built-in | Client-side | Client-side |
| Unique strength | Performance + MCP | Async observability | PII masking + budget caps |
| Caching | Semantic + exact | Semantic + exact | Basic |
| License | MIT | Apache 2.0 | Open source |
| Self-hosted | Full air-gap | Full air-gap | Full air-gap |
What I’m Testing Next#
I haven’t deployed any of these yet. The evaluation so far is based on architecture review, documentation analysis, published benchmarks and community feedback. The next step is hands-on testing against my actual requirements:
- Protocol translation accuracy: Routing Claude-formatted requests through to OpenAI and back. Does tool calling survive the round trip without breaking?
- Streaming fidelity: Does the SSE translation maintain the real-time feel, or does it introduce perceptible buffering?
- Failover behavior: How gracefully does it handle a provider going down mid-stream?
- Operational overhead: How much effort does it take to deploy, configure, and maintain?
Bifrost is first in line. I’ll share what I find.
The Full Analysis#
This post covers my shortlist and reasoning. The complete evaluation — all six LLM Gateways, detailed architectural breakdowns, performance benchmarks, the full comparison matrix and methodology — is available in the full report.
Download the full VIA Research report (PDF)
A Q1 2026 publication by VIA Research — Swiss engineering, AI-native with Human Insight.