A technical deep dive into Bifrost: an open-source, self-hostable Go LLM gateway
Gateway Overhead in Production LLM Systems
In most LLM systems, the gateway becomes a shared dependency: it affects tail latency, routing/failover behavior, retries, and cost attribution across providers. LiteLLM works well as a lightweight Python proxy, but in our production-like load tests we started seeing gateway overhead and operational complexity show up at higher concurrency. We moved to Bifrost for lower overhead and for first-class features like governance, cost semantics, and observability built into the gateway.
In our benchmark setup (with logging/retries enabled), LiteLLM added hundreds of microseconds of overhead per request. Results vary by deployment mode and configuration. When handling thousands of requests per second, this overhead compounds—infrastructure costs increase, tail latency suffers, and operational complexity grows.
Bifrost takes a different approach.
Enter Bifrost
Bifrost is an LLM gateway written in Go that adds approximately 11 microseconds of overhead per request in our test environment. That's roughly 40x faster than what we observed with LiteLLM in comparable configurations.
But the performance improvement is just one part of the story. Bifrost rethinks the control plane for LLM infrastructure—providing governance, cost attribution, and observability as first-class gateway features rather than requiring external tooling or application-level instrumentation.
Let me walk through the technical details.
maximhq
/
bifrost
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…
Setup and Deployment
Traditional gateway deployment often involves managing Python environments, dependency chains, and configuration files. Here's the Bifrost approach:
npx -y @maximhq/bifrost
This single command downloads a pre-compiled binary for your platform and starts a production-ready gateway on port 8080 with a web UI for configuration.
Compare this to typical Python gateway setup:
# Install Python (verify version compatibility)
pip install litellm
# Configure environment variables
# Set up configuration file
# Install additional dependencies for features
# Debug environment-specific issues
Bifrost uses NPX to download a pre-compiled binary for your platform. No Python interpreter required. No virtual environments. No dependency resolution. A single statically-linked executable that runs immediately.
Why Go for Gateway Infrastructure
The choice of Go over Python has measurable impacts on production systems, particularly around concurrency, memory efficiency, and operational simplicity.
Concurrency Model
Python gateways scale via async and multiple workers. At high concurrency, the tradeoffs show up as higher memory per instance, coordination overhead, and tail-latency under burst.
Go doesn't have these constraints. Go's goroutines are lightweight threads that can run truly in parallel across all available CPU cores. When a request arrives, Bifrost spawns a goroutine. When a thousand requests arrive simultaneously, Bifrost spawns a thousand goroutines—all running concurrently with minimal overhead.
Memory Efficiency
A Python process typically requires 30-50MB of memory at startup in most configurations. Add Flask or FastAPI, and baseline memory usage often reaches 100MB+ before handling any requests, though this varies based on the specific setup and dependencies.
The entire Bifrost binary is approximately 20MB. In memory, a single Bifrost instance uses roughly 50MB under sustained load while handling thousands of requests per second.
Startup Time
Python applications require time to initialize—import packages, start the interpreter, load configurations. Typical startup time is 2-3 seconds minimum.
Bifrost starts in milliseconds. This matters for autoscaling, development iteration, and serverless deployments where cold starts impact user experience.
Benchmark Results
Here are measurements from a sustained load test on a t3.xlarge EC2 instance at 5,000 requests per second:
| Metric | LiteLLM | Bifrost | Improvement |
|---|---|---|---|
| Gateway Overhead | 440 µs | 11 µs | 40x faster |
| Memory Usage | ~500 MB | ~50 MB | 10x less |
| Gateway-level Failures | 11% | 0% | No failures observed |
| Queue Wait Time | 47 µs | 1.67 µs | 28x faster |
| Total Latency (with provider) | 2.12 s | 1.61 s | 24% faster |
These measurements represent sustained load over multiple hours, not synthetic benchmarks.
Beyond Performance: Control-Plane Features That Matter in Production
The main reason to move from LiteLLM to Bifrost isn't language; it's control-plane features. Bifrost adds governance (virtual keys, budgets, rate limits), consistent cost attribution, and production-oriented observability at the gateway layer, not scattered across application code.
This architectural choice centralizes concerns that would otherwise require external services or application-level instrumentation:
- Governance controls managed at the gateway rather than per-application
- Cost attribution with per-request tracking and aggregation
- Observability with structured logs, metrics, and request tracing built-in
- Failure isolation with circuit breakers and automatic failover
Let's examine these features in detail.
Production Features
Automatic Failover
When your primary provider hits rate limits or experiences downtime, requests should seamlessly move to backup providers without manual intervention.
Bifrost configuration:
{
"fallbacks": {
"enabled": true,
"order": [
"openai/gpt-4o-mini",
"anthropic/claude-sonnet-4",
"mistral/mistral-large-latest"
]
}
}
When OpenAI returns a rate limit error, Bifrost automatically retries with Anthropic. If that fails, it tries Mistral. Your application receives a successful response without implementing retry logic.
Load Balancing
Distributing load across multiple API keys prevents any single key from hitting rate limits:
{
"providers": {
"openai": {
"keys": [
{"name": "key-1", "value": "sk-...", "weight": 2.0},
{"name": "key-2", "value": "sk-...", "weight": 1.0},
{"name": "key-3", "value": "sk-...", "weight": 1.0}
]
}
}
}
The first key receives 50% of traffic, the other two receive 25% each. When one key approaches its rate limit, Bifrost automatically shifts load to healthy keys.
Semantic Caching
Semantic caching isn't a new concept; teams can build it externally, but Bifrost ships it as a first-class gateway feature, reducing moving parts.
Traditional caching requires exact string matches. But users rarely phrase questions identically:
- "What's the weather like?"
- "How's the weather today?"
- "Tell me about current weather conditions"
These are semantically equivalent. Bifrost uses vector embeddings to understand semantic similarity:
- Request arrives: "What is Python?"
- Bifrost generates an embedding using a fast model
- Checks vector store for similar embeddings
- Finds previous request: "Explain Python to me"
- Returns cached response (similarity score: 0.92)
Result: No LLM call required. Response in approximately 5 milliseconds instead of 2 seconds. Cost: $0.00 instead of $0.0001.
Savings depend on cache hit-rate and workload repetition. Over a million requests with 60% cache hit rate, this saves approximately $60.
Unified Interface
Every LLM provider has different API formats. OpenAI uses one schema. Anthropic uses another. Bedrock and Vertex AI each have their own specifications.
Bifrost provides a single API that works with all providers:
from openai import OpenAI
# Change only the base URL
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8080/openai"
)
# Use ANY provider with the same code
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4", # Not an OpenAI model
messages=[{"role": "user", "content": "Hello"}]
)
Your application code remains unchanged. Switch providers by modifying one line. No refactoring required. No rewriting integration tests.
Model Context Protocol (MCP)
MCP is Anthropic's protocol for letting AI models use external tools. Integration with web search, filesystem access, or database queries:
{
"mcp": {
"enabled": true,
"servers": {
"web-search": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-brave-search"]
},
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
}
}
}
}
This enables AI models to perform actions rather than only generating text responses.
Web UI and Observability
Most gateways provide configuration files and command-line tools. Bifrost includes a comprehensive web interface at http://localhost:8080:
Dashboard: Real-time metrics showing request counts, error rates, and costs per provider
Providers: Visual configuration for all providers with click-based key management
Logs: Complete request/response history with token usage, searchable and filterable
Settings: Configure caching, governance, and plugins without editing configuration files
All configuration, monitoring, and debugging can be performed through the web interface without SSH access to servers or manual log analysis.
Architecture Details
Request Flow
- Request arrives at Bifrost's HTTP server
- Request validation happens in microseconds
- Cache lookup checks semantic cache if enabled
- Cache hit? Return immediately (approximately 5ms total)
- Cache miss? Continue to provider selection
- Load balancer selects API key based on weights and health
- Concurrent request dispatched to provider (goroutine spawned)
- Response streaming begins immediately if enabled
- Cache storage happens asynchronously (non-blocking)
- Response returns to client with metadata
All operations are non-blocking where possible. Cache lookup doesn't block provider calls in no-store mode. Cache storage doesn't delay response delivery.
Concurrency Implementation
Bifrost uses Go's goroutines for concurrency:
Traditional Python Threading:
Request 1 → Thread 1 → Process (limited parallelism)
Request 2 → Thread 2 → Wait/Process (coordination overhead)
Request 3 → Thread 3 → Wait/Process (memory per thread)
Bifrost Goroutines:
Request 1 → Goroutine 1 ⟍
Request 2 → Goroutine 2 ⟋→ All process in parallel → Responses
Request 3 → Goroutine 3 ⟋
Each goroutine uses approximately 2KB of memory. You can run millions concurrently.
Vector Store Integration
For semantic caching, Bifrost integrates with Weaviate:
- Request arrives with cache key: "user-session-123"
- Bifrost extracts message content
- Generates embedding using fast model (text-embedding-3-small)
- Searches Weaviate for similar embeddings (threshold: 0.8)
- Finds match with similarity 0.92
- Returns cached response with metadata
Embedding generation: approximately 50ms. Vector search: approximately 10ms. Total: 60ms compared to 2000ms for an actual LLM call.
Setting Up Semantic Caching
Step 1: Install Weaviate
Use Docker for local development:
docker run -d \
--name weaviate \
-p 8081:8080 \
-e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true \
semitechnologies/weaviate:latest
Or use Weaviate Cloud (free tier available) at https://console.weaviate.cloud/
Step 2: Configure Bifrost
Update your config.json:
{
"providers": {
"openai": {
"keys": [{
"name": "main",
"value": "env.OPENAI_API_KEY",
"models": ["gpt-4o-mini"]
}]
}
},
"vector_store": {
"enabled": true,
"type": "weaviate",
"config": {
"host": "localhost:8081",
"scheme": "http"
}
},
"plugins": [{
"enabled": true,
"name": "semantic_cache",
"config": {
"provider": "openai",
"embedding_model": "text-embedding-3-small",
"ttl": "5m",
"threshold": 0.8
}
}]
}
Step 3: Test the Cache
# First request (cache miss, calls LLM)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "x-bf-cache-key: user-123" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "What is Docker?"}]
}'
# Similar request (cache hit, fast response)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "x-bf-cache-key: user-123" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Explain Docker to me"}]
}'
The second request returns in approximately 60ms instead of 2000ms.
Cache Hit Response
{
"choices": [...],
"extra_fields": {
"cache_debug": {
"cache_hit": true,
"hit_type": "semantic",
"similarity": 0.94,
"threshold": 0.8,
"provider_used": "openai",
"model_used": "text-embedding-3-small"
}
}
}
The response includes debugging information showing the similarity score and threshold. Adjust the threshold based on your accuracy requirements.
Drop-In Replacement
You can replace existing OpenAI or Anthropic SDK calls with Bifrost by changing one parameter.
Python Example
Before:
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}]
)
After:
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8080/openai" # Only change
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}]
)
Node.js Example
Before:
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
After:
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: 'not-needed',
baseURL: 'http://localhost:8080/openai' // Only change
});
Benefits
This approach enables:
- Adding Bifrost to existing applications without refactoring
- Testing in production with gradual rollout
- Access to all Bifrost features (caching, fallbacks, monitoring) immediately
- Easy rollback if needed
Monitoring and Observability
Bifrost exposes Prometheus metrics at /metrics:
# Request metrics
bifrost_requests_total{provider="openai",model="gpt-4o-mini"} 1543
bifrost_request_duration_seconds{provider="openai"} 1.234
# Cache metrics
bifrost_cache_hits_total{type="semantic"} 892
bifrost_cache_misses_total 651
# Error metrics
bifrost_errors_total{provider="openai",type="rate_limit"} 12
Grafana Dashboard
Connect Prometheus to Grafana for visualization:
- Requests per second by provider
- Latency percentiles (p50, p95, p99)
- Cache hit rates over time
- Cost tracking per provider
- Error rates and types
Structured Logging
Bifrost logs to structured JSON:
{
"level": "info",
"time": "2024-01-15T10:30:00Z",
"msg": "request completed",
"provider": "openai",
"model": "gpt-4o-mini",
"duration_ms": 1234,
"tokens": 456,
"cache_hit": false
}
This format integrates with any log aggregation service (CloudWatch, Datadog, Elasticsearch).
Common Configuration Issues
Issue 1: Missing Cache Key
Semantic caching requires the x-bf-cache-key header. Without it, every request is a cache miss.
Incorrect:
curl -X POST http://localhost:8080/v1/chat/completions -d '{...}'
Correct:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "x-bf-cache-key: user-session-123" \
-d '{...}'
Issue 2: Threshold Configuration
Start with a threshold of 0.8 and adjust based on cache hit rate and accuracy:
{
"threshold": 0.8 // Starting point
}
Monitor your cache hit rate. If below 30%, lower the threshold to 0.75. If you're getting incorrect cached results, raise it to 0.85.
Issue 3: Config Store Requirement
Some plugins require a config store:
{
"config_store": {
"enabled": true,
"type": "sqlite",
"config": {"path": "./config.db"}
}
}
Issue 4: Weaviate Network Configuration
Ensure Weaviate is accessible from Bifrost. For Docker deployments:
{
"vector_store": {
"enabled": true,
"type": "weaviate",
"config": {
"host": "weaviate-container:8080", // Use correct hostname
"scheme": "http"
}
}
}
For Weaviate Cloud:
{
"vector_store": {
"enabled": true,
"type": "weaviate",
"config": {
"host": "<weaviate-host>.gcp.weaviate.cloud",
"scheme": "https",
"api_key": "<weaviate-api-key>"
}
}
}
When to Use Bifrost
Bifrost provides immediate value for:
- Production systems handling more than 1,000 requests per day
- Applications where tail latency impacts user experience
- Teams that need automatic failover without complex orchestration
- Organizations tracking LLM costs across multiple providers
- Systems requiring governance controls (rate limits, budgets, virtual keys)
- Deployments where operational simplicity reduces maintenance burden
Even for smaller projects, Bifrost's minimal overhead and built-in features provide a robust foundation that scales without requiring future refactoring.
Getting Started
- Run
npx -y @maximhq/bifrost - Open
http://localhost:8080 - Add your API keys in the UI
- Point your application to
http://localhost:8080/openai - Monitor performance and costs through the dashboard
Resources
- GitHub: github.com/maximhq/bifrost
- Website: getbifrost.ai
- Documentation: docs.getbifrost.ai
Questions or feedback? Please leave a comment below. If you use Bifrost in production, I'd be interested to hear about your experience and any challenges you encounter.















Top comments (2)
Nice work! Quick question, how does Bifrost handle Anthropic-specific features like system prompts and extended thinking? Are these supported via the unified API, or do we still need to hit Anthropic directly?
Great question! Bifrost supports Anthropic-specific features through the unified API. You can use system prompts, extended thinking, and other Anthropic parameters directly when routing to Claude models.