DEV Community

Cover image for Bifrost: The LLM Gateway That's 40x Faster Than LiteLLM
Varshith V Hegde
Varshith V Hegde Subscriber

Posted on

Bifrost: The LLM Gateway That's 40x Faster Than LiteLLM

A technical deep dive into Bifrost: an open-source, self-hostable Go LLM gateway


Gateway Overhead in Production LLM Systems

Performance comparison chart showing Bifrost's superior latency metrics compared to traditional Python-based gateways

In most LLM systems, the gateway becomes a shared dependency: it affects tail latency, routing/failover behavior, retries, and cost attribution across providers. LiteLLM works well as a lightweight Python proxy, but in our production-like load tests we started seeing gateway overhead and operational complexity show up at higher concurrency. We moved to Bifrost for lower overhead and for first-class features like governance, cost semantics, and observability built into the gateway.

In our benchmark setup (with logging/retries enabled), LiteLLM added hundreds of microseconds of overhead per request. Results vary by deployment mode and configuration. When handling thousands of requests per second, this overhead compounds—infrastructure costs increase, tail latency suffers, and operational complexity grows.

Bifrost takes a different approach.


Enter Bifrost

Artistic representation of Bifrost bridge from Norse mythology, symbolizing the connection between different AI providers

Bifrost is an LLM gateway written in Go that adds approximately 11 microseconds of overhead per request in our test environment. That's roughly 40x faster than what we observed with LiteLLM in comparable configurations.

But the performance improvement is just one part of the story. Bifrost rethinks the control plane for LLM infrastructure—providing governance, cost attribution, and observability as first-class gateway features rather than requiring external tooling or application-level instrumentation.

Let me walk through the technical details.


GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…


Setup and Deployment

Traditional gateway deployment often involves managing Python environments, dependency chains, and configuration files. Here's the Bifrost approach:

npx -y @maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

This single command downloads a pre-compiled binary for your platform and starts a production-ready gateway on port 8080 with a web UI for configuration.

Compare this to typical Python gateway setup:

# Install Python (verify version compatibility)
pip install litellm
# Configure environment variables
# Set up configuration file
# Install additional dependencies for features
# Debug environment-specific issues
Enter fullscreen mode Exit fullscreen mode

Bifrost uses NPX to download a pre-compiled binary for your platform. No Python interpreter required. No virtual environments. No dependency resolution. A single statically-linked executable that runs immediately.


Why Go for Gateway Infrastructure

The choice of Go over Python has measurable impacts on production systems, particularly around concurrency, memory efficiency, and operational simplicity.

Concurrency Model

Python gateways scale via async and multiple workers. At high concurrency, the tradeoffs show up as higher memory per instance, coordination overhead, and tail-latency under burst.

Go doesn't have these constraints. Go's goroutines are lightweight threads that can run truly in parallel across all available CPU cores. When a request arrives, Bifrost spawns a goroutine. When a thousand requests arrive simultaneously, Bifrost spawns a thousand goroutines—all running concurrently with minimal overhead.

Animated visualization showing parallel goroutines processing multiple requests simultaneously versus sequential Python execution

Memory Efficiency

A Python process typically requires 30-50MB of memory at startup in most configurations. Add Flask or FastAPI, and baseline memory usage often reaches 100MB+ before handling any requests, though this varies based on the specific setup and dependencies.
The entire Bifrost binary is approximately 20MB. In memory, a single Bifrost instance uses roughly 50MB under sustained load while handling thousands of requests per second.

Memory usage comparison graph showing Bifrost's 10x improvement over Python-based solutions

Startup Time

Python applications require time to initialize—import packages, start the interpreter, load configurations. Typical startup time is 2-3 seconds minimum.

Bifrost starts in milliseconds. This matters for autoscaling, development iteration, and serverless deployments where cold starts impact user experience.

Benchmark Results

Here are measurements from a sustained load test on a t3.xlarge EC2 instance at 5,000 requests per second:

Metric LiteLLM Bifrost Improvement
Gateway Overhead 440 µs 11 µs 40x faster
Memory Usage ~500 MB ~50 MB 10x less
Gateway-level Failures 11% 0% No failures observed
Queue Wait Time 47 µs 1.67 µs 28x faster
Total Latency (with provider) 2.12 s 1.61 s 24% faster

These measurements represent sustained load over multiple hours, not synthetic benchmarks.


Beyond Performance: Control-Plane Features That Matter in Production

The main reason to move from LiteLLM to Bifrost isn't language; it's control-plane features. Bifrost adds governance (virtual keys, budgets, rate limits), consistent cost attribution, and production-oriented observability at the gateway layer, not scattered across application code.

This architectural choice centralizes concerns that would otherwise require external services or application-level instrumentation:

  • Governance controls managed at the gateway rather than per-application
  • Cost attribution with per-request tracking and aggregation
  • Observability with structured logs, metrics, and request tracing built-in
  • Failure isolation with circuit breakers and automatic failover

Let's examine these features in detail.


Production Features

Automatic Failover

When your primary provider hits rate limits or experiences downtime, requests should seamlessly move to backup providers without manual intervention.

Bifrost configuration:

{
  "fallbacks": {
    "enabled": true,
    "order": [
      "openai/gpt-4o-mini",
      "anthropic/claude-sonnet-4",
      "mistral/mistral-large-latest"
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Configuration interface showing the automatic failover setup with multiple provider options

When OpenAI returns a rate limit error, Bifrost automatically retries with Anthropic. If that fails, it tries Mistral. Your application receives a successful response without implementing retry logic.

Load Balancing

Distributing load across multiple API keys prevents any single key from hitting rate limits:

{
  "providers": {
    "openai": {
      "keys": [
        {"name": "key-1", "value": "sk-...", "weight": 2.0},
        {"name": "key-2", "value": "sk-...", "weight": 1.0},
        {"name": "key-3", "value": "sk-...", "weight": 1.0}
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Diagram illustrating weighted load balancing distribution across multiple API keys with traffic percentages

The first key receives 50% of traffic, the other two receive 25% each. When one key approaches its rate limit, Bifrost automatically shifts load to healthy keys.

Semantic Caching

Semantic caching isn't a new concept; teams can build it externally, but Bifrost ships it as a first-class gateway feature, reducing moving parts.

Traditional caching requires exact string matches. But users rarely phrase questions identically:

  • "What's the weather like?"
  • "How's the weather today?"
  • "Tell me about current weather conditions"

These are semantically equivalent. Bifrost uses vector embeddings to understand semantic similarity:

Flowchart showing the semantic caching process from request to cache lookup to response delivery

  1. Request arrives: "What is Python?"
  2. Bifrost generates an embedding using a fast model
  3. Checks vector store for similar embeddings
  4. Finds previous request: "Explain Python to me"
  5. Returns cached response (similarity score: 0.92)

Dashboard showing semantic caching metrics with hit rates and similarity scores

Result: No LLM call required. Response in approximately 5 milliseconds instead of 2 seconds. Cost: $0.00 instead of $0.0001.

Screenshot displaying successful cache hit with performance metrics and cost savings

Savings depend on cache hit-rate and workload repetition. Over a million requests with 60% cache hit rate, this saves approximately $60.

Unified Interface

Every LLM provider has different API formats. OpenAI uses one schema. Anthropic uses another. Bedrock and Vertex AI each have their own specifications.

Bifrost provides a single API that works with all providers:

from openai import OpenAI

# Change only the base URL
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8080/openai"
)

# Use ANY provider with the same code
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4",  # Not an OpenAI model
    messages=[{"role": "user", "content": "Hello"}]
)
Enter fullscreen mode Exit fullscreen mode

Your application code remains unchanged. Switch providers by modifying one line. No refactoring required. No rewriting integration tests.

Model Context Protocol (MCP)

MCP is Anthropic's protocol for letting AI models use external tools. Integration with web search, filesystem access, or database queries:

{
  "mcp": {
    "enabled": true,
    "servers": {
      "web-search": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-brave-search"]
      },
      "filesystem": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This enables AI models to perform actions rather than only generating text responses.


Web UI and Observability

Main dashboard of Bifrost's web interface showing real-time metrics, request analytics, and provider status

Most gateways provide configuration files and command-line tools. Bifrost includes a comprehensive web interface at http://localhost:8080:

Dashboard: Real-time metrics showing request counts, error rates, and costs per provider

Providers: Visual configuration for all providers with click-based key management

Provider management interface showing API key configuration with visual controls and status indicators

Logs: Complete request/response history with token usage, searchable and filterable

Detailed request log viewer with filtering options, showing request/response pairs and token usage

Settings: Configure caching, governance, and plugins without editing configuration files

Settings panel displaying caching configuration, governance rules, and plugin management options

All configuration, monitoring, and debugging can be performed through the web interface without SSH access to servers or manual log analysis.


Architecture Details

Request Flow

Architecture diagram showing the complete request lifecycle through Bifrost's processing pipeline

  1. Request arrives at Bifrost's HTTP server
  2. Request validation happens in microseconds
  3. Cache lookup checks semantic cache if enabled
  4. Cache hit? Return immediately (approximately 5ms total)
  5. Cache miss? Continue to provider selection
  6. Load balancer selects API key based on weights and health
  7. Concurrent request dispatched to provider (goroutine spawned)
  8. Response streaming begins immediately if enabled
  9. Cache storage happens asynchronously (non-blocking)
  10. Response returns to client with metadata

All operations are non-blocking where possible. Cache lookup doesn't block provider calls in no-store mode. Cache storage doesn't delay response delivery.

Concurrency Implementation

Bifrost uses Go's goroutines for concurrency:

Traditional Python Threading:

Request 1 → Thread 1 → Process (limited parallelism)
Request 2 → Thread 2 → Wait/Process (coordination overhead)
Request 3 → Thread 3 → Wait/Process (memory per thread)
Enter fullscreen mode Exit fullscreen mode

Bifrost Goroutines:

Request 1 → Goroutine 1 ⟍
Request 2 → Goroutine 2 ⟋→ All process in parallel → Responses
Request 3 → Goroutine 3 ⟋
Enter fullscreen mode Exit fullscreen mode

Each goroutine uses approximately 2KB of memory. You can run millions concurrently.

Vector Store Integration

For semantic caching, Bifrost integrates with Weaviate:

  1. Request arrives with cache key: "user-session-123"
  2. Bifrost extracts message content
  3. Generates embedding using fast model (text-embedding-3-small)
  4. Searches Weaviate for similar embeddings (threshold: 0.8)
  5. Finds match with similarity 0.92
  6. Returns cached response with metadata

Embedding generation: approximately 50ms. Vector search: approximately 10ms. Total: 60ms compared to 2000ms for an actual LLM call.


Setting Up Semantic Caching

Step 1: Install Weaviate

Use Docker for local development:

docker run -d \
  --name weaviate \
  -p 8081:8080 \
  -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true \
  semitechnologies/weaviate:latest
Enter fullscreen mode Exit fullscreen mode

Or use Weaviate Cloud (free tier available) at https://console.weaviate.cloud/

Step 2: Configure Bifrost

Update your config.json:

{
  "providers": {
    "openai": {
      "keys": [{
        "name": "main",
        "value": "env.OPENAI_API_KEY",
        "models": ["gpt-4o-mini"]
      }]
    }
  },
  "vector_store": {
    "enabled": true,
    "type": "weaviate",
    "config": {
      "host": "localhost:8081",
      "scheme": "http"
    }
  },
  "plugins": [{
    "enabled": true,
    "name": "semantic_cache",
    "config": {
      "provider": "openai",
      "embedding_model": "text-embedding-3-small",
      "ttl": "5m",
      "threshold": 0.8
    }
  }]
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Test the Cache

# First request (cache miss, calls LLM)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-cache-key: user-123" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is Docker?"}]
  }'

# Similar request (cache hit, fast response)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-cache-key: user-123" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Explain Docker to me"}]
  }'
Enter fullscreen mode Exit fullscreen mode

The second request returns in approximately 60ms instead of 2000ms.

Cache Hit Response

{
  "choices": [...],
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "semantic",
      "similarity": 0.94,
      "threshold": 0.8,
      "provider_used": "openai",
      "model_used": "text-embedding-3-small"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The response includes debugging information showing the similarity score and threshold. Adjust the threshold based on your accuracy requirements.


Drop-In Replacement

You can replace existing OpenAI or Anthropic SDK calls with Bifrost by changing one parameter.

Python Example

Before:

from openai import OpenAI

client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)
Enter fullscreen mode Exit fullscreen mode

After:

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8080/openai"  # Only change
)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)
Enter fullscreen mode Exit fullscreen mode

Node.js Example

Before:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});
Enter fullscreen mode Exit fullscreen mode

After:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: 'not-needed',
  baseURL: 'http://localhost:8080/openai'  // Only change
});
Enter fullscreen mode Exit fullscreen mode

Benefits

This approach enables:

  • Adding Bifrost to existing applications without refactoring
  • Testing in production with gradual rollout
  • Access to all Bifrost features (caching, fallbacks, monitoring) immediately
  • Easy rollback if needed

Monitoring and Observability

Bifrost exposes Prometheus metrics at /metrics:

# Request metrics
bifrost_requests_total{provider="openai",model="gpt-4o-mini"} 1543
bifrost_request_duration_seconds{provider="openai"} 1.234

# Cache metrics
bifrost_cache_hits_total{type="semantic"} 892
bifrost_cache_misses_total 651

# Error metrics
bifrost_errors_total{provider="openai",type="rate_limit"} 12
Enter fullscreen mode Exit fullscreen mode

Grafana Dashboard

Connect Prometheus to Grafana for visualization:

  • Requests per second by provider
  • Latency percentiles (p50, p95, p99)
  • Cache hit rates over time
  • Cost tracking per provider
  • Error rates and types

Structured Logging

Bifrost logs to structured JSON:

{
  "level": "info",
  "time": "2024-01-15T10:30:00Z",
  "msg": "request completed",
  "provider": "openai",
  "model": "gpt-4o-mini",
  "duration_ms": 1234,
  "tokens": 456,
  "cache_hit": false
}
Enter fullscreen mode Exit fullscreen mode

This format integrates with any log aggregation service (CloudWatch, Datadog, Elasticsearch).


Common Configuration Issues

Issue 1: Missing Cache Key

Semantic caching requires the x-bf-cache-key header. Without it, every request is a cache miss.

Incorrect:

curl -X POST http://localhost:8080/v1/chat/completions -d '{...}'
Enter fullscreen mode Exit fullscreen mode

Correct:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "x-bf-cache-key: user-session-123" \
  -d '{...}'
Enter fullscreen mode Exit fullscreen mode

Issue 2: Threshold Configuration

Start with a threshold of 0.8 and adjust based on cache hit rate and accuracy:

{
  "threshold": 0.8  // Starting point
}
Enter fullscreen mode Exit fullscreen mode

Monitor your cache hit rate. If below 30%, lower the threshold to 0.75. If you're getting incorrect cached results, raise it to 0.85.

Issue 3: Config Store Requirement

Some plugins require a config store:

{
  "config_store": {
    "enabled": true,
    "type": "sqlite",
    "config": {"path": "./config.db"}
  }
}
Enter fullscreen mode Exit fullscreen mode

Issue 4: Weaviate Network Configuration

Ensure Weaviate is accessible from Bifrost. For Docker deployments:

{
  "vector_store": {
    "enabled": true,
    "type": "weaviate",
    "config": {
      "host": "weaviate-container:8080",  // Use correct hostname
      "scheme": "http"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

For Weaviate Cloud:

{
  "vector_store": {
    "enabled": true,
    "type": "weaviate",
    "config": {
      "host": "<weaviate-host>.gcp.weaviate.cloud",
      "scheme": "https",
      "api_key": "<weaviate-api-key>"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

When to Use Bifrost

Bifrost provides immediate value for:

  • Production systems handling more than 1,000 requests per day
  • Applications where tail latency impacts user experience
  • Teams that need automatic failover without complex orchestration
  • Organizations tracking LLM costs across multiple providers
  • Systems requiring governance controls (rate limits, budgets, virtual keys)
  • Deployments where operational simplicity reduces maintenance burden

Even for smaller projects, Bifrost's minimal overhead and built-in features provide a robust foundation that scales without requiring future refactoring.

Getting Started

  1. Run npx -y @maximhq/bifrost
  2. Open http://localhost:8080
  3. Add your API keys in the UI
  4. Point your application to http://localhost:8080/openai
  5. Monitor performance and costs through the dashboard

Resources


Questions or feedback? Please leave a comment below. If you use Bifrost in production, I'd be interested to hear about your experience and any challenges you encounter.

Top comments (2)

Collapse
 
shravanjp profile image
Shravan J Poojary

Nice work! Quick question, how does Bifrost handle Anthropic-specific features like system prompts and extended thinking? Are these supported via the unified API, or do we still need to hit Anthropic directly?

Collapse
 
varshithvhegde profile image
Varshith V Hegde

Great question! Bifrost supports Anthropic-specific features through the unified API. You can use system prompts, extended thinking, and other Anthropic parameters directly when routing to Claude models.