Varshith V Hegde

Posted on Dec 18

Bifrost: The LLM Gateway That's 40x Faster Than LiteLLM

#webdev #programming #ai #agents

A technical deep dive into Bifrost: an open-source, self-hostable Go LLM gateway

Gateway Overhead in Production LLM Systems

In most LLM systems, the gateway becomes a shared dependency: it affects tail latency, routing/failover behavior, retries, and cost attribution across providers. LiteLLM works well as a lightweight Python proxy, but in our production-like load tests we started seeing gateway overhead and operational complexity show up at higher concurrency. We moved to Bifrost for lower overhead and for first-class features like governance, cost semantics, and observability built into the gateway.

In our benchmark setup (with logging/retries enabled), LiteLLM added hundreds of microseconds of overhead per request. Results vary by deployment mode and configuration. When handling thousands of requests per second, this overhead compounds—infrastructure costs increase, tail latency suffers, and operational complexity grows.

Bifrost takes a different approach.

Enter Bifrost

Bifrost is an LLM gateway written in Go that adds approximately 11 microseconds of overhead per request in our test environment. That's roughly 40x faster than what we observed with LiteLLM in comparable configurations.

But the performance improvement is just one part of the story. Bifrost rethinks the control plane for LLM infrastructure—providing governance, cost attribution, and observability as first-class gateway features rather than requiring external tooling or application-level instrumentation.

Let me walk through the technical details.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

Setup and Deployment

Traditional gateway deployment often involves managing Python environments, dependency chains, and configuration files. Here's the Bifrost approach:

npx -y @maximhq/bifrost

This single command downloads a pre-compiled binary for your platform and starts a production-ready gateway on port 8080 with a web UI for configuration.

Compare this to typical Python gateway setup:

# Install Python (verify version compatibility)
pip install litellm
# Configure environment variables
# Set up configuration file
# Install additional dependencies for features
# Debug environment-specific issues

Bifrost uses NPX to download a pre-compiled binary for your platform. No Python interpreter required. No virtual environments. No dependency resolution. A single statically-linked executable that runs immediately.

Why Go for Gateway Infrastructure

The choice of Go over Python has measurable impacts on production systems, particularly around concurrency, memory efficiency, and operational simplicity.

Concurrency Model

Python gateways scale via async and multiple workers. At high concurrency, the tradeoffs show up as higher memory per instance, coordination overhead, and tail-latency under burst.

Go doesn't have these constraints. Go's goroutines are lightweight threads that can run truly in parallel across all available CPU cores. When a request arrives, Bifrost spawns a goroutine. When a thousand requests arrive simultaneously, Bifrost spawns a thousand goroutines—all running concurrently with minimal overhead.

Memory Efficiency

A Python process typically requires 30-50MB of memory at startup in most configurations. Add Flask or FastAPI, and baseline memory usage often reaches 100MB+ before handling any requests, though this varies based on the specific setup and dependencies.
The entire Bifrost binary is approximately 20MB. In memory, a single Bifrost instance uses roughly 50MB under sustained load while handling thousands of requests per second.

Startup Time

Python applications require time to initialize—import packages, start the interpreter, load configurations. Typical startup time is 2-3 seconds minimum.

Bifrost starts in milliseconds. This matters for autoscaling, development iteration, and serverless deployments where cold starts impact user experience.

Benchmark Results

Here are measurements from a sustained load test on a t3.xlarge EC2 instance at 5,000 requests per second:

Metric	LiteLLM	Bifrost	Improvement
Gateway Overhead	440 µs	11 µs	40x faster
Memory Usage	~500 MB	~50 MB	10x less
Gateway-level Failures	11%	0%	No failures observed
Queue Wait Time	47 µs	1.67 µs	28x faster
Total Latency (with provider)	2.12 s	1.61 s	24% faster

These measurements represent sustained load over multiple hours, not synthetic benchmarks.

Beyond Performance: Control-Plane Features That Matter in Production

The main reason to move from LiteLLM to Bifrost isn't language; it's control-plane features. Bifrost adds governance (virtual keys, budgets, rate limits), consistent cost attribution, and production-oriented observability at the gateway layer, not scattered across application code.

This architectural choice centralizes concerns that would otherwise require external services or application-level instrumentation:

Governance controls managed at the gateway rather than per-application
Cost attribution with per-request tracking and aggregation
Observability with structured logs, metrics, and request tracing built-in
Failure isolation with circuit breakers and automatic failover

Let's examine these features in detail.

Production Features

Automatic Failover

When your primary provider hits rate limits or experiences downtime, requests should seamlessly move to backup providers without manual intervention.

Bifrost configuration:

{
  "fallbacks": {
    "enabled": true,
    "order": [
      "openai/gpt-4o-mini",
      "anthropic/claude-sonnet-4",
      "mistral/mistral-large-latest"
    ]
  }
}

When OpenAI returns a rate limit error, Bifrost automatically retries with Anthropic. If that fails, it tries Mistral. Your application receives a successful response without implementing retry logic.

Load Balancing

Distributing load across multiple API keys prevents any single key from hitting rate limits:

{
  "providers": {
    "openai": {
      "keys": [
        {"name": "key-1", "value": "sk-...", "weight": 2.0},
        {"name": "key-2", "value": "sk-...", "weight": 1.0},
        {"name": "key-3", "value": "sk-...", "weight": 1.0}
      ]
    }
  }
}

The first key receives 50% of traffic, the other two receive 25% each. When one key approaches its rate limit, Bifrost automatically shifts load to healthy keys.

Semantic Caching

Semantic caching isn't a new concept; teams can build it externally, but Bifrost ships it as a first-class gateway feature, reducing moving parts.

Traditional caching requires exact string matches. But users rarely phrase questions identically:

"What's the weather like?"
"How's the weather today?"
"Tell me about current weather conditions"

These are semantically equivalent. Bifrost uses vector embeddings to understand semantic similarity:

Request arrives: "What is Python?"
Bifrost generates an embedding using a fast model
Checks vector store for similar embeddings
Finds previous request: "Explain Python to me"
Returns cached response (similarity score: 0.92)

Result: No LLM call required. Response in approximately 5 milliseconds instead of 2 seconds. Cost: $0.00 instead of $0.0001.

Savings depend on cache hit-rate and workload repetition. Over a million requests with 60% cache hit rate, this saves approximately $60.

Unified Interface

Every LLM provider has different API formats. OpenAI uses one schema. Anthropic uses another. Bedrock and Vertex AI each have their own specifications.

Bifrost provides a single API that works with all providers:

from openai import OpenAI

# Change only the base URL
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8080/openai"
)

# Use ANY provider with the same code
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4",  # Not an OpenAI model
    messages=[{"role": "user", "content": "Hello"}]
)

Your application code remains unchanged. Switch providers by modifying one line. No refactoring required. No rewriting integration tests.

Model Context Protocol (MCP)

MCP is Anthropic's protocol for letting AI models use external tools. Integration with web search, filesystem access, or database queries:

{
  "mcp": {
    "enabled": true,
    "servers": {
      "web-search": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-brave-search"]
      },
      "filesystem": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
      }
    }
  }
}

This enables AI models to perform actions rather than only generating text responses.

Web UI and Observability

Most gateways provide configuration files and command-line tools. Bifrost includes a comprehensive web interface at http://localhost:8080:

Dashboard: Real-time metrics showing request counts, error rates, and costs per provider

Providers: Visual configuration for all providers with click-based key management

Logs: Complete request/response history with token usage, searchable and filterable

Settings: Configure caching, governance, and plugins without editing configuration files

All configuration, monitoring, and debugging can be performed through the web interface without SSH access to servers or manual log analysis.

Architecture Details

Request Flow

Request arrives at Bifrost's HTTP server
Request validation happens in microseconds
Cache lookup checks semantic cache if enabled
Cache hit? Return immediately (approximately 5ms total)
Cache miss? Continue to provider selection
Load balancer selects API key based on weights and health
Concurrent request dispatched to provider (goroutine spawned)
Response streaming begins immediately if enabled
Cache storage happens asynchronously (non-blocking)
Response returns to client with metadata

All operations are non-blocking where possible. Cache lookup doesn't block provider calls in no-store mode. Cache storage doesn't delay response delivery.

Concurrency Implementation

Bifrost uses Go's goroutines for concurrency:

Traditional Python Threading:

Request 1 → Thread 1 → Process (limited parallelism)
Request 2 → Thread 2 → Wait/Process (coordination overhead)
Request 3 → Thread 3 → Wait/Process (memory per thread)

Bifrost Goroutines:

Request 1 → Goroutine 1 ⟍
Request 2 → Goroutine 2 ⟋→ All process in parallel → Responses
Request 3 → Goroutine 3 ⟋

Each goroutine uses approximately 2KB of memory. You can run millions concurrently.

Vector Store Integration

For semantic caching, Bifrost integrates with Weaviate:

Request arrives with cache key: "user-session-123"
Bifrost extracts message content
Generates embedding using fast model (text-embedding-3-small)
Searches Weaviate for similar embeddings (threshold: 0.8)
Finds match with similarity 0.92
Returns cached response with metadata

Embedding generation: approximately 50ms. Vector search: approximately 10ms. Total: 60ms compared to 2000ms for an actual LLM call.

Setting Up Semantic Caching

Step 1: Install Weaviate

Use Docker for local development:

docker run -d \
  --name weaviate \
  -p 8081:8080 \
  -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true \
  semitechnologies/weaviate:latest

Or use Weaviate Cloud (free tier available) at https://console.weaviate.cloud/

Step 2: Configure Bifrost

Update your config.json:

{
  "providers": {
    "openai": {
      "keys": [{
        "name": "main",
        "value": "env.OPENAI_API_KEY",
        "models": ["gpt-4o-mini"]
      }]
    }
  },
  "vector_store": {
    "enabled": true,
    "type": "weaviate",
    "config": {
      "host": "localhost:8081",
      "scheme": "http"
    }
  },
  "plugins": [{
    "enabled": true,
    "name": "semantic_cache",
    "config": {
      "provider": "openai",
      "embedding_model": "text-embedding-3-small",
      "ttl": "5m",
      "threshold": 0.8
    }
  }]
}

Step 3: Test the Cache

# First request (cache miss, calls LLM)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-cache-key: user-123" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is Docker?"}]
  }'

# Similar request (cache hit, fast response)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-cache-key: user-123" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Explain Docker to me"}]
  }'

The second request returns in approximately 60ms instead of 2000ms.

Cache Hit Response

{
  "choices": [...],
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "semantic",
      "similarity": 0.94,
      "threshold": 0.8,
      "provider_used": "openai",
      "model_used": "text-embedding-3-small"
    }
  }
}

The response includes debugging information showing the similarity score and threshold. Adjust the threshold based on your accuracy requirements.

Drop-In Replacement

You can replace existing OpenAI or Anthropic SDK calls with Bifrost by changing one parameter.

Python Example

Before:

from openai import OpenAI

client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)

After:

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8080/openai"  # Only change
)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)

Node.js Example

Before:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

After:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: 'not-needed',
  baseURL: 'http://localhost:8080/openai'  // Only change
});

Benefits

This approach enables:

Adding Bifrost to existing applications without refactoring
Testing in production with gradual rollout
Access to all Bifrost features (caching, fallbacks, monitoring) immediately
Easy rollback if needed

Monitoring and Observability

Bifrost exposes Prometheus metrics at /metrics:

# Request metrics
bifrost_requests_total{provider="openai",model="gpt-4o-mini"} 1543
bifrost_request_duration_seconds{provider="openai"} 1.234

# Cache metrics
bifrost_cache_hits_total{type="semantic"} 892
bifrost_cache_misses_total 651

# Error metrics
bifrost_errors_total{provider="openai",type="rate_limit"} 12

Grafana Dashboard

Connect Prometheus to Grafana for visualization:

Requests per second by provider
Latency percentiles (p50, p95, p99)
Cache hit rates over time
Cost tracking per provider
Error rates and types

Structured Logging

Bifrost logs to structured JSON:

{
  "level": "info",
  "time": "2024-01-15T10:30:00Z",
  "msg": "request completed",
  "provider": "openai",
  "model": "gpt-4o-mini",
  "duration_ms": 1234,
  "tokens": 456,
  "cache_hit": false
}

This format integrates with any log aggregation service (CloudWatch, Datadog, Elasticsearch).

Common Configuration Issues

Issue 1: Missing Cache Key

Semantic caching requires the x-bf-cache-key header. Without it, every request is a cache miss.

Incorrect:

curl -X POST http://localhost:8080/v1/chat/completions -d '{...}'

Correct:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "x-bf-cache-key: user-session-123" \
  -d '{...}'

Issue 2: Threshold Configuration

Start with a threshold of 0.8 and adjust based on cache hit rate and accuracy:

{
  "threshold": 0.8  // Starting point
}

Monitor your cache hit rate. If below 30%, lower the threshold to 0.75. If you're getting incorrect cached results, raise it to 0.85.

Issue 3: Config Store Requirement

Some plugins require a config store:

{
  "config_store": {
    "enabled": true,
    "type": "sqlite",
    "config": {"path": "./config.db"}
  }
}

Issue 4: Weaviate Network Configuration

Ensure Weaviate is accessible from Bifrost. For Docker deployments:

{
  "vector_store": {
    "enabled": true,
    "type": "weaviate",
    "config": {
      "host": "weaviate-container:8080",  // Use correct hostname
      "scheme": "http"
    }
  }
}

For Weaviate Cloud:

{
  "vector_store": {
    "enabled": true,
    "type": "weaviate",
    "config": {
      "host": "<weaviate-host>.gcp.weaviate.cloud",
      "scheme": "https",
      "api_key": "<weaviate-api-key>"
    }
  }
}

When to Use Bifrost

Bifrost provides immediate value for:

Production systems handling more than 1,000 requests per day
Applications where tail latency impacts user experience
Teams that need automatic failover without complex orchestration
Organizations tracking LLM costs across multiple providers
Systems requiring governance controls (rate limits, budgets, virtual keys)
Deployments where operational simplicity reduces maintenance burden

Even for smaller projects, Bifrost's minimal overhead and built-in features provide a robust foundation that scales without requiring future refactoring.

Getting Started

Run npx -y @maximhq/bifrost
Open http://localhost:8080
Add your API keys in the UI
Point your application to http://localhost:8080/openai
Monitor performance and costs through the dashboard

Resources

GitHub: github.com/maximhq/bifrost
Website: getbifrost.ai
Documentation: docs.getbifrost.ai

Questions or feedback? Please leave a comment below. If you use Bifrost in production, I'd be interested to hear about your experience and any challenges you encounter.

Top comments (2)

Shravan J Poojary • Dec 19

Nice work! Quick question, how does Bifrost handle Anthropic-specific features like system prompts and extended thinking? Are these supported via the unified API, or do we still need to hit Anthropic directly?

Varshith V Hegde • Dec 19

Great question! Bifrost supports Anthropic-specific features through the unified API. You can use system prompts, extended thinking, and other Anthropic parameters directly when routing to Claude models.