DEV Community

Cover image for I Benchmarked LLM APIs on Live BGP Streams. Here’s What Actually Matters.
Manas Mudbari
Manas Mudbari

Posted on

I Benchmarked LLM APIs on Live BGP Streams. Here’s What Actually Matters.

Most LLM benchmarks are polite.

They run clean prompts on static text, measure token speed, and declare a winner. That’s fine if you’re building a chatbot. It’s almost useless if you’re building a real-time system.

I wanted to see what happens when LLMs are exposed to something messier: live, high-velocity network telemetry.

So I wired multiple LLM APIs directly into a live BGP stream and measured how they behaved when the data never stopped.

This post is about what broke, what worked, and why “smartest model” is often the wrong question.


The Setup (Simple, No Tricks)

The data source was a live BGP feed from RIPE RIS:

WebSocket endpoint:

wss://ris-live.ripe.net/v1/ws/?client=turbomart-test

Subscription message:

{
  "type": "ris_subscribe",
  "data": { "host": "rrc21" }
}
Enter fullscreen mode Exit fullscreen mode

That gives you a continuous firehose of routing updates. No batching. No backpressure help.

Each update was sent to five LLM APIs:

  • OpenAI
  • Anthropic
  • Azure OpenAI
  • Gemini
  • Grok

Same prompts. Same parameters. No model-specific tuning.

System prompt:

You are an expert network engineer who analyzes BGP feeds for a living…

User prompt:

Summarize the following BGP update in under 140 characters for a real-time network alert. Include ASN owner, prefix, and region if known.

If a model failed, truncated output, or rambled, that was considered part of the result.


What I Measured (The Stuff That Actually Hurts in Production)

I didn’t care about abstract “intelligence.” I cared about things that break pipelines:

  • Time to First Token (TTFT)
  • Total completion latency
  • Tokens in vs tokens out
  • Compression ratio (output tokens divided by input tokens)

These metrics determine:

  • how stale your alerts are
  • whether your buffers explode
  • whether you burn money on filler text

The Averages (Across All Samples)

Here’s what the numbers looked like when averaged per provider:

OpenAI
TTFT ~830 ms
Total latency ~1.8 s
Tokens out ~45
Compression ~0.01

Anthropic
TTFT ~2.1 s
Total latency ~6.3 s
Tokens out ~137
Compression ~0.03

Azure OpenAI
TTFT ~2.8 s
Total latency ~2.8 s
Tokens out ~9,400
Compression ~1.85

Gemini
TTFT ~3.0 s
Total latency ~3.4 s
Tokens out ~9,600
Compression ~1.84

Grok
TTFT ~19 s
Total latency ~19.7 s
Tokens out ~33
Compression ~0.01

Even without context, some things should already look alarming.


Model-by-Model: What Actually Happened

OpenAI

OpenAI behaved exactly how you’d want in a streaming system:

  • fast first token
  • short, clean summaries
  • almost no wasted output

It followed the prompt closely and didn’t try to be clever. That’s a feature, not a bug.

If you’re building dashboards, alerts, or anything user-facing in real time, OpenAI was the most predictable option.


Anthropic

Anthropic did something different.

It didn’t just summarize updates. It tried to interpret them. Sometimes it flagged anomalies. Sometimes it suggested what might be happening.

That extra reasoning came at a cost:

  • slower responses
  • significantly more tokens
  • longer completions

This is not an alerting engine. It’s closer to a junior analyst reading the feed.

Great for offline analysis. Dangerous for live alerts.


Azure OpenAI

Azure OpenAI struggled in this setup.

It often behaved as if it only partially understood the incoming data. Output was verbose, repetitive, and sometimes ignored the summarization constraint entirely.

The compression ratio tells the story: output was often larger than input.

That’s a red flag in any streaming system.

I suspect this can be fixed with tighter controls, but out of the box it wasn’t stream-safe.


Gemini

Gemini responses were usually fast enough, but often incomplete.

Some outputs were truncated. Others were short but low-value. Many wasted tokens without adding useful signal.

It felt optimized for short Q&A, not for interpreting structured telemetry.

If you’re processing logs or metrics streams, Gemini isn’t there yet.


Grok

Grok was the strangest.

Responses were extremely slow to start, but very short once they arrived. Often it just signaled that something changed, without explaining what or why.

Think of it as a “delta detector,” not a summarizer.

If your use case is “ping me when anything changes,” maybe.
If you need explanation, no.


The Big Lesson

LLM APIs are not interchangeable components.

They encode assumptions about:

  • how fast answers should arrive
  • how verbose responses should be
  • how much reasoning is appropriate
  • how strictly prompts should be followed

In real-time systems:

  • latency beats intelligence
  • consistency beats creativity
  • token efficiency beats verbosity

An answer that arrives late is indistinguishable from noise.


If You’re Building a Streaming System

Based on this experiment:

  • Use OpenAI for real-time alerts and dashboards
  • Use Anthropic for offline analysis or investigations
  • Be cautious with Azure OpenAI unless you tightly constrain it
  • Avoid Gemini for structured stream summarization
  • Use Grok only if you care about “something changed,” not details

Top comments (0)