In microservice architectures, config services are critical infrastructure. They store feature flags, API endpoints, and runtime settings that services query constantly on startup, during requests, when auto-scaling. Most are backed by a database with aggressive caching. Everything works beautifully, until your database goes down.
Here's the nightmare scenario: Your cache has a 5-minute TTL. Your database outage lasts 25+ minutes. At the 5-minute mark, cache entries start expiring. Services start failing. New instances can't bootstrap. Your availability drops to zero.
This is the story of building a cache that survives prolonged database outages by persisting stale data to disk and the hard lessons learned along the way.
The Problem Nobody Talks About
Everyone tells you to cache your database. "Just use Redis!" "Throw some Caffeine in there!" And they're right for normal operations.
But here's what the tutorials don't cover: What happens when your cache expires during a prolonged outage?
The failure sequence looks like this:
- T+0 min: Database goes down. Cache still serving traffic (100% hit rate).
- T+5 min: First cache entries expire. Cache misses start happening.
- T+6 min: Cache miss → try database → timeout. Service starts returning errors.
- T+10 min: Most cache entries expired. Availability plummets.
- T+15 min: Auto-scaling spins up new instances. They can't fetch configs. Immediate crash.
- T+25 min: Database finally recovers. You've been down for 20 minutes.
The traditional solution is replication i.e Aurora multi-region, DynamoDB global tables, all that good stuff. But replication has its own problems:
Cost: You're running duplicate infrastructure 24/7 for failure scenarios that happen 2-3 times per year.
Complexity: Cross-region replication, failover logic, data consistency concerns, network latency.
Partial protection: Regional outages still take you down. Replication lag can be seconds to minutes.
There had to be a simpler approach.
The Core Insight: Stale Data Beats No Data
Here's the controversial take that changed everything: For read-heavy config services, serving 10-minute-old data during an outage is infinitely better than serving nothing.
Think about what your config service actually stores:
- Feature flags: Don't change every second
- Service endpoints: Relatively stable
- API rate limits: Rarely updated mid-incident
- Routing rules: Can tolerate brief staleness
Sure, you might serve a feature flag that was disabled 5 minutes ago. But that's better than taking down your entire service because the config is unreachable.
The question became: How do I serve stale data when my cache is empty and my database is unavailable?
The answer: Persist cache evictions to local disk.
Architecture: The Three-Tier Survival Strategy
I built what I call a "tier cache"—three layers of defense against database failures:
Normal Operation Flow:
- Request comes in → check L1 (memory)
- Cache hit (99% of the time) → return immediately in ~2.5μs
- Cache miss → fetch from L2 (database)
- Write to L1 for fast access
- Asynchronously write to L3 (disk) for outage protection
Outage Operation Flow:
- Request comes in → check L1 (memory)
- Cache miss → try L2 (database) → connection timeout
- Fall back to L3 (disk) → serve stale data
- Service stays alive with degraded data
The key innovation: Every cache eviction gets persisted to disk. When the database is unreachable, we serve from this stale disk cache. It's not perfect data, but it keeps services running.
Why RocksDB?
My first instinct was simple file serialization. Why not just dump everything to JSON?
File cacheFile = new File("cache-backup.json");
objectMapper.writeValue(cacheFile, cacheData);
This worked great for 100 entries in my test. Then I tried 10,000 realistic config objects:
- File size: 45MB of verbose JSON
- Write time: 280ms (blocking the cache)
- Read time: 380ms (sequential scan to find one key)
Completely unusable.
I needed something that could:
- Read individual keys fast without scanning the entire file
- Compress data since config JSON is highly repetitive
- Handle writes efficiently without blocking cache operations
- Survive crashes without losing all data
After researching embedded databases, RocksDB emerged as the clear winner:
Compression: My 45MB JSON dump compressed to ~8MB with LZ4 (5.6x reduction). Real-world compression varies by data patterns—typically in 2-4x.
Fast random reads: Log-Structured Merge (LSM) tree design optimized for key-value lookups. 10-50μs to fetch any key.
Write-optimized: Writes go to memory first, then flush to disk in batches. No blocking on individual writes.
Battle-tested: Powers production systems at Facebook, LinkedIn, Netflix. If it's good enough for them, it's good enough for my config service.
Crash safety: Write-Ahead Logging (WAL) ensures durability even if the process crashes.
public class RocksDBDiskStore implements AutoCloseable {
private final RocksDB db;
private final ObjectMapper mapper;
public RocksDBDiskStore(String path) throws RocksDBException {
RocksDB.loadLibrary();
Options options = new Options()
.setCreateIfMissing(true)
.setCompressionType(CompressionType.LZ4_COMPRESSION)
.setMaxOpenFiles(256)
.setWriteBufferSize(8 * 1024 * 1024); // 8MB buffer
this.db = RocksDB.open(options, path);
this.mapper = new ObjectMapper();
}
}
Disk Management Built-In
Implementation: RocksDB has a configurable background cleanup thread:
// From RocksDBDiskStore.java
if(cleanupDuration > 0) {
this.scheduler = Executors.newSingleThreadScheduledExecutor(r -> {
Thread thread = new Thread(r, "RocksDB-Cleanup");
thread.setDaemon(true);
return thread;
});
this.scheduler.scheduleAtFixedRate(
this::cleanup,
cleanupDuration,
cleanupDuration,
unit
);
}
This daemon thread runs periodic cleanup to prevent unbounded disk growth. You configure the cleanup frequency when initializing the disk store, ensuring L3 doesn't consume all server disk space over time.
Cache Eviction: The Secret Sauce
The clever part is when data gets written to RocksDB. I don't persist every cache write—that would be wasteful. Instead, I persist on cache eviction.
Caffeine's removal listener is the key:
this.cache = Caffeine.newBuilder()
.maximumSize(maxSize)
.expireAfterWrite(ttl)
.evictionListener((key, value, cause) -> {
this.diskStore.save(key, value); // write to RocksDB
})
.build();
When does eviction happen?
- Time-based expiry: Entry sits unused for X minutes → TTL expires → eviction
- Size-based eviction: Cache hits 10,000 entries → least recently used gets evicted
Why this approach is efficient:
Hot data stays in memory: Frequently accessed configs never touch disk.
Cold data gets archived: When a config entry expires from L1, it gets persisted to L3 for outage scenarios.
Eviction-triggered persistence: Data is written to disk when evicted from memory, not on every cache operation.
During normal operations: L3 is write-mostly, read-rarely. The database is healthy, so cache misses go to L2, not L3.
During outages: L3 becomes read-heavy. Cache misses can't reach L2 (database down), so they fall back to L3 for stale data.
This design means your disk isn't constantly thrashing with writes—it only persists data that's already being evicted from memory anyway.
Benchmarking: Does This Actually Work?
I built a test harness to simulate realistic failure scenarios. Here are the results that convinced me this approach works:
Test 1: Long Outage Resilience (25-min database failure)
Setup: 10K cache entries, 5-min TTL, simulated database outage at T+0
| Time Elapsed | Tier Cache | EhCache (disk) | Caffeine Only |
|---|---|---|---|
| 3 minutes | 100% | 100% | 100% |
| 5 minutes | 100% | 0% | 0% |
| 7 minutes | 100% | 0% | 0% |
| 10 minutes | 100% | 0% | 0% |
| 25 minutes | 100% | 0% | 0% |
Key finding: Tier cache maintained availability for previously-cached keys by
serving from L3 (RocksDB) after L1 expired.This assumes all requested keys were previously cached. In reality, newly added configs or never-requested keys won't be in L3 and will fail. This represents typical production traffic patterns.
Why did EhCache fail? Its disk persistence is designed for overflow, not outage recovery. When the cache expires, it tries to fetch from the database (which is down) rather than serving stale disk data.
Test 2: Normal Operation Performance
Setup: Database healthy, measuring latency for cache operations
| Operation | Tier Cache | EhCache | Caffeine |
|---|---|---|---|
| Cache hit (memory) | 2.50 μs | 6.31 μs | 2.74 μs |
| Cache miss (DB up) | 1.2 ms | 1.3 ms | 1.1 ms |
| Disk fallback | 19.11 μs | N/A | N/A |
Important clarification: The "cache miss" numbers include network round-trip (mocked) to the database. The "disk fallback" is what happens when the DB is down—we serve from RocksDB instead.
During normal operations, tier cache performs nearly identically to vanilla Caffeine. The disk layer only matters during outages.
Test 3: Write Throughput Under Memory Pressure
Setup: 50K writes with 10K cache size limit (heavy eviction)
| Strategy | Total Time | Throughput | vs Baseline |
|---|---|---|---|
| Caffeine Only | 37 ms | 1,351,351/s | 100% |
| Tier Cache | 140 ms | 357,143/s | 26% |
| EhCache | 201 ms | 248,756/s | 18% |
This is the cost. Async disk persistence reduces write throughput by ~74%. Every eviction triggers a disk write, and under heavy churn, this adds up.
What I Got Wrong
This is a learning project, not production-ready code. Here are the real limitations you need to understand:
1. The Cold Start Problem
New instances start with empty RocksDB. During an outage, they have no stale data to serve.
What happens: Auto-scaling spins up a new pod → L1 empty → L2 down → L3 empty → requests fail.
My benchmarks showed 100% availability, but that assumed warm caches. Real-world availability during outages depends on whether instances have previously cached the requested keys.
2. Single Node Limitation
Each instance maintains its own local RocksDB. In a distributed deployment with multiple instances, each has different stale data based on what it personally cached. Request routing becomes non-deterministic—the same config key might return different values depending on which instance handles the request.
This isn't a bug to fix; it's a fundamental architectural choice. Local disk persistence trades consistency for simplicity. Solving this requires either accepting eventual consistency or moving to distributed storage like Redis, which defeats the "simple local cache" design goal.
When Should You Actually Use This?
This project demonstrates caching patterns and outage resilience strategies. Based on the architecture:
Appropriate for:
- Single-node applications
- Systems where eventual consistency across instances is acceptable
Not appropriate for:
- Multi-instance production deployments requiring consistency
- Applications needing strong consistency guarantees
Try It Yourself
The full implementation is here github.com/SivagurunathanV/tier-cache
Quick start:
git clone https://github.com/SivagurunathanV/tier-cache
cd tier-cache
./gradlew test # Run test suite
./gradlew run # Interactive demo
What's Next?
If you're building something similar:
- Start simple (JSON files) and profile before over-engineering
- Measure your actual outage frequency and duration
- Calculate the real cost of downtime vs. infrastructure
- Test with realistic failure scenarios, not just happy paths
Key improvements for production:
- Implement write coalescing (batch evictions)
- Add circuit breakers and error handling
- Build comprehensive observability
- Test cold start and multi-instance scenarios
I'd love to hear about your failure survival strategies. What patterns have kept your services alive during database outages? What trade-offs have you made?
Resources:

Top comments (0)