Ilya Ploskovitov

Posted on Dec 11

Stop Testing Success. Kill the Database. 🧨

#testing #devops #chaosengineering #python

Intro to Chaos Engineering for QA. Learn how to test resilience by injecting failures with Docker and Playwright.

We are obsessed with the "Happy Path".

In traditional QA, we verify that the application works when everything is perfect:

The network is stable.
The database responds in 5ms.
Third-party APIs are online.

But in production, nothing is perfect. Pods crash, networks lag, and databases lock up.

When these things happen, a standard Selenium/Playwright test just says: Failed. It doesn't tell you how the application failed. Did it show a graceful error message? Or did it crash with a white screen and a raw stack trace?

This is where Chaos Engineering comes in.

From QA to Resilience Engineering

Chaos Engineering isn't just for Site Reliability Engineers (SREs). As modern QAs, we need to stop asking "Does it work?" and start asking "What happens when it breaks?"

Today, I’ll show you how to write a Chaos Test using Python, Playwright, and the Docker SDK.

The Goal

We aren't going to wait for the database to fail. We are going to kill it intentionally in the middle of a test and verify that our frontend handles it gracefully.

The Stack

Python (Test logic)
Playwright (UI Interaction)
Docker SDK (The Chaos Injector)

The Code 🐍

Here is the complete script. It connects to your local Docker daemon, finds the Postgres container, and strangles it while the user is trying to work.

import docker
import time
from playwright.sync_api import Page, expect

def test_database_failure_resilience(page: Page):
    # 1. Setup: Connect to Docker
    # We use the python-docker library to control the infrastructure
    client = docker.from_env()

    # Target your specific database container
    try:
        db_container = client.containers.get("postgres-prod")
    except docker.errors.NotFound:
        raise Exception("Database container not found! Is Docker running?")

    # 2. Happy Path: Verify the app loads normally
    print("✅ Step 1: Loading Dashboard...")
    page.goto("http://localhost:3000/dashboard")
    expect(page.locator(".user-balance")).to_be_visible()

    # 🧨 CHAOS TIME: Kill the Database
    print("🔥 Step 2: Injecting Chaos (Stopping DB)...")
    db_container.stop()

    # 3. Resilience Assertion
    # The app should NOT show a white screen or crash.
    # It SHOULD show a friendly "Connection Lost" toast or retry button.
    print("👀 Step 3: Verifying graceful degradation...")

    # Trigger an action that requires the DB
    page.reload() 

    # Assert UI handles the error
    expect(page.locator(".error-toast")).to_contain_text("Connection lost")
    expect(page.locator(".retry-button")).to_be_visible()

    # 🩹 RECOVERY: Bring the Database back
    print("🩹 Step 4: Healing the infrastructure...")
    db_container.start()

    # Give the app a moment to reconnect (or trigger a manual retry)
    page.locator(".retry-button").click()

    # 4. Self-Healing Assertion
    # The app should recover without requiring a full page refresh
    expect(page.locator(".user-balance")).to_be_visible()
    print("✅ Test Passed: System is resilient.")

Why this matters

If you run this test and your application shows a 500 Server Error page, you have found a bug. Not a functional bug, but an architectural bug.

By adding "Chaos Tests" to your regression suite, you guarantee that your product doesn't just work—it survives.

👋 Want more Chaos?

I write The 5-Minute QA—a daily newsletter for Senior QAs and SDETs. Every morning, I send one actionable tip on Chaos Engineering.

👉 Subscribe here to get the tips in your inbox

Top comments (2)

rokoss21 • Dec 16

This resonates a lot.

What I really like here is the reframing: you’re not “testing failure”, you’re testing system behavior under violated assumptions. That’s an architectural concern, not just a QA one.

Killing the database mid-flow forces an implicit contract to surface:
– What guarantees does the frontend actually rely on?
– Where do retries live?
– Is failure a state or just an unhandled exception?

Most teams unknowingly test success invariants and then act surprised when production violates them. Chaos tests like this turn resilience from an abstract SRE concept into something executable and reviewable.

I also appreciate that this isn’t random chaos for chaos’ sake — it’s scoped, intentional, and tied to observable user behavior. That’s the difference between “breaking things” and engineering robustness.

Great post. This kind of testing uncovers architectural bugs long before they become incidents.

Ilya Ploskovitov • Dec 25

This comment is fantastic—honestly, I might steal your phrasing for my next talk! 😄

'Testing system behavior under violated assumptions' is a much more precise definition than just 'Chaos Engineering.' You are spot on about the implicit contract. Often, the frontend assumes API = Always Available, and when that contract is violated by reality, the UI has no plan B.

Thanks for adding such a thoughtful perspective on 'random breaking' vs. 'engineering robustness.' That distinction is key.