Intro to Chaos Engineering for QA. Learn how to test resilience by injecting failures with Docker and Playwright.
We are obsessed with the "Happy Path".
In traditional QA, we verify that the application works when everything is perfect:
- The network is stable.
- The database responds in 5ms.
- Third-party APIs are online.
But in production, nothing is perfect. Pods crash, networks lag, and databases lock up.
When these things happen, a standard Selenium/Playwright test just says: Failed. It doesn't tell you how the application failed. Did it show a graceful error message? Or did it crash with a white screen and a raw stack trace?
This is where Chaos Engineering comes in.
From QA to Resilience Engineering
Chaos Engineering isn't just for Site Reliability Engineers (SREs). As modern QAs, we need to stop asking "Does it work?" and start asking "What happens when it breaks?"
Today, I’ll show you how to write a Chaos Test using Python, Playwright, and the Docker SDK.
The Goal
We aren't going to wait for the database to fail. We are going to kill it intentionally in the middle of a test and verify that our frontend handles it gracefully.
The Stack
- Python (Test logic)
- Playwright (UI Interaction)
- Docker SDK (The Chaos Injector)
The Code 🐍
Here is the complete script. It connects to your local Docker daemon, finds the Postgres container, and strangles it while the user is trying to work.
import docker
import time
from playwright.sync_api import Page, expect
def test_database_failure_resilience(page: Page):
# 1. Setup: Connect to Docker
# We use the python-docker library to control the infrastructure
client = docker.from_env()
# Target your specific database container
try:
db_container = client.containers.get("postgres-prod")
except docker.errors.NotFound:
raise Exception("Database container not found! Is Docker running?")
# 2. Happy Path: Verify the app loads normally
print("✅ Step 1: Loading Dashboard...")
page.goto("http://localhost:3000/dashboard")
expect(page.locator(".user-balance")).to_be_visible()
# 🧨 CHAOS TIME: Kill the Database
print("🔥 Step 2: Injecting Chaos (Stopping DB)...")
db_container.stop()
# 3. Resilience Assertion
# The app should NOT show a white screen or crash.
# It SHOULD show a friendly "Connection Lost" toast or retry button.
print("👀 Step 3: Verifying graceful degradation...")
# Trigger an action that requires the DB
page.reload()
# Assert UI handles the error
expect(page.locator(".error-toast")).to_contain_text("Connection lost")
expect(page.locator(".retry-button")).to_be_visible()
# 🩹 RECOVERY: Bring the Database back
print("🩹 Step 4: Healing the infrastructure...")
db_container.start()
# Give the app a moment to reconnect (or trigger a manual retry)
page.locator(".retry-button").click()
# 4. Self-Healing Assertion
# The app should recover without requiring a full page refresh
expect(page.locator(".user-balance")).to_be_visible()
print("✅ Test Passed: System is resilient.")
Why this matters
If you run this test and your application shows a 500 Server Error page, you have found a bug. Not a functional bug, but an architectural bug.
By adding "Chaos Tests" to your regression suite, you guarantee that your product doesn't just work—it survives.
👋 Want more Chaos?
I write The 5-Minute QA—a daily newsletter for Senior QAs and SDETs. Every morning, I send one actionable tip on Chaos Engineering.

Top comments (2)
This resonates a lot.
What I really like here is the reframing: you’re not “testing failure”, you’re testing system behavior under violated assumptions. That’s an architectural concern, not just a QA one.
Killing the database mid-flow forces an implicit contract to surface:
– What guarantees does the frontend actually rely on?
– Where do retries live?
– Is failure a state or just an unhandled exception?
Most teams unknowingly test success invariants and then act surprised when production violates them. Chaos tests like this turn resilience from an abstract SRE concept into something executable and reviewable.
I also appreciate that this isn’t random chaos for chaos’ sake — it’s scoped, intentional, and tied to observable user behavior. That’s the difference between “breaking things” and engineering robustness.
Great post. This kind of testing uncovers architectural bugs long before they become incidents.
This comment is fantastic—honestly, I might steal your phrasing for my next talk! 😄
'Testing system behavior under violated assumptions' is a much more precise definition than just 'Chaos Engineering.' You are spot on about the implicit contract. Often, the frontend assumes API = Always Available, and when that contract is violated by reality, the UI has no plan B.
Thanks for adding such a thoughtful perspective on 'random breaking' vs. 'engineering robustness.' That distinction is key.