Saurav Jha

Posted on Dec 29

Dual write problem in distributed systems

#architecture #microservices #systemdesign

The dual write problem is a classic issue that arises in distributed systems when a single logical operation needs to update two (or more) separate systems or data stores — for example, writing to a database and sending an event/message to a message broker (like Kafka).

Because these systems are independent, ensuring atomicity (all-or-nothing behavior) across them is extremely difficult without a distributed transaction protocol. Let’s break it down clearly:

Distributed transaction protocol:
A distributed transaction protocol is a mechanism that ensures atomicity (all-or-nothing behavior) for a transaction that spans multiple independent systems, such as multiple databases, services, or message brokers.
In other words, It makes several different systems behave as if they were performing a single, unified transaction.
Consensus-based (Paxos/Raft) guarantees strong consistency per state machine widely used in distributed DBs, config stores.

TrueTime + Paxos guarantees global ACID widely used in Google Spanner.

Microservices use:
Saga Pattern for business-level distributed workflows
Transactional Outbox Pattern for local atomicity
Idempotency + retries
These approaches trade strict ACID for eventual consistency + resilience.

Example Scenario

Suppose you have a user service that:
Stores user data in PostgreSQL.
Publishes a “UserCreated” event to Kafka.

The naive (and common) approach:

BEGIN
INSERT INTO users (id, name) VALUES (...);
SEND "UserCreated" EVENT TO Kafka;
COMMIT

If the database insert succeeds but the Kafka send fails (or vice versa), your systems become inconsistent — one reflects the change, the other doesn’t.

This is the dual write problem — trying to atomically update two systems that don’t share a transaction coordinator.
There’s no global transaction manager ensuring both operations succeed or fail together.
Failures (network issues, process crashes, retries) can easily cause partial updates.
Retrying can lead to duplicates or out-of-order events.

The Core Problem

This is the dual write problem — trying to atomically update two systems that don’t share a transaction coordinator.

There’s no global transaction manager ensuring both operations succeed or fail together.

Failures (network issues, process crashes, retries) can easily cause partial updates.

Retrying can lead to duplicates or out-of-order events.

Consequences
Failure Scenario
DB write succeeds, message send fails results ->State exists in DB but no event emitted — downstream systems never learn of it.
Message send succeeds, DB write fails results ->Event emitted for data that doesn’t exist — consumers act on invalid state.
Retry logic applied incorrectly results ->Duplicate events or multiple DB inserts.

Common Solutions / Patterns

Transactional Outbox Pattern

Write the event into the same database as the business data, in the same transaction.

A background process (or CDC tool like Debezium) later reads the “outbox” table and publishes to Kafka.

Guarantees consistency between the DB and the emitted events.

✅ Pros:

Strong consistency between DB and message.

Simple to implement if you control both DB and messaging.

🚫 Cons:

Adds complexity and operational overhead.

Requires deduplication on consumer side.

Change Data Capture (CDC)

Instead of manually writing to Kafka, rely on a CDC tool (e.g., Debezium, Oracle GoldenGate).

It monitors database changes and automatically publishes events when rows change.

✅ Pros:

No dual write logic in app.

Strong consistency if CDC is reliable.

🚫 Cons:

Possible event lag.

Requires reliable CDC infra and schema stability.

Idempotent & Retry-Safe Design

Make operations idempotent (safe to retry).

Use unique request IDs and deduplication to avoid inconsistent states even if writes are repeated.

✅ Pros:

Works across many systems.

🚫 Cons:

Still requires careful design.

Doesn’t solve ordering issues.

The Transactional Outbox Solution

Instead of writing directly to Kafka, you:

Write both the business data and the event to the same database transaction.

A background Outbox Processor (or CDC tool) later reads the outbox table and safely publishes events to Kafka.

CREATE TABLE orders (
    id UUID PRIMARY KEY,
    customer_id UUID NOT NULL,
    total NUMERIC NOT NULL,
    created_at TIMESTAMP DEFAULT now()
);

CREATE TABLE outbox (
    id UUID PRIMARY KEY,
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMP DEFAULT now(),
    published BOOLEAN DEFAULT FALSE
);

Application Transaction (Atomic Write)

import uuid
import json
from psycopg2 import connect
conn = connect(...)
cur = conn.cursor()
order_id = uuid.uuid4()

try:
    # Both writes happen in one transaction
    cur.execute(
        INSERT INTO orders (id, customer_id, total) 
        VALUES (%s, %s, %s)
    , (order_id, "c123", 100.0))

    event = {
        "event_id": str(uuid.uuid4()),
        "type": "OrderCreated",
        "order_id": str(order_id)
    }

    cur.execute(
        INSERT INTO outbox (id, event_type, payload)
        VALUES (%s, %s, %s),(event["event_id"], event["type"], json.dumps(event)))

    conn.commit()  # ✅ both are saved atomically
except Exception as e:
    conn.rollback()
    raise e

At this point:
The order exists in the DB.
The event is stored, but not yet published to Kafka.
No dual write risk, because both were done in one atomic transaction.

Outbox Processor (Async Publisher)

A small background worker or CDC tool continuously scans for new events. If Kafka or your app crashes mid-publish, the event remains in the outbox and will be retried safely.

Goal: Prevent inconsistency when writing to a database and publishing events to a message broker (fixes the dual write problem).

DEV Community

Dual write problem in distributed systems

Top comments (0)