May 18, 2026

The write path is your architecture

How data gets into your system decides latency, durability, retries, consistency, and every read model that follows.

Most architecture diagrams lie by omission.

They show boxes. API, database, queue, worker, cache, search. Sometimes they show arrows.

The useful diagram is narrower.

What happens after a user clicks submit?

That is the write path. The write path decides what the user waits for, what can be retried, what can be lost, what needs to be repaired, and which read models are even possible later.

If the write path is vague, the architecture is vague.

Start with the acknowledgement boundary

The first question:

What must be true before we tell the user the write succeeded?

That line is the acknowledgement boundary.

For a blog comment, success might mean "the canonical comment row exists." Notifications, search indexing, abuse scanning, and feed fanout can happen later.

For a bank transfer, success might mean "the ledger entries are durably committed, balanced, and visible to the account owner." Sending an email receipt can happen later.

For a file upload, success might mean "the object bytes and metadata are durable." Thumbnail generation can happen later.

If the boundary is too far downstream, users wait on work they do not care about.

If the boundary is too early, you say "success" for work that can still disappear.

The naive write path

This is the shape most systems start with:

def create_post(user_id, content):
    post = db.insert_post(user_id, content)
    cache.invalidate(f"feed:{user_id}")
    search.index_post(post)
    notifications.send_to_followers(user_id, post)
    analytics.track("post.created", user_id=user_id, post_id=post.id)
    return post

It is easy to read.

It is also a latency trap. The user waits for the database, cache, search, notifications, analytics, and every network hop behind those calls.

It is a failure trap too:

The post insert succeeds.
Search indexing times out.
The API returns 500.
The user retries.
The second request creates a duplicate post.

Now the code needs cleanup logic, dedupe logic, and product support.

The problem is not that synchronous writes are always wrong. The problem is that this function has no explicit boundary between the canonical write and the derived side effects.

One team adds analytics. Another adds notifications. Another adds search. Another invalidates a cache. Another writes audit logs. The endpoint still looks like one function, but the product contract has changed.

The user thinks they are creating a post.

The system is doing six things.

When p99 latency spikes, database is seen as the default culprit because that is the obvious stateful component. But the real issue is that the request path became an integration path.

The fix is to name the critical path.

Split canonical writes from derived work

A more scalable write path is usually:

request
  -> validate input
  -> write canonical fact
  -> enqueue durable event
  -> return

worker
  -> consume event
  -> update read models
  -> send notifications
  -> index search
  -> update analytics

The user waits for the fact. Workers handle the effects.

def create_post(user_id, content, idempotency_key):
    post_id = generate_id()

    with db.transaction() as tx:
        existing = tx.get_idempotency_result(
            user_id=user_id,
            key=idempotency_key,
        )
        if existing:
            return existing

        post = tx.insert_post(
            id=post_id,
            user_id=user_id,
            content=content,
            status="published",
        )
        tx.insert_outbox_event(
            event_id=generate_id(),
            event_type="post.created",
            aggregate_id=post_id,
            payload={
                "post_id": post_id,
                "user_id": user_id,
            },
        )
        tx.save_idempotency_result(
            user_id=user_id,
            key=idempotency_key,
            response={"id": post_id, "status": "published"},
        )

    return {"id": post_id, "status": "published"}

There are three important details here.

First, the canonical row and the event are written in the same database transaction.

Second, the idempotency result is stored in the same transaction.

Third, nothing calls search, notifications, or analytics before returning to the user.

This keeps the request path small without pretending derived work is optional.

The outbox pattern

The awkward problem with queues is the gap between the database commit and the queue publish.

This is unsafe:

def create_post(user_id, content):
    post = db.insert_post(user_id, content)
    queue.publish("post.created", {"post_id": post.id})
    return post

If the database write succeeds and the queue publish fails, the post exists but no worker hears about it.

The outbox pattern removes that gap by writing the event into the same database transaction as the canonical fact.

CREATE TABLE posts (
  id bigint PRIMARY KEY,
  user_id bigint NOT NULL,
  content text NOT NULL,
  status text NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE outbox_events (
  id bigint PRIMARY KEY,
  event_type text NOT NULL,
  aggregate_id bigint NOT NULL,
  payload jsonb NOT NULL,
  published_at timestamptz,
  created_at timestamptz NOT NULL DEFAULT now()
);

Then a relay publishes unpublished events:

def publish_outbox_batch(limit=100):
    events = db.fetch_unpublished_outbox_events(limit=limit)

    for event in events:
        queue.publish(
            event.event_type,
            key=str(event.aggregate_id),
            payload=event.payload,
        )
        db.mark_outbox_event_published(event.id)

This relay can crash after publishing but before marking the event as published. That means workers may receive duplicates.

That is normal.

The system must be designed around at-least-once delivery unless you have a very specific reason and infrastructure support for something stricter.

Idempotency is the retry contract

Retries are not optional.

Clients retry after timeouts. Load balancers retry. Workers retry. Operators replay events after bugs. You replay outbox rows after deploys. A mobile app sends the same request twice because the connection died before it saw the response.

If retries are unsafe, the system is unsafe.

For external API writes, require an idempotency key:

CREATE TABLE idempotency_keys (
  user_id bigint NOT NULL,
  key text NOT NULL,
  request_hash text NOT NULL,
  response jsonb NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (user_id, key)
);

The key should mean: "For this user, this logical command may run once."

The request hash matters because clients can accidentally reuse keys for different payloads. If the same key arrives with a different request body, return a conflict instead of guessing.

For worker handlers, make each effect idempotent too:

def handle_post_created(event):
    if db.event_already_processed(event.id, handler="feed_projection"):
        return

    with db.transaction() as tx:
        tx.upsert_feed_item(
            user_id=event.payload["user_id"],
            post_id=event.payload["post_id"],
        )
        tx.mark_event_processed(
            event_id=event.id,
            handler="feed_projection",
        )

The important part is the handler name. Search indexing, feed projection, notifications, and analytics are different effects. Each needs its own dedupe boundary.

A queue is not a magic durability machine

Queues help when work can happen later.

They do not remove decisions.

You still need to decide:

Question	Why it matters
Is enqueue durable?	Returning after an in-memory enqueue is not a durable success.
Can events be duplicated?	Assume yes unless proven otherwise.
Can events arrive out of order?	Assume yes across partitions, retries, and separate topics.
What is the retry policy?	Infinite hot retries can take down dependencies.
Where do poison messages go?	Bad payloads need a dead-letter path.
What is the max acceptable lag?	Async work still has a product freshness budget.

The queue turns one failure mode into another.

Without a queue, a spike makes users wait.

With a queue, a spike makes lag grow.

That is often a good trade, but only if lag is visible and bounded.

Backpressure

Backpressure is the system saying: "I cannot accept work at this rate and still keep my promises."

Ignoring backpressure creates dishonest success.

Imagine this:

The API accepts 20,000 writes per second.
Workers can process 5,000 events per second.
The queue grows by 15,000 events per second.
Notification lag reaches 45 minutes.
Users complain that the product is broken even though the API is returning 200.

The write path needs a policy before this happens.

Possible policies:

Return 429 or 503 when queue lag crosses a threshold.
Accept the write but mark derived effects as delayed.
Drop low-value effects such as analytics while preserving canonical writes.
Degrade expensive fanout into smaller batches.
Route large tenants through separate partitions.

The right policy depends on the product.

For payments, slow down intake before losing ledger correctness.

For social notifications, preserve the post and delay the notification.

For analytics, sample or drop events before impacting the user-facing write.

Streams vs queues vs direct writes

These words get overloaded.

The simpler distinction:

Pattern	Use it when	Watch out for
Direct write	The work is part of the success contract and must complete now	Latency, cascading failures, duplicate side effects
Queue	Each item should be processed by one consumer group for background work	Poison messages, retries, visibility timeouts, lag
Stream/log	Many consumers need the same ordered history of facts	Retention, replay safety, partition keys, schema evolution

Examples:

Direct write:
  create ledger entry before returning payment success

Queue:
  resize uploaded image
  send email receipt
  run abuse scan

Stream/log:
  post.created feeds search, ranking, notifications, analytics, audit

Use a queue when you want work distribution.

Use a stream when you want a durable history that multiple consumers can independently read and replay.

Use a direct write when the user-facing command is not true until that write completes.

Ordering comes from partition keys

Ordering is not global by default.

Most scalable logs order records within a partition. That means the key matters.

For posts, a reasonable key may be post_id if all events for one post need order:

post.created
post.updated
post.deleted

For account balances, the key is usually account_id because operations for one account need a single sequence.

For notifications, the key may be recipient_user_id so one user's notification timeline stays ordered.

The wrong key creates subtle bugs.

If all events use one global key, one partition becomes hot.

If events use random keys, related updates can be processed out of order.

Pick the key based on the invariant you need to preserve.

Update-in-place vs append-only

A normal table stores current state:

UPDATE posts
SET content = $2,
    updated_at = now()
WHERE id = $1;

An append-only log stores facts:

INSERT INTO post_events (
  id,
  post_id,
  event_type,
  payload,
  created_at
) VALUES (
  $1,
  $2,
  'post.updated',
  $3,
  now()
);

Append-only is useful when history matters, replay matters, or multiple projections need to be rebuilt.

It is not automatically better.

Current-state tables are easier to query. Logs are easier to audit and replay. Many systems need both:

post_events     -> source of historical facts
posts           -> current state projection
feed_items      -> read model
search_index    -> external projection

This is event sourcing in the small. You do not need to redesign the whole company around it. You can use append-only facts where replay and audit justify the cost.

A practical decision matrix

Before designing a write path, fill this out:

Requirement	Direct DB write	DB + outbox	Queue first	Stream first
User needs read-after-write	Strong fit	Strong fit	Weak unless status is pending	Depends on projection lag
Derived work is slow	Weak	Strong	Strong	Strong
Many consumers need the event	Weak	Strong	Medium	Strong
Must not lose accepted writes	Strong	Strong	Depends on queue durability	Strong if durable
Must absorb bursts	Weak	Medium	Strong	Strong
Simple operational model	Strong	Medium	Medium	Weak

My default for product systems is DB + outbox.

It preserves a simple source of truth, gives workers a durable event stream, and avoids coupling the request path to every side effect.

Queue first can be correct when the command itself is naturally asynchronous:

import a CSV
train a model
transcode a video
send a campaign

In those cases the user should get a job id and a status endpoint:

def start_import(user_id, file_id):
    job_id = generate_id()
    queue.publish("import.requested", {
        "job_id": job_id,
        "user_id": user_id,
        "file_id": file_id,
    })
    return {"job_id": job_id, "status": "queued"}

But be honest in the product language. If the work is queued, call it queued. Do not call it done.

Failure states are product states

Async writes create intermediate states.

Name them.

CREATE TABLE import_jobs (
  id bigint PRIMARY KEY,
  user_id bigint NOT NULL,
  status text NOT NULL CHECK (
    status IN ('queued', 'processing', 'completed', 'failed')
  ),
  error_message text,
  created_at timestamptz NOT NULL DEFAULT now(),
  updated_at timestamptz NOT NULL DEFAULT now()
);

The status model is not UX polish. It is part of correctness.

If the user can start a job, refresh the page, and see nothing, the write path is incomplete. The request returned before the real work finished, so the product needs a way to represent the pending work.

What I think

The useful mental model is:

command -> fact -> effects -> projections

A command is what the user asks for.

A fact is what the system durably records.

Effects are things caused by the fact.

Projections are read models created from facts.

Most messy architectures mix these together. A request handler validates a command, writes a fact, mutates three projections, calls two external systems, and returns success based on whatever happened last.

That is how a write path becomes untestable.

Separate the pieces. Then decide where the acknowledgement boundary belongs.

Tutorial checklist

For any important write path, fill this out:

Question	Example answer
User command	Create post
Canonical fact	Row in `posts`
Ack boundary	Return after `posts` row and `outbox_events` row commit
Idempotency key	`(user_id, client_request_id)`
Derived effects	Feed projection, search indexing, notifications, analytics
Delivery model	At least once
Dedupe table	`processed_events(event_id, handler)`
Ordering key	`post_id` for post lifecycle events
Backpressure signal	Outbox age over 60 seconds or queue lag over 50k
Repair path	Replay `post.created` events from outbox or event log
User-visible states	Published immediately; side effects may lag

Then ask the harder question:

What happens if every step after the acknowledgement boundary fails?

If the answer is "we do not know," the write path is not designed yet.

Summary

The write path is the real architecture for user-facing state changes.
Start by defining the acknowledgement boundary.
Keep the request path limited to the smallest durable product promise.
Move derived effects behind queues, workers, or streams.
Use an outbox when a database write and event publish must move together.
Assume retries and duplicates; make commands and handlers idempotent.
Treat queue lag and backpressure as product concerns, not just ops metrics.
Pick partition keys based on the ordering invariant you need.
Async workflows need explicit pending, completed, and failed states.

Pop quiz

Interactive quiz

Write path design

A randomized review of acknowledgement boundaries, outbox events, idempotency, and backpressure.

4of 10 questions