The write path is your architecture
How data gets into your system decides latency, durability, retries, consistency, and every read model that follows.
Most architecture diagrams lie by omission.
They show boxes. API, database, queue, worker, cache, search. Sometimes they show arrows.
The useful diagram is narrower.
What happens after a user clicks submit?
That is the write path. The write path decides what the user waits for, what can be retried, what can be lost, what needs to be repaired, and which read models are even possible later.
If the write path is vague, the architecture is vague.
Start with the acknowledgement boundary
The first question:
What must be true before we tell the user the write succeeded?
That line is the acknowledgement boundary.
For a blog comment, success might mean "the canonical comment row exists." Notifications, search indexing, abuse scanning, and feed fanout can happen later.
For a bank transfer, success might mean "the ledger entries are durably committed, balanced, and visible to the account owner." Sending an email receipt can happen later.
For a file upload, success might mean "the object bytes and metadata are durable." Thumbnail generation can happen later.
If the boundary is too far downstream, users wait on work they do not care about.
If the boundary is too early, you say "success" for work that can still disappear.
The naive write path
This is the shape most systems start with:
def create_post(user_id, content):
post = db.insert_post(user_id, content)
cache.invalidate(f"feed:{user_id}")
search.index_post(post)
notifications.send_to_followers(user_id, post)
analytics.track("post.created", user_id=user_id, post_id=post.id)
return post
It is easy to read.
It is also a latency trap. The user waits for the database, cache, search, notifications, analytics, and every network hop behind those calls.
It is a failure trap too:
- The post insert succeeds.
- Search indexing times out.
- The API returns
500. - The user retries.
- The second request creates a duplicate post.
Now the code needs cleanup logic, dedupe logic, and product support.
The problem is not that synchronous writes are always wrong. The problem is that this function has no explicit boundary between the canonical write and the derived side effects.
What I notice
The write path tends to grow quietly.
One team adds analytics. Another adds notifications. Another adds search. Another invalidates a cache. Another writes audit logs. The endpoint still looks like one function, but the product contract has changed.
The user thinks they are creating a post.
The system is doing six things.
When p99 latency spikes, database is seen as the default culprit because that is the obvious stateful component. But the real issue is that the request path became an integration path.
The fix is to name the critical path.
Split canonical writes from derived work
A more scalable write path is usually:
request
-> validate input
-> write canonical fact
-> enqueue durable event
-> return
worker
-> consume event
-> update read models
-> send notifications
-> index search
-> update analytics
The user waits for the fact. Workers handle the effects.
def create_post(user_id, content, idempotency_key):
post_id = generate_id()
with db.transaction() as tx:
existing = tx.get_idempotency_result(
user_id=user_id,
key=idempotency_key,
)
if existing:
return existing
post = tx.insert_post(
id=post_id,
user_id=user_id,
content=content,
status="published",
)
tx.insert_outbox_event(
event_id=generate_id(),
event_type="post.created",
aggregate_id=post_id,
payload={
"post_id": post_id,
"user_id": user_id,
},
)
tx.save_idempotency_result(
user_id=user_id,
key=idempotency_key,
response={"id": post_id, "status": "published"},
)
return {"id": post_id, "status": "published"}
There are three important details here.
First, the canonical row and the event are written in the same database transaction.
Second, the idempotency result is stored in the same transaction.
Third, nothing calls search, notifications, or analytics before returning to the user.
This keeps the request path small without pretending derived work is optional.
The outbox pattern
The awkward problem with queues is the gap between the database commit and the queue publish.
This is unsafe:
def create_post(user_id, content):
post = db.insert_post(user_id, content)
queue.publish("post.created", {"post_id": post.id})
return post
If the database write succeeds and the queue publish fails, the post exists but no worker hears about it.
The outbox pattern removes that gap by writing the event into the same database transaction as the canonical fact.
CREATE TABLE posts (
id bigint PRIMARY KEY,
user_id bigint NOT NULL,
content text NOT NULL,
status text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE outbox_events (
id bigint PRIMARY KEY,
event_type text NOT NULL,
aggregate_id bigint NOT NULL,
payload jsonb NOT NULL,
published_at timestamptz,
created_at timestamptz NOT NULL DEFAULT now()
);
Then a relay publishes unpublished events:
def publish_outbox_batch(limit=100):
events = db.fetch_unpublished_outbox_events(limit=limit)
for event in events:
queue.publish(
event.event_type,
key=str(event.aggregate_id),
payload=event.payload,
)
db.mark_outbox_event_published(event.id)
This relay can crash after publishing but before marking the event as published. That means workers may receive duplicates.
That is normal.
The system must be designed around at-least-once delivery unless you have a very specific reason and infrastructure support for something stricter.
Idempotency is the retry contract
Retries are not optional.
Clients retry after timeouts. Load balancers retry. Workers retry. Operators replay events after bugs. You replay outbox rows after deploys. A mobile app sends the same request twice because the connection died before it saw the response.
If retries are unsafe, the system is unsafe.
For external API writes, require an idempotency key:
CREATE TABLE idempotency_keys (
user_id bigint NOT NULL,
key text NOT NULL,
request_hash text NOT NULL,
response jsonb NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, key)
);
The key should mean: "For this user, this logical command may run once."
The request hash matters because clients can accidentally reuse keys for different payloads. If the same key arrives with a different request body, return a conflict instead of guessing.
For worker handlers, make each effect idempotent too:
def handle_post_created(event):
if db.event_already_processed(event.id, handler="feed_projection"):
return
with db.transaction() as tx:
tx.upsert_feed_item(
user_id=event.payload["user_id"],
post_id=event.payload["post_id"],
)
tx.mark_event_processed(
event_id=event.id,
handler="feed_projection",
)
The important part is the handler name. Search indexing, feed projection, notifications, and analytics are different effects. Each needs its own dedupe boundary.
A queue is not a magic durability machine
Queues help when work can happen later.
They do not remove decisions.
You still need to decide:
| Question | Why it matters |
|---|---|
| Is enqueue durable? | Returning after an in-memory enqueue is not a durable success. |
| Can events be duplicated? | Assume yes unless proven otherwise. |
| Can events arrive out of order? | Assume yes across partitions, retries, and separate topics. |
| What is the retry policy? | Infinite hot retries can take down dependencies. |
| Where do poison messages go? | Bad payloads need a dead-letter path. |
| What is the max acceptable lag? | Async work still has a product freshness budget. |
The queue turns one failure mode into another.
Without a queue, a spike makes users wait.
With a queue, a spike makes lag grow.
That is often a good trade, but only if lag is visible and bounded.
Backpressure
Backpressure is the system saying: "I cannot accept work at this rate and still keep my promises."
Ignoring backpressure creates dishonest success.
Imagine this:
- The API accepts 20,000 writes per second.
- Workers can process 5,000 events per second.
- The queue grows by 15,000 events per second.
- Notification lag reaches 45 minutes.
- Users complain that the product is broken even though the API is returning
200.
The write path needs a policy before this happens.
Possible policies:
- Return
429or503when queue lag crosses a threshold. - Accept the write but mark derived effects as delayed.
- Drop low-value effects such as analytics while preserving canonical writes.
- Degrade expensive fanout into smaller batches.
- Route large tenants through separate partitions.
The right policy depends on the product.
For payments, slow down intake before losing ledger correctness.
For social notifications, preserve the post and delay the notification.
For analytics, sample or drop events before impacting the user-facing write.
Streams vs queues vs direct writes
These words get overloaded.
The simpler distinction:
| Pattern | Use it when | Watch out for |
|---|---|---|
| Direct write | The work is part of the success contract and must complete now | Latency, cascading failures, duplicate side effects |
| Queue | Each item should be processed by one consumer group for background work | Poison messages, retries, visibility timeouts, lag |
| Stream/log | Many consumers need the same ordered history of facts | Retention, replay safety, partition keys, schema evolution |
Examples:
Direct write:
create ledger entry before returning payment success
Queue:
resize uploaded image
send email receipt
run abuse scan
Stream/log:
post.created feeds search, ranking, notifications, analytics, audit
Use a queue when you want work distribution.
Use a stream when you want a durable history that multiple consumers can independently read and replay.
Use a direct write when the user-facing command is not true until that write completes.
Ordering comes from partition keys
Ordering is not global by default.
Most scalable logs order records within a partition. That means the key matters.
For posts, a reasonable key may be post_id if all events for one post need order:
post.created
post.updated
post.deleted
For account balances, the key is usually account_id because operations for one account need a single sequence.
For notifications, the key may be recipient_user_id so one user's notification timeline stays ordered.
The wrong key creates subtle bugs.
If all events use one global key, one partition becomes hot.
If events use random keys, related updates can be processed out of order.
Pick the key based on the invariant you need to preserve.
Update-in-place vs append-only
A normal table stores current state:
UPDATE posts
SET content = $2,
updated_at = now()
WHERE id = $1;
An append-only log stores facts:
INSERT INTO post_events (
id,
post_id,
event_type,
payload,
created_at
) VALUES (
$1,
$2,
'post.updated',
$3,
now()
);
Append-only is useful when history matters, replay matters, or multiple projections need to be rebuilt.
It is not automatically better.
Current-state tables are easier to query. Logs are easier to audit and replay. Many systems need both:
post_events -> source of historical facts
posts -> current state projection
feed_items -> read model
search_index -> external projection
This is event sourcing in the small. You do not need to redesign the whole company around it. You can use append-only facts where replay and audit justify the cost.
A practical decision matrix
Before designing a write path, fill this out:
| Requirement | Direct DB write | DB + outbox | Queue first | Stream first |
|---|---|---|---|---|
| User needs read-after-write | Strong fit | Strong fit | Weak unless status is pending | Depends on projection lag |
| Derived work is slow | Weak | Strong | Strong | Strong |
| Many consumers need the event | Weak | Strong | Medium | Strong |
| Must not lose accepted writes | Strong | Strong | Depends on queue durability | Strong if durable |
| Must absorb bursts | Weak | Medium | Strong | Strong |
| Simple operational model | Strong | Medium | Medium | Weak |
My default for product systems is DB + outbox.
It preserves a simple source of truth, gives workers a durable event stream, and avoids coupling the request path to every side effect.
Queue first can be correct when the command itself is naturally asynchronous:
import a CSV
train a model
transcode a video
send a campaign
In those cases the user should get a job id and a status endpoint:
def start_import(user_id, file_id):
job_id = generate_id()
queue.publish("import.requested", {
"job_id": job_id,
"user_id": user_id,
"file_id": file_id,
})
return {"job_id": job_id, "status": "queued"}
But be honest in the product language. If the work is queued, call it queued. Do not call it done.
Failure states are product states
Async writes create intermediate states.
Name them.
CREATE TABLE import_jobs (
id bigint PRIMARY KEY,
user_id bigint NOT NULL,
status text NOT NULL CHECK (
status IN ('queued', 'processing', 'completed', 'failed')
),
error_message text,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
The status model is not UX polish. It is part of correctness.
If the user can start a job, refresh the page, and see nothing, the write path is incomplete. The request returned before the real work finished, so the product needs a way to represent the pending work.
What I think
The useful mental model is:
command -> fact -> effects -> projections
A command is what the user asks for.
A fact is what the system durably records.
Effects are things caused by the fact.
Projections are read models created from facts.
Most messy architectures mix these together. A request handler validates a command, writes a fact, mutates three projections, calls two external systems, and returns success based on whatever happened last.
That is how a write path becomes untestable.
Separate the pieces. Then decide where the acknowledgement boundary belongs.
Tutorial checklist
For any important write path, fill this out:
| Question | Example answer |
|---|---|
| User command | Create post |
| Canonical fact | Row in posts |
| Ack boundary | Return after posts row and outbox_events row commit |
| Idempotency key | (user_id, client_request_id) |
| Derived effects | Feed projection, search indexing, notifications, analytics |
| Delivery model | At least once |
| Dedupe table | processed_events(event_id, handler) |
| Ordering key | post_id for post lifecycle events |
| Backpressure signal | Outbox age over 60 seconds or queue lag over 50k |
| Repair path | Replay post.created events from outbox or event log |
| User-visible states | Published immediately; side effects may lag |
Then ask the harder question:
What happens if every step after the acknowledgement boundary fails?
If the answer is "we do not know," the write path is not designed yet.
Summary
- The write path is the real architecture for user-facing state changes.
- Start by defining the acknowledgement boundary.
- Keep the request path limited to the smallest durable product promise.
- Move derived effects behind queues, workers, or streams.
- Use an outbox when a database write and event publish must move together.
- Assume retries and duplicates; make commands and handlers idempotent.
- Treat queue lag and backpressure as product concerns, not just ops metrics.
- Pick partition keys based on the ordering invariant you need.
- Async workflows need explicit pending, completed, and failed states.
Pop quiz
Interactive quiz
Write path design
A randomized review of acknowledgement boundaries, outbox events, idempotency, and backpressure.