May 11, 2026

Approaching the problem of scaling mathematically

Quantitative analysis and arithmetic as the foundation for scaling decisions and architectural choices.

Scaling tends to feel overwhelming before it is measured, and quite ordinary afterward. Most of the worry dissolves once the question is turned into arithmetic. So that is where we start: with four numbers that quietly inform the architecture choices that follow. They are requests per second, data volume, concurrent connections, and latency targets. Hold onto those four. Everything else in this post is just a way of getting to know them better.

Translating user count to QPS

Load is not really about how many people exist in a database. It comes from activity, from what those people actually do and how often. Converting daily active users (DAU) into requests per second is a small piece of arithmetic:

average QPS = (DAU x requests_per_user_per_day) / 86,400
peak QPS    = average QPS x peak_multiplier (typically 10x)

Example: 1M DAU, 1 request per user per day:

1,000,000 / 86,400 = ~12 average QPS
peak at 10x        = ~120 QPS

At 10 requests per user per day, the same million users settle at around 1,157 peak QPS.

A reference table for a few common cases, worth glancing at slowly:

Daily users	Requests/user/day	Average QPS	10x peak QPS
1M	1	12	116
1M	10	116	1,157
10M	10	1,157	11,574
100M	10	11,574	115,741

If you would rather let the numbers run themselves, here is a short script to do the same work:

SECONDS_PER_DAY = 86_400


def estimate_average_qps(daily_users, avg_requests_per_user):
    return (daily_users * avg_requests_per_user) / SECONDS_PER_DAY


def estimate_peak_qps(
    daily_users,
    avg_requests_per_user,
    peak_multiplier=10,
):
    return estimate_average_qps(
        daily_users,
        avg_requests_per_user,
    ) * peak_multiplier


for users in [1_000_000, 10_000_000, 100_000_000]:
    print(users, estimate_peak_qps(users, avg_requests_per_user=10))

The quiet lesson here is that load follows frequency, not headcount. A thousand users refreshing a feed every ten seconds each issue six requests per minute, which sustains roughly 100 QPS continuously. A million users checking email once a day spread that million requests across all 86,400 seconds and land near 12 average QPS. The smaller crowd creates the heavier load. It is a good thing to remember the next time a large user number feels alarming on its own.

Four metrics for scaling decisions

These four numbers are the foundation everything else rests on:

QPS - how much work arrives per second
Data volume - how much state the system stores and scans
Concurrent connections - how many things stay open at once
Latency percentiles - p50, p95, p99

Percentiles are worth pausing on, because they show the whole distribution rather than a comfortable summary. With p50 at 40ms and p99 at 4 seconds, roughly 1 in 100 requests takes a full 4 seconds. The average alone keeps that tail politely out of view, which is exactly why averages can mislead.

These four pressures move independently of one another. A system can run at low QPS while holding an enormous dataset, run at high QPS over a tiny one, or carry light traffic with heavy concurrency simply because every client is holding a socket open. None of them stands in for the others, and each one asks for its own response.

Common bottlenecks

Before reaching for anything distributed, it helps to know where the early trouble actually tends to live. More often than not it is at the code or query level, close to home:

N+1 queries
Missing database indexes
Hot cache keys
Oversized response payloads (e.g., 4 MB responses)
Background jobs retrying without jitter
Exhausted database connection pools
CPU-intensive work per request
Missing pagination on list endpoints

The thing to notice is that per-request overhead multiplies by the request rate. At 50 QPS that overhead sits quietly in the background. At 500 QPS it begins to show up in the latency profile. At 5,000 QPS it accounts for most of the work the system does. Nothing about the overhead changed; only the rate it was multiplied by did.

This is also why splitting into microservices does not make such overhead disappear. The work simply travels with you, copied into each service that runs it. Moving a problem is not the same as resolving it.

Single server capacity

It is easy to underestimate a single machine. A single modern server carries far more headroom than most architecture diagrams assume:

SINGLE_POSTGRES_RULES_OF_THUMB = {
    "connections": "100-500 active connections before connection pooling becomes critical",
    "storage": "terabytes if indexes and maintenance are designed",
    "read_qps": "10k-50k simple indexed reads in favorable conditions",
    "write_qps": "depends on indexes, fsync, row size, constraints, and contention",
}

Single app servers: thousands of QPS for typical workloads
Cached reads: 10k+ QPS per server (Note that some hyper efficient backends can 2x this such as Actix written in Rust, as of 2026)
Postgres: hundreds of millions of rows with proper indexes and partitioning

Database scalability is not a single number but a product of several things together: database type, hardware specifications, schema design, query patterns, latency percentiles, and write patterns. When someone asks how much a database can handle, the honest answer usually begins with which of these they mean.

Common resource limits

When a system does finally reach a wall, it is usually one of a small, familiar set. Knowing them in advance takes a lot of the surprise out of the moment it happens:

Resource	What happens
Database connections	Postgres allows a few hundred. Once too many app workers exhaust the pool, requests queue, latency rises, and retries add further load, which compounds the problem
File descriptors	Every socket, file, pipe, and connection consumes one. WebSockets use them quickly. The box still has CPU and memory but can no longer open handles
Memory	Caches and queues grow, JSON bodies get copied, and a leak invisible in staging surfaces under production traffic, eventually as an OOM
Locks	A single hot row serializes work across the whole database. A shared mutex can turn 32 cores into something close to one
Thread/worker exhaustion	All workers are blocked on slow I/O, leaving no capacity for new requests
Network egress	Large payloads and media reach bandwidth limits before CPU or memory does

Where vertical scaling reaches its limit

A bigger machine answers many problems, until it doesn't. The point where it stops helping is usually clear once you know what to look for:

Vertical scaling reaches its limit when:

CPU-bound: workload doesn't parallelize (global locks, single-threaded sections)
Memory-bound: hot data no longer fits in RAM, so most accesses become cache misses
Disk-bound: durable writes hit fsync/compaction/checkpoint limits
Network-bound: payload volume exceeds NIC throughput
Operationally-bound: backups take too long, schema changes lock tables, restores are impractical (this can force distribution before hardware limits do)

Scaling decision framework

What follows is less a set of rules than a set of conditions worth checking honestly against your own system. Each move makes sense only when its conditions are genuinely met.

Stay monolith when:

One team can understand the codebase
One database holds the working set
Peak QPS fits on a few app nodes
Background work can be queued without changing the product contract

Scale vertically when:

The bottleneck resource (CPU/memory/disk/network) is identified and measured
The next machine size buys meaningful headroom
Operational tasks still fit maintenance windows

Route reads to replicas when:

Reads make up the large majority of traffic
Some paths can serve data that is a few seconds stale

Add queues when:

User-facing requests wait on deferrable work
Spikes are brief but expensive
Retries need backoff control

Split into services when:

Teams require independent deploy cycles
Each component scales on its own profile
Clear data ownership boundaries exist

Shard/partition when:

One database can't hold the data or sustain the write rate
Queries naturally include a partition key
Cross-partition operations are rare

Distribution is always a trade, never a free upgrade. You buy headroom, and you pay for it in coordination cost, in new failure modes, and in a wider operational surface to look after. Knowing the price in advance makes the decision a calm one rather than a regret.

From 1M to 100M DAU

Across the span from 1M to 100M DAU, what changes is not only the size of the numbers but the operating model itself. It is worth seeing the two ends side by side.

At 1M DAU, vertical scaling carries the load comfortably. The familiar moves are enough: larger database instances, connection pooling, a CDN, read replicas, and background job queues. A missing index will slow things down, but the situation stays recoverable, and that recoverability is part of why this stage feels forgiving.

At 100M DAU, the arithmetic asks more of the design. Ten requests per user per day works out to roughly 115k peak QPS. An endpoint doing 20ms of CPU work per request then consumes about 2,300 CPU-seconds per second at peak. And 100M new rows arriving each day pulls retention policies, index strategies, and backfill infrastructure forward into the design, rather than leaving them as things to handle later.

Distribution becomes warranted once the arithmetic actually shows single-server limits have been reached, and not really before that. The numbers, not the anxiety, are what give permission.

Summary

If you carry away nothing else, let it be that scaling becomes calm once it becomes arithmetic:

Convert user counts to QPS, and let that number, not the user count, guide the architecture
Establish the four metrics: QPS, data volume, concurrency, and latency percentiles
Address code-level bottlenecks (N+1 queries, missing indexes, oversized payloads) before reaching for distribution
Trust that single servers and single databases hold substantial capacity once they are reasonably tuned
Scale vertically until measured resource limits genuinely point toward distribution
Remember that each new component brings new failure modes, so distribute only when the arithmetic supports it

A few questions to sit with

No pressure here, and nothing to prove. These are just a quiet way to let the ideas settle.

Interactive quiz

Scaling arithmetic check

A randomized review of the quantitative scaling ideas from this post.

4of 11 questions