Approaching the problem of scaling mathematically
Quantitative analysis and arithmetic as the foundation for scaling decisions and architectural choices.
Scaling decisions should start with arithmetic analysis. Four fundamental metrics inform architecture choices: requests per second, data volume, concurrent connections, and latency targets.
Translating user count to QPS
User count is not load. Activity is load. To convert daily active users (DAU) into requests per second:
average QPS = (DAU x requests_per_user_per_day) / 86,400
peak QPS = average QPS x peak_multiplier (typically 10x)
Example: 1M DAU, 1 request per user per day:
1,000,000 / 86,400 = ~12 average QPS
peak at 10x = ~120 QPS
At 10 requests per user per day: ~1,157 peak QPS.
Here's the reference table:
| Daily users | Requests/user/day | Average QPS | 10x peak QPS |
|---|---|---|---|
| 1M | 1 | 12 | 116 |
| 1M | 10 | 116 | 1,157 |
| 10M | 10 | 1,157 | 11,574 |
| 100M | 10 | 11,574 | 115,741 |
Quick estimation script:
SECONDS_PER_DAY = 86_400
def estimate_average_qps(daily_users, avg_requests_per_user):
return (daily_users * avg_requests_per_user) / SECONDS_PER_DAY
def estimate_peak_qps(
daily_users,
avg_requests_per_user,
peak_multiplier=10,
):
return estimate_average_qps(
daily_users,
avg_requests_per_user,
) * peak_multiplier
for users in [1_000_000, 10_000_000, 100_000_000]:
print(users, estimate_peak_qps(users, avg_requests_per_user=10))
Load is determined by request frequency, not user count. 1,000 users refreshing a feed every 10 seconds generates more load than 1M users checking email once a day.
Four metrics for scaling decisions
These metrics establish the quantitative foundation for architecture decisions:
- QPS - how much work arrives per second
- Data volume - how much state the system stores and scans
- Concurrent connections - how many things stay open simultaneously
- Latency percentiles - p50, p95, p99 (not averages)
Average latency hides problems. If p50 is 40ms but p99 is 4 seconds, the system is slow for 1 in 100 users.
These four pressures are independent - a system can have low QPS but huge data, or high QPS but tiny data, or low traffic but high concurrency because every client holds a socket open. Each requires a different scaling strategy.
Common bottlenecks
Initial bottlenecks typically occur at the code or query level:
- N+1 queries
- Missing database indexes
- Hot cache keys
- Oversized response payloads (e.g., 4 MB responses)
- Background jobs retrying without jitter
- Exhausted database connection pools
- CPU-intensive work per request
- Missing pagination on list endpoints
Inefficient code scales linearly with load: at 50 QPS, overhead is minimal; at 500 QPS, costs increase; at 5,000 QPS, performance degrades significantly.
Distributing into microservices replicates code-level inefficiencies across multiple services.
Single server capacity
Typical capacity for a single modern server:
SINGLE_POSTGRES_RULES_OF_THUMB = {
"connections": "100-500 active connections before connection pooling becomes critical",
"storage": "terabytes if indexes and maintenance are designed",
"read_qps": "10k-50k simple indexed reads in favorable conditions",
"write_qps": "depends on indexes, fsync, row size, constraints, and contention",
}
- Single app servers: thousands of QPS for typical workloads
- Cached reads: 10k+ QPS per server (Note that some hyper efficient backends can 2x this such as Actix written in Rust, as of 2026)
- Postgres: hundreds of millions of rows with proper indexes and partitioning
Database scalability depends on: database type, hardware specifications, schema design, query patterns, latency percentiles, and write patterns.
Common resource limits
Resource exhaustion failure modes:
| Resource | What happens |
|---|---|
| Database connections | Postgres allows a few hundred. Too many app workers exhaust the pool, requests queue, latency spikes, retries pile on → death spiral |
| File descriptors | Every socket/file/pipe/connection consumes one. WebSockets burn through these fast. Box has CPU and memory left but can't open more handles |
| Memory | Caches and queues grow. JSON bodies get copied. A leak invisible in staging ramps under production traffic → OOM |
| Locks | A single hot row serializes work across the whole database. A shared mutex turns 32 cores into 1 |
| Thread/worker exhaustion | All workers blocked on slow I/O, no capacity left for new requests |
| Network egress | Large payloads and media hit bandwidth limits before CPU or memory |
When to scale vertically vs. distribute
Vertical scaling stops when:
- CPU-bound: workload doesn't parallelize (global locks, single-threaded sections)
- Memory-bound: hot data no longer fits in RAM, everything becomes a cache miss
- Disk-bound: durable writes hit fsync/compaction/checkpoint limits
- Network-bound: payload volume exceeds NIC throughput
- Operationally-bound: backups take too long, schema changes lock tables, restores are impractical (this can force distribution before hardware limits do)
Scaling decision framework
Stay monolith when:
- One team can understand the codebase
- One database holds the working set
- Peak QPS fits on a few app nodes
- Background work can be queued without changing the product contract
Scale vertically when:
- The bottleneck resource (CPU/memory/disk/network) is identified and measured
- The next machine size buys meaningful headroom
- Operational tasks still fit maintenance windows
Split reads from writes when:
- Read traffic dominates write traffic
- Stale reads are acceptable for some paths
Add queues when:
- User-facing requests wait on deferrable work
- Spikes are brief but expensive
- Retries need backoff control
Split into services when:
- Teams require independent deploy cycles
- Components have different scaling profiles
- Clear data ownership boundaries exist
Shard/partition when:
- One database can't hold the data or sustain the write rate
- Queries naturally include a partition key
- Cross-partition operations are rare
Distribution is a trade: you buy headroom with coordination cost, more failure modes, and more operational surface area.
The 1M → 100M gap
The jump from 1M to 100M DAU changes the operating model, not just the scale.
At 1M DAU: Vertical scaling remains effective. Standard approaches include larger database instances, connection pooling, CDN integration, read replicas, and background job queues. Missing indexes degrade performance but remain recoverable.
At 100M DAU: 10 requests/user/day = ~115k peak QPS. An endpoint with 20ms CPU work per request consumes 2,300 CPU-seconds per second at peak. 100M new rows per day requires architectural consideration for retention policies, index strategies, and backfill infrastructure.
Distribution becomes necessary when quantitative analysis indicates single-server limits have been reached.
Summary
- Convert user counts to QPS for architecture decisions
- Establish four metrics: QPS, data volume, concurrency, latency percentiles
- Address code-level bottlenecks (N+1 queries, missing indexes, oversized payloads) before distributing
- Single servers and single databases have substantial capacity with proper optimization
- Scale vertically until measured resource limits indicate distribution is required
- Each additional component introduces new failure modes - distribute when quantitative analysis supports it
Pop quiz
Interactive quiz
Scaling arithmetic check
A randomized review of the quantitative scaling ideas from this post.