Back to blog

Approaching the problem of scaling mathematically

Quantitative analysis and arithmetic as the foundation for scaling decisions and architectural choices.

Scaling decisions should start with arithmetic analysis. Four fundamental metrics inform architecture choices: requests per second, data volume, concurrent connections, and latency targets.

Orders of magnitude scale comparisonAnimated bars compare user-count milestones against a single-server limit line, showing how much room exists before distribution is required.Actual load vs assumed scaleuser growth follows logarithmic, not linear, load patternsSingle server limit1K users10K users100K users1M users10M users

Translating user count to QPS

User count is not load. Activity is load. To convert daily active users (DAU) into requests per second:

average QPS = (DAU x requests_per_user_per_day) / 86,400
peak QPS    = average QPS x peak_multiplier (typically 10x)

Example: 1M DAU, 1 request per user per day:

1,000,000 / 86,400 = ~12 average QPS
peak at 10x        = ~120 QPS

At 10 requests per user per day: ~1,157 peak QPS.

Here's the reference table:

Daily usersRequests/user/dayAverage QPS10x peak QPS
1M112116
1M101161,157
10M101,15711,574
100M1011,574115,741

Quick estimation script:

SECONDS_PER_DAY = 86_400


def estimate_average_qps(daily_users, avg_requests_per_user):
    return (daily_users * avg_requests_per_user) / SECONDS_PER_DAY


def estimate_peak_qps(
    daily_users,
    avg_requests_per_user,
    peak_multiplier=10,
):
    return estimate_average_qps(
        daily_users,
        avg_requests_per_user,
    ) * peak_multiplier


for users in [1_000_000, 10_000_000, 100_000_000]:
    print(users, estimate_peak_qps(users, avg_requests_per_user=10))

Load is determined by request frequency, not user count. 1,000 users refreshing a feed every 10 seconds generates more load than 1M users checking email once a day.

Four metrics for scaling decisions

These metrics establish the quantitative foundation for architecture decisions:

  1. QPS - how much work arrives per second
  2. Data volume - how much state the system stores and scans
  3. Concurrent connections - how many things stay open simultaneously
  4. Latency percentiles - p50, p95, p99 (not averages)

Average latency hides problems. If p50 is 40ms but p99 is 4 seconds, the system is slow for 1 in 100 users.

These four pressures are independent - a system can have low QPS but huge data, or high QPS but tiny data, or low traffic but high concurrency because every client holds a socket open. Each requires a different scaling strategy.

Common bottlenecks

Initial bottlenecks typically occur at the code or query level:

  • N+1 queries
  • Missing database indexes
  • Hot cache keys
  • Oversized response payloads (e.g., 4 MB responses)
  • Background jobs retrying without jitter
  • Exhausted database connection pools
  • CPU-intensive work per request
  • Missing pagination on list endpoints

Inefficient code scales linearly with load: at 50 QPS, overhead is minimal; at 500 QPS, costs increase; at 5,000 QPS, performance degrades significantly.

Distributing into microservices replicates code-level inefficiencies across multiple services.

Single server capacity

Typical capacity for a single modern server:

SINGLE_POSTGRES_RULES_OF_THUMB = {
    "connections": "100-500 active connections before connection pooling becomes critical",
    "storage": "terabytes if indexes and maintenance are designed",
    "read_qps": "10k-50k simple indexed reads in favorable conditions",
    "write_qps": "depends on indexes, fsync, row size, constraints, and contention",
}
  • Single app servers: thousands of QPS for typical workloads
  • Cached reads: 10k+ QPS per server (Note that some hyper efficient backends can 2x this such as Actix written in Rust, as of 2026)
  • Postgres: hundreds of millions of rows with proper indexes and partitioning

Database scalability depends on: database type, hardware specifications, schema design, query patterns, latency percentiles, and write patterns.

Common resource limits

Resource exhaustion failure modes:

ResourceWhat happens
Database connectionsPostgres allows a few hundred. Too many app workers exhaust the pool, requests queue, latency spikes, retries pile on → death spiral
File descriptorsEvery socket/file/pipe/connection consumes one. WebSockets burn through these fast. Box has CPU and memory left but can't open more handles
MemoryCaches and queues grow. JSON bodies get copied. A leak invisible in staging ramps under production traffic → OOM
LocksA single hot row serializes work across the whole database. A shared mutex turns 32 cores into 1
Thread/worker exhaustionAll workers blocked on slow I/O, no capacity left for new requests
Network egressLarge payloads and media hit bandwidth limits before CPU or memory

When to scale vertically vs. distribute

Vertical scaling stops when:

  • CPU-bound: workload doesn't parallelize (global locks, single-threaded sections)
  • Memory-bound: hot data no longer fits in RAM, everything becomes a cache miss
  • Disk-bound: durable writes hit fsync/compaction/checkpoint limits
  • Network-bound: payload volume exceeds NIC throughput
  • Operationally-bound: backups take too long, schema changes lock tables, restores are impractical (this can force distribution before hardware limits do)

Scaling decision framework

Stay monolith when:

  • One team can understand the codebase
  • One database holds the working set
  • Peak QPS fits on a few app nodes
  • Background work can be queued without changing the product contract

Scale vertically when:

  • The bottleneck resource (CPU/memory/disk/network) is identified and measured
  • The next machine size buys meaningful headroom
  • Operational tasks still fit maintenance windows

Split reads from writes when:

  • Read traffic dominates write traffic
  • Stale reads are acceptable for some paths

Add queues when:

  • User-facing requests wait on deferrable work
  • Spikes are brief but expensive
  • Retries need backoff control

Split into services when:

  • Teams require independent deploy cycles
  • Components have different scaling profiles
  • Clear data ownership boundaries exist

Shard/partition when:

  • One database can't hold the data or sustain the write rate
  • Queries naturally include a partition key
  • Cross-partition operations are rare

Distribution is a trade: you buy headroom with coordination cost, more failure modes, and more operational surface area.

The 1M → 100M gap

The jump from 1M to 100M DAU changes the operating model, not just the scale.

At 1M DAU: Vertical scaling remains effective. Standard approaches include larger database instances, connection pooling, CDN integration, read replicas, and background job queues. Missing indexes degrade performance but remain recoverable.

At 100M DAU: 10 requests/user/day = ~115k peak QPS. An endpoint with 20ms CPU work per request consumes 2,300 CPU-seconds per second at peak. 100M new rows per day requires architectural consideration for retention policies, index strategies, and backfill infrastructure.

Distribution becomes necessary when quantitative analysis indicates single-server limits have been reached.

Summary

  1. Convert user counts to QPS for architecture decisions
  2. Establish four metrics: QPS, data volume, concurrency, latency percentiles
  3. Address code-level bottlenecks (N+1 queries, missing indexes, oversized payloads) before distributing
  4. Single servers and single databases have substantial capacity with proper optimization
  5. Scale vertically until measured resource limits indicate distribution is required
  6. Each additional component introduces new failure modes - distribute when quantitative analysis supports it

Pop quiz

Interactive quiz

Scaling arithmetic check

A randomized review of the quantitative scaling ideas from this post.

4of 11 questions
Question 1 of 425%
You have 5M daily active users, each making 5 requests per day. What is the estimated peak QPS using a 10x multiplier?