May 19, 2026

APIs that don't fall over

Rate limits, timeouts, retries, circuit breakers, and the controls that keep APIs available under pressure.

Most API outages start the same way. A dependency gets slow, requests pile up, clients retry, workers block, and connection pools fill. The database still has CPU. The app still has memory. Dashboards look half-normal until the queueing delay becomes the user-visible behavior.

Adding more servers at this point often makes things worse. More servers mean more concurrent calls to the slow dependency, more retries, and more database connections.

This post covers the controls that prevent this pattern.

The failure is usually queueing

An API does not only fail when code throws. It also fails when useful work waits behind work that will not complete in time.

Slow dependency calls occupy request workers. Those workers hold memory, sockets, and database connections. New requests arrive and wait. Clients time out and retry, and the retry arrives while the first request is still running. A small increase in latency can become an outage through this mechanism.

normal:
  500 requests/sec
  50ms service time
  25 concurrent in-flight requests

dependency slows:
  500 requests/sec
  2s service time
  1000 concurrent in-flight requests

Same arrival rate, different service time, forty times more in-flight work.

This is why adding servers does not always help. More servers increase the number of concurrent calls hitting the dependency that is already slow.

Rate limiting is admission control

Rate limiting decides which requests enter the system. It answers:

Should this request be allowed to enter the system right now?

A simple token bucket gives each caller a burst allowance and a refill rate.

import time


class TokenBucket:
    def __init__(self, capacity, refill_per_second):
        self.capacity = capacity
        self.refill_per_second = refill_per_second
        self.tokens = capacity
        self.updated_at = time.monotonic()

    def consume(self, cost=1):
        now = time.monotonic()
        elapsed = max(0, now - self.updated_at)

        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_per_second,
        )
        self.updated_at = now

        if self.tokens < cost:
            return False

        self.tokens -= cost
        return True

This is a teaching version. In production, the bucket needs to be shared across app instances, which usually means Redis with an atomic script, a gateway, or an edge proxy.

The shape of the configuration:

capacity = burst size
refill   = sustained rate
cost     = request weight

A cheap read might cost 1 token. A report export might cost 20. A login attempt might use a separate security limit. A tenant-level write limit can protect the database from one large customer.

Pick the limit key carefully

The limit key decides who is isolated. Common keys:

Key	Protects against	Risk
IP address	Anonymous abuse and simple scraping	NATs and mobile networks can group many users
User id	One user overwhelming their own quota	Does not protect multi-user tenants
API key	External integrations and partner apps	One customer can create many keys unless controlled
Tenant id	One organization dominating shared resources	Large tenants may need purchased capacity
Endpoint	One expensive route overwhelming internals	Does not distinguish good callers from bad callers
Global	Total system overload	Can let noisy callers crowd out important traffic

A common approach is layered limits:

global limit
  tenant limit
    user/API key limit
      endpoint cost limit

Requests should not all cost the same. A GET /status request and a POST /exports request should not spend the same budget.

Fixed window, sliding window, token bucket, leaky bucket

The behavior matters more than the name.

Strategy	Behavior	Use it when
Fixed window	100 requests per minute resets on the minute	Simple counters and coarse limits
Sliding window	Counts recent requests across a moving interval	Fairer user-facing quotas
Token bucket	Allows bursts up to capacity, then sustained refill	Most API admission control
Leaky bucket	Smooths output at a steady rate	Protecting a downstream with strict throughput

Fixed windows are simple but have edge bursts. A caller can send 100 requests at 12:00:59 and 100 more at 12:01:00.

Token buckets are a reasonable default because real traffic is bursty. They allow short bursts without allowing unbounded pressure.

Timeouts are part of the contract

Every network call needs a timeout. This includes the public API, internal HTTP calls, database queries, Redis calls, search calls, queue publish calls, and object storage calls. Anything that waits on another system needs a deadline.

Without a timeout, the dependency decides how long the worker lives.

def get_profile(user_id, request_deadline):
    remaining = request_deadline - time.monotonic()

    if remaining <= 0:
        raise TimeoutError("request deadline exceeded")

    return http.get(
        f"https://profiles.internal/users/{user_id}",
        timeout=min(0.150, remaining),
    )

A timeout budget makes the constraint explicit:

public request deadline: 800ms
  auth:                  50ms
  database:             150ms
  profile service:      150ms
  recommendation call:  200ms
  rendering/marshal:     50ms
  slack:                200ms

If the outer request times out at 800ms, an inner call waiting 2 seconds cannot help the user.

Retries are load multipliers

Retries are useful when the failure is temporary and the operation is safe to repeat. Otherwise they add load without adding reliability.

One retry at three layers multiplies load:

client retries 2x
  gateway retries 2x
    service retries 2x

one user request can become 8 downstream attempts

This is called a retry storm.

A retry budget contains it:

import random
import time


def call_with_retries(fn, deadline, max_attempts=3, base_delay=0.025):
    last_error = None

    for attempt in range(max_attempts):
        remaining = deadline - time.monotonic()
        if remaining <= 0:
            raise TimeoutError("deadline exceeded") from last_error

        try:
            return fn(timeout=remaining)
        except TemporaryError as error:
            last_error = error

        if attempt == max_attempts - 1:
            break

        delay = min(base_delay * (2 ** attempt), 0.250)
        jitter = random.uniform(0, delay)
        time.sleep(min(jitter, max(0, deadline - time.monotonic())))

    raise last_error

Synchronized retries are themselves a burst. Backoff spreads retries out over time, and jitter makes that spread uneven so the dependency has room to recover.

Circuit breakers fail fast

A circuit breaker stops calling a dependency that is already failing. It has three states:

closed:    calls flow normally
open:      calls fail fast
half-open: a few probe calls test recovery

Simplified:

import time


class CircuitBreaker:
    def __init__(self, failure_threshold, reset_after_seconds):
        self.failure_threshold = failure_threshold
        self.reset_after_seconds = reset_after_seconds
        self.failure_count = 0
        self.opened_at = None

    def call(self, fn):
        if self.opened_at is not None:
            elapsed = time.monotonic() - self.opened_at
            if elapsed < self.reset_after_seconds:
                raise ServiceUnavailable("circuit open")

            return self._half_open_call(fn)

        try:
            result = fn()
            self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.opened_at = time.monotonic()
            raise

    def _half_open_call(self, fn):
        try:
            result = fn()
            self.failure_count = 0
            self.opened_at = None
            return result
        except Exception:
            self.opened_at = time.monotonic()
            raise

This code is intentionally small. Real breakers need rolling windows, error rates, probe limits, and metrics. The principle is the same: once a dependency is unhealthy, stop spending request threads on it.

Choosing the numbers

The patterns are well known. The harder part is choosing specific numbers and deciding what the product does when those numbers take effect:

What status code is returned?
What does the user see?
Which endpoint is protected first?
Which customer is allowed to burst?
Which dependency is optional?
Which write is safe to retry?

These are architecture decisions, not implementation details.

Bulkheads and concurrency limits

Rate limits control arrival. Concurrency limits control in-flight work. Both are needed.

If an endpoint can spend 2 seconds waiting on a dependency, even a modest request rate can consume every worker. A concurrency limit caps the number of concurrent calls.

class ConcurrencyLimiter:
    def __init__(self, max_in_flight):
        self.semaphore = Semaphore(max_in_flight)

    def call(self, fn):
        if not self.semaphore.acquire(blocking=False):
            raise ServiceUnavailable("too many in-flight requests")

        try:
            return fn()
        finally:
            self.semaphore.release()

Use separate pools for separate classes of work:

checkout pool:       small, protected, high priority
search pool:         medium, user-facing
export pool:         small, background
analytics pool:      best effort
admin report pool:   isolated

This is a bulkhead. One flooded endpoint should not consume the capacity the rest of the product depends on.

Graceful degradation

Not every dependency needs to take the API down with it.

If recommendations are unavailable, return the page without recommendations.
If the personalization service is slow, return the default ranking.
If a profile badge service fails, omit the badge.
If a search cluster is unhealthy, return cached results with a stale marker.

The goal is to preserve the core product path when optional pieces fail.

def get_home_feed(user_id, deadline):
    feed = db.get_recent_feed_items(user_id, timeout=0.120)

    try:
        recommendations = recommender.get_items(
            user_id,
            timeout=min(0.080, deadline - time.monotonic()),
        )
    except (TimeoutError, ServiceUnavailable):
        recommendations = []

    return {
        "feed": feed,
        "recommendations": recommendations,
    }

The design question to ask:

What can we remove and still serve something useful?

Load shedding

Load shedding is rejecting work on purpose to keep the system available for the rest of the traffic.

Good load shedding rejects early:

Before expensive authentication fanout if possible.
Before opening database transactions.
Before calling slow dependencies.
Before accepting background jobs that cannot be processed.

Bad load shedding rejects after most of the work has been done.

When the system is overloaded, a fast 429 or 503 with Retry-After is more useful than a 30-second timeout that prompts the client to retry blindly.

HTTP/1.1 429 Too Many Requests
Retry-After: 3
Content-Type: application/json

{"error":"rate_limited","retry_after_seconds":3}

Status codes are control signals

Clients need clear, distinct signals to behave well.

Code	Meaning in this context	Client behavior
`408`	Request timed out before completion	Usually safe to retry only if operation is idempotent
`409`	Conflicting command or reused idempotency key with different body	Do not blindly retry
`425`	Too early for unsafe replay	Retry later if protocol supports it
`429`	Caller exceeded a limit	Wait according to `Retry-After`
`500`	Unexpected server error	Retry only with budget and idempotency
`502`	Bad upstream response	Retry cautiously
`503`	Service unavailable or circuit open	Retry later with backoff
`504`	Gateway timeout	Retry cautiously; original work may still be running

Ambiguous errors lead to poorly behaved clients, which create extra load. Clear errors are part of reliability.

The retry matrix

Before adding a retry, decide if it is safe:

Operation	Retry?	Why
`GET /profile/123`	Usually yes	Read is naturally safe if bounded
`POST /payments` without idempotency key	No	Duplicate charge risk
`POST /payments` with idempotency key	Yes, carefully	Server can dedupe the command
Queue publish after DB commit without outbox	No simple retry is enough	Can create lost or duplicate effects
Search index update from event handler	Yes	Handler should be idempotent
Email send	Usually no direct retry without dedupe	Duplicate emails hurt trust

Retry policy depends on operation semantics, not the HTTP method. PUT can be unsafe if implemented poorly, and POST can be safe with an idempotency key.

Metrics that help

The metrics worth putting on the first dashboard for an API:

Metric	Why it matters
Request rate by endpoint and caller	Shows who is creating load
In-flight requests	Exposes queueing before CPU does
p50, p95, p99 latency	Shows tail pain
Timeout count by dependency	Shows where budget is spent
Retry attempts by caller and dependency	Finds retry storms
Rate-limit rejects	Shows protected pressure
Circuit breaker state	Shows fast-fail behavior
Queue lag and age	Shows async backlog
Saturation by pool	Shows bulkhead pressure
Error rate by status code	Separates overload, bugs, dependency failure, and client misuse

CPU alone is not enough. An API can be unresponsive at low CPU if every worker is waiting on I/O.

Practical sequence

A workable order to apply these controls:

Define the core product path.
Put deadlines around every downstream call.
Limit in-flight work per route and dependency.
Add caller-aware rate limits.
Retry only idempotent operations, inside a deadline, with jitter.
Use circuit breakers for dependencies that fail slowly.
Decide which pieces can degrade.
Return clear status codes that teach clients what to do.
Watch in-flight work, tail latency, retries, and rejects.

A service mesh diagram is not the place to start. The starting questions are simpler:

How many requests are allowed in?
How long can they run?
How many can wait?
What happens when dependency X is slow?
What should the client do next?

If those answers are missing, the API does not yet have a reliability story.

Endpoint checklist

For any important API endpoint, fill in:

Question	Example answer
Endpoint	`POST /exports`
Caller key	`tenant_id` and `user_id`
Cost	20 tokens per export request
Burst limit	10 requests per tenant
Sustained limit	2 requests per minute per tenant
Public deadline	1 second for enqueue response
Downstream timeouts	DB 150ms, queue publish 150ms
Retry policy	Queue publish retry 2x with jitter inside deadline
Idempotency	Required `Idempotency-Key` header
Concurrency limit	5 active export starts per tenant
Degraded response	Return existing export if duplicate key is reused
Overload response	`429` with `Retry-After`
Dashboard	rate, rejects, in-flight, p99, queue lag, duplicate keys

Then ask:

If this endpoint gets 10x traffic for five minutes, what protects the rest of the system?

If the only answer is autoscaling, the checklist is incomplete. Autoscaling adds capacity; it is not admission control.

Summary

API failures are often queueing failures before they are code failures.
Rate limits control which work enters the system.
Concurrency limits control how much work can be in flight.
Every downstream call needs a timeout and should respect the request deadline.
Retries multiply load; use budgets, backoff, jitter, and idempotency.
Circuit breakers fail fast when a dependency is already unhealthy.
Graceful degradation preserves the core product path.
Load shedding is better than letting every request time out slowly.
Status codes and Retry-After headers are part of the reliability contract.
Watch in-flight work, p99 latency, retry attempts, rate-limit rejects, and queue lag.

Pop quiz

Interactive quiz

API reliability controls

A randomized review of rate limits, timeouts, retries, circuit breakers, and load shedding.

4of 11 questions