Back to blog

APIs that don't fall over

Rate limits, timeouts, retries, circuit breakers, and the controls that keep APIs available under pressure.

Most API outages start the same way. A dependency gets slow, requests pile up, clients retry, workers block, and connection pools fill. The database still has CPU. The app still has memory. Dashboards look half-normal until the queueing delay becomes the user-visible behavior.

Adding more servers at this point often makes things worse. More servers mean more concurrent calls to the slow dependency, more retries, and more database connections.

This post covers the controls that prevent this pattern.

API reliability controls from admission to degradationDiagram showing requests passing through admission control, deadlines, retry budgets, bulkheads, circuit breakers, graceful degradation, and operational metrics.API reliability control chainadmit less, bound work, fail fast, and preserve the core product pathincoming requests1Admissioncaller key + cost429/503 early2Deadline800ms total budgettimeout every hop3Retry budgetidempotent onlybackoff + jitter4Bulkheadsmax in-flight workpool per routeshed before workInside one request deadlinecore pathrequired depsDB 150msoptional depsdegrade cleanlycircuit breakeropen = fast failno slow dependency pileupFirst dashboard:in-flightp99timeoutsretriesrejectscircuitqueue

The failure is usually queueing

An API does not only fail when code throws. It also fails when useful work waits behind work that will not complete in time.

Slow dependency calls occupy request workers. Those workers hold memory, sockets, and database connections. New requests arrive and wait. Clients time out and retry, and the retry arrives while the first request is still running. A small increase in latency can become an outage through this mechanism.

normal:
  500 requests/sec
  50ms service time
  25 concurrent in-flight requests

dependency slows:
  500 requests/sec
  2s service time
  1000 concurrent in-flight requests

Same arrival rate, different service time, forty times more in-flight work.

This is why adding servers does not always help. More servers increase the number of concurrent calls hitting the dependency that is already slow.

Rate limiting is admission control

Rate limiting decides which requests enter the system. It answers:

Should this request be allowed to enter the system right now?

A simple token bucket gives each caller a burst allowance and a refill rate.

import time


class TokenBucket:
    def __init__(self, capacity, refill_per_second):
        self.capacity = capacity
        self.refill_per_second = refill_per_second
        self.tokens = capacity
        self.updated_at = time.monotonic()

    def consume(self, cost=1):
        now = time.monotonic()
        elapsed = max(0, now - self.updated_at)

        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_per_second,
        )
        self.updated_at = now

        if self.tokens < cost:
            return False

        self.tokens -= cost
        return True

This is a teaching version. In production, the bucket needs to be shared across app instances, which usually means Redis with an atomic script, a gateway, or an edge proxy.

The shape of the configuration:

capacity = burst size
refill   = sustained rate
cost     = request weight

A cheap read might cost 1 token. A report export might cost 20. A login attempt might use a separate security limit. A tenant-level write limit can protect the database from one large customer.

Pick the limit key carefully

The limit key decides who is isolated. Common keys:

KeyProtects againstRisk
IP addressAnonymous abuse and simple scrapingNATs and mobile networks can group many users
User idOne user overwhelming their own quotaDoes not protect multi-user tenants
API keyExternal integrations and partner appsOne customer can create many keys unless controlled
Tenant idOne organization dominating shared resourcesLarge tenants may need purchased capacity
EndpointOne expensive route overwhelming internalsDoes not distinguish good callers from bad callers
GlobalTotal system overloadCan let noisy callers crowd out important traffic

A common approach is layered limits:

global limit
  tenant limit
    user/API key limit
      endpoint cost limit

Requests should not all cost the same. A GET /status request and a POST /exports request should not spend the same budget.

Fixed window, sliding window, token bucket, leaky bucket

The behavior matters more than the name.

StrategyBehaviorUse it when
Fixed window100 requests per minute resets on the minuteSimple counters and coarse limits
Sliding windowCounts recent requests across a moving intervalFairer user-facing quotas
Token bucketAllows bursts up to capacity, then sustained refillMost API admission control
Leaky bucketSmooths output at a steady rateProtecting a downstream with strict throughput

Fixed windows are simple but have edge bursts. A caller can send 100 requests at 12:00:59 and 100 more at 12:01:00.

Token buckets are a reasonable default because real traffic is bursty. They allow short bursts without allowing unbounded pressure.

Timeouts are part of the contract

Every network call needs a timeout. This includes the public API, internal HTTP calls, database queries, Redis calls, search calls, queue publish calls, and object storage calls. Anything that waits on another system needs a deadline.

Without a timeout, the dependency decides how long the worker lives.

def get_profile(user_id, request_deadline):
    remaining = request_deadline - time.monotonic()

    if remaining <= 0:
        raise TimeoutError("request deadline exceeded")

    return http.get(
        f"https://profiles.internal/users/{user_id}",
        timeout=min(0.150, remaining),
    )

A timeout budget makes the constraint explicit:

public request deadline: 800ms
  auth:                  50ms
  database:             150ms
  profile service:      150ms
  recommendation call:  200ms
  rendering/marshal:     50ms
  slack:                200ms

If the outer request times out at 800ms, an inner call waiting 2 seconds cannot help the user.

Retries are load multipliers

Retries are useful when the failure is temporary and the operation is safe to repeat. Otherwise they add load without adding reliability.

One retry at three layers multiplies load:

client retries 2x
  gateway retries 2x
    service retries 2x

one user request can become 8 downstream attempts

This is called a retry storm.

A retry budget contains it:

import random
import time


def call_with_retries(fn, deadline, max_attempts=3, base_delay=0.025):
    last_error = None

    for attempt in range(max_attempts):
        remaining = deadline - time.monotonic()
        if remaining <= 0:
            raise TimeoutError("deadline exceeded") from last_error

        try:
            return fn(timeout=remaining)
        except TemporaryError as error:
            last_error = error

        if attempt == max_attempts - 1:
            break

        delay = min(base_delay * (2 ** attempt), 0.250)
        jitter = random.uniform(0, delay)
        time.sleep(min(jitter, max(0, deadline - time.monotonic())))

    raise last_error

Synchronized retries are themselves a burst. Backoff spreads retries out over time, and jitter makes that spread uneven so the dependency has room to recover.

Circuit breakers fail fast

A circuit breaker stops calling a dependency that is already failing. It has three states:

closed:    calls flow normally
open:      calls fail fast
half-open: a few probe calls test recovery

Simplified:

import time


class CircuitBreaker:
    def __init__(self, failure_threshold, reset_after_seconds):
        self.failure_threshold = failure_threshold
        self.reset_after_seconds = reset_after_seconds
        self.failure_count = 0
        self.opened_at = None

    def call(self, fn):
        if self.opened_at is not None:
            elapsed = time.monotonic() - self.opened_at
            if elapsed < self.reset_after_seconds:
                raise ServiceUnavailable("circuit open")

            return self._half_open_call(fn)

        try:
            result = fn()
            self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.opened_at = time.monotonic()
            raise

    def _half_open_call(self, fn):
        try:
            result = fn()
            self.failure_count = 0
            self.opened_at = None
            return result
        except Exception:
            self.opened_at = time.monotonic()
            raise

This code is intentionally small. Real breakers need rolling windows, error rates, probe limits, and metrics. The principle is the same: once a dependency is unhealthy, stop spending request threads on it.

Choosing the numbers

The patterns are well known. The harder part is choosing specific numbers and deciding what the product does when those numbers take effect:

  • What status code is returned?
  • What does the user see?
  • Which endpoint is protected first?
  • Which customer is allowed to burst?
  • Which dependency is optional?
  • Which write is safe to retry?

These are architecture decisions, not implementation details.

Bulkheads and concurrency limits

Rate limits control arrival. Concurrency limits control in-flight work. Both are needed.

If an endpoint can spend 2 seconds waiting on a dependency, even a modest request rate can consume every worker. A concurrency limit caps the number of concurrent calls.

class ConcurrencyLimiter:
    def __init__(self, max_in_flight):
        self.semaphore = Semaphore(max_in_flight)

    def call(self, fn):
        if not self.semaphore.acquire(blocking=False):
            raise ServiceUnavailable("too many in-flight requests")

        try:
            return fn()
        finally:
            self.semaphore.release()

Use separate pools for separate classes of work:

checkout pool:       small, protected, high priority
search pool:         medium, user-facing
export pool:         small, background
analytics pool:      best effort
admin report pool:   isolated

This is a bulkhead. One flooded endpoint should not consume the capacity the rest of the product depends on.

Graceful degradation

Not every dependency needs to take the API down with it.

  • If recommendations are unavailable, return the page without recommendations.
  • If the personalization service is slow, return the default ranking.
  • If a profile badge service fails, omit the badge.
  • If a search cluster is unhealthy, return cached results with a stale marker.

The goal is to preserve the core product path when optional pieces fail.

def get_home_feed(user_id, deadline):
    feed = db.get_recent_feed_items(user_id, timeout=0.120)

    try:
        recommendations = recommender.get_items(
            user_id,
            timeout=min(0.080, deadline - time.monotonic()),
        )
    except (TimeoutError, ServiceUnavailable):
        recommendations = []

    return {
        "feed": feed,
        "recommendations": recommendations,
    }

The design question to ask:

What can we remove and still serve something useful?

Load shedding

Load shedding is rejecting work on purpose to keep the system available for the rest of the traffic.

Good load shedding rejects early:

  1. Before expensive authentication fanout if possible.
  2. Before opening database transactions.
  3. Before calling slow dependencies.
  4. Before accepting background jobs that cannot be processed.

Bad load shedding rejects after most of the work has been done.

When the system is overloaded, a fast 429 or 503 with Retry-After is more useful than a 30-second timeout that prompts the client to retry blindly.

HTTP/1.1 429 Too Many Requests
Retry-After: 3
Content-Type: application/json

{"error":"rate_limited","retry_after_seconds":3}

Status codes are control signals

Clients need clear, distinct signals to behave well.

CodeMeaning in this contextClient behavior
408Request timed out before completionUsually safe to retry only if operation is idempotent
409Conflicting command or reused idempotency key with different bodyDo not blindly retry
425Too early for unsafe replayRetry later if protocol supports it
429Caller exceeded a limitWait according to Retry-After
500Unexpected server errorRetry only with budget and idempotency
502Bad upstream responseRetry cautiously
503Service unavailable or circuit openRetry later with backoff
504Gateway timeoutRetry cautiously; original work may still be running

Ambiguous errors lead to poorly behaved clients, which create extra load. Clear errors are part of reliability.

The retry matrix

Before adding a retry, decide if it is safe:

OperationRetry?Why
GET /profile/123Usually yesRead is naturally safe if bounded
POST /payments without idempotency keyNoDuplicate charge risk
POST /payments with idempotency keyYes, carefullyServer can dedupe the command
Queue publish after DB commit without outboxNo simple retry is enoughCan create lost or duplicate effects
Search index update from event handlerYesHandler should be idempotent
Email sendUsually no direct retry without dedupeDuplicate emails hurt trust

Retry policy depends on operation semantics, not the HTTP method. PUT can be unsafe if implemented poorly, and POST can be safe with an idempotency key.

Metrics that help

The metrics worth putting on the first dashboard for an API:

MetricWhy it matters
Request rate by endpoint and callerShows who is creating load
In-flight requestsExposes queueing before CPU does
p50, p95, p99 latencyShows tail pain
Timeout count by dependencyShows where budget is spent
Retry attempts by caller and dependencyFinds retry storms
Rate-limit rejectsShows protected pressure
Circuit breaker stateShows fast-fail behavior
Queue lag and ageShows async backlog
Saturation by poolShows bulkhead pressure
Error rate by status codeSeparates overload, bugs, dependency failure, and client misuse

CPU alone is not enough. An API can be unresponsive at low CPU if every worker is waiting on I/O.

Practical sequence

A workable order to apply these controls:

  1. Define the core product path.
  2. Put deadlines around every downstream call.
  3. Limit in-flight work per route and dependency.
  4. Add caller-aware rate limits.
  5. Retry only idempotent operations, inside a deadline, with jitter.
  6. Use circuit breakers for dependencies that fail slowly.
  7. Decide which pieces can degrade.
  8. Return clear status codes that teach clients what to do.
  9. Watch in-flight work, tail latency, retries, and rejects.

A service mesh diagram is not the place to start. The starting questions are simpler:

How many requests are allowed in?
How long can they run?
How many can wait?
What happens when dependency X is slow?
What should the client do next?

If those answers are missing, the API does not yet have a reliability story.

Endpoint checklist

For any important API endpoint, fill in:

QuestionExample answer
EndpointPOST /exports
Caller keytenant_id and user_id
Cost20 tokens per export request
Burst limit10 requests per tenant
Sustained limit2 requests per minute per tenant
Public deadline1 second for enqueue response
Downstream timeoutsDB 150ms, queue publish 150ms
Retry policyQueue publish retry 2x with jitter inside deadline
IdempotencyRequired Idempotency-Key header
Concurrency limit5 active export starts per tenant
Degraded responseReturn existing export if duplicate key is reused
Overload response429 with Retry-After
Dashboardrate, rejects, in-flight, p99, queue lag, duplicate keys

Then ask:

If this endpoint gets 10x traffic for five minutes, what protects the rest of the system?

If the only answer is autoscaling, the checklist is incomplete. Autoscaling adds capacity; it is not admission control.

Summary

  1. API failures are often queueing failures before they are code failures.
  2. Rate limits control which work enters the system.
  3. Concurrency limits control how much work can be in flight.
  4. Every downstream call needs a timeout and should respect the request deadline.
  5. Retries multiply load; use budgets, backoff, jitter, and idempotency.
  6. Circuit breakers fail fast when a dependency is already unhealthy.
  7. Graceful degradation preserves the core product path.
  8. Load shedding is better than letting every request time out slowly.
  9. Status codes and Retry-After headers are part of the reliability contract.
  10. Watch in-flight work, p99 latency, retry attempts, rate-limit rejects, and queue lag.

Pop quiz

Interactive quiz

API reliability controls

A randomized review of rate limits, timeouts, retries, circuit breakers, and load shedding.

4of 11 questions
Question 1 of 425%
A dependency slows from 50ms to 2 seconds while request rate stays flat. What is the immediate API risk?