APIs that don't fall over
Rate limits, timeouts, retries, circuit breakers, and the controls that keep APIs available under pressure.
Most API outages start the same way. A dependency gets slow, requests pile up, clients retry, workers block, and connection pools fill. The database still has CPU. The app still has memory. Dashboards look half-normal until the queueing delay becomes the user-visible behavior.
Adding more servers at this point often makes things worse. More servers mean more concurrent calls to the slow dependency, more retries, and more database connections.
This post covers the controls that prevent this pattern.
The failure is usually queueing
An API does not only fail when code throws. It also fails when useful work waits behind work that will not complete in time.
Slow dependency calls occupy request workers. Those workers hold memory, sockets, and database connections. New requests arrive and wait. Clients time out and retry, and the retry arrives while the first request is still running. A small increase in latency can become an outage through this mechanism.
normal:
500 requests/sec
50ms service time
25 concurrent in-flight requests
dependency slows:
500 requests/sec
2s service time
1000 concurrent in-flight requests
Same arrival rate, different service time, forty times more in-flight work.
This is why adding servers does not always help. More servers increase the number of concurrent calls hitting the dependency that is already slow.
Rate limiting is admission control
Rate limiting decides which requests enter the system. It answers:
Should this request be allowed to enter the system right now?
A simple token bucket gives each caller a burst allowance and a refill rate.
import time
class TokenBucket:
def __init__(self, capacity, refill_per_second):
self.capacity = capacity
self.refill_per_second = refill_per_second
self.tokens = capacity
self.updated_at = time.monotonic()
def consume(self, cost=1):
now = time.monotonic()
elapsed = max(0, now - self.updated_at)
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_per_second,
)
self.updated_at = now
if self.tokens < cost:
return False
self.tokens -= cost
return True
This is a teaching version. In production, the bucket needs to be shared across app instances, which usually means Redis with an atomic script, a gateway, or an edge proxy.
The shape of the configuration:
capacity = burst size
refill = sustained rate
cost = request weight
A cheap read might cost 1 token. A report export might cost 20. A login attempt might use a separate security limit. A tenant-level write limit can protect the database from one large customer.
Pick the limit key carefully
The limit key decides who is isolated. Common keys:
| Key | Protects against | Risk |
|---|---|---|
| IP address | Anonymous abuse and simple scraping | NATs and mobile networks can group many users |
| User id | One user overwhelming their own quota | Does not protect multi-user tenants |
| API key | External integrations and partner apps | One customer can create many keys unless controlled |
| Tenant id | One organization dominating shared resources | Large tenants may need purchased capacity |
| Endpoint | One expensive route overwhelming internals | Does not distinguish good callers from bad callers |
| Global | Total system overload | Can let noisy callers crowd out important traffic |
A common approach is layered limits:
global limit
tenant limit
user/API key limit
endpoint cost limit
Requests should not all cost the same. A GET /status request and a POST /exports request should not spend the same budget.
Fixed window, sliding window, token bucket, leaky bucket
The behavior matters more than the name.
| Strategy | Behavior | Use it when |
|---|---|---|
| Fixed window | 100 requests per minute resets on the minute | Simple counters and coarse limits |
| Sliding window | Counts recent requests across a moving interval | Fairer user-facing quotas |
| Token bucket | Allows bursts up to capacity, then sustained refill | Most API admission control |
| Leaky bucket | Smooths output at a steady rate | Protecting a downstream with strict throughput |
Fixed windows are simple but have edge bursts. A caller can send 100 requests at 12:00:59 and 100 more at 12:01:00.
Token buckets are a reasonable default because real traffic is bursty. They allow short bursts without allowing unbounded pressure.
Timeouts are part of the contract
Every network call needs a timeout. This includes the public API, internal HTTP calls, database queries, Redis calls, search calls, queue publish calls, and object storage calls. Anything that waits on another system needs a deadline.
Without a timeout, the dependency decides how long the worker lives.
def get_profile(user_id, request_deadline):
remaining = request_deadline - time.monotonic()
if remaining <= 0:
raise TimeoutError("request deadline exceeded")
return http.get(
f"https://profiles.internal/users/{user_id}",
timeout=min(0.150, remaining),
)
A timeout budget makes the constraint explicit:
public request deadline: 800ms
auth: 50ms
database: 150ms
profile service: 150ms
recommendation call: 200ms
rendering/marshal: 50ms
slack: 200ms
If the outer request times out at 800ms, an inner call waiting 2 seconds cannot help the user.
Retries are load multipliers
Retries are useful when the failure is temporary and the operation is safe to repeat. Otherwise they add load without adding reliability.
One retry at three layers multiplies load:
client retries 2x
gateway retries 2x
service retries 2x
one user request can become 8 downstream attempts
This is called a retry storm.
A retry budget contains it:
import random
import time
def call_with_retries(fn, deadline, max_attempts=3, base_delay=0.025):
last_error = None
for attempt in range(max_attempts):
remaining = deadline - time.monotonic()
if remaining <= 0:
raise TimeoutError("deadline exceeded") from last_error
try:
return fn(timeout=remaining)
except TemporaryError as error:
last_error = error
if attempt == max_attempts - 1:
break
delay = min(base_delay * (2 ** attempt), 0.250)
jitter = random.uniform(0, delay)
time.sleep(min(jitter, max(0, deadline - time.monotonic())))
raise last_error
Synchronized retries are themselves a burst. Backoff spreads retries out over time, and jitter makes that spread uneven so the dependency has room to recover.
Circuit breakers fail fast
A circuit breaker stops calling a dependency that is already failing. It has three states:
closed: calls flow normally
open: calls fail fast
half-open: a few probe calls test recovery
Simplified:
import time
class CircuitBreaker:
def __init__(self, failure_threshold, reset_after_seconds):
self.failure_threshold = failure_threshold
self.reset_after_seconds = reset_after_seconds
self.failure_count = 0
self.opened_at = None
def call(self, fn):
if self.opened_at is not None:
elapsed = time.monotonic() - self.opened_at
if elapsed < self.reset_after_seconds:
raise ServiceUnavailable("circuit open")
return self._half_open_call(fn)
try:
result = fn()
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.opened_at = time.monotonic()
raise
def _half_open_call(self, fn):
try:
result = fn()
self.failure_count = 0
self.opened_at = None
return result
except Exception:
self.opened_at = time.monotonic()
raise
This code is intentionally small. Real breakers need rolling windows, error rates, probe limits, and metrics. The principle is the same: once a dependency is unhealthy, stop spending request threads on it.
Choosing the numbers
The patterns are well known. The harder part is choosing specific numbers and deciding what the product does when those numbers take effect:
- What status code is returned?
- What does the user see?
- Which endpoint is protected first?
- Which customer is allowed to burst?
- Which dependency is optional?
- Which write is safe to retry?
These are architecture decisions, not implementation details.
Bulkheads and concurrency limits
Rate limits control arrival. Concurrency limits control in-flight work. Both are needed.
If an endpoint can spend 2 seconds waiting on a dependency, even a modest request rate can consume every worker. A concurrency limit caps the number of concurrent calls.
class ConcurrencyLimiter:
def __init__(self, max_in_flight):
self.semaphore = Semaphore(max_in_flight)
def call(self, fn):
if not self.semaphore.acquire(blocking=False):
raise ServiceUnavailable("too many in-flight requests")
try:
return fn()
finally:
self.semaphore.release()
Use separate pools for separate classes of work:
checkout pool: small, protected, high priority
search pool: medium, user-facing
export pool: small, background
analytics pool: best effort
admin report pool: isolated
This is a bulkhead. One flooded endpoint should not consume the capacity the rest of the product depends on.
Graceful degradation
Not every dependency needs to take the API down with it.
- If recommendations are unavailable, return the page without recommendations.
- If the personalization service is slow, return the default ranking.
- If a profile badge service fails, omit the badge.
- If a search cluster is unhealthy, return cached results with a stale marker.
The goal is to preserve the core product path when optional pieces fail.
def get_home_feed(user_id, deadline):
feed = db.get_recent_feed_items(user_id, timeout=0.120)
try:
recommendations = recommender.get_items(
user_id,
timeout=min(0.080, deadline - time.monotonic()),
)
except (TimeoutError, ServiceUnavailable):
recommendations = []
return {
"feed": feed,
"recommendations": recommendations,
}
The design question to ask:
What can we remove and still serve something useful?
Load shedding
Load shedding is rejecting work on purpose to keep the system available for the rest of the traffic.
Good load shedding rejects early:
- Before expensive authentication fanout if possible.
- Before opening database transactions.
- Before calling slow dependencies.
- Before accepting background jobs that cannot be processed.
Bad load shedding rejects after most of the work has been done.
When the system is overloaded, a fast 429 or 503 with Retry-After is more useful than a 30-second timeout that prompts the client to retry blindly.
HTTP/1.1 429 Too Many Requests
Retry-After: 3
Content-Type: application/json
{"error":"rate_limited","retry_after_seconds":3}
Status codes are control signals
Clients need clear, distinct signals to behave well.
| Code | Meaning in this context | Client behavior |
|---|---|---|
408 | Request timed out before completion | Usually safe to retry only if operation is idempotent |
409 | Conflicting command or reused idempotency key with different body | Do not blindly retry |
425 | Too early for unsafe replay | Retry later if protocol supports it |
429 | Caller exceeded a limit | Wait according to Retry-After |
500 | Unexpected server error | Retry only with budget and idempotency |
502 | Bad upstream response | Retry cautiously |
503 | Service unavailable or circuit open | Retry later with backoff |
504 | Gateway timeout | Retry cautiously; original work may still be running |
Ambiguous errors lead to poorly behaved clients, which create extra load. Clear errors are part of reliability.
The retry matrix
Before adding a retry, decide if it is safe:
| Operation | Retry? | Why |
|---|---|---|
GET /profile/123 | Usually yes | Read is naturally safe if bounded |
POST /payments without idempotency key | No | Duplicate charge risk |
POST /payments with idempotency key | Yes, carefully | Server can dedupe the command |
| Queue publish after DB commit without outbox | No simple retry is enough | Can create lost or duplicate effects |
| Search index update from event handler | Yes | Handler should be idempotent |
| Email send | Usually no direct retry without dedupe | Duplicate emails hurt trust |
Retry policy depends on operation semantics, not the HTTP method. PUT can be unsafe if implemented poorly, and POST can be safe with an idempotency key.
Metrics that help
The metrics worth putting on the first dashboard for an API:
| Metric | Why it matters |
|---|---|
| Request rate by endpoint and caller | Shows who is creating load |
| In-flight requests | Exposes queueing before CPU does |
| p50, p95, p99 latency | Shows tail pain |
| Timeout count by dependency | Shows where budget is spent |
| Retry attempts by caller and dependency | Finds retry storms |
| Rate-limit rejects | Shows protected pressure |
| Circuit breaker state | Shows fast-fail behavior |
| Queue lag and age | Shows async backlog |
| Saturation by pool | Shows bulkhead pressure |
| Error rate by status code | Separates overload, bugs, dependency failure, and client misuse |
CPU alone is not enough. An API can be unresponsive at low CPU if every worker is waiting on I/O.
Practical sequence
A workable order to apply these controls:
- Define the core product path.
- Put deadlines around every downstream call.
- Limit in-flight work per route and dependency.
- Add caller-aware rate limits.
- Retry only idempotent operations, inside a deadline, with jitter.
- Use circuit breakers for dependencies that fail slowly.
- Decide which pieces can degrade.
- Return clear status codes that teach clients what to do.
- Watch in-flight work, tail latency, retries, and rejects.
A service mesh diagram is not the place to start. The starting questions are simpler:
How many requests are allowed in?
How long can they run?
How many can wait?
What happens when dependency X is slow?
What should the client do next?
If those answers are missing, the API does not yet have a reliability story.
Endpoint checklist
For any important API endpoint, fill in:
| Question | Example answer |
|---|---|
| Endpoint | POST /exports |
| Caller key | tenant_id and user_id |
| Cost | 20 tokens per export request |
| Burst limit | 10 requests per tenant |
| Sustained limit | 2 requests per minute per tenant |
| Public deadline | 1 second for enqueue response |
| Downstream timeouts | DB 150ms, queue publish 150ms |
| Retry policy | Queue publish retry 2x with jitter inside deadline |
| Idempotency | Required Idempotency-Key header |
| Concurrency limit | 5 active export starts per tenant |
| Degraded response | Return existing export if duplicate key is reused |
| Overload response | 429 with Retry-After |
| Dashboard | rate, rejects, in-flight, p99, queue lag, duplicate keys |
Then ask:
If this endpoint gets 10x traffic for five minutes, what protects the rest of the system?
If the only answer is autoscaling, the checklist is incomplete. Autoscaling adds capacity; it is not admission control.
Summary
- API failures are often queueing failures before they are code failures.
- Rate limits control which work enters the system.
- Concurrency limits control how much work can be in flight.
- Every downstream call needs a timeout and should respect the request deadline.
- Retries multiply load; use budgets, backoff, jitter, and idempotency.
- Circuit breakers fail fast when a dependency is already unhealthy.
- Graceful degradation preserves the core product path.
- Load shedding is better than letting every request time out slowly.
- Status codes and
Retry-Afterheaders are part of the reliability contract. - Watch in-flight work, p99 latency, retry attempts, rate-limit rejects, and queue lag.
Pop quiz
Interactive quiz
API reliability controls
A randomized review of rate limits, timeouts, retries, circuit breakers, and load shedding.