The outage is over. The retry storm is starting.
The orchestrator sees a dependency recover. Scheduled jobs leave retry state. Workers refill the Trino queue before production traffic has caught up.
In data platforms, retries become load.
Retries are not bad. But retry: 3 is a spending policy hiding inside YAML.
The serious mistake is treating every retry solution as a nicer version of the same knob. Backoff, idempotency, retry budgets, dead-letter queues, Airflow policy hooks, queue isolation, and circuit breakers do not solve the same problem. They sit at different layers. Pick the wrong one and you get a system that is technically more “reliable” while it quietly burns the cluster down.
The problem is not one problem
The failure has at least five separate questions:
- Should this failure be retried at all?
- When should the next attempt happen?
- How many attempts can the platform afford?
- Can another attempt safely repeat the write?
- Where should failed work wait while healthy work continues?
Most teams answer all five with one integer.
That is how you get a retry storm.
AWS has the cleanest math: five layers with three retries each can create 243x load amplification at the dependency layer. Google’s SRE book gives the same shape from another angle: if the browser, frontend, and backend all retry, one user action can create 64 database attempts.
Those are not pathological edge cases. They are what happens when a local reliability default becomes a global traffic policy.
Compare the controls by what they actually control
Failure classification
- Best for
- Deciding whether a failure deserves another attempt.
- Protects
- Compute waste on deterministic failures.
- Does not solve
- Overload after many valid retries wake up.
Backoff and jitter
- Best for
- Spreading retries over time.
- Protects
- Thundering herds from synchronized clients.
- Does not solve
- Total retry volume.
Retry budget
- Best for
- Capping retry volume.
- Protects
- Platform-wide amplification.
- Does not solve
- Unsafe duplicate writes.
Idempotency
- Best for
- Making repeated writes safe.
- Protects
- Duplicate side effects and ambiguous commits.
- Does not solve
- Capacity pressure.
Queue isolation and priority
- Best for
- Separating recovery traffic from production.
- Protects
- Noisy-neighbor collapse.
- Does not solve
- Bad retry decisions inside a lane.
Circuit breaker / admission control
- Best for
- Stopping doomed calls while a dependency is unhealthy.
- Protects
- Self-inflicted overload during an active incident.
- Does not solve
- Recovery ordering after the breaker opens again.
Dead-letter queue / quarantine
- Best for
- Moving bad work out of the hot path.
- Protects
- Queue starvation from poison records.
- Does not solve
- Root-cause repair.
Backfill / replay controller
- Best for
- Rerunning a bounded slice of history.
- Protects
- Uncontrolled recovery after corruption or late fixes.
- Does not solve
- Online transient failures.
That comparison is the post. If your team can place its retry tooling on it, you can compare solutions honestly. If it cannot, you are probably comparing marketing names.
Failure classification: do not retry the deterministic
The first decision is not delay. It is eligibility.
Transient failures can deserve another attempt: queue pressure, temporary transport failure, cluster capacity trouble, short-lived broker unavailability, ambiguous commit state. Deterministic failures usually do not: bad SQL, invalid credentials, missing permissions, malformed input, schema mismatch, broken code.
This is where Airflow’s accepted AIP-105: Pluggable Retry Policies is interesting. The headline is the optional LLM retry policy, but the durable feature is more boring and more useful: a task-level policy can inspect the exception and choose RETRY, FAIL, or default behavior with an optional delay. That is the right abstraction. Not because LLMs are magic, but because retries=3 never knew the difference between “API timed out” and “API key is invalid.”
The same idea exists outside orchestration. Google Cloud Storage says retry safety depends on both response type and idempotency. Kafka producer retries are documented as a mechanism for transient errors, not a license to resend business commands forever.
Classification is the cheapest win because it prevents stupid spend. Deterministic failures should fail once, loudly, not three times politely.
Backoff and jitter: spread the pain, do not pretend it disappeared
Backoff answers “when?”
If a service is briefly overloaded, retrying immediately is usually self-harm. Microsoft’s Retry Storm antipattern calls this out directly: immediate retries can create a thundering herd and stop the service from recovering. AWS recommends backoff with jitter because synchronized retry clients can all wake up on the same schedule.
Backoff is necessary. It is not sufficient.
If 10,000 expensive jobs are waiting to retry, exponential backoff makes the wave less sharp. It does not decide whether all 10,000 attempts are worth buying. It also does not make a duplicate write safe. Backoff is a timing control, not a correctness control and not a budget.
Use it. Just do not confuse “later” with “cheaper.”
Retry budgets: make retries compete for a limited pool
Retry budgets answer “how much?”
Google explicitly recommends server-wide retry budgets for cascading-failure prevention. AWS SDK retry behavior also includes quota-style retry limiting in modern modes. The principle is simple: retries consume scarce recovery capacity, so they need a pool, not just a per-task counter.
This matters more in data platforms than in ordinary API clients because the unit of retry can be expensive. A cheap metadata request and a multi-terabyte transformation should not inherit the same retry philosophy. One more attempt may mean a few milliseconds of CPU, or it may mean another full scan, another shuffle, another table write, and another bill nobody wanted.
A useful retry budget has at least four dimensions:
- Cost class. Cheap API calls and heavyweight jobs get different pools.
- Failure class. Rate limits, queue pressure, and ambiguous commits do not share one bucket.
- Workload priority. P0 recovery work should not wait behind low-value scheduled noise.
- Time window. A daily budget is useless if the whole budget can burn in five minutes after a dependency recovers.
The hard part is political, not technical. A retry budget says some work will not get another attempt right now. That is the point. Unlimited retries are just unlimited spend with a reliability costume.
Idempotency: make another attempt safe
Idempotency answers “can repeating this corrupt state?”
Stripe’s idempotent request design is the mainstream example: clients send an idempotency key, and repeated requests with the same key get the same result instead of creating duplicate side effects. AWS makes the same argument in Making retries safe with idempotent APIs: retrying a mutating operation assumes the operation can be repeated without creating a second business action.
In data systems, the equivalent is not just “use an idempotency key.” It is a write contract:
- a stable run ID or operation ID
- deterministic output paths or commit identifiers
- partition replacement instead of append-and-hope
- commit protocols that distinguish “failed before commit” from “commit may have happened”
- reconciliation for ambiguous writes
Idempotency is what turns retries from chaos into a tool. But it does not reduce load by itself. A perfectly idempotent retry can still DDoS your shared Trino cluster if every failed job wakes up together.
Idempotency is correctness insurance. Capacity management still needs a separate control.
Queue isolation and priority: recovery traffic is still production traffic
Queueing answers “where does retry work wait?”
This was the practical Wix lesson behind the original post. A dependency fails. Scheduled jobs queue for retry. The dependency recovers. The queued jobs wake up together and hit Trino. Now recovery traffic fights production on the same cluster. The symptoms are visible: retry backlog size rises, query queue depth climbs, recovery work consumes slots that interactive or critical jobs expected to use.
The retry was locally correct. The recovery wave was globally dumb.
The fix is not one magic knob. It is lane design:
- isolate ad-hoc and scheduled workloads
- give critical jobs a separate lane
- cap retry concurrency before the shared engine
- route by DAG, workload, tenant, or priority instead of one flat backlog
- autoscale behind admission control, not instead of it
The priority detail matters. During recovery, FIFO is often the wrong moral philosophy. If a few upstream jobs unblock a large dependency graph, run those first. The goal is not neatness. The goal is restoring the graph without waking the whole backlog at once.
Queue isolation is the difference between “the platform is recovering” and “the platform has started a second incident.”
Circuit breakers and admission control: the missing solution
The solution most retry posts underplay is the circuit breaker.
Retries ask “should I try again?” A circuit breaker asks “should I stop sending this class of calls for a while because the dependency is probably still unhealthy?”
That distinction matters. Backoff still schedules another attempt. A retry budget still allows attempts until the budget is gone. A circuit breaker changes the system state: closed, open, half-open. When open, it fails fast instead of feeding more load into a dependency that is unlikely to recover under pressure.
For data platforms, I would not use a circuit breaker as a blind kill switch around everything. That can make partial outages worse and hide useful work. Use it as admission control around retry traffic:
- let first attempts for high-priority work continue if the dependency can handle them
- stop low-priority retries while the dependency is clearly failing
- allow half-open probes before reopening the lane
- combine breaker state with queue priority so recovery starts with unblockers, not the largest backlog
This is the missing middle between “retry with backoff” and “turn the system off.” It is the control that says: no, we are not buying another thousand doomed attempts just because the retry delay expired. Holy shit, that sentence should not be controversial, but many platforms still need to hear it.
DLQs and quarantine: remove poison from the hot path
Some failures should not sit at the front of a queue.
Uber’s Kafka reprocessing write-up is useful here because it separates transient retry from non-blocking reprocessing and dead-letter handling. A poison message should not be re-consumed forever while healthy traffic waits behind it. Move it aside with enough context to repair it later: original payload, offset, error class, retry count, schema version, owner, and timestamps.
This is not just a streaming concern. Data pipelines need the same shape:
- invalid source records go to quarantine
- ambiguous writes go to manual reconciliation
- deterministic task failures fail fast with a useful reason
- replayable slices get bounded recovery plans
DLQs are not a garbage bin. They are a control plane for failed work. If nobody owns them, they become a second data lake with worse governance.
Backfill and replay controls: recovery is a workload
Retries handle online failure. Backfills handle history.
Do not mix them up.
Apache Airflow’s backfill controls make this distinction visible: reprocessing behavior, max active runs, and dry runs are separate knobs. Uber’s Kappa architecture write-up made the cost side explicit: multi-day replay can overwhelm shared infrastructure and downstream consumers unless the generated data is rate-limited.
This is where teams get bitten after silent corruption. The failure was not loud. Bad data propagated. Now the question is not “should we retry?” It is:
- which partitions are poisoned
- which downstream tables inherited the bad state
- what can be recomputed safely
- what order keeps consumers consistent
- how much recovery traffic the platform can afford today
At that point, the unit of recovery is no longer one task attempt. It is a bounded slice of history with ordering, throttling, and downstream consistency rules.
How to choose between the solutions
Start with the failure mode, not the tool.
Malformed input, permission error, bad SQL, invalid config
Prefer Fail-fast classification
Because Another attempt buys nothing.
Transient network or broker trouble
Prefer Bounded retry with backoff and jitter
Because Time may fix it, synchronization makes it worse.
Rate limiting
Prefer Delayed retry that honors the server signal
Because Immediate retry attacks the limiter.
Overload across a shared dependency
Prefer Retry budget plus circuit breaker
Because The system needs fewer attempts, not just later attempts.
Ambiguous write or lost response after a write
Prefer Idempotency and reconciliation
Because Correctness is the risk, not timing.
Poison message or permanently bad record
Prefer DLQ or quarantine
Because Healthy work should keep moving.
Many failed jobs waking after recovery
Prefer Queue isolation and priority
Because Recovery traffic must not fight production.
Silent data corruption found late
Prefer Bounded backfill/replay controller
Because The unit of recovery is a slice of history, not one task attempt.
The useful comparison question is: which scarce resource does this solution protect?
Backoff protects time. Retry budgets protect capacity. Idempotency protects state. Queue isolation protects neighbors. Circuit breakers protect dependencies. DLQs protect the hot path. Backfill controllers protect recovery from becoming an unbounded historical rewrite.
Once you say it that way, a lot of retry debates get less mystical.
Audit your retry stack
For each retry path, write down:
- Who starts the retry: client, SDK, scheduler, worker, broker, query engine, or human.
- What failure class it retries: transport, capacity, rate limit, ambiguous commit, bad input, permission, schema, code bug.
- What budget limits it: attempts, elapsed time, retry tokens, queue slots, cost class, or operator approval.
- Whether the write is idempotent: run ID, operation ID, deterministic output path, commit protocol, or reconciliation path.
- Where failed work waits: same queue, priority lane, DLQ, quarantine table, replay controller, or incident runbook.
- What proves recovery is healthy: retry backlog size, retry attempts by failure class, retry concurrency, query queue depth, production-vs-recovery slot share, duplicate-write reconciliation count, DLQ age, and backfill throttle rate.
The obvious objection is that not every system needs all of this. True. A small API client can often survive with bounded retries, backoff, and idempotency. The moment retries fan out through a scheduler, shared query engine, broker, or write path, the simple model stops being enough.
The real end of an outage
An outage is not over when the dependency comes back.
It is over when retry traffic is admitted slowly enough, critical jobs have caught up, poison work has left the hot path, ambiguous writes have been reconciled, and the platform has not created a second incident through recovery traffic.
Retries are reliability tools. They are also cost tools.
Treat them like both, or the retry policy will spend the money for you.