postmortem

The four hours we spent chasing a missing log line.

One async consumer, no structured logs, a queue silently filling up — and the logging pattern we now require for every async job.

Platform team·May 02, 2026·2 min read

Summary

On May 2nd at 18:14 BRT, a Celery consumer on the analysis-worker service silently stopped acknowledging tasks. The queue grew from ~30 messages to ~38,000 before paging triggered. Total customer impact: 4h12m of delayed analyses, no data loss, no failed payments.

This document is the kind of postmortem we publish openly: what broke, why we didn't see it sooner, and what we changed.

Timeline

  • 18:14 — A code deploy lands a new pre-fetch step in the worker.
  • 18:17 — The first task hits the new code path, raises a KeyError, and the worker's outer try/except swallows it. The task is not re-queued.
  • 18:21 — Queue length crosses 1,000.
  • 20:09 — On-call gets paged on queue.depth > 5000.
  • 22:26 — Root cause identified, hotfix deployed, drain begins.

Why the page came late

We had a queue-depth alert. It triggered. But by the time it triggered, the queue had been growing for almost two hours. The right alert was not "queue deep" — it was "consumer not consuming."

We didn't have that alert because we couldn't write it: the worker had no log line at the boundary of the new code path. The exception was swallowed silently. There was nothing for Datadog to count.

What we changed

Three durable changes:

  1. Every async handler now emits a task.started and task.finished log line, structured (structlog), with the task name and id. This gives us a "consumer alive" metric we can alert on independently of queue depth.
  2. Outer try/except blocks in workers must log the exception and re-raise. We removed the silent swallow.
  3. A Horion rule now flags any handler that catches Exception without a log.exception(...) call.

What we did not change

We didn't blame the deploy or the engineer who wrote the code path. The bug was a one-character typo. The system failure was that a typo could silently take down a consumer for four hours.

If a typo can stall production for four hours, the system is wrong, not the typist.

The new rule plus the boundary logs mean the next typo gets caught at the first task, not at task 38,000.