Java Version : Dead Letter Queues That Actually Work
Most teams’ Dead Letter Queues are accidentally where messages go to die unnoticed. This is a guide to building a DLQ that does what it’s supposed to do.
What a DLQ is for
A Dead Letter Queue is a place to send messages that cannot be processed. The key word is “cannot.” A DLQ is not for transient failures.
Transient failures (network blip, downstream service temporarily down): retry with exponential backoff. Don’t DLQ.
Permanent failures (malformed payload, business rule violation, missing required field): DLQ immediately. No retries.
The first mistake teams make is conflating these. A consumer that DLQs on every error pollutes the DLQ with transient garbage — making the actual permanent-failure messages impossible to find.
What a DLQ message must preserve
When a message is DLQ’d, you need to be able to answer three questions:
What was the original message?
Why did it fail?
Can I safely reprocess it after fixing the bug?
To answer those, the DLQ message must contain:
{
"original_message": { ... },
"original_topic": "user-events",
"original_partition": 4,
"original_offset": 8472938,
"consumer_id": "user-event-processor-7f3a",
"failure_reason": "ValidationException: required field 'user_id' missing",
"stack_trace": "...",
"timestamp": "2025-12-09T14:32:18Z",
"retry_count": 0
}
The original_topic, partition, and offset are the keys. They let you trace back to exactly where in the source stream the message came from — useful for diagnosing a producer bug, not just the consumer’s failure.
Operating a DLQ
A DLQ that nobody looks at is worse than no DLQ — it gives the team false confidence that “we have one.”
Alert on growth, not absolute count
A DLQ with 1000 messages from 6 months ago that haven’t grown is fine. A DLQ with 50 messages added in the last hour is a fire.
Alert on rate of arrival, not on size:
10 messages/minute → page on-call
100 messages in last hour → wake up the team
DLQ has any messages older than 7 days → ticket the team
Build a reprocessing tool
When you fix the bug that caused messages to DLQ, you need to replay them. This needs:
A way to filter by
failure_reason(so you only replay messages affected by the bug you fixed)A “dry run” mode that shows what would happen
A way to replay back to the original topic, not the DLQ
Logging of every replay attempt
If you don’t have this tool, you’ll never replay messages. They’ll sit in the DLQ forever. The DLQ becomes a graveyard.
Schedule a monthly DLQ review
Once a month, the team that owns each consumer reviews their DLQ:
What categories of failures showed up?
Are any persistent (suggesting a bug we haven’t fixed)?
Should any be auto-archived (after N days, move to cold storage)?
This sounds like overkill until the third time you discover a 6-month-old bug because the DLQ surfaced it.
Three patterns for poison-pill protection
A “poison pill” is a single message that breaks every consumer that processes it. Without protection, it can stop your entire partition.
Pattern 1: Per-message retry counter
Track retry count in the message header. Increment on each failed processing attempt. Skip and DLQ when count exceeds threshold:
int retries = record.headers().lastHeader("x-retry-count")...;
if (retries >= MAX_RETRIES) {
sendToDLQ(record, "Retry limit exceeded");
return;
}
try {
process(record);
} catch (Exception e) {
sendToRetryTopic(record, retries + 1);
}
Pattern 2: Schema validation at consumer entry
Reject obviously malformed messages immediately, before they enter your processing logic:
if (!schemaValidator.isValid(record.value())) {
sendToDLQ(record, "Schema validation failed");
return;
}This catches producer bugs early. Don’t waste cycles on messages that can’t possibly succeed.
Pattern 3: Fail-open with sampling
For non-critical processing, sometimes “skip and log” is the right call. Don’t DLQ every event from a noisy producer — sample, alert on rate, and skip:
if (failureRate.getRate() > 0.1) {
samplingFilter.recordFailure();
if (samplingFilter.shouldSample()) {
sendToDLQ(record, "High failure rate, sampled");
}
return;
}
This is appropriate for telemetry, not financial transactions. Pick the pattern based on the cost of dropping a message.
GitHub Link:
https://github.com/sysdr/sdc-java/tree/main/day36/dlq-log-systemWhat most teams get wrong
In order of frequency:
No alerting on DLQ growth. The DLQ exists. Nobody knows when it’s growing.
Original message context not preserved. “Failed to parse” with no source info → impossible to debug.
No reprocessing tool. Engineers do it ad-hoc with one-off scripts every time.
Treating all errors as DLQ candidates. Including transient ones, which floods the DLQ with noise.
Single DLQ for all consumers. Should be per-consumer (or at least per-topic), so alerting and review can be team-specific.
Working Code Demo:
A reference architecture
Producer → Topic A → Consumer A
↓ on permanent failure
DLQ-A (per consumer)
↓ daily
Cold archive (S3 + TTL)
↓ on demand
Reprocessing tool → Topic A (replay)
Each consumer has its own DLQ. Cold archive prevents the DLQ from growing forever. The reprocessing tool can replay back to the source topic with a filter, so fixed bugs replay only the affected messages.
Closing thought
A DLQ is a confession that your processing pipeline isn’t perfect. That’s fine — no pipeline is. The point of a DLQ is to make imperfection visible, diagnosable, and recoverable.
If your DLQ doesn’t do all three, it’s a leak in disguise.
More like this at SDCourse — a hands-on distributed systems course covering building production-grade systems in Java + Spring Boot + Kafka + Python.
