The Dead Letter Queue: How We Handle Build Failures

Building a microblogging platform means dealing with a reality that most side projects ignore: things fail.

A Lambda function crashes. A network request times out. A database is temporarily unavailable. An upload to Cloudflare R2 fails halfway through.

When you're building a service that thousands of people depend on, you can't just let those failures disappear. You need a system that detects them, retries them, and alerts you when they actually need human intervention.

That's where the Dead Letter Queue comes in. It's one of the less glamorous parts of infrastructure, but it's one of the most important.

What's a Dead Letter Queue?

A Dead Letter Queue (DLQ) is a safety net for messages that fail repeatedly.

Here's the flow in Jottings:

You click "Publish"
The API sends a message to SQS (Simple Queue Service): "Rebuild this site"
A Lambda worker picks up the message and tries to build your site
If the build fails, SQS automatically retries it
If it fails multiple times (we set it to 3 attempts), the message is sent to the Dead Letter Queue
Messages in the DLQ sit there until we investigate and fix them manually

The key word is automatic. We don't have to be awake when your build fails. We don't have to manually retry it. The system just keeps trying.

But more importantly, the DLQ gives us visibility. It's like an error inbox. Every message in the DLQ is a real problem that we can investigate and debug.

Why This Matters

Let's say your site has 500 jots, and you click "Publish." The Lambda function starts building your static HTML files.

Halfway through—after rendering 300 pages—the function runs out of memory. Or a network request to Cloudflare times out. Or AWS is having infrastructure issues in our region.

Without a DLQ, here's what happens: Your site isn't updated. You don't know why. You refresh the dashboard and see a spinning loading indicator. We don't know there's a problem until you email support.

With a DLQ, here's what happens: The message fails. SQS retries it automatically. Maybe the second attempt succeeds (the temporary outage is resolved). Maybe it fails again. On the third failure, it goes to the DLQ. We get an automated alert. We check the DLQ immediately, see the error, and start debugging.

The difference is measurable: minutes instead of hours before we know about the problem.

How Jottings Uses the DLQ

We have a specific setup for build failures. Here's the architecture:

1. Primary Build Queue

When you publish a jot, the API sends a message to the build queue:

Queue: jottings-build-queue
Message: { siteId: "123", userId: "456", priority: "high" }
Lambda: build-processor
Timeout: 15 minutes
Retries: 3 attempts

The Lambda function has 15 minutes to complete the build. If it takes longer, it times out.

It gets 3 attempts. If all 3 fail, the message is sent to the DLQ.

2. Dead Letter Queue

Every failed message goes here:

DLQ: jottings-build-queue-dlq
MessageRetentionPeriod: 14 days

Messages sit in the DLQ for up to 2 weeks. That gives us time to investigate.

3. CloudWatch Alarms

We have an alarm that fires whenever any message appears in the DLQ:

Alarm: "Build Queue DLQ has messages"
Threshold: > 0 messages for 1 minute
Action: Send SNS email notification

The second a build fails 3 times, I get an email. No delays. No batching. Real-time visibility.

4. Admin Tooling

We have a CLI tool to investigate DLQ messages:

jottings-prod debug failed-builds

This shows us:

What site failed to build
What user owns the site
What error message we got
When it failed
How many retries we attempted

Real Example: What Happened Last Month

Here's a real failure we caught with the DLQ:

A user published a jot with a very large image. Our build system tried to process it, but the Lambda didn't have enough memory to resize the image.

Build attempt 1: Out of memory → Fail Build attempt 2: Out of memory → Fail Build attempt 3: Out of memory → Fail Result: Message in DLQ

We got an alert. We checked the DLQ. We saw the user's site ID and the error. We:

Increased the Lambda memory allocation (512MB → 1024MB)
Manually retried the message from the DLQ
The build succeeded

Without the DLQ, we wouldn't have known about this for hours (if the user even reported it). The user would have been confused why their site didn't update. We would have been debugging blindly.

With the DLQ, we fixed it in minutes.

What About Retries?

Not every failure needs a DLQ. Some are temporary and resolve themselves.

That's why SQS retries automatically. Here's the timing:

Attempt 1: Message picked up immediately
Attempt 2: After 2 minutes (temporary network hiccup? Probably recovered)
Attempt 3: After 4 minutes (still failing? Probably a real problem)
DLQ: Give up and park it here for manual review

This logic is baked into SQS. We configure the retry policy in our serverless.yml:

DeadLetterConfig:
  maxEventAge: 3600
  maximumEventRetryAttempts: 3

If a message is still failing after 3 retries, it's probably not a temporary issue. It's time for humans to look.

The Bigger Picture

The DLQ is part of a larger resilience pattern we use throughout Jottings:

Asynchronous Processing: Don't block the user waiting for slow operations. Queue the work and let them move on. (This is why publishing is so fast.)

Automatic Retries: Most failures are temporary. Give the system a chance to recover.

Visibility: If something fails repeatedly, make sure someone notices. No silent failures.

Manual Intervention: When automation isn't enough, the DLQ is a clear signal that a human needs to step in.

This pattern isn't unique to Jottings. It's how most modern systems handle failures. But it's not obvious if you've only built simple apps where everything is synchronous.

Monitoring Beyond the DLQ

The DLQ is just one piece. We also monitor:

Build duration: Are builds getting slower? That's a leading indicator of trouble.
API error rates: Are publish requests failing? How many?
Lambda duration vs. timeout: Are we getting close to the 15-minute limit?
Database throttles: Is DynamoDB struggling to keep up?

All of these feed into CloudWatch dashboards and alarms. The DLQ catches the catastrophic failures. The other monitoring catches the slow degradation.

Together, they give us confidence that the system is working.

Why This Matters for Users

Here's the thing: most users never think about any of this.

You click "Publish." Your site updates in 10 seconds. Life is good.

You click "Publish" at exactly the moment our database is having issues. Your site doesn't update immediately. You click "Publish" again. This time it works. You never knew there was a failure.

The DLQ, the retries, the monitoring—they all exist so you don't have to think about it. The system handles failures gracefully. Your content doesn't disappear. Your site keeps working.

That's the goal. Infrastructure should be invisible. You should only notice it when it breaks.

If You're Building Something

If you're building any kind of service—especially one that processes things asynchronously—consider adding a Dead Letter Queue:

Visibility: You'll actually know when things fail
Resilience: Temporary failures don't lose data
Debugging: You have concrete error messages to investigate
Alerting: You don't have to manually check for failures

AWS SQS makes this trivial to set up. Most message queues (RabbitMQ, Apache Kafka) have similar patterns.

It's one of those things that feels optional until you need it. Then it becomes indispensable.

Building reliable systems is hard. Invisible reliability is even harder. If you're interested in how we keep Jottings stable, write a few jots and see what happens under the hood. You can find more about our architecture on GitHub or reach out on Twitter.