Hands On System Design - Distributed Systems Implementation : Hands On Production-grade distributed log platform — Build LogStream (Java/Spring boot)

Day 66: Implement Log Redaction for Compliance

System Design Course — Tue, 07 Jul 2026 08:31:06 GMT

What We’re Building Today

Today we implement a production-grade log redaction system that automatically sanitizes sensitive data before storage and transmission:

Pattern-based redaction engine detecting PII, credentials, and regulated data types
Streaming redaction pipeline processing logs at ingestion with zero storage of raw sensitive data
Configurable redaction policies supporting GDPR, HIPAA, PCI-DSS compliance frameworks
Audit-preserving redaction maintaining investigative capability while protecting privacy

Why This Matters: The Compliance-Performance Paradox

When Uber expanded to Europe, they faced a critical challenge: their centralized logging infrastructure inadvertently stored millions of user phone numbers, email addresses, and location data across multiple regions. Under GDPR, each instance represented potential €20M fines. Their solution? Real-time redaction at ingestion, processing 2M events/second while guaranteeing zero sensitive data persistence.

The core tension in compliance-focused systems is this: regulations demand you never store sensitive data, yet debugging production incidents requires detailed logs. Traditional approaches—post-hoc scrubbing or field-level encryption—either fail audits (data existed, even temporarily) or make logs unusable for troubleshooting. Production systems need redaction that’s both immediate and intelligent, removing genuine risks while preserving investigative context.

At scale, this becomes a distributed systems problem. Netflix’s approach processes logs through three independent redaction layers before storage, accepting 15% performance overhead to guarantee compliance. Each layer uses different detection strategies (regex, ML-based entity recognition, schema validation) to catch what others miss. The architectural insight: compliance isn’t a feature you add—it’s a constraint that shapes your entire data flow.

System Design Deep Dive

Preparing for a distributed systems interview?

→Download the free Interview Pack

→ Subscribe now to access source code repository - 200 + coding lessons

Day 65: Field-Level Encryption for Sensitive Log Data

System Design Course — Mon, 29 Jun 2026 05:26:06 GMT

What We’re Building Today

AES-256-GCM encryption service for PII fields (email, SSN, credit cards) in log events
Key rotation system with versioned encryption keys stored in distributed cache
Transparent decryption layer with role-based access to encrypted fields
Performance-optimized encryption pipeline handling 50,000+ events/second

Why This Matters

When logs contain personally identifiable information (PII), compliance frameworks like GDPR, HIPAA, and PCI-DSS mandate encryption at rest and in transit. However, encrypting entire log streams creates operational nightmares: debugging becomes impossible, pattern analysis breaks, and search performance tanks. Field-level encryption solves this by selectively protecting sensitive fields while keeping the rest of the log searchable and analyzable.

Netflix processes billions of log events daily, many containing user IDs, IP addresses, and session tokens. Their approach: encrypt only the 5-10% of fields containing PII, keeping 90% of log data in plaintext for operational visibility. This hybrid strategy reduces encryption overhead by 85% while maintaining compliance. The challenge isn’t just encryption—it’s building a system that remains performant, debuggable, and operationally practical at scale.

System Design Deep Dive

Preparing for a distributed systems interview?

→A free welcome gift: Download the free Interview Pack

→52 FAANG Questions Drill cards Cheatsheets Vault - Get it here

→ Subscribe now to access source code repository - 200 + coding lessons

1. Encryption Strategy: Field-Level vs Full Payload

The Trade-off: Encrypting entire log events is simple but operationally devastating. Field-level encryption adds complexity but preserves observability.

Full Payload Encryption (the naive approach):

Pros: Simple implementation, guaranteed protection
Cons: Logs become opaque binary blobs, regex/grep fails, storage increases 40%, search requires full decryption

Field-Level Encryption (production approach):

Pros: Selective protection, searchable metadata, 10x faster queries
Cons: Complex key management, schema awareness required, potential for data leakage in related fields

Production Pattern: Use a field classification system. Tag fields as PUBLIC, INTERNAL, or PII during schema definition. Only PII fields get encrypted. This requires schema evolution support—when you add a new PII field, historical logs remain unencrypted for that field, so the decryption layer must handle missing encrypted data gracefully.

2. Key Management: Rotation Without Downtime

The CAP Theorem Reality: You cannot have consistent key distribution, always-available encryption, and partition-tolerant key storage simultaneously.

Most systems choose AP (Available + Partition-Tolerant) for encryption keys:

Keys cached in Redis with 5-minute TTL
Key rotation happens asynchronously
Old keys remain valid for 24 hours during rotation
Accept eventual consistency: 0.01% of events might use stale keys during rotation windows

Anti-pattern: Storing keys in a central database and fetching on every encryption. This creates a single point of failure and bottlenecks throughput at ~5,000 ops/sec. Instead, use versioned key caching:

Key Structure: 
{
  "keyId": "encryption-key-v47",
  "algorithm": "AES-256-GCM",
  "key": "base64-encoded-key",
  "validFrom": "2025-02-01T00:00:00Z",
  "validUntil": "2025-03-01T00:00:00Z"
}

Each encrypted field stores its keyId version. Decryption services fetch the appropriate key from cache, falling back to the key management service (AWS KMS, HashiCorp Vault) only on cache miss.

3. Encryption Performance: Batching and Async Processing

The Bottleneck: AES-256-GCM encryption on a single core processes ~200MB/sec. For a system handling 50,000 events/sec with 5 PII fields each (250,000 encryption ops/sec), naive synchronous encryption saturates at 12,000 events/sec on a 16-core machine.

Solution: Async encryption pipeline with batching:

Kafka consumer reads log events at 50K/sec
Classification stage identifies PII fields (CPU-bound, parallelizable)
Batch accumulator groups 100 events (reduces per-event overhead)
Parallel encryption workers (16 threads) encrypt batches
Kafka producer publishes encrypted events

This architecture achieves 50,000 events/sec because:

Batching reduces context switching overhead by 60%
Parallel workers saturate all CPU cores
Async processing prevents backpressure to upstream services

Trade-off: Batching adds 50-100ms latency. For real-time alerting on PII exposure, maintain a separate fast-path pipeline that processes high-priority events synchronously.

4. Decryption Access Control: Role-Based Field Visibility

The Problem: Not all engineers should decrypt all PII. Support engineers need email addresses for customer lookup but shouldn’t see SSNs. Compliance teams need audit trails of who decrypted what.

Pattern: Attribute-Based Access Control (ABAC) at the field level:

Access Policy:
{
  "role": "support-engineer",
  "allowedFields": ["user.email", "user.name"],
  "deniedFields": ["user.ssn", "payment.cardNumber"],
  "auditRequired": true
}

When a user queries logs:

Parse query for requested fields
Check user role against field-level policies
Decrypt only allowed encrypted fields
Redact denied fields (replace with [REDACTED-SSN])
Log decryption event to audit trail

Failure Mode: Policy changes don’t propagate instantly. Use a policy versioning system where each decryption request includes a policyVersion timestamp. Services cache policies for 60 seconds, then check for updates. During policy updates, some requests may use stale policies—accept this as a trade-off for availability.

5. Storage Optimization: Encrypted Field Indexing

The Problem: Encrypted fields can’t be indexed or searched. How do you find “all logs with email=user@example.com” when emails are encrypted?

Solution: Deterministic encryption for indexable fields:

Use HMAC-SHA256 to generate a searchable hash: hmac(email) = "abc123"
Store both the hash and AES-encrypted value
Queries search by hash, retrieve encrypted value, decrypt for display

Trade-off: Deterministic encryption is vulnerable to rainbow table attacks if the keyspace is small (e.g., email domains). Mitigate with domain-specific salts: use different HMAC keys for different email domains.

Implementation Walkthrough

GitHub Link:

https://github.com/sysdr/sdc-java-p/tree/main/day65/log-encryption-system

Step 1: Encryption Service with Key Versioning

The EncryptionService handles field-level encryption with automatic key rotation:

@Service
public class EncryptionService {
    private final RedisTemplate keyCache;
    private final KeyManagementClient kmsClient;
    
    public EncryptedField encrypt(String fieldName, String value) {
        EncryptionKey key = getCurrentKey(fieldName);
        
        Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
        byte[] iv = generateIV(); // 12 bytes for GCM
        cipher.init(Cipher.ENCRYPT_MODE, key.getSecretKey(), new GCMParameterSpec(128, iv));
        
        byte[] encrypted = cipher.doFinal(value.getBytes(UTF_8));
        
        return new EncryptedField(
            fieldName,
            Base64.encode(encrypted),
            key.getKeyId(),
            Base64.encode(iv)
        );
    }
}

Architectural Decision: Store the initialization vector (IV) with each encrypted field. GCM mode requires unique IVs per encryption to maintain security. This increases storage by 16 bytes per field but prevents IV reuse attacks.

Step 2: Kafka Consumer with Async Encryption

The consumer processes logs in batches, encrypting PII fields asynchronously:

@Service
public class LogEncryptionConsumer {
    private final ExecutorService encryptionPool = Executors.newFixedThreadPool(16);
    
    @KafkaListener(topics = "raw-logs")
    public void processLogs(List events) {
        List> futures = events.stream()
            .map(event -> CompletableFuture.supplyAsync(
                () -> encryptSensitiveFields(event), 
                encryptionPool
            ))
            .toList();
        
        CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
            .thenAccept(v -> publishEncryptedLogs(
                futures.stream().map(CompletableFuture::join).toList()
            ));
    }
}

Why Async? Blocking encryption on 16 threads would cap throughput at 12K events/sec. Async processing with CompletableFuture allows the consumer thread to batch-fetch the next 100 events while encryption workers process the current batch, achieving 50K+ events/sec.

Step 3: Role-Based Decryption Layer

The query service decrypts fields based on user roles:

@Service
public class LogQueryService {
    public List queryLogs(LogQuery query, UserContext user) {
        List results = logRepository.findByQuery(query);
        
        return results.stream()
            .map(event -> decryptAllowedFields(event, user))
            .toList();
    }
    
    private LogEvent decryptAllowedFields(LogEvent event, UserContext user) {
        Set allowedFields = policyService.getAllowedFields(user.getRole());
        
        event.getEncryptedFields().forEach((fieldName, encryptedValue) -> {
            if (allowedFields.contains(fieldName)) {
                String decrypted = encryptionService.decrypt(encryptedValue);
                event.setField(fieldName, decrypted);
                auditService.logDecryption(user, fieldName, event.getId());
            } else {
                event.setField(fieldName, "[REDACTED]");
            }
        });
        
        return event;
    }
}

Key Insight: Decryption happens lazily at query time, not during storage. This allows policy changes to take effect immediately—revoking SSN access from a role instantly prevents future decryptions without re-encrypting stored data.

Production Considerations

Performance: Encryption adds 2-5ms per event. Under load, batch 100 events and parallelize encryption across 16 workers to maintain 50K events/sec throughput. Monitor encryption_latency_p99 metric—if it exceeds 10ms, scale worker pools horizontally.

Monitoring: Track key rotation failures with key_rotation_error_rate. A spike indicates KMS availability issues. Set up alerts for decryption_access_denied_rate—sudden increases suggest policy misconfigurations or potential security incidents.

Failure Scenarios:

KMS unavailable: Encryption falls back to last-known cached keys (valid for 24h). Logs continue flowing; alert ops team.
Key rotation mid-flight: Events encrypted with old keys remain decryptable. Decryption service tries current key first, falls back to previous 3 versions.
Corrupted encrypted data: Add HMAC signatures to detect tampering. Reject events with invalid signatures; alert security team.

Working Demo Link

Scale Connection

Uber’s approach: Encrypt rider/driver PII in trip logs but keep geolocation data plaintext. This enables fraud detection (pattern analysis on routes) while protecting identities. They use field-level encryption with geographic key sharding—US data uses US KMS keys, EU data uses EU KMS keys, ensuring GDPR compliance.

Amazon’s pattern: Encrypt customer PII in CloudWatch logs but maintain separate encrypted search indexes using deterministic encryption. Queries search the index by hashed email, retrieve encrypted log IDs, then batch-decrypt full logs. This hybrid approach achieves sub-second search on billions of encrypted logs.

Next Steps

Tomorrow: Implement log redaction for compliance—automatically detect and redact SSNs, credit cards, and API keys in unstructured log text using regex patterns and ML-based PII detection.

Day 64: Implement Role-Based Access Control for Log Data

System Design Course — Tue, 23 Jun 2026 08:30:30 GMT

What We’re Building Today

JWT-based authentication service with refresh token rotation
Fine-grained authorization middleware controlling log access by team/service
Audit logging system tracking all data access attempts
Multi-tenant isolation ensuring teams only see their logs

Why This Matters

In production log processing systems at scale, access control isn’t optional—it’s a regulatory requirement. When your system ingests millions of events per second from hundreds of microservices, you’re inevitably capturing sensitive data: user IDs, payment details, health information, internal API keys. A single unauthorized query could expose customer PII, violate GDPR/HIPAA compliance, or leak competitive intelligence.

Netflix processes 500+ billion events daily across 1,000+ microservices. Without RBAC, any engineer could query any service’s logs—a security nightmare. Uber’s data breach in 2016 partially stemmed from inadequate access controls. Modern distributed systems require defense-in-depth: authentication proves identity, authorization enforces least-privilege access, and audit trails ensure accountability.

The challenge: implementing security without sacrificing the millisecond-latency queries that make logs useful for incident response. Today’s RBAC system must scale horizontally, cache authorization decisions, and fail closed (deny access on errors) while maintaining the sub-100ms query performance engineers expect.

System Design Deep Dive

Preparing for a distributed systems interview?

→Download the free Interview Pack

→ Subscribe now to access source code repository - 200 + coding lessons

Week 2 : Network-Based Log Collection: From TCP Ingestion to Throughput Measurement

sysdai — Sat, 20 Jun 2026 08:31:01 GMT

1. Introduction

Most production systems do not keep logs on the machine that generated them. Applications emit events continuously; operators need those events centralized, searchable, and durable. That requirement pushes you toward network-based log collection: agents and receivers that accept log streams over TCP or UDP, buffer them when downstream systems lag, batch and compress them to save bandwidth, encrypt them in transit, and finally land them in storage or a message bus like Kafka.
This integrated Week 2 project (week_2_integrated_project) walks that arc in seven incremental layers—Days 8 through 14—inside a single Spring Boot application (com.example.week2.Week2IntegratedApplication). Each day remains a distinct package (com.example.week2.day8 … day14), while shared persistence, configuration, and observability live under com.example.week2.common. You can run the full stack locally on one JVM (HTTP on 8080, Day 8 Netty TCP on 9080, Day 9 TCP receiver on 9091, Day 10 UDP on 9876) and optionally attach Prometheus and Grafana for dashboards.
This article is written for a mid-level engineer who knows Spring Boot and wants to treat log shipping as an infrastructure problem, not a one-off socket demo. You will see how transport choice, buffering, batching, compression, TLS readiness, and load testing interact—and how to keep each concern in a module that can be understood and tested on its own.

2. From Fundamentals to a Unified System

Day 8 — TCP Server and Bounded In-Memory Buffering

Day 8 implements a Netty-based TCP log server (Day8TcpLogServer) that listens on port 9080 (configured via week2.day8.tcp.port in application.yml). Incoming lines are framed as newline-delimited JSON and handled by Day8LogMessageHandler, which validates timestamp, level, and message before handing events to Day8LogBufferService. The buffer is a bounded LinkedBlockingQueue; when it fills, events are dropped and counted in Day8LogMetrics.

The engineering concern is backpressure at the edge: a naive server that writes synchronously to a database will collapse under burst traffic. Day 8 decouples network I/O from persistence with a scheduled flush (@Scheduled(fixedDelay = 5000)) and batch writes through LogPersistenceService, guarded by a Resilience4j circuit breaker named database. REST queries are exposed at /api/week2/day8/logs/* via Day8LogQueryController.

Day 9 — Log Shipping Client and Kafka Handoff

Day 9 adds a TCP shipping pipeline: HTTP clients post to Day9LogIngestionController (POST /api/week2/day9/logs), Day9TcpLogShipperService buffers JSON lines and flushes them every 100ms to Day9TcpServerService on port 9091, and the receiver publishes each line to the Kafka topic logs (Day9KafkaLogConsumer persists via LogPersistenceService and caches ERROR logs in Redis through RedisLogCacheService).

The concern is reliable forwarding across process boundaries. The shipper uses a circuit breaker (logShipper) and reconnect logic; when app.kafka.enabled is false (default local profile), logs are stored directly so the API remains testable without Kafka.

Day 10 — UDP for High-Throughput, Loss-Tolerant Paths

Day 10 introduces UDP transport via Day10UdpLogShipperService and Day10UdpServerService (port 9876). Datagrams carry an 8-byte sequence number prefix plus JSON payload. The shipper tracks in-flight messages, retries on timeout, and exposes POST /api/week2/day10/logs/ship. The UDP server uses a NIO Selector loop and forwards to Kafka topic logs.udp.ingress; Day10KafkaConsumerService deduplicates and persists.

The concern is throughput versus delivery guarantees: UDP avoids connection overhead but requires application-level sequencing and dedupe—patterns you see in statsd-style agents and some observability collectors.

Day 11 — Batching to Optimize Network and Broker Usage

Day 11 centralizes batching in Day11LogBatchingService: events land in an ArrayBlockingQueue, flush when the batch reaches week2.day11.batching.max-batch-size (1000) or on a time trigger (@Scheduled every 5000ms). A single Kafka message carries a JSON array on topic log-events-batched. Day11KafkaConsumerService deserializes the array and calls saveAllFromEvents. Optional synthetic traffic comes from Day11LogGeneratorService when week2.day11.generator.enabled=true.

The concern is amortizing fixed costs—fewer TCP segments, fewer Kafka produce requests, better broker compression ratios—at the cost of latency until the batch fills or the timer fires.

Day 12 — Application-Level Compression

Day 12 compresses payloads before Kafka in Day12CompressionService (NONE, GZIP, SNAPPY, LZ4) with algorithm selection based on payload size. Day12KafkaProducerService attaches Kafka headers compression-algorithm and original-size; Day12KafkaConsumerService decompresses inside a Resilience4j decompression circuit breaker. Ingest is via POST /api/week2/day12/logs (Day12LogController).

The concern is bandwidth and broker load when JSON log lines are repetitive or large. Note that broker-level compression (compression.type) and application-level compression solve different problems; this project demonstrates the latter explicitly.

Day 13 — TLS Readiness and Production Resilience

Day 13 hardens the Kafka produce path with Day13KafkaProducerService (@CircuitBreaker(name = "kafka-producer")) and Day13GatewayController (@CircuitBreaker(name = "producer-service")) on POST /api/week2/day13/logs. Day13CertificateHealthIndicator (bean name day13CertificateHealth) reports Actuator health based on JKS certificate expiry when week2.day13.ssl.keystore-path is set; locally it reports UP with certificateCheck: skipped. WARN/ERROR logs are cached in Redis in Day13KafkaConsumerService.

The concern is secure, observable pipelines: encryption in transit (TLS scaffolding), failure isolation (circuit breakers), and operator visibility (health indicators)—without conflating “TLS configured” with “TLS required in dev.”

Day 14 — Load Generation and Throughput Measurement

Day 14 adds sustained and burst load through Day14LoadGeneratorService (@Scheduled steady rate when week2.day14.load-generator.enabled=true) and POST /api/week2/day14/load/burst. Day14KafkaProducerService records Micrometer timers; Day14KafkaConsumerService runs with concurrency 3 and publishes day14.log.processing.time percentiles. Operators inspect throughput via Day14MetricsController (/api/week2/day14/metrics/summary) and Grafana panels fed from /actuator/prometheus.

The concern is capacity planning: you cannot claim a pipeline handles 10k logs/sec without a generator, metrics, and a dashboard that prove it under your hardware and configuration.

3. Architecture Overview

At runtime, the system is one Spring Boot process whose context is rooted at Week2IntegratedApplication with @EnableScheduling and @EnableAsync for periodic flushes and load generation.

Configuration layer (com.example.week2.common.config):

AppConfig — ObjectMapper, CircuitBreakerRegistry, RestTemplate, WebClient.Builder
KafkaConfig — conditional on app.kafka.enabled; provides KafkaTemplate and byte-array templates for Day 12
SchedulingConfig — enables Spring’s scheduler for @Scheduled methods across day modules
application.yml — ports, topics, batch sizes, and profile-specific datasource (H2 for local, PostgreSQL for docker)

Shared domain layer (com.example.week2.common):

LogEvent — canonical DTO
LogEntryEntity / LogEntryRepository — JPA persistence with module column distinguishing day pipelines
LogPersistenceService — single write path for map, event, and batch inserts
RedisLogCacheService — optional dedupe and error caching
HomeController / Week2HealthController — discovery and aggregate health

Per-day transport and pipeline modules sit beside common code: Netty (day8), blocking TCP + Kafka (day9), UDP + Kafka (day10), batching (day11), compression (day12), resilient gateway (day13), load + metrics (day14).

External boundaries:

Log producers (curl, nc, agents) on HTTP 8080 and network ports 9080/9091/9876
Kafka cluster (when enabled) on spring.kafka.bootstrap-servers
Redis (optional) for cache and dedupe
Prometheus/Grafana (Docker Compose) scraping /actuator/prometheus on the host

Day 63: Chaos Engineering - Building Resilience Through Controlled Failure

System Design Course — Tue, 16 Jun 2026 09:01:03 GMT

What We’re Building Today

Chaos testing framework that randomly introduces network failures, service crashes, and resource exhaustion
Automated failure injection across Kafka, Redis, PostgreSQL, and API services
Resilience verification suite that validates circuit breakers, retry logic, and graceful degradation
Production-grade monitoring to observe system behavior under failure conditions

Why This Matters

After implementing backpressure mechanisms on Day 62, your log processing system can handle traffic spikes. But real production failures aren’t predictable—they’re messy, cascading, and often occur at the worst possible moment. Netflix’s Chaos Monkey famously shuts down production instances randomly, not because they want outages, but because controlled chaos builds antifragile systems.
In distributed systems, failure is not an edge case—it’s the default state. Network partitions, disk failures, OOM kills, and cascade failures happen constantly at scale. The difference between a robust system and a brittle one is whether you’ve tested these scenarios before your users experience them. Chaos engineering shifts failure discovery from 3 AM production incidents to controlled daytime experiments.
Your log processing system processes thousands of events per second across multiple services. When Kafka becomes unavailable, does your producer gracefully degrade or crash the entire pipeline? When PostgreSQL experiences latency spikes, does the consumer exhaust its thread pool? These aren’t hypothetical questions—they’re Tuesday morning at any company running distributed systems.

System Design Deep Dive

Preparing for a distributed systems interview?

→Download the free Interview Pack

→ Subscribe now to access source code repository - 200 + coding lessons

Day 62: Implement Backpressure Mechanisms for Load Management

System Design Course — Tue, 09 Jun 2026 09:01:08 GMT

What We’re Building Today

Today, we’re implementing production-grade backpressure mechanisms that protect our distributed log processing system from overload:

Adaptive rate limiting with token bucket algorithms at the API gateway
Kafka consumer lag monitoring with automatic throughput throttling
Circuit breaker integration that triggers backpressure when downstream services degrade
Reactive backpressure using Spring WebFlux for non-blocking flow control

Why This Matters: The Cost of Ignoring Backpressure

In 2020, a major e-commerce platform experienced a cascading failure during Black Friday when their log aggregation system accepted more events than it could process. Without backpressure, their Kafka consumers fell 2 hours behind, Redis caches expired prematurely, and the entire observability stack collapsed—leaving teams blind during their highest-traffic day.
Backpressure isn’t about rejecting work; it’s about controlled degradation. When Netflix’s API gateway detects rising latencies, it doesn’t crash—it applies backpressure to preserve core functionality. Uber’s real-time dispatch system uses backpressure to ensure that even during surge demand, critical ride matching continues while less critical analytics can lag.
The fundamental challenge: distributed systems have no single point of control. A producer publishing 100K logs/second doesn’t know that downstream consumers can only handle 20K. Without backpressure, queues grow unbounded, memory exhausts, and the system fails catastrophically.

System Design Deep Dive: Backpressure Patterns

Preparing for a distributed systems interview?

→Download the free Interview Pack

→ Subscribe now to access source code repository - 200 + coding lessons

Java Version : Dead Letter Queues That Actually Work

System Design Course — Mon, 08 Jun 2026 05:42:20 GMT

Most teams’ Dead Letter Queues are accidentally where messages go to die unnoticed. This is a guide to building a DLQ that does what it’s supposed to do.

What a DLQ is for

A Dead Letter Queue is a place to send messages that cannot be processed. The key word is “cannot.” A DLQ is not for transient failures.

Transient failures (network blip, downstream service temporarily down): retry with exponential backoff. Don’t DLQ.

Permanent failures (malformed payload, business rule violation, missing required field): DLQ immediately. No retries.

The first mistake teams make is conflating these. A consumer that DLQs on every error pollutes the DLQ with transient garbage — making the actual permanent-failure messages impossible to find.

What a DLQ message must preserve

When a message is DLQ’d, you need to be able to answer three questions:

What was the original message?
Why did it fail?
Can I safely reprocess it after fixing the bug?

To answer those, the DLQ message must contain:

{
  "original_message": { ... },
  "original_topic": "user-events",
  "original_partition": 4,
  "original_offset": 8472938,
  "consumer_id": "user-event-processor-7f3a",
  "failure_reason": "ValidationException: required field 'user_id' missing",
  "stack_trace": "...",
  "timestamp": "2025-12-09T14:32:18Z",
  "retry_count": 0
}

The original_topic, partition, and offset are the keys. They let you trace back to exactly where in the source stream the message came from — useful for diagnosing a producer bug, not just the consumer’s failure.

Operating a DLQ

A DLQ that nobody looks at is worse than no DLQ — it gives the team false confidence that “we have one.”

Alert on growth, not absolute count

A DLQ with 1000 messages from 6 months ago that haven’t grown is fine. A DLQ with 50 messages added in the last hour is a fire.

Alert on rate of arrival, not on size:

10 messages/minute → page on-call
100 messages in last hour → wake up the team
DLQ has any messages older than 7 days → ticket the team

Build a reprocessing tool

When you fix the bug that caused messages to DLQ, you need to replay them. This needs:

A way to filter by failure_reason (so you only replay messages affected by the bug you fixed)
A “dry run” mode that shows what would happen
A way to replay back to the original topic, not the DLQ
Logging of every replay attempt

If you don’t have this tool, you’ll never replay messages. They’ll sit in the DLQ forever. The DLQ becomes a graveyard.

Schedule a monthly DLQ review

Once a month, the team that owns each consumer reviews their DLQ:

What categories of failures showed up?
Are any persistent (suggesting a bug we haven’t fixed)?
Should any be auto-archived (after N days, move to cold storage)?

This sounds like overkill until the third time you discover a 6-month-old bug because the DLQ surfaced it.

Three patterns for poison-pill protection

A “poison pill” is a single message that breaks every consumer that processes it. Without protection, it can stop your entire partition.

Pattern 1: Per-message retry counter

Track retry count in the message header. Increment on each failed processing attempt. Skip and DLQ when count exceeds threshold:

int retries = record.headers().lastHeader("x-retry-count")...;
if (retries >= MAX_RETRIES) {
    sendToDLQ(record, "Retry limit exceeded");
    return;
}
try {
    process(record);
} catch (Exception e) {
    sendToRetryTopic(record, retries + 1);
}

Pattern 2: Schema validation at consumer entry

Reject obviously malformed messages immediately, before they enter your processing logic:

if (!schemaValidator.isValid(record.value())) {
    sendToDLQ(record, "Schema validation failed");
    return;
}

This catches producer bugs early. Don’t waste cycles on messages that can’t possibly succeed.

Pattern 3: Fail-open with sampling

For non-critical processing, sometimes “skip and log” is the right call. Don’t DLQ every event from a noisy producer — sample, alert on rate, and skip:

if (failureRate.getRate() > 0.1) {
    samplingFilter.recordFailure();
    if (samplingFilter.shouldSample()) {
        sendToDLQ(record, "High failure rate, sampled");
    }
    return;
}

This is appropriate for telemetry, not financial transactions. Pick the pattern based on the cost of dropping a message.

GitHub Link:

https://github.com/sysdr/sdc-java/tree/main/day36/dlq-log-system

What most teams get wrong

In order of frequency:

No alerting on DLQ growth. The DLQ exists. Nobody knows when it’s growing.
Original message context not preserved. “Failed to parse” with no source info → impossible to debug.
No reprocessing tool. Engineers do it ad-hoc with one-off scripts every time.
Treating all errors as DLQ candidates. Including transient ones, which floods the DLQ with noise.
Single DLQ for all consumers. Should be per-consumer (or at least per-topic), so alerting and review can be team-specific.

Working Code Demo:

A reference architecture

Producer → Topic A → Consumer A
                       ↓ on permanent failure
                     DLQ-A (per consumer)
                       ↓ daily
                     Cold archive (S3 + TTL)
                       ↓ on demand
                     Reprocessing tool → Topic A (replay)

Each consumer has its own DLQ. Cold archive prevents the DLQ from growing forever. The reprocessing tool can replay back to the source topic with a filter, so fixed bugs replay only the affected messages.

Closing thought

A DLQ is a confession that your processing pipeline isn’t perfect. That’s fine — no pipeline is. The point of a DLQ is to make imperfection visible, diagnosable, and recoverable.

If your DLQ doesn’t do all three, it’s a leak in disguise.

More like this at SDCourse — a hands-on distributed systems course covering building production-grade systems in Java + Spring Boot + Kafka + Python.

Day 61: Circuit Breakers for Handling Component Failures

sdr11 — Tue, 02 Jun 2026 09:00:14 GMT

What We’re Building Today

Yesterday we wired multi-region replication across our log pipeline. Today we harden that pipeline against the inevitable: components go down, networks partition, downstream services time out. We add circuit breakers — the single most important resilience pattern in any distributed system — directly into our log processing flow.

Circuit breaker wrappers around every outbound service call (Kafka produce, Redis cache-write, PostgreSQL persist)
State-machine-driven failure detection with configurable thresholds, timeouts, and half-open probing
Fallback strategies that keep the pipeline moving when a downstream is dead
Real-time health dashboards in Grafana that surface breaker trips before your on-call page fires

Why This Matters

Multi-region replication (Day 60) gave us durability. But durability without availability is just a well-preserved corpse. A single slow PostgreSQL replica can hold up every consumer thread if we let synchronous calls block indefinitely. In production at scale, this is not a hypothetical — it is a certainty. Netflix’s Hystrix documentation famously documented how a single downstream latency spike in their recommendation service cascaded into a full outage because every thread in the calling service was stuck waiting.
Circuit breakers are the distributed-systems equivalent of an electrical fuse: they detect abnormal current (errors or latency), trip open to stop the cascade, and periodically probe to see if the downstream has recovered. The pattern converts a hard failure in one service into a graceful degradation across the entire system. Every FAANG-scale platform runs this pattern on every single outbound network call — no exceptions. Today we wire it into our log pipeline so that a failing PostgreSQL node, a saturated Kafka broker, or a flapping Redis instance cannot bring down our ingestion throughput.

System Design Deep Dive

Preparing for a distributed systems interview?

→Download the free Interview Pack

→ Subscribe now to access source code repository - 200 + coding lessons

Week 1 Java: Setting Up the Infrastructure

sysdai — Sat, 30 May 2026 08:30:55 GMT

1. Introduction

Task scheduling is one of those “boring” infrastructure concerns that quietly determines whether a backend behaves like a system or like a pile of endpoints. Real services don’t only react to inbound HTTP. They poll upstream systems, rotate files, refresh caches, emit periodic heartbeats, and enforce time-based policies. When those time-based behaviors are unreliable, everything else degrades: logs pile up, caches go stale, disk fills, and operators lose observability exactly when they need it most.

This integrated Week 1 project demonstrates scheduling as a concrete, inspectable part of a local log processing pipeline. It starts with simple time-driven behaviors (generation rate accounting and periodic stats), then connects them to more “operational” concerns like log parsing and file rotation. In code terms, this system is intentionally small, but it uses the same primitives you’d use in production Spring services: @Scheduled methods, a shared scheduler (ThreadPoolTaskScheduler), and a lightweight registry that tracks task state across runs.

This is written for a mid-level engineer who already knows Spring Boot basics and wants to treat scheduling as a first-class engineering concern. You’ll learn how scheduling interacts with thread pools and failure handling, how to keep periodic jobs observable, and how to integrate scheduled work into a pipeline without turning it into a ball of side effects.

2. From Fundamentals to a Unified System

Day 1 — Environment + a stable place to “run the system”

Day 1 is about getting a repeatable development environment and repository structure so infrastructure can actually be exercised. In this integrated app, that shows up as the ability to run “the system” with one entry point (com.example.week1.Week1IntegratedApplication) and a predictable set of endpoints (see com.example.week1.common.web.HomeController and SystemOverviewController). The engineering concern is reproducibility: scheduled systems are hard to debug when you can’t reliably start, stop, and observe them.

Day 2 — Configurable log generator and time-driven behavior

Day 2 introduces a log generator that can produce sample events at a configurable rate. In the integrated project, the generator runs in com.example.week1.day2.service.Day2LogGenerationService, which uses an internal thread pool for generation and a scheduled method for per-second rate calculations:

Day2LogGenerationService.updateMetrics() is @Scheduled(fixedRate = 1000) and also annotated with @TrackedTask(id = "day2.updateMetrics", ...).

This maps directly to a common production pattern: time-window metrics (or control loops) that run regardless of inbound traffic. You can expose a generator as an HTTP API, but its behavior is only correct if the periodic bookkeeping is correct and thread-safe.

Day 3 — File-based collector and periodic reporting

Day 3 adds a log collector that reads local files, maintains offsets, and de-duplicates entries. In the integrated app, com.example.week1.day3.service.Day3LogCollectorService watches directories and periodically emits statistics:

Day3LogCollectorService.logStatistics() is @Scheduled(fixedRate = 60000) and @TrackedTask(id = "day3.logStatistics", ...).

The engineering concern here is “background work as a pipeline.” Watching a filesystem isn’t request/response; it’s event-driven plus periodic health/reporting. When you’re building infrastructure, the first step toward reliability is making the background work visible.

Day 4 — Parsing as a service, not a regex in the hot path

Day 4 implements parsing to extract structured data from common log formats. In the integrated project, parsing lives in com.example.week1.day4.service.Day4LogParsingService, and the raw-to-structured transformation is represented by com.example.week1.day4.service.Day4RawLogProcessor.

The scheduling tie-in is subtle but real: parsing is often invoked by scheduled or asynchronous pipelines (batching, periodic scans, backfills). Even when parsing is triggered by file changes or Kafka, the system still needs predictable timing around reporting and policy enforcement.

Day 5 — Storage, rotation policies, and operational “tick” jobs

Day 5 introduces a storage mechanism using flat files with rotation. In the integrated project, rotation is a scheduled policy evaluation:

com.example.week1.day5.service.Day5RotationPolicyService.evaluateRotationPolicies() runs on @Scheduled(fixedRate = 300000) and is annotated with @TrackedTask(id = "day5.rotateColdFiles", ...).

This is exactly what a backend does in production: continuously enforce time/size policies. It’s not “feature code,” but it’s the code that prevents outages. Rotation jobs need predictable scheduling, bounded runtime, and strong observability when they fail.

Day 6 — Querying and filtering as an operational interface

Day 6 adds query APIs that filter stored logs with cache behavior and circuit breakers (com.example.week1.day6.service.Day6LogQueryService). This day is about operator ergonomics: scheduled pipelines need query surfaces so you can validate that the system is behaving over time, not just at a single moment.

The engineering concern is feedback loops. Scheduled systems are prone to silent failure. Query endpoints (plus metrics) are how you close the loop.

Day 7 — Integration into a local pipeline

Day 7 is the integration day: connect generation, collection, parsing, and storage into a pipeline that can run locally. In this integrated project, the pipeline is mediated via a dispatch layer (com.example.week1.common.kafka.LogEventDispatchService) that can either publish to Kafka (when enabled) or persist directly (when Kafka is disabled).

The scheduling concern is orchestration without tight coupling. The system can run local-only and still exercise scheduled tasks like metrics rollups and rotation. When Kafka is enabled, the same scheduled tasks still work; they just become part of a larger asynchronous pipeline.

3. Architecture Overview

At runtime, the application is a single Spring Boot process with a few key internal layers that matter specifically for scheduling:

The outer boundary is the Spring application context (Week1IntegratedApplication). Inside that context, configuration beans provide core infrastructure:

com.example.week1.common.config.SchedulingConfig defines a shared ThreadPoolTaskScheduler used by Spring’s scheduling subsystem.
com.example.week1.common.config.JacksonConfig and RedisConfig provide serialization and Redis clients that scheduled tasks use indirectly (e.g., rate limiter windows, file offset tracking).

Scheduled task beans live in per-day modules:

Day 2 metrics tick: Day2LogGenerationService.updateMetrics()
Day 3 periodic collector stats: Day3LogCollectorService.logStatistics()
Day 5 rotation policy evaluation: Day5RotationPolicyService.evaluateRotationPolicies()

To keep “background” behavior observable, the project adds a minimal task registry and tracking aspect:

com.example.week1.common.scheduling.TaskRegistry stores TaskState objects keyed by a stable task id.
com.example.week1.common.scheduling.TaskTrackingAspect wraps methods annotated with @TrackedTask and updates state on start/success/failure.
com.example.week1.common.web.TaskRegistryController exposes the registry at /api/week1/scheduler/tasks.

Finally, the external boundary is the JVM/OS clock, which triggers the scheduler’s timing wheel. In production you’d also model this boundary in terms of deployment topology and clock skew; locally it’s enough to treat it as “time ticks happen.”

4. Data / Control Flow

Even though the project ingests and stores logs, the scheduling lifecycle is the piece that makes the system “keep moving” without humans poking it.

When the application starts, Spring initializes beans, then scheduling infrastructure scans for @Scheduled methods. Those methods are scheduled onto the shared scheduler defined in SchedulingConfig and begin waiting for time-based triggers.

On each trigger cycle, the runtime flow looks like this:

Startup → scheduler initialized → @Scheduled methods registered → WAIT for time → clock fires → scheduler picks a worker thread → the scheduled method executes → it emits logs/metrics and may update storage → the method returns → the thread goes back to the pool → the scheduler waits for the next trigger.

The task tracking aspect turns that flow into observable state transitions. When a tracked scheduled method begins, it is marked RUNNING in the TaskRegistry. If the method returns normally it is marked SUCCESS; if an exception propagates it becomes FAILED. In both cases the registry returns the task to WAITING so the next trigger can proceed.

5. State Management

A scheduled job isn’t just “a method that runs sometimes.” It has state that matters operationally: when it last ran, whether it is failing, and whether it is making progress. The integrated project tracks this state in-memory via TaskRegistry and TaskState:

TaskState.status records REGISTERED, WAITING, RUNNING, SUCCESS, FAILED, DISABLED.
TaskState.lastStartedAt and lastFinishedAt capture timing.
lastSuccessAt, lastFailureAt, and lastErrorMessage capture health.
totalRuns, totalSuccess, totalFailures capture a small history without needing a full timeseries DB.

The mechanism is intentionally small and transparent: methods annotated with @TrackedTask are intercepted by TaskTrackingAspect. This keeps the task beans themselves focused on their domain logic, while the cross-cutting concerns (timing, success/failure bookkeeping) live in one place.

You can view the current state at runtime via:

GET /api/week1/scheduler/tasks (implemented by TaskRegistryController)

This is not meant to replace Prometheus/Grafana; it’s a debugging surface that makes scheduling correctness visible immediately during local development.

6. Step-by-Step Implementation Guide

Github Link :

https://github.com/sysdr/sdc-java-p/tree/main/week_1_integrated_project/week_1_integrated_project

Start from the single application entry point in com.example.week1.Week1IntegratedApplication. This is where @EnableScheduling is activated and where component scanning is rooted.
Configure a real scheduler rather than relying on defaults. com.example.week1.common.config.SchedulingConfig provides a ThreadPoolTaskScheduler with a bounded pool and predictable thread names (week1-scheduler-*). This makes debugging and performance reasoning possible.
Define a minimal “task identity” abstraction with com.example.week1.common.scheduling.TrackedTask. In the real world, observability needs stable identifiers; method names are not enough once refactors begin.
Implement state storage with com.example.week1.common.scheduling.TaskRegistry and TaskState. Keep it thread-safe by construction (ConcurrentHashMap for registry, volatile fields on state).
Add cross-cutting tracking using com.example.week1.common.scheduling.TaskTrackingAspect. It wraps tracked scheduled methods and records RUNNING/SUCCESS/FAILED transitions along with timestamps and error summaries.
Annotate real scheduled methods so the registry reflects actual system behavior. In this project those methods live in:
- Day2LogGenerationService.updateMetrics()
  - Day3LogCollectorService.logStatistics()
    - Day5RotationPolicyService.evaluateRotationPolicies()
Expose the scheduler’s state as an API via com.example.week1.common.web.TaskRegistryController at /api/week1/scheduler/tasks. This endpoint is a local operator interface for verifying that tasks are alive and healthy.
Validate the end-to-end system using the existing operational endpoints and flows:
- POST /api/week1/day2/generator/start to create ongoing activity.
  - POST /api/week1/day1/logs to inject events.
    - GET /api/week1/scheduler/tasks to confirm periodic tasks are running and succeeding.

7. Key Engineering Insights

Thread safety isn’t optional in scheduled systems because there’s no user request boundary to “hide” behind. Day2LogGenerationService uses AtomicLong/AtomicInteger for counters because the scheduled tick (updateMetrics) is racing with generator threads that increment counters. If those updates weren’t atomic you’d get negative or inconsistent rates, which is the kind of bug that looks like “the system is flaky” rather than “the math is wrong.”

Fixed-rate and fixed-delay scheduling are not interchangeable. This project uses fixedRate for periodic accounting and rotation evaluation. In a production service you’d choose fixedDelay when work duration matters more than wall-clock alignment (e.g., “run 5 minutes after the last successful completion”), and fixedRate when you want alignment to time windows (e.g., “every second, compute a rolling rate”). Even locally, those differences show up when tasks get slow or the system is under load.

Cron expressions are powerful but brittle when you need dynamic behavior. This week’s code uses @Scheduled with explicit intervals, which keeps behavior testable and easy to reason about. In later weeks, adding cron schedules (or Quartz) makes sense when you need calendar-aligned jobs, but you still need the same observability and state tracking; cron does not remove the need for task health.

Testability improves when scheduling is separated from work. The tracking layer in TaskTrackingAspect makes task state independent from the business logic. This means you can unit test the domain work (e.g., Day4LogParsingService.parseLogEntry) without time, and then separately exercise scheduling behavior via integration tests that assert /api/week1/scheduler/tasks transitions.

8. Success Metrics

Modularity

You should be able to point to scheduling-related code without digging through business logic. Concretely:

SchedulingConfig contains scheduler configuration only.
Task tracking lives only in common/scheduling/*.
Day modules (day2, day3, day5) keep scheduled logic in their own services.

Readability

A reader should be able to answer “what tasks are scheduled?” quickly:

The scheduled methods are visible via @Scheduled on:
- Day2LogGenerationService.updateMetrics()
  - Day3LogCollectorService.logStatistics()
    - Day5RotationPolicyService.evaluateRotationPolicies()
Each of those methods has a stable id and description via @TrackedTask.

Correctness

Scheduling correctness isn’t “it compiles.” It’s observable behavior:

GET /api/week1/scheduler/tasks returns entries for day2.updateMetrics, day3.logStatistics, and day5.rotateColdFiles.
Each task transitions through RUNNING → SUCCESS repeatedly under normal operation.
When a task fails, lastErrorMessage is populated and totalFailures increases.

Extensibility

Adding a new scheduled task should not require copy/pasting tracking code:

A new task is created by adding @Scheduled + @TrackedTask to a method on any bean.
The registry and endpoint automatically reflect the new task id.
The scheduler pool size and behavior can be tuned in one place (SchedulingConfig).

9. Conclusion

After building and running this integrated project, you should see scheduling as a system concern rather than an annotation. The core idea is that time-based work needs the same discipline as request-driven code: bounded concurrency (ThreadPoolTaskScheduler), clear ownership (per-day modules), and observability (task state plus metrics).

From here, the natural next step is taking scheduling beyond a single JVM. Once you have multiple instances, “every instance runs the cron” becomes a correctness bug. That’s where distributed scheduling patterns come in: leader election, database-backed locks, Quartz clustered mode, or dedicated schedulers. The mechanics change, but the fundamentals you practiced here—explicit task identity, clear state, and careful concurrency—carry straight through.

Working Code Demo :

Day 60: Multi-Region Replication for Log Data

sdr11 — Tue, 26 May 2026 03:54:36 GMT

What We’re Building Today

Yesterday you implemented active-passive failover so critical components recover automatically. Today, you’re taking that foundation and turning it into a globally replicated log pipeline — the backbone of every major cloud-native observability stack in production.
By the end of this lesson you will have:

A multi-region Kafka replication topology that mirrors log events across two simulated geographic regions using MirrorMaker 2
A conflict-resolution strategy for out-of-order and duplicate events that arrive from separate regions during a split-brain scenario
Region-aware routing in the API Gateway that directs producers to the nearest region while keeping consumers globally consistent
End-to-end monitoring that surfaces replication lag, cross-region throughput, and divergence alerts in Grafana

Why This Matters

A single-region log system is an availability bet. The moment your datacenter loses power, your network partition hits, or your cloud provider has an outage affecting one zone, every log event generated during that window is either lost or delayed — and in observability, delayed logs are nearly as dangerous as missing logs because they break correlation windows used for incident diagnosis.
Multi-region replication is how Netflix kept its recommendation engine’s telemetry flowing during the 2012 US East outage, how Uber surfaces driver GPS traces in the nearest region while guaranteeing global query consistency, and how Amazon’s Route 53 health checks keep log ingestion alive across continents. The pattern is deceptively simple in marketing materials — “we just replicate” — but the engineering trade-offs around consistency, ordering, and deduplication are where real system designers earn their credibility. Today you’ll wrestle with those trade-offs hands-on.

System Design Deep Dive

Preparing for a distributed systems interview?

→Download the free Interview Pack

→ Subscribe now to access source code repository - 200 + coding lessons

1. Replication Topology: Active-Active vs. Active-Passive

Day 59 gave you active-passive failover within a single region. Multi-region replication forces a harder question: do both regions write simultaneously, or does one region always own the write path?

Active-passive replication keeps things simple. Region A is the primary; Region B is a hot standby that receives replicated data but does not accept producer writes under normal conditions. Failover promotes Region B to primary. The cost: Region B’s logs are always behind by the replication lag, so during a failover you accept a small window of potential data loss (RPO).

Active-active replication lets both regions accept writes simultaneously and cross-replicate. Throughput doubles and RPO drops to near-zero, but you immediately introduce the split-brain problem: if the replication link between regions fails, both regions continue accepting writes independently. When the link recovers, you have two divergent histories and must merge them.

For a log processing system, active-active is almost always the right call. Logs are append-only, high-volume, and latency-sensitive. The deduplication cost of merging after a split is far less than the cost of dropping events during an outage. Today’s implementation uses active-active with idempotency-key-based deduplication at the consumer layer to resolve conflicts.

Architectural Insight: Kafka’s MirrorMaker 2 was designed for active-active. It uses topic-offset translation to prevent replication loops — Region A’s copy of Region B’s topic is tagged with a source prefix, so MirrorMaker in Region B does not re-replicate it back. Understanding this loop-prevention mechanism is what separates a working multi-region Kafka setup from one that spirals into infinite replication.

2. Ordering Guarantees Across Regions

Kafka guarantees ordering within a partition on a single broker. Cross-region replication shatters that guarantee. An event produced in Region A at T=100ms may arrive in Region B at T=250ms, but an event produced in Region B at T=180ms is already sitting in Region B’s partition at T=180ms. A consumer reading Region B’s merged topic sees events out of order relative to their original production time.

The standard mitigation is a watermark-based reorder buffer at the consumer. Each consumer maintains a sliding window (configurable, typically 500ms–2s) and holds events until it’s confident no earlier event will arrive from the other region. Events older than the watermark are flushed in order; events that arrive after the watermark has passed are flagged as late arrivals and routed to a dead-letter partition for reconciliation.

This is exactly the pattern used in Google’s Dataflow and Flink’s event-time processing. The trade-off is clear: lower watermark = lower latency, higher risk of out-of-order emissions. Higher watermark = better ordering, higher end-to-end latency. Today’s implementation exposes the watermark as a tunable configuration property so you can observe the effect directly in your load tests.

3. Deduplication Strategy

When both regions accept writes and the replication link recovers after a split, some events will exist in both regions’ topics. A naive consumer will process them twice — which, for a log indexing system, means duplicate entries in your search backend and inflated metrics.

The production pattern is server-side idempotency with a distributed cache. Each log event carries an eventId (a UUID generated by the producer at ingestion time). The consumer checks a Redis-backed bloom filter before processing. If the eventId is already present, the event is silently discarded. Bloom filters give you probabilistic deduplication with near-zero memory overhead — the false-positive rate (roughly 0.01% at our configuration) is an acceptable trade-off for the memory savings versus a full hash set.

Uber’s log platform uses a similar two-tier approach: a fast bloom filter for the hot path, backed by a slower exact-match store for audit trails. You’ll implement both tiers today.

4. Region-Aware Routing

The API Gateway needs to know which region a producer request should target. In a real deployment this is DNS-based (GeoDNS routes to the nearest region). In our simulated environment, the gateway reads a X-Region header and routes accordingly, with a fallback to the local region if the target is unhealthy.

The key design decision here is sticky routing: once a producer is assigned to a region for a given session or logical group (e.g., all logs from a single microservice instance), it should stay on that region until a health event forces migration. This minimizes cross-region replication pressure and keeps event ordering more predictable within a single producer’s stream.

5. Monitoring Replication Health

Replication lag is the single most important operational metric for a multi-region system. If Region B’s consumer offset falls more than 10 seconds behind Region A’s producer offset, you’re approaching your RPO budget. Prometheus scrapes both regions’ Kafka exporters every 15 seconds; Grafana alerts fire when lag exceeds configurable thresholds.

Beyond lag, you need to track replication throughput (bytes/sec crossing the region boundary), split-brain duration (time between network partition and recovery), and deduplication hit rate (percentage of events filtered by the bloom filter). All four metrics are wired into today’s Grafana dashboard.

Implementation Walkthrough

GitHub Link :

https://github.com/sysdr/sdc-java-p/tree/main/day60/day60-multi-region-log-replication

Step 1: MirrorMaker 2 Configuration

The docker-compose.yml spins up two Kafka clusters (kafka-region-a, kafka-region-b) and two MirrorMaker 2 instances. Each MirrorMaker is configured with a source.cluster and target.cluster, and a replication factor of 2. The topic replication policy uses a regex filter (log-events-.*) so only your log topics are mirrored — not internal Kafka topics like __consumer_offsets.

The critical configuration property is replication.lag.max.bytes. Set this too high and your failover RPO balloons. Set it too low and MirrorMaker becomes a bottleneck under peak load. Start at 1MB and tune based on your load test results.

Step 2: Region-Aware Producer

KafkaProducerService now accepts a targetRegion parameter. Under the hood it maintains two KafkaProducer instances — one per region — and selects the appropriate one based on the routing decision from the gateway. If the target region’s producer fails to send within the configured timeout, the service falls back to the local region and logs a REGION_FALLBACK event for monitoring.

Step 3: Watermark Consumer

ReplicationAwareConsumer extends the base consumer with a ReorderBuffer. On each poll cycle, the buffer checks whether the current event’s eventTimestamp is within the watermark window. Events inside the window are held; events outside are emitted to the downstream processor in timestamp order. Late arrivals (events with timestamps older than the current watermark) are sent to the log-events-late dead-letter topic.

Step 4: Bloom Filter Deduplication

DeduplicationService wraps a Guava BloomFilter backed by a Redis HyperLogLog for cross-instance coordination. Each consumer instance maintains its own local bloom filter for speed, and periodically syncs its state to Redis so that if a consumer instance restarts, it can rebuild from the shared state. The sync interval is configurable — 30 seconds is the default, balancing consistency with network overhead.

Step 5: Integration and Load Testing

The load-test.sh script fires 10,000 events split evenly across both simulated regions over 60 seconds, then pauses replication for 5 seconds (simulating a split) and resumes. The integration test suite validates that after recovery, the total unique events processed across both regions equals 10,000 — confirming that deduplication caught all duplicates and the reorder buffer caught all late arrivals.

Working Demo Link :

Production Considerations

Latency budget: MirrorMaker 2 adds 50–200ms of replication latency depending on network distance. If your log consumers have SLAs tighter than that, you need region-local consumption with eventual cross-region consistency, not strong consistency.

Failure mode — MirrorMaker crash: If MirrorMaker goes down, replication stops but both clusters continue accepting writes independently. This is by design. When MirrorMaker restarts, it resumes from the last committed offset. No data is lost; lag temporarily spikes.

Capacity planning: Each region needs enough Kafka broker capacity to handle its own write load plus the full replication load from the other region. In practice, this means sizing each region for roughly 1.8x its expected standalone throughput to account for replication overhead and burst traffic.

Scale Connection

Netflix’s EVIPro (Event Processing Infrastructure) uses a topology nearly identical to what you built today: active-active Kafka clusters across three regions, MirrorMaker 2 for cross-region replication, and a custom deduplication layer at the consumer. Uber’s logging pipeline (documented in their 2021 engineering blog) applies the same watermark-based reorder buffer pattern for GPS event streams that must be globally consistent for driver tracking. Amazon’s CloudWatch Logs uses a similar region-aware routing strategy to keep log ingestion latency under 100ms globally.

This is just the starting point.

The paid content includes advanced lessons, frameworks, templates, practical projects, and deeper explanations to accelerate your learning journey.

Next Steps

Tomorrow: circuit breakers with Resilience4j — how to keep your system healthy when downstream components start failing.

Day 59: Implement Active-Passive Failover for Critical Components

sdr11 — Tue, 19 May 2026 09:00:50 GMT

What We’re Building Today

Active-passive failover mechanism for Kafka consumers with automatic leader election
Health monitoring system with heartbeat detection and failure recovery
Stateful component migration with zero data loss guarantees
Failover orchestration demonstrating sub-second recovery times

Why This Matters

High availability isn’t optional at scale—it’s the difference between 99.9% and 99.99% uptime, which translates to 8.76 hours versus 52 minutes of downtime per year. When Netflix streams to 200M+ subscribers or Uber coordinates millions of rides, a single point of failure can cascade into millions in lost revenue and eroded user trust.
Active-passive failover solves the fundamental distributed systems challenge: how do you ensure critical components continue operating when infrastructure fails? Unlike active-active architectures that require complex conflict resolution, active-passive provides simpler consistency guarantees while maintaining high availability. This pattern appears in everything from database replication (PostgreSQL streaming replication) to message broker cluster management (Kafka controller election) to API gateway redundancy.
Today’s implementation tackles the hardest part of failover: coordinating state migration without data loss while minimizing downtime. You’ll see why “just restart the service” isn’t sufficient at scale, and how distributed consensus enables automatic recovery.

System Design Deep Dive

Preparing for a distributed systems interview?

→ Download the free Interview Pack

→Subscribe now to access source code repository - 200 + coding lessons

Day 58: Build a Search API for Programmatic Access

sdr11 — Tue, 12 May 2026 11:19:19 GMT

What We’re Building Today

Today we’re implementing a production-grade RESTful search API that exposes our distributed log analytics system to programmatic clients. You’ll build:

RESTful API Gateway with versioned endpoints, pagination, and rate limiting
Query DSL Parser that translates HTTP requests into optimized Elasticsearch queries
Response Streaming for large result sets with cursor-based pagination
Authentication & Authorization layer with API key management and quota enforcement

Why This Matters

The gap between powerful analytics capabilities and programmatic access is where most distributed systems fail to deliver business value. Netflix processes 1 trillion events daily, but their value comes from exposing this data through APIs that power recommendations, fraud detection, and A/B testing. Your search infrastructure might index petabytes of logs, but without a well-designed API, teams resort to manual queries, CSV exports, and one-off scripts.
The challenge isn’t just exposing Elasticsearch through HTTP—it’s designing an API contract that remains stable as your backend evolves, scales to thousands of requests per second without crushing your cluster, and provides query flexibility while preventing abuse. Twitter’s Firehose API demonstrates this: they expose real-time tweet streams to thousands of clients while protecting cluster stability through sophisticated rate limiting and quota management. This lesson bridges analytics infrastructure and API design, teaching you to build interfaces that unlock data access without compromising system integrity.

System Design Deep Dive

Day 57: Full-Text Search with Relevance Scoring

sdr11 — Fri, 08 May 2026 07:53:28 GMT

What We’re Building Today

Elasticsearch integration for distributed full-text search across log streams
BM25 ranking algorithm implementation with custom scoring functions
Multi-field search supporting structured and unstructured log data
Real-time search API with sub-100ms query latency for millions of logs

Why This Matters

When Netflix processes 500 billion events daily, engineers need to find specific error patterns across distributed services in seconds, not hours. Traditional database queries fall apart at this scale—a LIKE '%error%' query on 100M log entries will bring PostgreSQL to its knees. Full-text search engines like Elasticsearch solve this by inverting the problem: instead of scanning documents for terms, they build inverted indices that map terms to documents instantly.
The challenge isn’t just speed—it’s relevance. When searching “authentication timeout user service,” you want logs matching all terms ranked higher than partial matches, recent errors prioritized over old ones, and critical severity logs surfaced first. This requires sophisticated scoring algorithms that understand both textual similarity and domain-specific importance signals.

System Design Deep Dive

Day 56: Real-Time Indexing of Incoming Logs

sdr11 — Mon, 04 May 2026 10:02:09 GMT

What We’re Building Today

Near-real-time indexing pipeline that indexes logs within 100ms of arrival
Distributed inverted index with LSM-tree optimization for high write throughput
Index coordination layer managing shard distribution and replication across nodes
Query API with millisecond-latency searches over freshly indexed data

Why This Matters

Search latency directly impacts incident response time. When a production outage occurs, every second engineers spend waiting for logs to become searchable costs money and customer trust. Netflix processes 1 trillion events daily—if their indexing pipeline had even a 10-second delay, teams would be flying blind during critical incidents. Real-time indexing transforms logs from historical artifacts into actionable intelligence.
The challenge isn’t just speed—it’s maintaining search quality while ingesting 50,000+ events per second. Traditional batch indexing offers perfect consistency but unacceptable latency. Stream-based indexing delivers speed but introduces complexity around partial updates, segment management, and query consistency. Today’s architecture balances these trade-offs using proven patterns from Elasticsearch, Splunk, and Datadog.

System Design Deep Dive

Day 55: Faceted Search for Multi-Dimensional Log Analytics

sdr11 — Thu, 30 Apr 2026 10:41:06 GMT

What We’re Building Today

Today we implement a production-grade faceted search system that enables simultaneous filtering across multiple dimensions of log data:

Distributed faceted index with Elasticsearch supporting 10,000+ queries/sec
Real-time facet aggregation computing filter counts across billions of log entries
Multi-dimensional filtering enabling complex queries like “ERROR logs from service=auth AND host=prod-* in last 15 minutes”
Intelligent caching layer with Redis reducing repeated aggregation overhead by 80%

Why This Matters: The Search at Scale Challenge

When Datadog ingests 1 trillion log events daily, users don’t just search for text—they filter by 20+ dimensions simultaneously: service name, environment, log level, HTTP status code, error type, region, host, container ID, and custom tags. Each facet must show real-time counts: “42,385 ERROR logs, 8,234 from auth-service, 3,102 in us-east-1.”
The naive approach—scanning all logs for each dimension—would require petabytes of I/O per second. Instead, production systems use inverted indexes, bitmap intersection, and distributed aggregation to deliver sub-second faceted search across massive datasets. Splunk’s faceted search handles 100,000+ concurrent users filtering terabytes of logs, while Elastic’s distributed aggregation framework powers real-time analytics for half of Fortune 500 companies.
The architectural challenge: how do you maintain accurate facet counts while ingesting 500,000 events/second, support complex boolean queries with 10+ filters, and cache strategically when facet distributions change every second? Today’s lesson implements the patterns that make billion-event search systems feel instantaneous.

System Design Deep Dive

1. Inverted Indexes: The Foundation of Faceted Search

Traditional log storage organizes by time: 2024-01-28 10:00:00 | ERROR | auth-service | Login failed. Faceted search requires the inverse: given level=ERROR, which logs match? Elasticsearch and Solr build inverted indexes mapping each facet value to document IDs:

service:auth-service → [doc1, doc5, doc89, doc234, ...]
level:ERROR → [doc1, doc12, doc89, doc156, ...]
region:us-east-1 → [doc5, doc12, doc234, ...]

Trade-off Analysis: Inverted indexes consume 30-40% additional storage but reduce faceted query time from O(n) full scans to O(k) where k is the result set size. For 1 billion logs, this transforms a 10-minute scan into a 50ms index lookup. The cost: write amplification. Each log ingestion updates multiple indexes—one per facet. At scale, Datadog uses LSM-tree indexes (like Lucene) that batch writes and compact asynchronously, trading slightly stale facets (1-2 second lag) for 10x better write throughput.

2. Bitmap Intersection: Combining Multiple Facets

When users query service=auth AND level=ERROR AND region=us-east-1, the system must intersect three result sets. The naive approach iterates through each set sequentially—O(n1 + n2 + n3). Production systems use compressed bitmaps (Roaring Bitmaps) for bitwise AND operations completing in microseconds.

For 1 million matching document IDs, a sorted integer array consumes 4MB. A compressed bitmap uses 50KB with O(1) intersection. Twitter’s search infrastructure reduced facet query latency by 90% switching from hash-set intersection to roaring bitmaps. The pattern: represent document IDs as bit positions in a bitmap, use hardware-accelerated SIMD instructions for parallel intersection.

Failure Mode: Sparse bitmaps (few set bits in large ranges) compress poorly and waste memory. Pinterest’s search team dynamically chooses between bitmaps and skip lists based on result set density—bitmaps for >1% density, skip lists otherwise.

3. Distributed Aggregation: Computing Facet Counts

The hardest part of faceted search isn’t finding matching documents—it’s computing counts for each facet value in real-time. “Show me all ERROR logs AND tell me how many are from each service” requires scanning matching docs and aggregating dimensions.

Elasticsearch’s distributed aggregation uses a scatter-gather pattern:

Scatter: Send query to all shards (partitioned by time or hash)
Local Aggregation: Each shard computes top-K facet values with counts
Gather: Coordinator merges results, re-ranks, and returns final counts

For a query matching 10 million logs across 50 shards, each shard processes 200K docs locally, returns top 1000 facet values. The coordinator merges 50 × 1000 = 50K facet values into final top 1000. This reduces network transfer from 10 million to 50K results—a 200× reduction.

Scalability Bottleneck: Aggregation memory scales with cardinality (distinct facet values), not document count. A field with 1 million unique values requires 1M × 8 bytes = 8MB per shard just for facet counts. LinkedIn’s search infrastructure uses HyperLogLog for approximate counts when exact values aren’t critical—1.5% error with 99% less memory.

4. Intelligent Caching: Reducing Repeated Aggregation

Facet distributions change slowly—the count of ERROR logs fluctuates, but relative proportions (auth-service: 30%, api-service: 25%) remain stable over minutes. Uber’s log search system caches facet aggregations in Redis with 60-second TTLs:

cache_key = "facets:service:level=ERROR:last_15min:20240128T1000"
cache_value = {
  "auth-service": 8234,
  "api-service": 6012,
  "payment-service": 4829
}

The cache hit rate reaches 80% for dashboard queries where users repeatedly filter similar time ranges. Cache invalidation uses time-bucketed keys—cache expires naturally when time window shifts.

Anti-Pattern: Caching final search results per user query. With 20 facets and 10+ values each, the query space explodes to billions of combinations. Cache aggregations at the dimension level (counts per facet), not query level.

5. Query Optimization: Choosing Execution Paths

Not all faceted queries are equal. level=ERROR AND service=auth (matching 0.1% of logs) should execute differently than level!=ERROR AND service!=auth (matching 99.9%). Production query planners use statistics to choose optimal execution:

Strategy 1 - Filter First: For selective filters (matches <5% of docs), retrieve matching doc IDs, then aggregate facets. Used when result set is small.

Strategy 2 - Aggregate First: For broad filters (matches >50%), compute all facet distributions, then filter. Avoids repeated aggregation for exploratory queries.

Elasticsearch’s query planner estimates cardinality using index statistics and chooses execution strategy automatically. Netflix’s search infrastructure reduced P99 latency from 8 seconds to 400ms by implementing cardinality-aware query planning—pre-computing aggregations for high-cardinality fields, streaming results for low-cardinality fields.

Implementation Walkthrough

GitHub Link :

https://github.com/sysdr/sdc-java/tree/main/day55/day55-faceted-search-system

Our system implements faceted search with four services:

1. Faceted Search Service (Spring Boot + Elasticsearch)

The search service maintains Elasticsearch indexes with faceted fields. Each log document includes dimensions as analyzed fields:

@Document(indexName = "logs")
public class LogDocument {
    private String id;
    private Instant timestamp;
    private String level;      // Facet: ERROR, WARN, INFO
    private String service;    // Facet: auth-service, api-service
    private String environment; // Facet: prod, staging
    private String host;       // Facet: prod-01, prod-02
    private String message;    // Full-text searchable
}

The search API accepts multi-dimensional filters and returns both matching logs and facet counts:

@PostMapping("/search")
public FacetedSearchResponse search(@RequestBody SearchRequest request) {
    // Build Elasticsearch query with multiple filters
    BoolQueryBuilder query = QueryBuilders.boolQuery();
    request.getFilters().forEach((field, value) -> 
        query.filter(QueryBuilders.termQuery(field, value))
    );
    
    // Add facet aggregations for each dimension
    SearchSourceBuilder searchBuilder = new SearchSourceBuilder()
        .query(query)
        .size(request.getLimit())
        .aggregation(AggregationBuilders.terms("by_level").field("level"))
        .aggregation(AggregationBuilders.terms("by_service").field("service"))
        .aggregation(AggregationBuilders.terms("by_environment").field("environment"));
    
    // Execute distributed query across all shards
    SearchResponse response = elasticsearchClient.search(searchBuilder);
    
    return buildFacetedResponse(response);
}

2. Aggregation Service (Kafka Streams)

Real-time facet counts flow through a Kafka Streams aggregation pipeline. As logs ingress, the service maintains materialized views of facet distributions:

@Bean
public KStream buildFacetAggregations(StreamsBuilder builder) {
    return builder.stream("logs")
        .groupBy((key, log) -> log.getService())
        .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
        .count(Materialized.as("service-counts"))
        .toStream()
        .to("facet-aggregations");
}

These pre-computed aggregations seed the cache, reducing query-time aggregation overhead.

3. Cache Layer (Redis)

Facet aggregations cache in Redis with composite keys capturing query dimensions:

public Map getCachedFacets(String field, Map filters) {
    String cacheKey = buildFacetCacheKey(field, filters);
    Map cached = redisTemplate.opsForHash().entries(cacheKey);
    
    if (!cached.isEmpty()) {
        return cached.entrySet().stream()
            .collect(Collectors.toMap(
                e -> e.getKey().toString(),
                e -> Long.parseLong(e.getValue().toString())
            ));
    }
    
    // Cache miss - compute from Elasticsearch
    Map facets = computeFacetsFromElasticsearch(field, filters);
    redisTemplate.opsForHash().putAll(cacheKey, facets);
    redisTemplate.expire(cacheKey, Duration.ofMinutes(1));
    return facets;
}

4. Query Planning

Before executing, the system estimates result set size using Elasticsearch count API and chooses execution strategy:

public SearchStrategy chooseStrategy(SearchRequest request) {
    long estimatedMatches = elasticsearchClient.count(buildQuery(request));
    long totalDocs = elasticsearchClient.count(matchAllQuery());
    double selectivity = (double) estimatedMatches / totalDocs;
    
    // <5% of docs: filter-first strategy (retrieve docs, then aggregate)
    if (selectivity < 0.05) {
        return SearchStrategy.FILTER_FIRST;
    }
    // >50% of docs: aggregate-first strategy (compute facets, then filter)
    else if (selectivity > 0.50) {
        return SearchStrategy.AGGREGATE_FIRST;
    }
    // Medium selectivity: standard scatter-gather
    return SearchStrategy.SCATTER_GATHER;
}

Working Demo Link:

Production Considerations

Performance Optimization: Elasticsearch heap size must accommodate aggregation memory. For 100M unique facet values per shard, allocate 4GB heap minimum. Monitor indices.fielddata.memory_size and implement circuit breakers to prevent OOM during high-cardinality aggregations.

Facet Explosion: Users creating custom tags can introduce unbounded cardinality. Implement dynamic field limits—Splunk restricts to 10,000 unique values per facet per day. Beyond that, treat as full-text searchable but not faceted.

Failure Scenarios: When Elasticsearch shards become unavailable, return partial results with a “incomplete data” warning rather than failing completely. Netflix’s search UI shows “Search results from 45/50 datacenters” during partial outages, maintaining user trust through transparency.

Monitoring: Track facet cache hit rates (target >70%), P99 aggregation latency (target <500ms), and Elasticsearch heap usage (alert above 85%). Set up alerts for abnormal cardinality growth indicating potential facet explosion attacks or bugs.

Scale Connection: FAANG-Level Faceted Search

Datadog processes 100,000 faceted search queries per second across 1 trillion daily log events. Their distributed search cluster uses 500+ Elasticsearch nodes, each maintaining sharded time-series indexes. Facet aggregations execute in parallel across shards with results merged in <200ms. They cache aggregations in a distributed Redis cluster with 100+ nodes, achieving 85% cache hit rates for common queries.

Elastic Cloud powers faceted search for companies like Tinder, Slack, and Walmart. Their most demanding workload—Walmart’s inventory search—handles 50,000 queries/sec with 15+ simultaneous facets across 500 billion documents. The system uses Elasticsearch’s frozen tier to archive old indexes while maintaining query support, reducing storage costs by 90%.

Next Steps

Tomorrow we implement real-time indexing with minimal latency, building sub-second indexing pipelines that make logs searchable within 100ms of generation—the final piece for production-grade observability.

Day 54: Implement a Query Language for Complex Log Searches

sdr11 — Sat, 25 Apr 2026 23:38:57 GMT

Stop reading about architecture and start building it. Theory only gets you so far. Commit to a daily coding habit where you implement distributed systems from the ground up. No abstractions, no shortcuts—just pure engineering. Start your first lesson today. Upgrade to Lifetime Access subscription & Get one year free subscription to our hands on course portal : “systemdrd.com” that offers wide variety of hands on courses on covering multiple high demanding technologies.

What We’re Building Today

Today we’re implementing a production-grade SQL-like query language for distributed log analytics. By the end of this lesson, you’ll have:

Query Parser & Lexer: ANTLR-based parser supporting SELECT, WHERE, GROUP BY, ORDER BY, and aggregation functions
Distributed Query Executor: Coordinator that pushes query execution to indexed nodes from Day 53’s distributed index
Query Optimizer: Cost-based optimizer that rewrites queries to leverage existing indexes and minimize network traffic
Result Aggregator: Merge-sort implementation that combines partial results from multiple nodes while maintaining memory efficiency

Why This Matters

Every log analytics platform at scale needs a query interface that abstracts complexity while maintaining performance. Elasticsearch powers queries at Uber processing 10+ billion logs daily. Splunk’s Search Processing Language handles petabytes at major financial institutions. Google’s Dremel (BigQuery) executes SQL queries across distributed columnar storage.

The challenge isn’t just parsing syntax—it’s distributing execution intelligently. A naive query that scans all logs across 100 nodes will timeout. The key architectural insight: push computation to data, not data to computation. Query coordinators must rewrite queries into node-local operations, leverage distributed indexes, and aggregate results incrementally.

This lesson bridges your Day 53 distributed indexing with practical querying. Without a sophisticated query layer, your indexes are useless. Users need SQL familiarity, not custom APIs. Production query engines must handle complex predicates, joins across time windows, and real-time aggregations while maintaining sub-second latency.

System Design Deep Dive

Day 53: Distributed Indexing Across Multiple Nodes

sdr11 — Wed, 22 Apr 2026 01:20:35 GMT

What We’re Building Today

Partitioned search index spanning 3+ independent index nodes using consistent hashing
Scatter-gather query coordinator that parallelizes searches across shards
Index replication layer ensuring fault tolerance with primary/replica topology
Shard routing service that maps log entries to appropriate index partitions based on tenant ID

Why This Matters

Yesterday’s inverted index hit a wall at ~500GB of log data. Single-node indexes face three critical bottlenecks: index size exceeds available RAM forcing disk swaps, write throughput caps at disk I/O limits (~10K writes/sec), and query latency degrades linearly with index growth. Elasticsearch solves this by distributing indexes across hundreds of nodes, processing 10M+ documents per second. LinkedIn’s distributed search infrastructure handles 15TB of indexed data across 200+ nodes, serving queries in <50ms p99. Today we’re implementing the core pattern that powers these systems: horizontal index partitioning with coordinated scatter-gather queries.
The architectural insight: distributed indexes trade consistency for availability and partition tolerance. When you shard an index, you’re choosing AP over CA in the CAP theorem—accepting eventual consistency in exchange for horizontal scalability and fault tolerance. Understanding this trade-off is fundamental to system design at scale.

System Design Deep Dive

Day 52: Implement a Simple Inverted Index for Log Searching

sdr11 — Sat, 18 Apr 2026 04:17:13 GMT

Looking for Professional Growth ?

The difference between a "design interview" and a "production system" is massive. Close that gap today with the Hands on Distributed Log System Building course. Get 40% off for a limited time: https://sdcourse.substack.com/fbbab0d8

What We’re Building Today

Real-time inverted index that tokenizes and indexes log messages as they arrive via Kafka
Search API with relevance scoring and ranked results for natural language queries
Index persistence layer using Redis for hot data and PostgreSQL for cold storage
Query processing engine supporting boolean operators and phrase matching

Why This Matters

Every major observability platform—Splunk, Datadog, Elastic—runs on inverted indices. When you search “ERROR user authentication failed” across billions of log entries and get results in milliseconds, you’re querying an inverted index. This data structure powers everything from application monitoring to security incident response.

Without inverted indices, log search would require scanning every log entry linearly—O(n) complexity that becomes impossible at scale. An inverted index transforms this into O(k) lookups where k is the number of query terms, enabling sub-second searches across terabytes of logs. Understanding inverted indices is fundamental to building search infrastructure that scales from thousands to trillions of documents.

Today’s implementation bridges the gap between local prototypes and production search engines, showing how the same architectural patterns scale from single-node deployments to distributed clusters processing petabytes daily.

System Design Deep Dive

Day 51: Build Dashboards for Visualizing Analytics Results

sdr11 — Tue, 14 Apr 2026 11:30:53 GMT

What We’re Building Today

Real-time analytics dashboard consuming aggregated metrics from Kafka streams
WebSocket-based push architecture delivering sub-second metric updates to browsers
Multi-dimensional visualization service supporting time-series, histograms, and geographic heatmaps
Query optimization layer with Redis caching and PostgreSQL time-series partitioning

Why This Matters

At scale, the gap between generating metrics and making them actionable determines your incident response time. Netflix processes 500 billion events daily, but their dashboard systems compress this into 200ms query responses because engineers can’t wait 30 seconds to see if a deployment broke something. When Uber’s surge pricing algorithms trigger, dashboard systems must surface the decision rationale within 100ms or drivers can’t understand why rates changed.
The architectural challenge isn’t building charts—it’s designing systems that maintain query responsiveness as data volume grows exponentially. Your dashboard becomes the bottleneck between detecting problems and fixing them. Poor dashboard architecture means your monitoring system generates alerts 5 minutes before your engineers can see the underlying data.

System Design Deep Dive

Day 50: Alert Generation Based on Log Patterns

sdr11 — Fri, 10 Apr 2026 04:30:54 GMT

Upgrade to get one month free subscription to our hands on course systemdrd.com that offer wide variety of hands on courses on covering various technologies.

Subscribe to our portal for systemdrd.com & get lifetime access to “Hands On System Design with “Distributed Systems Implementation with python and javascript” and this “Distributed Log Implementation With Java & Spring Boot”

What We’re Building Today

A production-grade distributed alerting system that monitors log patterns in real-time and triggers intelligent notifications:

Real-time alert rule engine processing 50,000+ events/second with Kafka Streams
Smart alert manager with deduplication, correlation, and escalation logic
Multi-channel notification service supporting email, Slack, and PagerDuty integration
Alert configuration API for dynamic rule management without system restarts

Why This Matters

Alert generation is where distributed log processing transitions from passive observation to active operational response. At scale, naive alerting becomes your biggest operational burden—Netflix processes 2 billion alerts daily but only acts on 0.01% of them. The challenge isn’t detecting problems; it’s preventing alert fatigue while ensuring critical issues never slip through.
Poor alerting architectures create alert storms during outages (compounding incident response), suffer from flapping alerts that erode trust, generate excessive false positives that train teams to ignore notifications, and fail during the very incidents they’re designed to detect. Production alerting requires sophisticated state management, intelligent suppression, and fault-tolerant delivery mechanisms that work when your primary systems are degraded.