Day 5: Distributed Log Storage with Intelligent Rotation Policies
What We’re Building Today
Today we’ll architect and implement a production-grade distributed log storage system that goes far beyond simple flat files. You’ll build:
• Multi-tier storage architecture with hot/warm/cold data policies and intelligent rotation based on access patterns and retention requirements
• Distributed log ingestion gateway that handles millions of events per second with backpressure control and graceful degradation
• Real-time stream processing pipeline using Kafka for immediate log analysis while maintaining persistent storage for historical queries
• Scalable query engine with Redis caching and PostgreSQL indexing that can search terabytes of log data in milliseconds
Youtube Video:
Why This Matters: The Scale Reality Check
When Netflix processes 500 billion log events daily or Uber handles 100TB of log data per day, they don’t just dump everything to files and hope for the best. The challenge isn’t just storing logs—it’s building systems that remain performant and cost-effective as data volume grows exponentially.
The fundamental distributed systems challenge here is the CAP theorem in practice: You need immediate availability for real-time monitoring (A), consistency for audit trails (C), and partition tolerance when network issues split your storage cluster (P). The architectural decisions you make today about rotation policies, replication strategies, and query patterns will determine whether your system scales to enterprise levels or collapses under its own weight.
Most engineers think log storage is solved by “just use Elasticsearch” or “dump everything to S3,” but the real complexity emerges when you need sub-second query performance on historical data while maintaining cost-effective storage for compliance retention. The patterns we implement today form the foundation of every major observability platform.
System Design Deep Dive: Five Critical Distributed Patterns
1. Hierarchical Storage Management (HSM) Pattern
The core insight here is that not all log data has equal value over time. Recent logs need millisecond access times for incident response, while month-old logs can tolerate minute-level retrieval for compliance queries.
Our implementation uses a three-tier strategy:
Hot Storage (Redis): Last 24 hours, sub-millisecond access, expensive but critical for real-time monitoring
Warm Storage (PostgreSQL): 30-day window, optimized for complex queries with sub-second response times
Cold Storage (File-based with compression): Long-term retention, cost-optimized with acceptable retrieval delays
The rotation trigger logic considers both time and access patterns. A log entry that’s frequently queried stays in warm storage longer, while rarely-accessed data moves to cold storage ahead of schedule. This adaptive approach reduces storage costs by 60-70% compared to naive time-based rotation.
Trade-off Analysis: This complexity requires sophisticated lifecycle management and monitoring. The benefit is dramatic cost reduction and maintained performance, but you’re now managing three distinct query paths and data consistency between tiers.
2. Backpressure and Circuit Breaker Pattern
When log ingestion rates spike—think Black Friday for e-commerce or incident storms—your system needs graceful degradation rather than cascading failures.
Our implementation uses Resilience4j circuit breakers with three states:
Closed: Normal operation, full processing
Open: System overloaded, reject new logs with HTTP 503 but preserve system stability
Half-Open: Gradual recovery testing, slowly increasing throughput
The backpressure mechanism implements token bucket rate limiting combined with priority queuing. Critical logs (errors, security events) get preferential treatment over debug logs during high-load periods.
Architectural Insight: The key realization is that losing some debug logs during peak load is acceptable, but losing error logs is catastrophic. This priority-based approach maintains system observability when you need it most.
3. Write-Ahead Log (WAL) with Async Replication
For compliance and audit requirements, log data cannot be lost. Our implementation uses a Write-Ahead Log pattern where incoming logs are immediately persisted to a durable commit log before processing.
The replication strategy employs async replication to maintain throughput while ensuring durability:
Primary node writes to local WAL and confirms receipt
Async replication to secondary nodes happens in background
Query routing favors primary for latest data, secondaries for historical queries
Trade-off: We choose eventual consistency over strong consistency to maintain write throughput. For most log processing use cases, a few seconds of replication lag is acceptable compared to the performance hit of synchronous replication.
4. Distributed Query Federation
When your log data spans multiple storage tiers and potentially multiple data centers, query execution becomes a distributed systems problem. Our federation layer implements:
Query planning: Analyzes time ranges and predicates to determine which storage tiers to query
Parallel execution: Distributes query workload across available storage nodes
Result merging: Combines results from multiple sources while maintaining chronological ordering
Intelligent caching: Stores frequently-accessed query results in Redis for sub-second repeated access
The query planner uses cost-based optimization, considering network latency, storage tier performance characteristics, and current system load to generate optimal execution plans.
5. Observable System Design
Production log processing systems need comprehensive observability—you’re building the system that monitors other systems, so it must monitor itself effectively.
Our implementation includes:
Distributed tracing: Every log processing request gets a trace ID that follows the data through ingestion, storage, and querying
Custom metrics: Storage tier utilization, rotation policy effectiveness, query performance percentiles
Health check endpoints: Deep health checks that verify not just service availability but data consistency across storage tiers
Alerting on SLA violations: Response time degradation, storage tier migration delays, replication lag
Production Reality: The monitoring system often generates more log data than the applications it’s monitoring. This recursive complexity requires careful design to avoid monitoring loops and resource exhaustion.
Implementation Walkthrough: Building for Scale
Our architecture implements a microservices pattern with three core services that can be independently scaled and deployed.
Log Producer Service: The Ingestion Gateway
The producer service handles the critical path—accepting log data from applications and ensuring it’s durably stored without blocking the caller. The key architectural decision is using async processing with immediate acknowledgment:
@PostMapping(”/logs”)
public ResponseEntity<String> ingestLog(@RequestBody LogEvent event) {
// Immediate WAL write for durability
walService.append(event);
// Async processing for performance
kafkaProducer.send(”log-events”, event);
return ResponseEntity.accepted().body(”Log accepted”);
}
The service implements circuit breaker patterns to handle downstream failures gracefully. When Kafka is unavailable, logs accumulate in the WAL and replay automatically when connectivity resumes.
Log Consumer Service: The Processing Engine
The consumer service handles the complex logic of storage tier management and rotation policies. It processes the Kafka stream and makes intelligent decisions about where each log should be stored based on content analysis and access patterns.
The rotation policy engine evaluates multiple criteria:
Time-based policies (configurable by log source)
Size-based policies (prevents any single storage tier from becoming a bottleneck)
Access pattern analysis (frequently queried logs stay in faster tiers longer)
Cost optimization (automatic migration to cheaper storage as data ages)
API Gateway: The Query Federation Layer
The gateway service provides a unified query interface that transparently spans all storage tiers. It implements sophisticated query planning to minimize response times while controlling costs.
The query planner analyzes incoming requests and generates execution plans that may involve:
Cache-first lookups for recently executed queries
Parallel queries across multiple storage tiers
Intelligent result streaming for large datasets


