Week 4 : Integrated - Build a Distributed Log Storage Cluster — Replication, Quorums, and Repair

May 30, 2026

∙ Paid

What we build today

By the end of this lesson you will run a three-node log storage cluster that:

Partitions logs by source or time
Places replicas with consistent hashing
Commits writes with quorum (W=2) and reads with quorum (R=2)
Queries ERROR logs across partitions
Runs anti-entropy (Merkle diff + repair)
Shows live metrics on a dashboard — click Run Demo to add more data

Success criteria: Dashboard ingested and repairs_completed increase after each Run Demo; /metrics last_update changes every second.

Where this sits in the full system

Week 4 is the storage spine of our 254-day log platform. Upstream weeks ingest and serialize logs; downstream weeks add Kafka, search, and APIs. Today you own where bytes live, how many copies exist, and how the cluster heals drift — the same problems Cassandra, Dynamo-style KV, and log shard managers solve at billion-event scale.

Core concepts (non-obvious insights)

Partitioning ≠ sharding keys. Partitioning decides which slice owns a log for queries; consistent hashing decides which physical nodes hold replicas. Confusing them causes “balanced partitions but hot nodes.”

Quorum is a latency contract. W=2 of 3 means one slow replica still allows writes — but you must design read repair or anti-entropy so the lagging copy catches up. Production systems expose tunable R/W like Cassandra’s ONE / QUORUM / ALL.

Anti-entropy is not a luxury. At LinkedIn-scale ingestion, bit rot, partial writes, and restarts leave silent divergence. Merkle trees compare digests cheaply before shipping full datasets — the same idea behind Dynamo’s hinted handoff follow-up repairs.

Preparing for a distributed systems interview?

→Download the free Interview Pack

→ Subscribe now to access source code repository - 200 + coding lessons

Continue reading this post for free, courtesy of System Design Course.

Or purchase a paid subscription.

LogStream — Build Distributed Systems