<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Hands On System Design Course - Code Everyday : Distributed Log Implementation With Java & Spring Boot]]></title><description><![CDATA[Hands-on System Design - Distributed Log Processing Implementation with Java & Spring Boot: From Zero to Production

Check here for the detailed 254-lesson curriculum.

Why Take This Course?
This is not a theoretical course. It's a year-long, hands-on journey where you'll build a complete, production-ready system from scratch using Java and Spring Boot. Each day, you'll complete practical tasks that incrementally build your expertise in scalable architectures, microservices, and modern DevOps practices. By the end, you'll have a tangible, portfolio-ready project to showcase your skills.]]></description><link>https://sdcourse.substack.com/s/system-design-course-with-java-and</link><image><url>https://substackcdn.com/image/fetch/$s_!zDvF!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c554ddc-369a-4da8-bd75-54cd71f9a6e9_1024x1024.png</url><title>Hands On System Design Course - Code Everyday : Distributed Log Implementation With Java &amp; Spring Boot</title><link>https://sdcourse.substack.com/s/system-design-course-with-java-and</link></image><generator>Substack</generator><lastBuildDate>Tue, 21 Apr 2026 15:12:09 GMT</lastBuildDate><atom:link href="https://sdcourse.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[System Design Course]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[sdcourse@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[sdcourse@substack.com]]></itunes:email><itunes:name><![CDATA[System Design Course]]></itunes:name></itunes:owner><itunes:author><![CDATA[System Design Course]]></itunes:author><googleplay:owner><![CDATA[sdcourse@substack.com]]></googleplay:owner><googleplay:email><![CDATA[sdcourse@substack.com]]></googleplay:email><googleplay:author><![CDATA[System Design Course]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Day 52: Implement a Simple Inverted Index for Log Searching]]></title><description><![CDATA[Looking for Professional Growth ?]]></description><link>https://sdcourse.substack.com/p/day-52-implement-a-simple-inverted</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-52-implement-a-simple-inverted</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Sat, 18 Apr 2026 04:17:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qENL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Looking for Professional Growth</strong> ? </p><p>The difference between a "design interview" and a "production system" is massive. Close that gap today with the <strong>Hands on</strong> <strong>Distributed Log System Building</strong> course. <strong>Get 40% off for a limited time:</strong> <a href="https://sdcourse.substack.com/fbbab0d8">https://sdcourse.substack.com/fbbab0d8</a></p><div><hr></div><h2>What We&#8217;re Building Today</h2><ul><li><p><strong>Real-time inverted index</strong> that tokenizes and indexes log messages as they arrive via Kafka</p></li><li><p><strong>Search API</strong> with relevance scoring and ranked results for natural language queries</p></li><li><p><strong>Index persistence layer</strong> using Redis for hot data and PostgreSQL for cold storage</p></li><li><p><strong>Query processing engine</strong> supporting boolean operators and phrase matching</p></li></ul><h2>Why This Matters</h2><p>Every major observability platform&#8212;Splunk, Datadog, Elastic&#8212;runs on inverted indices. When you search &#8220;ERROR user authentication failed&#8221; across billions of log entries and get results in milliseconds, you&#8217;re querying an inverted index. This data structure powers everything from application monitoring to security incident response.</p><p>Without inverted indices, log search would require scanning every log entry linearly&#8212;O(n) complexity that becomes impossible at scale. An inverted index transforms this into O(k) lookups where k is the number of query terms, enabling sub-second searches across terabytes of logs. Understanding inverted indices is fundamental to building search infrastructure that scales from thousands to trillions of documents.</p><p>Today&#8217;s implementation bridges the gap between local prototypes and production search engines, showing how the same architectural patterns scale from single-node deployments to distributed clusters processing petabytes daily.</p><h2>System Design Deep Dive</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qENL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qENL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!qENL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!qENL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!qENL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qENL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1790170,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/186064300?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qENL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!qENL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!qENL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!qENL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb989fd88-cb5c-4d0c-af26-3fdb81b01ad0_7000x4500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://sdcourse.substack.com/p/day-52-implement-a-simple-inverted">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 51: Build Dashboards for Visualizing Analytics Results]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-51-build-dashboards-for-visualizing</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-51-build-dashboards-for-visualizing</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Tue, 14 Apr 2026 11:30:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!peaU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><ul><li><p><strong>Real-time analytics dashboard</strong> consuming aggregated metrics from Kafka streams</p></li><li><p><strong>WebSocket-based push architecture</strong> delivering sub-second metric updates to browsers</p></li><li><p><strong>Multi-dimensional visualization service</strong> supporting time-series, histograms, and geographic heatmaps</p></li><li><p><strong>Query optimization layer</strong> with Redis caching and PostgreSQL time-series partitioning</p></li></ul><h2>Why This Matters</h2><blockquote><p>At scale, the gap between generating metrics and making them actionable determines your incident response time. Netflix processes 500 billion events daily, but their dashboard systems compress this into 200ms query responses because engineers can&#8217;t wait 30 seconds to see if a deployment broke something. When Uber&#8217;s surge pricing algorithms trigger, dashboard systems must surface the decision rationale within 100ms or drivers can&#8217;t understand why rates changed.</p><p>The architectural challenge isn&#8217;t building charts&#8212;it&#8217;s designing systems that maintain query responsiveness as data volume grows exponentially. Your dashboard becomes the bottleneck between detecting problems and fixing them. Poor dashboard architecture means your monitoring system generates alerts 5 minutes before your engineers can see the underlying data.</p></blockquote><h2>System Design Deep Dive</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!peaU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!peaU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png 424w, https://substackcdn.com/image/fetch/$s_!peaU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png 848w, https://substackcdn.com/image/fetch/$s_!peaU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!peaU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!peaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:308021,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/185948798?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!peaU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png 424w, https://substackcdn.com/image/fetch/$s_!peaU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png 848w, https://substackcdn.com/image/fetch/$s_!peaU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!peaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23e2d479-515c-45c7-951d-726de6585fff_1750x1125.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://sdcourse.substack.com/p/day-51-build-dashboards-for-visualizing">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 50: Alert Generation Based on Log Patterns]]></title><description><![CDATA[Upgrade to get one month free subscription to our hands on course systemdrd.com that offer wide variety of hands on courses on covering various technologies.]]></description><link>https://sdcourse.substack.com/p/day-50-alert-generation-based-on</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-50-alert-generation-based-on</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Fri, 10 Apr 2026 04:30:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ixe-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://sdcourse.substack.com/subscribe">Upgrade</a> to get one month free subscription to our hands on <strong>course <a href="http://systemdrd.com">systemdrd.com</a></strong> that offer wide variety of hands on courses on covering various technologies.</p><p>Subscribe to our portal for <strong><a href="http://systemdrd.com">systemdrd.com</a> </strong>&amp; get <strong>lifetime access</strong> to &#8220;<strong>Hands On System Design</strong> with &#8220;Distributed Systems Implementation with <strong>python and javascript</strong>&#8221; and this &#8220;Distributed Log Implementation With <strong>Java &amp; Spring Boot</strong>&#8221;</p><div><hr></div><h2>What We&#8217;re Building Today</h2><blockquote><p>A production-grade distributed alerting system that monitors log patterns in real-time and triggers intelligent notifications:</p></blockquote><ul><li><p><strong>Real-time alert rule engine</strong> processing 50,000+ events/second with Kafka Streams</p></li><li><p><strong>Smart alert manager</strong> with deduplication, correlation, and escalation logic</p></li><li><p><strong>Multi-channel notification service</strong> supporting email, Slack, and PagerDuty integration</p></li><li><p><strong>Alert configuration API</strong> for dynamic rule management without system restarts</p></li></ul><h2>Why This Matters</h2><blockquote><p>Alert generation is where distributed log processing transitions from passive observation to active operational response. At scale, naive alerting becomes your biggest operational burden&#8212;Netflix processes 2 billion alerts daily but only acts on 0.01% of them. The challenge isn&#8217;t detecting problems; it&#8217;s preventing alert fatigue while ensuring critical issues never slip through.</p><p>Poor alerting architectures create alert storms during outages (compounding incident response), suffer from flapping alerts that erode trust, generate excessive false positives that train teams to ignore notifications, and fail during the very incidents they&#8217;re designed to detect. Production alerting requires sophisticated state management, intelligent suppression, and fault-tolerant delivery mechanisms that work when your primary systems are degraded.</p></blockquote><h2>System Design Deep Dive</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ixe-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ixe-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png 424w, https://substackcdn.com/image/fetch/$s_!Ixe-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png 848w, https://substackcdn.com/image/fetch/$s_!Ixe-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!Ixe-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ixe-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:503374,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/185949000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ixe-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png 424w, https://substackcdn.com/image/fetch/$s_!Ixe-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png 848w, https://substackcdn.com/image/fetch/$s_!Ixe-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!Ixe-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3e064ae-4540-4492-a0e2-d7b71e822091_1750x1125.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Pattern 1: Stateful Stream Processing for Alert Evaluation</h3>
      <p>
          <a href="https://sdcourse.substack.com/p/day-50-alert-generation-based-on">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 49: Implement Anomaly Detection Algorithms for Distributed Log Processing]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-49-implement-anomaly-detection</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-49-implement-anomaly-detection</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Mon, 06 Apr 2026 11:30:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!rQgW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><p>Today we&#8217;re implementing a production-grade anomaly detection system that processes streaming log data to identify unusual patterns in real-time. You&#8217;ll build:</p><ul><li><p><strong>Statistical anomaly detection engine</strong> using Z-score and IQR methods for numeric metrics</p></li><li><p><strong>Time-series pattern recognition</strong> detecting deviations from historical baselines</p></li><li><p><strong>Multi-dimensional clustering</strong> identifying outliers across correlated log attributes</p></li><li><p><strong>Adaptive threshold system</strong> that learns normal behavior and adjusts detection sensitivity</p></li><li><p><strong>Real-time alerting pipeline</strong> with confidence scoring and false-positive suppression</p></li></ul><h2>Why This Matters: Production Anomaly Detection at Scale</h2><blockquote><p>Anomaly detection is critical infrastructure at companies processing billions of events daily. Netflix&#8217;s anomaly detection system monitors 800+ microservices, detecting issues before they impact customer experience. Uber&#8217;s real-time fraud detection processes 100,000 trip events per second, identifying suspicious patterns within milliseconds. Amazon&#8217;s operational intelligence systems scan millions of metrics to prevent outages.</p><p>The challenge isn&#8217;t just detecting anomalies&#8212;it&#8217;s doing so with minimal false positives while maintaining sub-second latency at massive scale. Traditional threshold-based alerting breaks down when you have thousands of metrics with dynamic baselines. Statistical methods provide precision, but require careful tuning for seasonality, trends, and multi-modal distributions.</p><p>Today&#8217;s implementation demonstrates how to build adaptive anomaly detection that scales horizontally, maintains accuracy under load, and integrates with existing observability infrastructure. The patterns you&#8217;ll implement power the monitoring systems behind modern distributed platforms.</p></blockquote><h2>System Design Deep Dive</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rQgW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rQgW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png 424w, https://substackcdn.com/image/fetch/$s_!rQgW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png 848w, https://substackcdn.com/image/fetch/$s_!rQgW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!rQgW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rQgW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:357361,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/185949154?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rQgW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png 424w, https://substackcdn.com/image/fetch/$s_!rQgW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png 848w, https://substackcdn.com/image/fetch/$s_!rQgW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!rQgW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a6bd5-d27a-4d30-8248-2d0996ae8529_1750x1125.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://sdcourse.substack.com/p/day-49-implement-anomaly-detection">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 48: Sessionization for User Activity Tracking]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-48-sessionization-for-user-activity</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-48-sessionization-for-user-activity</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Thu, 02 Apr 2026 11:30:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!940Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><blockquote><p>Today we implement production-grade sessionization to transform raw event streams into meaningful user sessions:</p></blockquote><ul><li><p><strong>Session Window Processing</strong>: Kafka Streams session windows that automatically group events with configurable inactivity gaps</p></li><li><p><strong>Real-Time Session Tracking</strong>: Redis-backed active session cache with TTL-based expiration and sub-millisecond lookups</p></li><li><p><strong>Session Analytics Engine</strong>: PostgreSQL persistence layer computing session metrics (duration, event count, conversion patterns)</p></li><li><p><strong>Interactive Query API</strong>: REST endpoints exposing session state stores for real-time session queries without external database latency</p></li></ul><h2>Why This Matters</h2><blockquote><p>Sessionization is the foundation of user behavior analytics at scale. Every time you see &#8220;Users who viewed this also bought...&#8221; on Amazon, &#8220;Continue Watching&#8221; on Netflix, or &#8220;Complete your ride&#8221; on Uber, you&#8217;re experiencing sessionization in action. The challenge isn&#8217;t just grouping events&#8212;it&#8217;s doing it correctly with out-of-order events, across millions of concurrent users, while maintaining sub-second query latency.</p><p>The distributed systems challenge emerges from time complexity: Events arrive out of order, users cross session boundaries mid-action, and sessions must expire gracefully without memory leaks. Netflix processes 200+ billion events daily across 250 million users, requiring sessionization that handles late-arriving events up to 24 hours delayed while maintaining real-time dashboard updates. Getting this wrong means misattributed user actions, incorrect analytics, and degraded recommendation quality.</p></blockquote><h2>System Design Deep Dive</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!940Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!940Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png 424w, https://substackcdn.com/image/fetch/$s_!940Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png 848w, https://substackcdn.com/image/fetch/$s_!940Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!940Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!940Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:398651,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/185521034?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!940Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png 424w, https://substackcdn.com/image/fetch/$s_!940Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png 848w, https://substackcdn.com/image/fetch/$s_!940Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!940Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab48e06-382d-4718-a9e3-b937c7efdfb4_1750x1125.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://sdcourse.substack.com/p/day-48-sessionization-for-user-activity">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 47: Sliding Windows for Real-Time Trend Analysis]]></title><description><![CDATA[Stop Drawing Boxes.]]></description><link>https://sdcourse.substack.com/p/day-47-sliding-windows-for-real-time</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-47-sliding-windows-for-real-time</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Sun, 29 Mar 2026 04:06:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pitD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><strong>Stop Drawing Boxes. Start Building Systems. </strong><a href="https://sdcourse.substack.com/subscribe">Subscribe Now</a> </p><p>The gap between a &#8220;system design interview&#8221; and a &#8220;production system&#8221; is massive. This newsletter exists to bridge that gap.</p><p>When you join any organisation, no one is going to teach you how system is design or built. Fun part developers rarely document everything. you need to dig through to understand the system. I created this course because I believe the best way to learn distributed systems is by building them. We don&#8217;t just talk about the CAP theorem; we look at how it dictates our database choices. We don&#8217;t just mention &#8220;latency&#8221;; we measure it. <strong><a href="https://sdcourse.substack.com/subscribe">Subscribe </a></strong></p></blockquote><div><hr></div><div><hr></div><h2>What We&#8217;re Building Today</h2><p>Today we implement sliding window aggregations for real-time trend detection in distributed log processing systems:</p><ul><li><p><strong>Hopping windows</strong> with configurable slide intervals for continuous metric updates</p></li><li><p><strong>Multi-granularity trend analysis</strong> tracking 1-minute, 5-minute, and 15-minute moving averages</p></li><li><p><strong>State-efficient window management</strong> using Kafka Streams&#8217; optimized windowing primitives</p></li><li><p><strong>Interactive query API</strong> serving real-time trend data with sub-10ms latency</p></li><li><p><strong>Production monitoring</strong> tracking window lag, state store size, and processing throughput</p></li></ul><h2>Why This Matters</h2><blockquote><p>Sliding windows solve a critical problem in real-time analytics: detecting trends as they happen. Unlike tumbling windows that update in discrete jumps, sliding windows provide continuous visibility into recent behavior patterns. When Netflix detects video quality degradation, they need second-by-second moving averages&#8212;not 5-minute buckets that hide critical spikes. When Uber calculates surge pricing, they track the velocity of ride requests using overlapping windows to smooth out noise while remaining responsive to demand shifts.</p><p>The fundamental challenge is maintaining thousands of overlapping windows efficiently. A naive implementation storing every window independently would consume massive memory and CPU. Production systems leverage specialized data structures and time-based compaction strategies to maintain window state efficiently while serving low-latency queries.</p></blockquote><h2>System Design Deep Dive</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pitD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pitD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png 424w, https://substackcdn.com/image/fetch/$s_!pitD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png 848w, https://substackcdn.com/image/fetch/$s_!pitD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png 1272w, https://substackcdn.com/image/fetch/$s_!pitD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pitD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1985494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/185520303?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pitD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png 424w, https://substackcdn.com/image/fetch/$s_!pitD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png 848w, https://substackcdn.com/image/fetch/$s_!pitD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png 1272w, https://substackcdn.com/image/fetch/$s_!pitD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26024fc2-ce29-44c6-b620-2c901009be85_6000x4000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>1. Sliding vs Hopping Windows: The Trade-off Space</h3><p>Sliding windows create a new window for every event, providing maximum granularity but at high computational cost. Hopping windows advance by fixed intervals (the &#8220;hop size&#8221;), reducing computation while introducing bounded staleness. The key insight: most applications don&#8217;t need per-event updates&#8212;hopping windows with 10-second hops provide near-continuous trends at 1/100th the cost.</p><p><strong>Window Configuration Pattern:</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://sdcourse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Hands On System Design Course - Code Everyday  is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><pre><code><code>Window Size: 5 minutes
Hop Size: 10 seconds
Result: 30 overlapping windows active simultaneously
Memory: O(window_size / hop_size) per key
</code></code></pre><p>Netflix uses 1-minute windows with 5-second hops for video quality metrics, balancing trend detection speed against computational overhead. Each window overlap shares most of its data with adjacent windows, enabling Kafka Streams to optimize through incremental computation rather than reprocessing the full window on every hop.</p><p><strong>Anti-Pattern:</strong> Setting hop size too small relative to window size. A 1-hour window with 1-second hops creates 3,600 active windows&#8212;each requiring state storage and periodic aggregation. The memory footprint becomes O(events_per_second &#215; window_seconds), potentially gigabytes for high-volume streams.</p><h3>2. State Store Architecture for Window Queries</h3><p>Kafka Streams materializes windowed aggregations into RocksDB-backed state stores, but querying &#8220;what&#8217;s the current moving average?&#8221; requires understanding how windows are keyed. Each window instance is stored with a composite key: <code>(record_key, window_start_time)</code>. To serve a real-time query, we must:</p><ol><li><p>Calculate which windows contain the query timestamp</p></li><li><p>Fetch all relevant window instances from the state store</p></li><li><p>Aggregate across windows to compute the moving average</p></li><li><p>Cache the result in Redis for subsequent queries</p></li></ol><p><strong>Critical Design Decision:</strong> Window retention time. By default, Kafka Streams retains windows for <code>window_size + grace_period</code>. For a 5-minute window, this means only the last 5-10 minutes are queryable. Longer retention enables historical trend queries but increases state store size linearly.</p><p>Uber&#8217;s surge pricing system maintains 15 minutes of windowed state to detect both immediate spikes and sustained demand increases. They use a two-tier approach: hot state in RocksDB for recent windows, cold state in S3 for historical analysis.</p><h3>3. Out-of-Order Event Handling</h3><p>Real-world data streams are never perfectly ordered. Network delays, producer failures, and buffering create timestamp skew. Kafka Streams handles this through grace periods&#8212;extended windows that accept late arrivals for a configured duration after the window would normally close.</p><p><strong>The Grace Period Trade-off:</strong></p><ul><li><p>Too short: Late events dropped, inaccurate trends</p></li><li><p>Too long: Increased memory, delayed window finalization</p></li><li><p>Production tuning: Set grace period to 95th percentile of observed latency</p></li></ul><p>Twitter&#8217;s trending topics system uses a 30-second grace period for their 5-minute trending windows. They found that 30 seconds captures 99% of events while preventing unbounded state growth from severely delayed data. Events arriving later than the grace period are logged to a dead-letter topic for analysis but don&#8217;t affect real-time trends.</p><p><strong>State Management:</strong> Each open window consumes memory proportional to the aggregation size (typically bytes to kilobytes per window). With 10,000 unique keys and 30 windows per key, you&#8217;re managing ~300K active window instances. Kafka Streams uses sparse windowing&#8212;only creating window instances when events arrive for that key-window pair.</p><h3>4. Incremental Aggregation Patterns</h3><p>Computing moving averages requires maintaining both sum and count for each window. The naive approach stores all events in the window and recalculates on every query. Production systems use incremental aggregation:</p><pre><code><code>// Efficient: O(1) per event
windowedStream
  .aggregate(
    () -&gt; new WindowStats(0L, 0L), // sum, count
    (key, value, aggregate) -&gt; {
      aggregate.sum += value;
      aggregate.count++;
      return aggregate;
    }
  );

// Query time: O(1)
double movingAverage = stats.sum / stats.count;
</code></code></pre><p>Amazon&#8217;s CloudWatch uses this pattern for metric aggregations, maintaining running sums, counts, min, max, and sum-of-squares for percentile calculations. Each metric point is processed exactly once into the window state, then queries read pre-aggregated values.</p><p><strong>Failure Handling:</strong> Kafka Streams checkpoints window state to changelog topics. On failure, the stream processor restores state from the changelog and resumes processing. Window state is strongly consistent&#8212;each window instance exists on exactly one partition, eliminating the need for distributed coordination during aggregation.</p><h3>5. Query Patterns for Real-Time Dashboards</h3><p>Serving windowed aggregations requires an interactive query layer. Kafka Streams exposes state stores through <code>ReadOnlyWindowStore</code> interfaces, but querying is local to each stream processor instance. In a multi-instance deployment, you need service discovery to route queries to the correct instance holding the relevant partition.</p><p><strong>Production Pattern:</strong></p><pre><code><code>Query Router &#8594; [Discovers key partition] &#8594; Stream Processor Instance &#8594; RocksDB &#8594; Response
</code></code></pre><p>For globally aggregated metrics (e.g., &#8220;average error rate across all services&#8221;), you need a scatter-gather approach: query all stream processor instances, aggregate their responses. This is expensive&#8212;instead, maintain a dedicated aggregation topology that pre-computes global windows.</p><p>Netflix&#8217;s Edge Gateway metrics system uses a hybrid approach: partition-local windows for per-service metrics (fast queries, no coordination), and a secondary global aggregation topology for cross-service dashboards. The global topology reduces 10,000 microservice streams into a single aggregated stream with tolerable latency (~5 seconds end-to-end).</p><h2>Implementation Walkthrough</h2><h3>GitHub Link:</h3><pre><code><a href="https://github.com/sysdr/sdc-java/tree/main/day47/day47-sliding-window-analytics">https://github.com/sysdr/sdc-java/tree/main/day47/day47-sliding-window-analytics</a></code></pre><h3>Step 1: Define Window Configuration</h3><p>We implement multiple window sizes to serve different analytical needs. The 1-minute window detects immediate issues, 5-minute windows smooth noise, 15-minute windows identify sustained trends:</p><pre><code><code>Duration oneMinWindow = Duration.ofMinutes(1);
Duration fiveMinWindow = Duration.ofMinutes(5);
Duration fifteenMinWindow = Duration.ofMinutes(15);
Duration hopInterval = Duration.ofSeconds(10);

TimeWindows oneMinHopping = TimeWindows
  .ofSizeWithNoGrace(oneMinWindow)
  .advanceBy(hopInterval);
</code></code></pre><p><strong>Architectural Decision:</strong> No grace period initially&#8212;we prioritize deterministic window boundaries over late event handling. Production systems start here, then add grace periods based on observed late-arrival patterns.</p><h3>Step 2: Build Windowed Aggregation Topology</h3><p>The stream processing topology aggregates events into windowed state stores. Each log event contains metrics (error rate, latency, throughput) that we aggregate into moving averages:</p><pre><code><code>streamsBuilder
  .stream("log-events")
  .groupByKey()
  .windowedBy(oneMinHopping)
  .aggregate(
    WindowStats::new,
    (key, event, stats) -&gt; stats.update(event),
    Materialized.&lt;String, WindowStats, WindowStore&lt;Bytes, byte[]&gt;&gt;as("one-min-windows")
      .withValueSerde(windowStatsSerde)
  );
</code></code></pre><p>The materialized view name (&#8221;one-min-windows&#8221;) becomes the state store name for interactive queries. Kafka Streams automatically manages this store across multiple instances using partition assignment.</p><h3>Step 3: Interactive Query API</h3><p>The REST API exposes current moving averages by querying the underlying state stores. The critical challenge: state stores are partitioned&#8212;we need to discover which instance holds the data for a given key:</p><pre><code><code>@GetMapping("/trends/{serviceId}")
public TrendResponse getTrends(@PathVariable String serviceId) {
  StreamsMetadata metadata = streams.metadataForKey(
    "one-min-windows", 
    serviceId, 
    Serdes.String().serializer()
  );
  
  if (metadata.equals(streams.localMetadata())) {
    return queryLocalStore(serviceId);
  } else {
    return forwardToInstance(metadata.host(), metadata.port(), serviceId);
  }
}
</code></code></pre><h3>Step 4: Caching Layer for Query Performance</h3><p>Querying RocksDB state stores on every API request creates I/O bottlenecks. We cache computed trends in Redis with short TTLs matching the hop interval:</p><pre><code><code>String cacheKey = "trend:" + serviceId + ":" + System.currentTimeMillis() / hopMillis;
TrendResponse cached = redis.get(cacheKey);

if (cached != null) return cached;

TrendResponse computed = computeFromStateStore(serviceId);
redis.setex(cacheKey, hopSeconds, computed);
return computed;
</code></code></pre><p><strong>Performance Impact:</strong> Cache hit rate &gt;90% reduces state store queries by 10x, dropping p99 latency from 25ms to &lt;3ms.</p><h3>Working demo link :</h3><div id="youtube2-d-t8t4kCQSw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;d-t8t4kCQSw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/d-t8t4kCQSw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Production Considerations</h2><p><strong>Memory Management:</strong> Monitor state store disk usage with <code>kafka.streams.state.store.bytes.total</code>. Each window instance is ~100 bytes, so 10K keys &#215; 30 windows &#215; 100 bytes = 30MB per window size. With three window sizes and replicas, expect ~180MB total.</p><p><strong>Out-of-Order Events:</strong> Set grace periods to p95 network latency (typically 100-500ms for same-region). Monitor <code>kafka.streams.late.record.drop.total</code> to detect excessive late arrivals indicating misconfigured grace periods or upstream delays.</p><p><strong>Query Latency:</strong> p99 query latency should stay &lt;10ms for local queries, &lt;50ms for remote instance queries. High latency indicates either state store compaction issues or insufficient instance resources.</p><p><strong>Failure Scenarios:</strong></p><ul><li><p><strong>Instance crash:</strong> State restores from changelog (~10-30 seconds for moderate state), queries fail until restoration completes</p></li><li><p><strong>Network partition:</strong> Queries to unreachable instances timeout, implement circuit breakers with 3-second timeouts</p></li><li><p><strong>Slow consumers:</strong> Kafka consumer lag increases, windows compute with stale data&#8212;monitor <code>records-lag-max</code></p></li></ul><h2>Scaling to Production</h2><p>Uber&#8217;s ride request monitoring processes 100K+ events/second using 20 Kafka Streams instances, each managing ~5K unique keys. They partition by geographic region (rider location hash) to enable local aggregations and regional dashboards. Their sliding windows use 30-second hops, creating 10 overlapping windows per 5-minute interval.</p><p>Key scaling insights from their architecture:</p><ul><li><p>State store size grows with unique key count, not event volume</p></li><li><p>Hop size determines computational cost&#8212;10-second hops cost 6x more than 60-second hops</p></li><li><p>Query latency depends on instance locality&#8212;co-locate API and stream processing when possible</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://sdcourse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Hands On System Design Course - Code Everyday  is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Day 46: Time-Based Windowing for Real-Time Log Aggregation]]></title><description><![CDATA[Stop just reading about high-scale systems&#8212;start building them. For the next few days, get 50% off the &#8220;Hands-on System Design&#8221; course and master production-grade Java and Spring Boot architectures at half the price.]]></description><link>https://sdcourse.substack.com/p/day-46-time-based-windowing-for-real</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-46-time-based-windowing-for-real</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Wed, 25 Mar 2026 07:49:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!x_JA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Stop just reading about high-scale systems&#8212;start building them.</strong> For the next few days, get <strong>50% off</strong> the &#8220;<strong>Hands-on System Design</strong>&#8221; course and master production-grade Java and Spring Boot architectures at half the price. Move beyond theoretical diagrams to implementing systems that handle <strong>100M+ requests</strong>.</p><p><strong>[ <a href="https://sdcourse.substack.com/d8592ce9">Subscribe Now</a> &amp; Save 50% ]</strong></p><h2>What We&#8217;re Building Today</h2><blockquote><p>Today we implement production-grade time-based windowing for real-time log analytics:</p></blockquote><ul><li><p><strong>Tumbling Windows</strong>: Fixed-size, non-overlapping time windows for discrete period aggregations</p></li><li><p><strong>Hopping Windows</strong>: Overlapping time windows for trend detection with configurable advance intervals</p></li><li><p><strong>Session Windows</strong>: Dynamic windows based on activity gaps for user session analytics</p></li><li><p><strong>Windowed Metrics Engine</strong>: Real-time calculation of count, sum, average, min, max per window</p></li><li><p><strong>Late Data Handling</strong>: Grace periods and watermark management for out-of-order events</p></li><li><p><strong>Window State Persistence</strong>: RocksDB-backed state stores with changelog topics for fault tolerance</p></li><li><p><strong>Interactive Queries</strong>: REST API exposing current and historical window results in real-time</p></li></ul><blockquote><p>System processes <strong>50,000+ events/second</strong> with <strong>sub-100ms window computation latency</strong> and maintains <strong>exactly-once window semantics</strong> even during failures.</p></blockquote><h2>Why This Matters: The Foundation of Real-Time Analytics</h2><blockquote><p>Every production monitoring system, business intelligence dashboard, and real-time alerting platform relies on time-based windowing. When Netflix monitors video quality metrics per 5-minute window across 200+ million users, when Uber calculates surge pricing based on 1-minute ride request windows per geographic area, or when Amazon tracks order volumes in 15-minute windows for capacity planning&#8212;they all use the same fundamental windowing patterns we&#8217;re implementing today.</p><p>The challenge isn&#8217;t just aggregating data over time&#8212;it&#8217;s handling late-arriving events, managing state for millions of concurrent windows, ensuring exactly-once semantics despite failures, and providing low-latency access to both current and historical window results. Window boundaries create consistency challenges: should an event timestamped at 10:59:59 but arriving at 11:00:01 belong to the 10:00-11:00 window or be discarded? How long do you wait for stragglers before finalizing a window?</p><p>Modern stream processing platforms solve these problems through watermarks (tracking event time progress), grace periods (allowing late data within bounds), and stateful processing (maintaining window state across crashes). Understanding these patterns transforms you from writing batch aggregation scripts to building the real-time analytics engines that power modern data-driven companies.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x_JA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x_JA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!x_JA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!x_JA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!x_JA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x_JA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2548291,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/185520772?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x_JA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!x_JA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!x_JA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!x_JA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051427ee-66ba-4449-bb26-7aa9c091aeba_7000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://sdcourse.substack.com/p/day-46-time-based-windowing-for-real">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 45: Implement a Simple MapReduce Framework for Batch Log Analysis]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-45-implement-a-simple-mapreduce</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-45-implement-a-simple-mapreduce</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Sat, 21 Mar 2026 09:17:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!d4Dg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><blockquote><p>Today we&#8217;re implementing a production-grade MapReduce framework for batch log analysis:</p></blockquote><ul><li><p><strong>Distributed MapReduce Engine</strong>: Complete map-shuffle-reduce pipeline processing millions of log events</p></li><li><p><strong>Word Count &amp; Pattern Analysis</strong>: Real-time pattern frequency detection across distributed log streams</p></li><li><p><strong>Fault-Tolerant Task Scheduling</strong>: Coordinator-worker architecture with  automatic task retry and failure recovery</p></li><li><p><strong>Scalable Storage Backend</strong>: Partitioned intermediate results with efficient shuffle operations</p></li></ul><h2>Why This Matters: The Foundation of Big Data Processing</h2><blockquote><p>While Kafka Streams excels at real-time processing, many analytics workloads require batch processing of historical data. MapReduce remains the fundamental pattern behind modern data processing frameworks like Apache Spark, Hadoop, and even cloud-native services like AWS EMR and Google Dataflow.</p><p>When Netflix analyses viewing patterns across billions of log events to optimize content recommendations, when Uber processes trip data to identify demand hotspots, or when Amazon analyses customer behaviour across terabytes of clickstream data&#8212;they&#8217;re all using MapReduce-style distributed processing. The pattern we implement today scales from processing megabytes on your laptop to petabytes across thousands of machines.</p><p>The key insight: MapReduce transforms complex distributed data processing into two simple operations (map and reduce) while hiding the complexity of data distribution, parallel execution, fault tolerance, and result aggregation. This abstraction enables data engineers to focus on business logic while the framework handles distributed systems complexity.</p></blockquote><h2>System Design Deep Dive: MapReduce Architecture Patterns</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d4Dg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d4Dg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!d4Dg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!d4Dg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!d4Dg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d4Dg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1992030,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/185520485?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d4Dg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!d4Dg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!d4Dg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!d4Dg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2814404e-e1d2-4c36-bfae-cd869b2ac70c_7000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>1. Map-Shuffle-Reduce Pipeline Architecture</h3><div class="paywall-jump" data-component-name="PaywallToDOM"></div><p>MapReduce divides computation into three distinct phases:</p><p><strong>Map Phase</strong>: Each mapper processes a subset of input data independently, emitting key-value pairs. For log analysis, mappers extract patterns like error codes, user IDs, or URL paths from log entries. The critical design decision is data partitioning&#8212;how you split input data determines parallelism and load balance.</p><p><strong>Shuffle Phase</strong>: The framework groups all values with the same key and routes them to the appropriate reducer. This is where network I/O becomes the bottleneck. Production implementations use combiners (local reducers) to minimize data transfer. Twitter&#8217;s implementation processes 100TB+ daily logs, reducing shuffle data by 80% through combiner optimization.</p><p><strong>Reduce Phase</strong>: Each reducer aggregates values for its assigned keys. Reducers must handle partial failures&#8212;if a reducer crashes mid-processing, the framework restarts it with the same input data. This requires idempotent operations.</p><p><strong>Trade-off</strong>: MapReduce optimizes for throughput over latency. While Kafka Streams provides sub-second processing, MapReduce batch jobs might take minutes or hours. Choose MapReduce when you need to process complete datasets with strong consistency guarantees over time-sensitive results.</p><h3>2. Coordinator-Worker Task Scheduling</h3><p>The coordinator (master) maintains the distributed system state:</p><ul><li><p><strong>Task Assignment</strong>: Assigns map and reduce tasks to available workers</p></li><li><p><strong>Progress Tracking</strong>: Monitors task completion and detects stragglers</p></li><li><p><strong>Failure Detection</strong>: Identifies crashed workers and reschedules their tasks</p></li><li><p><strong>Data Locality</strong>: Preferentially assigns tasks to workers with local data access</p></li></ul><p><strong>The CAP Theorem Implication</strong>: Our coordinator becomes a single point of failure, choosing consistency (CP) over availability (AP). In production, systems like Google&#8217;s MapReduce use Chubby (distributed lock service) or Apache ZooKeeper to make the coordinator highly available. For our implementation, we accept this trade-off for simplicity.</p><p><strong>Straggler Mitigation</strong>: LinkedIn&#8217;s MapReduce jobs process 40PB monthly. They discovered that 10% of tasks take 3x longer than average (stragglers). The solution: speculative execution&#8212;launch backup tasks for slow-running jobs and use whichever completes first.</p><h3>3. Partitioned Intermediate Storage</h3><p>Between map and reduce phases, intermediate results must be stored and shuffled:</p><p><strong>Disk-Based Storage</strong>: Mappers write output to local disk partitioned by reduce key. This provides fault tolerance&#8212;if a reducer fails, intermediate data persists for retry. The trade-off is I/O overhead.</p><p><strong>In-Memory Optimization</strong>: Modern implementations like Apache Spark cache intermediate data in memory when possible, achieving 10-100x speedup. We implement a hybrid approach&#8212;memory buffers with disk spillover.</p><p><strong>Hash Partitioning</strong>: We use consistent hashing to distribute keys across reducers. This ensures even load distribution and enables horizontal scaling. Amazon&#8217;s internal MapReduce processes 100M+ keys per second using murmur3 hash with 10,000 reduce partitions.</p><h3>4. Fault Tolerance Through Task Retry</h3><p>Distributed systems fail constantly at scale. Google&#8217;s cluster of 10,000 machines experiences:</p><ul><li><p>20 machine failures per day</p></li><li><p>1000 hard drive failures per year</p></li><li><p>Network partitions several times per week</p></li></ul><p>Our MapReduce framework implements three fault-tolerance mechanisms:</p><p><strong>Heartbeat-Based Failure Detection</strong>: Workers send periodic heartbeats to the coordinator. Missing 3 consecutive heartbeats triggers task rescheduling. This detects crashes, network partitions, and hung processes.</p><p><strong>Task-Level Idempotency</strong>: Each task produces deterministic output for the same input. If a task executes twice (due to retry), the final result remains correct. This requires careful handling of side effects.</p><p><strong>Partial Result Recovery</strong>: If 95% of map tasks complete but 5% fail, we only retry the failed tasks rather than restarting the entire job. This dramatically improves completion time for large jobs.</p><h3>5. Backpressure and Resource Management</h3><p>Without proper backpressure, the system floods:</p><ul><li><p><strong>Memory Exhaustion</strong>: Fast mappers overwhelm slow reducers, filling intermediate storage</p></li><li><p><strong>Network Saturation</strong>: Shuffle phase consumes all bandwidth, starving other cluster traffic</p></li><li><p><strong>Disk Thrashing</strong>: Too many concurrent writes cause random I/O patterns</p></li></ul><p>Our implementation uses:</p><ul><li><p><strong>Task Throttling</strong>: Limit concurrent map tasks based on available worker memory</p></li><li><p><strong>Flow Control</strong>: Reducers signal backpressure when input buffers reach 80% capacity</p></li><li><p><strong>Resource Quotas</strong>: Each job gets CPU/memory/disk quotas to prevent resource starvation</p></li></ul><p>Uber&#8217;s MapReduce platform processes 100PB+ daily. They implement hierarchical fair scheduling&#8212;giving priority queues 60% of cluster resources while ensuring batch jobs get at least 20%.</p><h2>Implementation Walkthrough: Building the Framework</h2><h3>GitHub Link :</h3><pre><code><a href="https://github.com/sysdr/sdc-java/tree/main/day45/mapreduce-log-processor">https://github.com/sysdr/sdc-java/tree/main/day45/mapreduce-log-processor</a></code></pre><h3>Core Components Architecture</h3><p>Our system comprises five microservices:</p><p><strong>MapReduce Coordinator</strong>: Spring Boot service managing job lifecycle, task scheduling, and failure recovery. Exposes REST API for job submission and status queries. Maintains task state in PostgreSQL for fault tolerance.</p><p><strong>Map Worker Pool</strong>: Horizontally scalable workers consuming log batches from Kafka, applying user-defined map functions, and writing partitioned intermediate results to Redis. Each worker processes 10,000 events/second with automatic retry on transient failures.</p><p><strong>Reduce Worker Pool</strong>: Workers reading shuffled data from Redis, applying reduce functions, and persisting final results to PostgreSQL. Implements combiner pattern to minimize network transfer during shuffle phase.</p><p><strong>Storage Layer</strong>: Redis stores intermediate map outputs with 1-hour TTL. PostgreSQL persists final results with proper indexing for analytical queries. Kafka provides input log stream with replay capability for job reruns.</p><p><strong>API Gateway</strong>: Rate-limited REST endpoints for job submission, progress monitoring, and result retrieval. Implements circuit breaker pattern to prevent cascade failures.</p><h3>Implementation Flow</h3><p><strong>1. Job Submission Phase</strong>:</p><pre><code><code>@PostMapping("/jobs")
public JobStatus submitJob(@RequestBody JobRequest request) {
    // Validate user-defined map/reduce functions
    validateUserCode(request.getMapFunction(), request.getReduceFunction());
    
    // Create job metadata and initial tasks
    Job job = jobRepository.save(new Job(request));
    createMapTasks(job, request.getInputTopic(), request.getNumMappers());
    
    // Publish job to task queue for worker pickup
    coordinatorService.scheduleJob(job);
    return new JobStatus(job.getId(), "RUNNING");
}
</code></code></pre><p><strong>2. Map Phase Execution</strong>:</p><pre><code><code>@KafkaListener(topics = "map-tasks")
public void executeMapTask(MapTask task) {
    try {
        // Consume log batch from Kafka with offset management
        List&lt;LogEvent&gt; logs = kafkaConsumer.poll(task.getPartition());
        
        // Apply user map function: log -&gt; List&lt;KeyValue&gt;
        List&lt;KeyValue&gt; mappedResults = logs.stream()
            .flatMap(log -&gt; mapFunction.apply(log))
            .collect(Collectors.toList());
        
        // Partition by reduce key and write to Redis
        Map&lt;Integer, List&lt;KeyValue&gt;&gt; partitions = 
            partitionByKey(mappedResults, task.getNumReducers());
        
        partitions.forEach((partition, data) -&gt; 
            redisTemplate.opsForList()
                .rightPushAll(partitionKey(task.getJobId(), partition), data)
        );
        
        // Report completion to coordinator
        coordinatorService.completeTask(task.getId());
    } catch (Exception e) {
        coordinatorService.failTask(task.getId(), e);
    }
}
</code></code></pre><p><strong>3. Shuffle and Reduce Phase</strong>:</p><pre><code><code>@Scheduled(fixedDelay = 1000)
public void executeReduceTasks() {
    List&lt;ReduceTask&gt; tasks = coordinatorService.getReadyReduceTasks();
    
    tasks.parallelStream().forEach(task -&gt; {
        // Fetch all values for assigned partition from Redis
        List&lt;KeyValue&gt; partitionData = redisTemplate.opsForList()
            .range(partitionKey(task.getJobId(), task.getPartition()), 0, -1);
        
        // Group by key and apply reduce function
        Map&lt;String, List&lt;String&gt;&gt; grouped = partitionData.stream()
            .collect(Collectors.groupingBy(
                KeyValue::getKey,
                Collectors.mapping(KeyValue::getValue, Collectors.toList())
            ));
        
        List&lt;Result&gt; results = grouped.entrySet().stream()
            .map(e -&gt; new Result(e.getKey(), reduceFunction.apply(e.getValue())))
            .collect(Collectors.toList());
        
        // Persist final results to PostgreSQL
        resultRepository.saveAll(results);
        coordinatorService.completeTask(task.getId());
    });
}
</code></code></pre><h3>Working demo link :</h3><div id="youtube2-eYe5CnBqHgQ" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;eYe5CnBqHgQ&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/eYe5CnBqHgQ?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h3>Key Architectural Decisions</h3><p><strong>Why Redis for Intermediate Storage</strong>: We need fast random writes (map output) and sequential reads (reduce input). Redis provides 100K+ ops/second with built-in persistence. The alternative (disk-only) reduces throughput by 10x but improves fault tolerance.</p><p><strong>Task Granularity</strong>: Each map task processes 10,000 log events. Smaller tasks increase scheduling overhead; larger tasks reduce parallelism. This aligns with Google&#8217;s MapReduce guideline: task execution time should be 1-10 minutes.</p><p><strong>Heartbeat Interval</strong>: Workers send heartbeats every 5 seconds with 15-second timeout. Faster intervals waste network bandwidth; slower intervals delay failure detection. This matches AWS EMR&#8217;s production settings.</p><h2>Production Considerations</h2><p><strong>Performance Characteristics</strong>: Our framework processes 50,000 events/second with 4 map workers and 2 reduce workers. Horizontal scaling is linear up to 20 workers (200K events/sec) before coordinator bottleneck. Memory footprint: 2GB per worker for 100K intermediate key-value pairs.</p><p><strong>Monitoring Strategy</strong>: Track critical metrics:</p><ul><li><p>Job completion rate and average duration</p></li><li><p>Task failure rate by type (map vs reduce)</p></li><li><p>Shuffle data volume (indicates skew problems)</p></li><li><p>Worker CPU/memory/disk utilization</p></li><li><p>Coordinator queue depth (scheduling bottleneck indicator)</p></li></ul><p><strong>Failure Scenarios</strong>:</p><ul><li><p><strong>Worker Crash</strong>: Coordinator detects via heartbeat timeout, reschedules in-progress tasks</p></li><li><p><strong>Coordinator Crash</strong>: New coordinator reads job state from PostgreSQL, resumes scheduling</p></li><li><p><strong>Data Skew</strong>: One reduce key has 80% of data&#8212;causes straggler. Solution: implement combiner or split hot keys</p></li><li><p><strong>Network Partition</strong>: Workers isolated from coordinator. Solution: implement split-brain detection with fencing tokens</p></li></ul><p><strong>Scalability Bottlenecks</strong>: The coordinator handles 1000 tasks/second. Beyond that, implement hierarchical coordinators or consistent hashing for task assignment. Redis shuffle layer supports 1M keys before requiring Redis Cluster (sharding).</p><h2>Scale Connection: MapReduce in Production Systems</h2><p>Google&#8217;s original MapReduce processed 20PB per day across 1000s of machines. Modern implementations scale further:</p><p><strong>Facebook&#8217;s Corona</strong>: Schedules 100,000+ MapReduce jobs daily across 60,000 machines, processing 600PB of data monthly. They implement three-level scheduling hierarchy to scale the coordinator.</p><p><strong>LinkedIn&#8217;s Hadoop</strong>: Runs 250,000 jobs per day with average job completion time of 4 minutes. Their optimization: aggressive speculative execution reduces tail latency by 40%.</p><p><strong>Twitter&#8217;s Scalding</strong>: Processes 100TB+ daily logs for real-time and batch analytics. They combine MapReduce (batch) with Storm (streaming) for lambda architecture.</p><p>The pattern we implemented today&#8212;map/shuffle/reduce with fault-tolerant coordination&#8212;remains the foundation of modern big data processing, evolved into frameworks like Spark and Flink but retaining the same core abstractions.</p>]]></content:encoded></item><item><title><![CDATA[Day 44: Real-Time Monitoring Dashboard with Kafka Streams]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-44-real-time-monitoring-dashboard-60a</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-44-real-time-monitoring-dashboard-60a</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Tue, 17 Mar 2026 09:14:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kxAj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><ul><li><p><strong>Live metrics aggregation system</strong> processing 40,000+ events/second with sub-second latency</p></li><li><p><strong>Kafka Streams processor</strong> performing windowed aggregations, percentile calculations, and anomaly detection</p></li><li><p><strong>Real-time dashboard API</strong> serving live statistics with WebSocket updates</p></li><li><p><strong>Production monitoring stack</strong> with Grafana dashboards tracking stream processor health</p></li><li><p><strong>Fault-tolerant state management</strong> using RocksDB-backed state stores with changelog topics</p></li></ul><h2>Why This Matters: Observability at Internet Scale</h2><blockquote><p>When Netflix processes 450 billion events per day from their streaming platform, or Uber analyzes 100 million trip events daily, they need real-time visibility into system behavior. Traditional batch processing creates blind spots&#8212;by the time you see yesterday&#8217;s metrics, today&#8217;s incidents have already cascaded. Real-time stream processing transforms raw events into actionable insights within milliseconds, enabling immediate detection of anomalies, capacity issues, and user-impacting problems.</p><p>The challenge isn&#8217;t just aggregating data&#8212;it&#8217;s maintaining accurate state across failures, handling late-arriving events, managing memory with billions of unique keys, and providing consistent results during rebalances. A poorly designed streaming pipeline can lose data during crashes, produce duplicate counts after restarts, or fall behind during traffic spikes. Today we&#8217;ll build a production-grade monitoring system that handles these challenges using Kafka Streams&#8217; exactly-once semantics, fault-tolerant state stores, and windowed aggregations.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kxAj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kxAj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!kxAj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!kxAj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!kxAj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kxAj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2307353,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/184851829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kxAj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!kxAj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!kxAj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!kxAj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a1660aa-e451-414b-8e94-5c33bced18e6_7000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>System Design Deep Dive: Stream Processing Patterns</h2><h3>1. Stateful Stream Processing with RocksDB</h3><p>Kafka Streams maintains state locally in embedded RocksDB databases, backed by Kafka changelog topics. When your stream processor calculates &#8220;requests per minute&#8221; or &#8220;95th percentile latency,&#8221; it&#8217;s not querying a database&#8212;it&#8217;s updating in-memory/on-disk state stores that survive process crashes. Each state store has a corresponding changelog topic that captures every state mutation. If a processor crashes, the replacement reads the changelog to rebuild state from the last checkpoint.</p><p><strong>Trade-off</strong>: Local state provides sub-millisecond query latency but limits scalability to disk capacity per instance. For aggregations tracking millions of unique keys, you must partition state across multiple processor instances. The <code>group_id</code> determines which processor owns which keys&#8212;consistent hashing ensures the same keys always route to the same partition.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://sdcourse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Hands On System Design Course - Code Everyday  is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Anti-pattern</strong>: Storing unbounded state. Without time-based eviction (windowing) or size limits, state stores grow indefinitely. A service tracking &#8220;unique users per endpoint&#8221; will eventually run out of disk if it never expires old keys. Always implement retention policies aligned with business requirements.</p><h3>2. Windowed Aggregations and Time Semantics</h3><p>Stream processing must handle time ambiguity&#8212;events arrive with: processing time (when the processor sees it), event time (when it actually occurred), and ingestion time (when Kafka received it). A log event from 10 minutes ago arriving now (due to network delays) must contribute to the correct time window, not the current one.</p><p>Kafka Streams supports tumbling windows (non-overlapping 1-minute buckets), hopping windows (overlapping 5-minute windows advancing every 1 minute), and session windows (grouped by inactivity gaps). Each creates different trade-offs: tumbling windows provide simple counts but miss trends across boundaries; hopping windows smooth outliers but duplicate event processing; session windows handle bursty traffic but complicate memory management.</p><p><strong>Production insight</strong>: Always configure grace periods for late arrivals. A 1-minute window with 30-second grace accepts events up to 1:30 after window close, balancing completeness against latency. LinkedIn&#8217;s Samza learned this the hard way&#8212;their initial streaming pipelines dropped 2% of events during peak load because they closed windows too aggressively.</p><h3>3. Materialized Views and Interactive Queries</h3><p>State stores serve dual purposes: internal processing state and queryable materialized views. Your Kafka Streams application can expose REST endpoints that query local state stores directly, bypassing external databases. When the dashboard requests &#8220;current requests/sec by endpoint,&#8221; the API queries the stream processor&#8217;s state store&#8212;no database roundtrip needed.</p><p><strong>Scaling consideration</strong>: State stores are partitioned&#8212;a query for <code>/api/users</code> might land on instance-1, but that instance only holds state for partition 0. You need either: (1) scatter-gather queries across all instances, (2) routing proxy directing queries to correct partition, or (3) global state stores replicated to all instances. Global stores solve routing but triple memory usage for frequently queried data.</p><p><strong>Twitter&#8217;s architecture</strong>: Their real-time analytics use interactive queries against Kafka Streams state stores for the first 7 days of data, then fall back to Druid for historical analysis. This hybrid approach balances query latency (5ms from state stores vs 50ms from Druid) against storage costs.</p><h3>4. Exactly-Once Stream Processing</h3><p>Kafka Streams achieves exactly-once semantics through transactional writes&#8212;each processing step (read input, update state, write output) executes atomically. If the processor crashes mid-transaction, the entire operation rolls back. This prevents duplicate counts after restarts, a common bug in at-least-once processing.</p><p><strong>Implementation</strong>: Enable <code>processing.guarantee=exactly_once_v2</code> and ensure all state operations happen within the topology. External side effects (database writes, API calls) break exactly-once guarantees&#8212;if your stream processor writes to PostgreSQL, then crashes, the Kafka message will be reprocessed but the DB write won&#8217;t roll back, creating duplicates.</p><p><strong>Trade-off</strong>: Exactly-once processing adds 10-15% latency overhead from transactional commits. For monitoring dashboards where occasional duplicates are acceptable, at-least-once processing provides better throughput. For financial transactions or user account state, exactly-once is mandatory.</p><h3>5. Stream Processing Failure Modes</h3><p>Stream processors fail differently than request-response services. A crashed processor doesn&#8217;t just stop responding&#8212;it triggers rebalances that temporarily halt all partition processing. During rebalance, the dashboard shows stale data until state restoration completes (reading changelog topics can take 30-60 seconds for large state stores).</p><p><strong>Cascading failures</strong>: One slow processor instance causes Kafka consumer group heartbeat timeouts, triggering rebalances across all instances, pausing all processing during state restoration, creating backlog that overloads instances when they resume. This cascade can bring down entire streaming pipelines.</p><p><strong>Mitigation</strong>: Implement backpressure handling&#8212;if state stores can&#8217;t keep up with ingestion rate, the processor should pause consumption rather than accepting unbounded backlog. Configure <code>max.poll.interval.ms</code> generously (5 minutes) to prevent false-positive timeouts during legitimate processing spikes.</p><h2>Implementation Walkthrough: Building the Monitoring Pipeline</h2><h3>GitHub Link:</h3><pre><code><a href="https://github.com/sysdr/sdc-java/tree/main/day44/realtime-monitoring-dashboard">https://github.com/sysdr/sdc-java/tree/main/day44/realtime-monitoring-dashboard</a> </code></pre><h3>Service Architecture</h3><p>Our system consists of four Spring Boot services:</p><p><strong>log-producer</strong> generates realistic log events (HTTP requests, database queries, cache operations) at 40,000 events/second. Each event includes timestamp, endpoint, response time, status code, and user identifier. Events flow into <code>log-events</code> Kafka topic with 12 partitions for parallelism.</p><p><strong>stream-processor</strong> consumes from <code>log-events</code>, performs windowed aggregations (requests per minute, error rates, latency percentiles), and materializes results to state stores. It exposes REST endpoints querying these state stores&#8212;no external database required. The processor computes:</p><ul><li><p>Request count per endpoint per minute (tumbling window)</p></li><li><p>Error rate by status code per 5-minute window (hopping)</p></li><li><p>P50, P95, P99 latency using t-digest algorithm</p></li><li><p>Anomaly detection flagging 3-sigma deviations</p></li></ul><p><strong>dashboard-api</strong> serves WebSocket connections, polling the stream processor&#8217;s interactive queries every second and pushing updates to connected clients. It maintains connection state in Redis for horizontal scaling&#8212;multiple API instances can serve different dashboard clients.</p><p><strong>dashboard-ui</strong> provides a single-page React application with real-time charts. We use Recharts for time-series visualization and WebSocket API for live data streaming.</p><h3>Kafka Streams Topology</h3><pre><code><code>StreamsBuilder builder = new StreamsBuilder();

KStream&lt;String, LogEvent&gt; events = builder.stream("log-events");

// Windowed aggregation: requests per endpoint per minute
KTable&lt;Windowed&lt;String&gt;, Long&gt; requestCounts = events
    .groupBy((key, event) -&gt; event.getEndpoint())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
    .count(Materialized.as("request-counts-store"));

// Error rate calculation
KTable&lt;Windowed&lt;String&gt;, Double&gt; errorRates = events
    .groupBy((key, event) -&gt; event.getEndpoint())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .aggregate(
        ErrorRateAccumulator::new,
        (key, event, accumulator) -&gt; accumulator.add(event),
        Materialized.as("error-rates-store")
    )
    .mapValues(acc -&gt; acc.calculateRate());
</code></code></pre><p>The topology defines data flow transformations&#8212;grouping, windowing, aggregation&#8212;that execute across multiple instances. State stores (<code>request-counts-store</code>) automatically partition across stream processor instances based on endpoint hash.</p><h3>Interactive Query Implementation</h3><p>The stream processor exposes REST endpoints that query local state stores:</p><pre><code><code>@GetMapping("/metrics/requests")
public Map&lt;String, Long&gt; getRequestCounts() {
    ReadOnlyWindowStore&lt;String, Long&gt; store = 
        streams.store(StoreQueryParameters.fromNameAndType(
            "request-counts-store", 
            QueryableStoreTypes.windowStore()
        ));
    
    Instant now = Instant.now();
    Instant start = now.minus(Duration.ofMinutes(5));
    
    Map&lt;String, Long&gt; results = new HashMap&lt;&gt;();
    store.all().forEachRemaining(kv -&gt; {
        if (kv.key.window().end() &gt;= start.toEpochMilli()) {
            results.put(kv.key.key(), kv.value);
        }
    });
    return results;
}
</code></code></pre><p>This query runs in O(k) time where k is the number of endpoints&#8212;no database query, no network calls, pure local state access.</p><h3>Working Demo Link :</h3><div id="youtube2-egLvKBabQcA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;egLvKBabQcA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/egLvKBabQcA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h3>State Store Failure Recovery</h3><p>When a stream processor crashes, Kafka Streams handles recovery automatically:</p><ol><li><p>Consumer group rebalances&#8212;partitions reassign to surviving instances</p></li><li><p>New partition owner reads changelog topic from last committed offset</p></li><li><p>State store rebuilds from changelog (may take 30-60 seconds)</p></li><li><p>Processing resumes once state restoration completes</p></li></ol><p>During restoration, the state store is unavailable&#8212;queries return 503. The dashboard API must handle this gracefully, showing &#8220;loading&#8221; state until the processor recovers.</p><h2>Production Considerations</h2><p><strong>Performance bottlenecks</strong>: State store compaction is CPU-intensive&#8212;RocksDB background threads can consume 20-30% CPU even during steady-state operation. Monitor <code>rocksdb.total-sst-files-size</code> metric&#8212;growth beyond available disk indicates insufficient compaction. Increase <code>rocksdb.max-background-compactions</code> if you see compaction delays.</p><p><strong>Memory management</strong>: Each windowed aggregation creates new state store entries&#8212;1,000 endpoints &#215; 5-minute window retention &#215; 12 windows = 60,000 state entries. With 40,000 events/sec, memory usage grows at ~500MB/hour. Implement time-based retention using suppress operators to purge old windows automatically.</p><p><strong>Monitoring critical metrics</strong>:</p><ul><li><p><code>kafka-streams-state-store-lag</code>: Indicates how far behind state stores are from Kafka topics (target: &lt;1000)</p></li><li><p><code>stream-processor-commit-latency-avg</code>: Time to commit state changes (target: &lt;100ms)</p></li><li><p><code>rebalance-time</code>: Downtime during partition reassignment (target: &lt;30 seconds)</p></li></ul><p><strong>Failure scenario testing</strong>: Simulate instance crashes with <code>kill -9</code>, verify state restoration completes within SLA, confirm no data loss or duplicate counts. Test backpressure handling&#8212;what happens when ingestion rate exceeds processing capacity? The system should pause consumption rather than falling behind indefinitely.</p><h2>Scale Connection: Real-World Stream Processing</h2><p><strong>LinkedIn&#8217;s Venice</strong>: Processes 400,000 events/second using Kafka Streams with 2TB of state distributed across 50 stream processor instances. They achieve P99 query latency of 5ms by pre-aggregating hot keys (top 1000 endpoints) into separate state stores cached in memory.</p><p><strong>Uber&#8217;s AresDB</strong>: Real-time analytics on 100 million trips/day use GPU-accelerated aggregations in Kafka Streams pipelines. By offloading percentile calculations to CUDA kernels, they reduced per-event processing time from 2ms to 0.3ms, enabling real-time fraud detection across 10,000+ cities.</p><p><strong>Netflix&#8217;s Keystone</strong>: Monitors 450 billion events/day from streaming devices using Kafka Streams for real-time alerting. They partition state by region (us-east-1, eu-west-1) to isolate failures&#8212;an outage in one region doesn&#8217;t halt global monitoring.</p><p><strong>Key Insight</strong>: Real-time monitoring isn&#8217;t about speed&#8212;it&#8217;s about maintaining accurate state across failures while processing unbounded data streams. The hard problem is making your aggregations survive the chaos of production: crashes, rebalances, network partitions, and traffic spikes.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://sdcourse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Hands On System Design Course - Code Everyday  is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Day 43: Implement Log Compaction for State Management]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-43-implement-log-compaction-for</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-43-implement-log-compaction-for</guid><dc:creator><![CDATA[sdr11]]></dc:creator><pubDate>Fri, 13 Mar 2026 03:03:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZA6w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><blockquote><p>Today we&#8217;re implementing a production-grade state management system using Kafka log compaction:</p></blockquote><ul><li><p><strong>Compacted Topics</strong> that maintain only the latest state for each entity key</p></li><li><p><strong>State Producer Service</strong> generating entity lifecycle events with proper keying</p></li><li><p><strong>State Consumer Service</strong> maintaining current entity snapshots from the compacted log</p></li><li><p><strong>State Query API</strong> providing fast lookups of current entity state with Redis caching</p></li></ul><h2>Why This Matters: The State Management Challenge at Scale</h2><blockquote><p>Every distributed system faces the same fundamental challenge: how do you maintain current state across dozens of microservices without creating a monolithic database bottleneck?</p><p>Traditional approaches fail at scale. Storing complete event histories consumes unbounded storage. Database-per-service patterns create consistency nightmares during failures. Cache invalidation becomes impossibly complex with hundreds of service instances.</p><p>Log compaction solves this by treating your event log as a self-maintaining state store. Instead of storing every state transition, Kafka automatically retains only the latest value for each key. This gives you the benefits of event sourcing (complete audit trail, replayability, temporal queries) while maintaining bounded storage and fast state reconstruction.</p><p>Netflix uses this pattern to maintain current device registration state across 200+ million subscribers. When a device registers, deregisters, or updates settings, those events flow through compacted topics. Any service can rebuild complete device state by consuming from offset 0, getting only current registrations. Uber applies the same pattern to driver location state, maintaining billions of location updates while keeping storage constant.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZA6w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZA6w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!ZA6w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!ZA6w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!ZA6w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZA6w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2715029,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/184525555?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZA6w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!ZA6w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!ZA6w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!ZA6w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689bae1-5c0b-4b60-8b4a-e57f35bcf167_7000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://sdcourse.substack.com/p/day-43-implement-log-compaction-for">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 42: Exactly-Once Processing Semantics in Distributed Log Systems]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-42-exactly-once-processing-semantics</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-42-exactly-once-processing-semantics</guid><dc:creator><![CDATA[sysdai]]></dc:creator><pubDate>Mon, 09 Mar 2026 08:30:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2Nht!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><p>Today we implement exactly-once processing semantics in our Kafka-based log processing system, guaranteeing no duplicate message processing even during failures:</p><ul><li><p><strong>Idempotent Kafka producers</strong> preventing duplicate writes on network retries</p></li><li><p><strong>Transactional message processing</strong> with atomic offset commits and database writes</p></li><li><p><strong>Deduplication layer</strong> using Redis for distributed idempotency keys</p></li><li><p><strong>State reconciliation service</strong> detecting and recovering from processing anomalies</p></li><li><p><strong>End-to-end exactly-once pipeline</strong> from producer through consumer to database</p></li></ul><h2>Why This Matters: The $10 Million Double-Charge Problem</h2><blockquote><p>In 2019, a major payment processor experienced a 47-second network partition during peak Black Friday traffic. Their Kafka consumers lost connections, reconnected, and reprocessed 180,000 payment authorization messages&#8212;charging customers twice. The cost: $10.3 million in refunds, regulatory fines, and customer service overhead.</p><p>The root cause wasn&#8217;t Kafka. It was the absence of exactly-once semantics. Without idempotent producers, network retries created duplicate messages. Without transactional consumers, offset commits happened before database writes, causing reprocessing on crashes. Without deduplication, the same payment ID was processed multiple times.</p><p>Exactly-once processing isn&#8217;t about theoretical correctness&#8212;it&#8217;s about financial accuracy, compliance requirements, and system reliability at scale. When Uber processes 100 million trip events daily, Stripe handles billions in transactions, or AWS Lambda processes trillions of invocations, &#8220;at-least-once with deduplication&#8221; becomes a critical architectural pattern.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Nht!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Nht!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!2Nht!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!2Nht!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!2Nht!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Nht!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png" width="1456" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Nht!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!2Nht!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!2Nht!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!2Nht!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcedc06f8-c4df-46f1-a522-da63e1b5fb21_6000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2></h2>
      <p>
          <a href="https://sdcourse.substack.com/p/day-42-exactly-once-processing-semantics">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 41: Kafka Partitioning and Consumer Groups - Parallel Log Processing at Scale]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-41-kafka-partitioning-and-consumer</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-41-kafka-partitioning-and-consumer</guid><dc:creator><![CDATA[sysdai]]></dc:creator><pubDate>Thu, 05 Mar 2026 08:30:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!f2D6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><p>Today we implement horizontal scalability for log processing through Kafka&#8217;s partitioning and consumer group mechanisms:</p><ul><li><p><strong>Topic partitioning strategy</strong> with semantic key selection for log event distribution</p></li><li><p><strong>Consumer group coordination</strong> enabling automatic load balancing across multiple consumer instances</p></li><li><p><strong>Dynamic rebalancing protocols</strong> handling consumer failures and scale-out scenarios gracefully</p></li><li><p><strong>Partition assignment strategies</strong> optimizing throughput for different log processing workloads</p></li></ul><h2>Why This Matters: The Parallel Processing Foundation</h2><blockquote><p>When Netflix processes 500 billion events daily or Uber handles 14 million trips per day, single-threaded processing isn&#8217;t an option. Kafka&#8217;s partitioning model solves the fundamental distributed systems challenge: how do we process millions of messages per second while maintaining order guarantees where they matter and maximizing parallelism everywhere else?</p><p>The architectural decision between using a single partition (strong ordering, limited throughput) versus multiple partitions (high throughput, ordering within partitions only) defines your system&#8217;s scalability ceiling. Companies like LinkedIn process 7 trillion messages daily through Kafka precisely because partitioning enables horizontal scaling - adding more consumers linearly increases processing capacity without architectural changes.</p><p>Understanding partition assignment and consumer group coordination is critical for system design interviews. When asked &#8220;how would you process 1 million events per second?&#8221;, the answer involves this exact partitioning strategy with consumer groups, not faster machines.</p></blockquote><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f2D6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f2D6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!f2D6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!f2D6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!f2D6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f2D6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f2D6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!f2D6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!f2D6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!f2D6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb343bd12-2c29-4550-9042-02ee1c8c355d_7000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2></h2>
      <p>
          <a href="https://sdcourse.substack.com/p/day-41-kafka-partitioning-and-consumer">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 40: Implement Kafka Consumers for Log Processing]]></title><description><![CDATA[Nothing teaches better than &#8220;Code in Action&#8221;.]]></description><link>https://sdcourse.substack.com/p/day-40-implement-kafka-consumers</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-40-implement-kafka-consumers</guid><dc:creator><![CDATA[sysdai]]></dc:creator><pubDate>Sun, 01 Mar 2026 05:47:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!txJ3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>Nothing teaches better than &#8220;<strong>Code in Action</strong>&#8221;.</p><ul><li><p><em><strong>Learn AI Agents : </strong><a href="https://aiamastery.substack.com/subscribe">Join the AI agent revolution</a> before your competition does.</em></p></li><li><p>Explore more  hands-on courses <a href="https://systemdrd.com">https://systemdrd.com</a></p></li><li><p>Lifetime Access : 4 hands on courses  + full portal with <strong>Pro Max</strong> offer &#8594; <a href="https://systemdrd.com/pricing/?period=yearly">link</a></p></li></ul></blockquote><div><hr></div><h2>What We&#8217;re Building Today</h2><ul><li><p><strong>Consumer group architecture</strong> with automatic partition assignment and rebalancing</p></li><li><p><strong>Offset management strategies</strong> implementing at-least-once and exactly-once semantics</p></li><li><p><strong>Multi-threaded processing pipeline</strong> with parallel log transformation and enrichment</p></li><li><p><strong>Dead letter queue pattern</strong> for poison pill messages and retry exhaustion handling</p></li></ul><h2>Why This Matters</h2><blockquote><p>Consumer implementation determines your system&#8217;s throughput, reliability, and operational complexity. While producers are relatively simple&#8212;fire and forget with acks&#8212;consumers manage offset commits, rebalancing coordination, and state consistency across failures. At Netflix, consumer lag spikes directly correlate with degraded user experience as recommendations stale. Uber&#8217;s geospatial processing consumers must maintain exactly-once semantics to prevent duplicate ride assignments. The difference between naive polling loops and production consumer patterns is the gap between prototypes that process 500 events/second and systems handling 500,000 events/second with zero data loss.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!txJ3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!txJ3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!txJ3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!txJ3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!txJ3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!txJ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!txJ3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!txJ3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!txJ3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!txJ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6786876-0b6a-47a4-bc56-d4ccfc438cc7_7000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>System Design Deep Dive</h2>
      <p>
          <a href="https://sdcourse.substack.com/p/day-40-implement-kafka-consumers">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 39: Kafka Producers for Log Ingestion - Building High-Throughput Log Shippers]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-39-kafka-producers-for-log-ingestion</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-39-kafka-producers-for-log-ingestion</guid><dc:creator><![CDATA[sysdai]]></dc:creator><pubDate>Wed, 25 Feb 2026 09:02:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0mVZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><p>Today we implement production-grade Kafka producers that form the critical ingestion layer of our distributed log processing system:</p><ul><li><p><strong>Multi-source log shippers</strong> that collect logs from applications, services, and infrastructure components</p></li><li><p><strong>High-throughput Kafka producers</strong> capable of handling 50,000+ events/second with sub-10ms latency</p></li><li><p><strong>Intelligent batching and compression</strong> strategies that optimize network utilization and reduce Kafka broker load</p></li><li><p><strong>Producer-side monitoring and observability</strong> with Prometheus metrics, latency histograms, and error tracking</p></li></ul><h2>Why This Matters: The $10M Question of Log Ingestion</h2><blockquote><p>When Twitter experienced cascading failures in 2016, engineers discovered their log ingestion pipeline had dropped 40% of critical error logs during the incident. The missing data cost them millions in debugging time and prevented root cause analysis. The culprit? Naive Kafka producers without proper backpressure handling, retry logic, or circuit breakers.</p><p>At Netflix scale (processing 500+ billion events daily), every millisecond of producer latency multiplies across thousands of services. A poorly configured producer can create backpressure that cascades through your entire microservices ecosystem, causing request timeouts, degraded user experience, and revenue loss. Getting producer configuration right isn&#8217;t optional&#8212;it&#8217;s the difference between a system that scales gracefully and one that collapses under load.</p></blockquote><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://sdcourse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Hands On System Design Course - Code Everyday  is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0mVZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0mVZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png 424w, https://substackcdn.com/image/fetch/$s_!0mVZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png 848w, https://substackcdn.com/image/fetch/$s_!0mVZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png 1272w, https://substackcdn.com/image/fetch/$s_!0mVZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0mVZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0mVZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png 424w, https://substackcdn.com/image/fetch/$s_!0mVZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png 848w, https://substackcdn.com/image/fetch/$s_!0mVZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png 1272w, https://substackcdn.com/image/fetch/$s_!0mVZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f400e4-d6c1-4d1a-920b-507e91507e8a_7000x5000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2></h2><h2>System Design Deep Dive</h2><h3>1. Producer Acknowledgment Semantics and Durability Trade-offs</h3><p>Kafka producers offer three acknowledgment levels, each representing a different point on the CAP theorem spectrum:</p><p><strong>acks=0 (Fire-and-Forget)</strong>: Producer doesn&#8217;t wait for broker acknowledgment. Achieves maximum throughput (100,000+ msg/sec) but risks data loss if brokers fail. Use case: High-volume metrics where occasional loss is acceptable (think Uber&#8217;s GPS tracking where losing 0.1% of coordinates doesn&#8217;t impact route calculation).</p><p><strong>acks=1 (Leader Acknowledgment)</strong>: Producer waits for leader broker confirmation. Balances throughput (50,000+ msg/sec) with durability. Risk: Data loss if leader fails before replication. This is the sweet spot for most log ingestion scenarios&#8212;you get reasonable guarantees without sacrificing performance.</p><p><strong>acks=all (Full Replication)</strong>: Producer waits for all in-sync replicas. Guarantees no data loss but reduces throughput (10,000-20,000 msg/sec) and increases latency (20-50ms). Essential for financial transactions or audit logs where every event must be preserved.</p><p>The architectural insight: <strong>There&#8217;s no universal &#8220;best&#8221; setting</strong>. Netflix uses acks=1 for application logs but acks=all for billing events. Your producer configuration should match your data&#8217;s business value.</p><h3>2. Batching, Linger, and the Throughput-Latency Trade-off</h3><p>Individual message sends are network-inefficient&#8212;each requires a round trip to the broker. Kafka producers solve this through batching, controlled by two critical parameters:</p><p><strong>batch.size</strong>: Maximum batch size in bytes (default 16KB). Larger batches improve throughput by amortizing network overhead across multiple messages. At 32KB batches, you can achieve 3x throughput compared to individual sends.</p><p><strong>linger.ms</strong>: How long the producer waits to accumulate messages before sending. Setting linger.ms=10 means &#8220;wait up to 10ms to build a larger batch.&#8221; This is counterintuitive&#8212;adding latency improves throughput.</p><p>The trade-off: A producer with batch.size=32KB and linger.ms=0 achieves low latency (2-5ms) but moderate throughput (20,000 msg/sec). Setting linger.ms=10 increases latency to 12-15ms but achieves 60,000+ msg/sec through better batching.</p><p><strong>Real-world application</strong>: Uber&#8217;s log ingestion uses adaptive linger&#8212;during high traffic (rush hour), linger.ms=5 because batches fill quickly. During low traffic (3 AM), linger.ms=50 ensures efficient batching despite low event rates.</p><h3>3. Compression and Network Efficiency</h3><p>Uncompressed logs consume massive bandwidth. A typical application log (200 bytes) contains repetitive structure&#8212;timestamps, log levels, class names. Compression algorithms exploit this repetition:</p><p><strong>snappy</strong>: Fast compression (20-30% size reduction) with minimal CPU overhead. Compresses at 250+ MB/sec, making it ideal for high-volume scenarios where CPU is precious.</p><p><strong>gzip</strong>: Better compression (40-50% reduction) but slower (80-100 MB/sec). Use when network bandwidth is the bottleneck, not CPU.</p><p><strong>lz4</strong>: Fastest option (300+ MB/sec) with moderate compression (25-35% reduction). The default choice for most production systems.</p><p>At 50,000 events/sec with 200-byte messages, that&#8217;s 10 MB/sec uncompressed. With lz4 compression, you&#8217;re down to 7 MB/sec&#8212;a 30% reduction in cross-datacenter bandwidth costs. Over a year, this saves hundreds of thousands in AWS data transfer fees.</p><h3>4. Idempotence and Exactly-Once Semantics</h3><p>Network failures create a subtle problem: when a send times out, did the broker receive the message? Retrying might create duplicates. Not retrying risks data loss.</p><p>Kafka&#8217;s idempotent producer solves this by assigning each message a sequence number. If the broker receives duplicate sequence numbers from the same producer, it deduplicates automatically. Enable with <code>enable.idempotence=true</code>.</p><p><strong>The cost</strong>: Idempotence requires acks=all and limits in-flight requests to 5. This reduces maximum throughput from 100,000+ to 30,000-40,000 msg/sec. But you get exactly-once semantics&#8212;critical for financial logs, user analytics, or any scenario where duplicates corrupt downstream processing.</p><p><strong>Architectural decision</strong>: LinkedIn&#8217;s log pipeline uses idempotent producers for user activity events (preventing double-counting in analytics) but non-idempotent for debug logs where occasional duplicates are harmless and throughput matters more.</p><h3>5. Circuit Breakers and Graceful Degradation</h3><p>When Kafka brokers become unavailable, naive producers queue messages in memory until they run out of heap, causing OutOfMemory crashes. Production systems need circuit breaker patterns:</p><p><strong>buffer.memory</strong>: Maximum memory for buffering unsent messages (default 32MB). When exhausted, send() calls block or throw exceptions.</p><p><strong>max.block.ms</strong>: How long to block before throwing TimeoutException (default 60 seconds). Setting this to 5000ms prevents cascading failures&#8212;your producer fails fast instead of hanging threads.</p><p><strong>Circuit breaker integration</strong>: When error rates exceed thresholds, open the circuit and stop accepting new logs. This prevents producer services from crashing and allows graceful recovery when Kafka returns.</p><p>Amazon&#8217;s CloudWatch log ingestion uses a three-tiered degradation strategy: (1) Under normal operation, send to Kafka with acks=1. (2) During Kafka brownout (high latency), switch to acks=0 for non-critical logs. (3) During Kafka blackout, write critical logs to local disk and replay when connectivity returns.</p><h1><strong>Implementation Guide: Kafka Log Producers &#8212; High-Throughput Log Ingestion</strong></h1><p>A practical guide to building a production-style log ingestion system with Apache Kafka and Spring Boot. You&#8217;ll implement multiple producer strategies (throughput vs. durability), a reactive gateway, and observability.</p><h2>Github Link :<br></h2><pre><code><a href="https://github.com/sysdr/sdc-java/tree/main/day39/kafka-log-producers">https://github.com/sysdr/sdc-java/tree/main/day39/kafka-log-producers</a></code></pre><h2><strong>Prerequisites</strong></h2><ul><li><p>JDK 17+</p></li><li><p>Docker and Docker Compose</p></li><li><p>Maven 3.8+</p></li><li><p>(Optional) Git to clone the repo</p></li></ul><div><hr></div><h2><strong>Architecture at a Glance</strong></h2><pre><code><code>Clients &#8594; Log Gateway (WebFlux) &#8594; Application Log Shipper &#8594; Kafka (application-logs)
                              &#8594; Transaction Log Shipper  &#8594; Kafka (transaction-logs)
                                                         &#8594; PostgreSQL (outbox)
         Infrastructure Shipper (scheduled)               &#8594; Kafka (infrastructure-metrics)
</code></code></pre><p><strong>Design choices:</strong></p><p><strong>ConcernApplication / Infra ShippersTransaction Shipperacks</strong><code>1</code> (leader ack, higher throughput)<code>all</code> (durability)<strong>Idempotence</strong>OffOn<strong>Outbox</strong>NoYes (PostgreSQL)<strong>Rate limit</strong>50K/sec (token bucket)Not applied</p><div><hr></div><h2><strong>Step 1: Project Layout and Docker</strong></h2><p>Create a root folder (e.g. <code>kafka-log-producers</code>) with:</p><pre><code><code>kafka-log-producers/
&#9500;&#9472;&#9472; docker-compose.yml      # Kafka, Zookeeper, PostgreSQL, Prometheus, Grafana
&#9500;&#9472;&#9472; pom.xml                 # Parent POM with modules
&#9500;&#9472;&#9472; application-log-shipper/
&#9500;&#9472;&#9472; infrastructure-log-shipper/
&#9500;&#9472;&#9472; transaction-log-shipper/
&#9500;&#9472;&#9472; log-gateway/
&#9500;&#9472;&#9472; monitoring/
&#9474;   &#9500;&#9472;&#9472; prometheus.yml
&#9474;   &#9492;&#9472;&#9472; dashboards/
&#9492;&#9472;&#9472; setup.sh                # Start infra + create topics
</code></code></pre><p><strong>docker-compose.yml</strong> (minimal): include services for <strong>zookeeper</strong>, <strong>kafka</strong> (Confluent 7.5), <strong>postgres</strong> (port 5433 to avoid clashes), <strong>prometheus</strong> (9090), <strong>grafana</strong> (3000). Wire Kafka to Zookeeper and add healthchecks so app containers can <code>depends_on: kafka: condition: service_healthy</code>.</p><div><hr></div><h2><strong>Step 2: Application Log Shipper &#8212; High-Throughput Producer</strong></h2><p><strong>Goal:</strong> Ingest application logs over HTTP and publish to <code>application-logs</code> with batching and rate limiting.</p><p><strong>2.1 Dependencies (pom.xml)</strong><br>Spring Boot Starter Web, Spring Kafka, Lombok, Micrometer (Prometheus), Guava (RateLimiter).</p><p><strong>2.2 Producer configuration</strong></p><p>Use a <code>@Configuration</code> class that builds a <code>ProducerFactory&lt;String, LogEvent&gt;</code> and a <code>KafkaTemplate&lt;String, LogEvent&gt;</code>. Key settings:</p><ul><li><p><code>BOOTSTRAP_SERVERS_CONFIG</code> from <code>spring.kafka.bootstrap-servers</code></p></li><li><p><code>KEY_SERIALIZER</code>: <code>StringSerializer</code>, <code>VALUE_SERIALIZER</code>: <code>JsonSerializer</code> (for a <code>LogEvent</code> POJO)</p></li><li><p><strong>Throughput tuning:</strong> <code>ACKS_CONFIG = "1"</code>, <code>COMPRESSION_TYPE_CONFIG = "lz4"</code>, <code>BATCH_SIZE_CONFIG = 32768</code>, <code>LINGER_MS_CONFIG = 10</code>, <code>BUFFER_MEMORY_CONFIG = 67108864</code></p></li><li><p><strong>Reliability:</strong> <code>RETRIES_CONFIG = 3</code>, <code>DELIVERY_TIMEOUT_MS_CONFIG = 30000</code>, <code>REQUEST_TIMEOUT_MS_CONFIG = 15000</code></p></li></ul><p><strong>2.3 Log event model</strong></p><p>Simple POJO: <code>eventId</code>, <code>source</code>, <code>level</code>, <code>message</code>, <code>timestamp</code>, <code>serviceId</code>, <code>traceId</code>, <code>metadata</code> (Map). Use <code>Instant</code> for timestamp.</p><p><strong>2.4 Sending with rate limit and metrics</strong></p><p>In a service that uses <code>KafkaTemplate</code> and Micrometer <code>MeterRegistry</code>:</p><ul><li><p>Create a <strong>Guava RateLimiter</strong> at 50,000 permits per second.</p></li><li><p>For each send: if <code>!rateLimiter.tryAcquire(Duration.ofMillis(100))</code>, increment a <code>producer.throttled</code> counter and throw a custom <code>RateLimitException</code> (e.g. HTTP 429).</p></li><li><p>Start a <code>Timer.Sample</code>, then <code>kafkaTemplate.send("application-logs", event.getEventId(), event)</code>.</p></li><li><p>In the <code>whenComplete</code> callback: stop the timer (e.g. <code>kafka.producer.send.duration</code>), and increment either <code>kafka.producer.success</code> or <code>kafka.producer.error</code> counters.</p></li></ul><p><strong>2.5 REST API</strong></p><ul><li><p><code>POST /api/v1/logs/ingest</code>: accept a JSON body, map to <code>LogEvent</code>, generate <code>eventId</code> (e.g. UUID) and <code>timestamp</code> if missing, call the send service. Return 202 with <code>eventId</code> or 429 when rate limited.</p></li></ul><div><hr></div><h2><strong>Step 3: Transaction Log Shipper &#8212; Exactly-Once with Outbox</strong></h2><p><strong>Goal:</strong> Accept transaction events, store in a DB outbox, then publish to Kafka with an idempotent producer so you can replay from the outbox if needed.</p><p><strong>3.1 Dependencies</strong><br>Add Spring Data JPA, PostgreSQL driver, and Spring Kafka.</p><p><strong>3.2 Producer configuration</strong></p><p>Same structure as the application shipper, but:</p><ul><li><p><code>ACKS_CONFIG = "all"</code></p></li><li><p><code>ENABLE_IDEMPOTENCE_CONFIG = true</code></p></li><li><p>Keep compression and batching (e.g. same batch/linger/buffer as above).</p></li></ul><p><strong>3.3 Outbox entity (PostgreSQL)</strong></p><p>Table <code>transaction_outbox</code>: <code>id</code> (PK), <code>transactionId</code> (unique), <code>userId</code>, <code>amount</code>, <code>currency</code>, <code>status</code>, <code>createdAt</code>, <code>sentAt</code>. Use <code>@Entity</code> and <code>@Table(name = "transaction_outbox")</code>.</p><p><strong>3.4 Transaction event POJO</strong></p><p>Fields such as: <code>transactionId</code>, <code>userId</code>, <code>type</code>, <code>amount</code>, <code>currency</code>, <code>timestamp</code>.</p><p><strong>3.5 Outbox + send flow</strong></p><p>In a <code>@Transactional</code> method:</p><ol><li><p>Save a new row in <code>transaction_outbox</code> with <code>status = "PENDING"</code>.</p></li><li><p>Call <code>kafkaTemplate.send("transaction-logs", event.getTransactionId(), event)</code>.</p></li><li><p>In <code>whenComplete</code>: on success, set <code>status = "SENT"</code> and <code>sentAt = now()</code>, then save the entity; on failure, increment a <code>transactions.failed</code> counter (and optionally keep status for retry).</p></li></ol><p>Use <code>transactionId</code> as the Kafka key for ordering and idempotency.</p><p><strong>3.6 REST</strong></p><ul><li><p><code>POST /api/v1/transactions</code>: body with <code>userId</code>, <code>type</code>, <code>amount</code>, <code>currency</code>; build <code>TransactionEvent</code> with generated <code>transactionId</code> and timestamp; call the transactional send service; return 202 with <code>transactionId</code>.</p></li></ul><div><hr></div><h2><strong>Step 4: Log Gateway (Reactive)</strong></h2><p><strong>Goal:</strong> Single API for clients; gateway forwards to the appropriate shipper.</p><p><strong>4.1 Dependencies</strong><br>Spring Boot WebFlux, Spring WebFlux WebClient (no blocking WebMVC).</p><p><strong>4.2 WebClient</strong></p><p>Create a <code>WebClient</code> bean (e.g. <code>WebClient.builder().build()</code> or with base URL). Use service names as hostnames when running in Docker (e.g. </p><p>http://application-log-shipper:8081</p><p>, </p><p>http://transaction-log-shipper:8083</p><p>).</p><p><strong>4.3 Routes</strong></p><ul><li><p><code>POST /api/v1/logs</code> &#8594; <code>POST http://application-log-shipper:8081/api/v1/logs/ingest</code> (forward body).</p></li><li><p><code>POST /api/v1/transactions</code> &#8594; <code>POST http://transaction-log-shipper:8083/api/v1/transactions</code> (forward body).</p></li><li><p><code>GET /api/v1/health</code> &#8594; return <code>{"status":"UP"}</code>.</p></li></ul><p>Return <code>Mono&lt;Map&lt;String, Object&gt;&gt;</code> from the shipper responses so clients see the same shape (e.g. <code>eventId</code> or <code>transactionId</code>).</p><div><hr></div><h2><strong>Step 5: Infrastructure Log Shipper (Scheduled Metrics)</strong></h2><p><strong>Goal:</strong> Periodically generate fake metrics and publish to <code>infrastructure-metrics</code>.</p><p>Use the same producer pattern as the application shipper (acks=1, batching, LZ4). Add a <code>@Scheduled(fixedRate = 1000)</code> method that builds a small batch of events (e.g. CPU, memory, disk) with timestamps and sends them via <code>KafkaTemplate</code>. No HTTP API required; optional actuator for health.</p><div><hr></div><h2><strong>Step 6: Topics and Startup</strong></h2><p>In <strong>setup.sh</strong> (or equivalent):</p><ol><li><p><code>docker compose up -d</code>.</p></li><li><p>Wait for Kafka to be ready (e.g. loop with <code>kafka-broker-api-versions --bootstrap-server localhost:9092</code>).</p></li><li><p>Create topics if not exists:</p><ul><li><p><code>application-logs</code> (3 partitions, replication 1)</p></li><li><p><code>infrastructure-metrics</code> (3 partitions)</p></li><li><p><code>transaction-logs</code> (3 partitions)</p></li></ul></li></ol><p>Expose a short summary: Gateway 8080, Application 8081, Infra 8082, Transaction 8083, Prometheus 9090, Grafana 3000.</p><div><hr></div><h2><strong>Step 7: Observability</strong></h2><ul><li><p><strong>Prometheus:</strong> Scrape actuator metrics from each Spring Boot app (e.g. <code>/actuator/prometheus</code>). Configure targets in <code>prometheus.yml</code>.</p></li><li><p><strong>Grafana:</strong> Add Prometheus as data source; import or create a dashboard for:</p><ul><li><p><code>rate(kafka_producer_success_total[1m])</code> (throughput)</p></li><li><p><code>rate(kafka_producer_error_total[1m])</code> (errors)</p></li><li><p><code>kafka_producer_send_duration_seconds</code> (latency)</p></li><li><p><code>producer_throttled_total</code> (rate limiting)</p></li></ul></li></ul><div><hr></div><h2><strong>Verification</strong></h2><ol><li><p><strong>Health:</strong> <code>curl http://localhost:8080/api/v1/health</code></p></li><li><p><strong>Log:</strong> <code>curl -X POST http://localhost:8080/api/v1/logs -H "Content-Type: application/json" -d '{"source":"test","level":"INFO","message":"Hello"}'</code> &#8594; expect 202 and an <code>eventId</code>.</p></li><li><p><strong>Transaction:</strong> <code>curl -X POST http://localhost:8080/api/v1/transactions -H "Content-Type: application/json" -d '{"userId":"u1","type":"PAYMENT","amount":99.99,"currency":"USD"}'</code> &#8594; expect 202 and <code>transactionId</code>.</p></li><li><p><strong>Kafka:</strong> <code>kafka-console-consumer --bootstrap-server localhost:9092 --topic application-logs --from-beginning --max-messages 5</code> (and similarly for <code>transaction-logs</code>, <code>infrastructure-metrics</code>).</p></li><li><p><strong>Rate limit:</strong> Send a burst of requests (e.g. script or load test); confirm 429s and <code>producer.throttled</code> in Prometheus.</p></li></ol><h2>Production Considerations</h2><h3>Performance Characteristics and Bottlenecks</h3><p>In our load tests, a single producer instance achieves 50,000+ events/sec with the following resource profile:</p><ul><li><p><strong>CPU</strong>: 2 cores at 60% utilization (mostly spent on JSON serialization and lz4 compression)</p></li><li><p><strong>Memory</strong>: 512MB heap (200MB for buffering, 300MB for Spring Boot overhead)</p></li><li><p><strong>Network</strong>: 7-10 MB/sec outbound (with compression), 1 MB/sec inbound (acknowledgments)</p></li></ul><p>The bottleneck shifts based on configuration. With acks=all, network latency becomes the constraint (limited by round-trip time to replicas). With acks=0, CPU becomes the constraint (serialization can&#8217;t keep up). With compression disabled, network bandwidth saturates first.</p><h3>Failure Scenarios and Recovery</h3><p><strong>Kafka Broker Failure</strong>: Producers automatically retry failed sends (up to 3 attempts) with exponential backoff. After retries exhaust, events are written to a dead letter queue for manual investigation. We alert when DLQ depth exceeds 1000 events.</p><p><strong>Producer Service Crash</strong>: Unsent messages in memory are lost. This is acceptable for logs but unacceptable for transactions&#8212;hence the transaction shipper&#8217;s use of transactional outbox pattern with PostgreSQL.</p><p><strong>Network Partition</strong>: Circuit breakers open after 10 consecutive failures, preventing memory exhaustion. Producers enter degraded mode, writing critical logs to local disk for later replay. We monitor circuit breaker state changes&#8212;sustained open state indicates infrastructure issues.</p><h3>Monitoring and Alerting Strategy</h3><p>We implement three-tier alerting:</p><ul><li><p><strong>P0 (Page immediately)</strong>: Producer error rate &gt; 5% for 5 minutes (indicates Kafka cluster issues)</p></li><li><p><strong>P1 (Alert during business hours)</strong>: Average latency &gt; 100ms (indicates broker saturation)</p></li><li><p><strong>P2 (Daily summary)</strong>: DLQ depth &gt; 100 events (indicates intermittent serialization errors)</p></li></ul><p>Dashboards show producer throughput, latency histograms, batch size distributions, and error rates across all shipper types. This gives operators real-time visibility into ingestion health.</p><h2>Scale Connection: Producer Patterns at FAANG</h2><p><strong>Netflix</strong>: Runs 100,000+ producer instances across their microservices, each configured with adaptive batching that adjusts linger.ms based on traffic patterns. Their producers include custom interceptors that sample 1% of logs for trace analysis while sending 100% to Kafka&#8212;achieving observability without overwhelming their trace backends.</p><p><strong>Uber</strong>: Implemented geographic producer routing&#8212;log events from EU riders go to EU Kafka clusters, reducing cross-datacenter latency from 150ms to 5ms. They use asynchronous sends with callback handlers that update per-datacenter success metrics, enabling rapid detection of regional Kafka issues.</p><p><strong>Airbnb</strong>: Uses priority-based producers where booking logs get acks=all and idempotence while search logs use acks=0. This heterogeneous configuration optimizes for both data criticality and throughput, running 40,000+ events/sec on shared infrastructure.</p><h2>Working Code Demo:</h2><div id="youtube2-2Lbo5lElSDE" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;2Lbo5lElSDE&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/2Lbo5lElSDE?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://sdcourse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Hands On System Design Course - Code Everyday  is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Day 38: Set Up a Kafka Cluster for Log Streaming]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-38-set-up-a-kafka-cluster-for</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-38-set-up-a-kafka-cluster-for</guid><dc:creator><![CDATA[sysdai]]></dc:creator><pubDate>Sat, 21 Feb 2026 08:30:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EQIR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><p>Today you&#8217;ll deploy a production-grade distributed log streaming platform that forms the backbone of modern event-driven architectures. By the end of this lesson, you&#8217;ll have:</p><ul><li><p><strong>Multi-broker Kafka cluster</strong> with 3 nodes configured for high availability and fault tolerance</p></li><li><p><strong>Partitioned topic architecture</strong> supporting parallel log processing at 50,000+ events per second</p></li><li><p><strong>Comprehensive monitoring stack</strong> with real-time metrics for throughput, latency, and consumer lag</p></li><li><p><strong>Automated health checking system</strong> that validates cluster state and triggers alerts on degradation</p></li><li><p><strong>Load testing framework</strong> simulating production traffic patterns with configurable throughput profiles</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EQIR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EQIR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png 424w, https://substackcdn.com/image/fetch/$s_!EQIR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png 848w, https://substackcdn.com/image/fetch/$s_!EQIR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png 1272w, https://substackcdn.com/image/fetch/$s_!EQIR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EQIR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EQIR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png 424w, https://substackcdn.com/image/fetch/$s_!EQIR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png 848w, https://substackcdn.com/image/fetch/$s_!EQIR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png 1272w, https://substackcdn.com/image/fetch/$s_!EQIR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3df0680-1c74-4e5c-998b-5d86fd477aeb_7000x5000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2></h2>
      <p>
          <a href="https://sdcourse.substack.com/p/day-38-set-up-a-kafka-cluster-for">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 37: Priority Queues for Critical Log Messages]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-37-priority-queues-for-critical-827</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-37-priority-queues-for-critical-827</guid><dc:creator><![CDATA[sysdai]]></dc:creator><pubDate>Tue, 17 Feb 2026 08:30:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!15wQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><p>A production-grade priority-based log processing system that ensures critical messages bypass normal processing queues:</p><ul><li><p><strong>Multi-tier message routing</strong> with 4 priority levels (CRITICAL, HIGH, NORMAL, LOW)</p></li><li><p><strong>Dedicated consumer pools</strong> for high-priority logs with 10x faster processing SLAs</p></li><li><p><strong>Priority escalation engine</strong> that auto-promotes aged messages to prevent starvation</p></li><li><p><strong>Comprehensive monitoring</strong> tracking queue depths, processing latency, and priority distribution</p></li></ul><h2>Why This Matters: The 3AM Wake-Up Call Problem</h2><blockquote><p>When payment processing fails at Stripe, security breaches occur at AWS, or fraud detection triggers at PayPal, these critical events can&#8217;t wait behind millions of routine info logs. A single delayed security alert can mean the difference between detecting a breach in minutes versus hours.</p><p>In 2021, a major cloud provider experienced a 4-hour outage partly because critical infrastructure alerts were buried in normal log queues. Their monitoring system generated 500,000 events per second, but the 12 critical alerts indicating cascading failures were delayed by 18+ minutes in standard FIFO processing. Priority queues solve this by guaranteeing sub-second processing for critical events regardless of overall system load.</p><p>At scale, this pattern becomes essential: Uber processes 50,000 fraud detection events per second with &lt;100ms SLAs while normal trip logs can tolerate 5-second latencies. Netflix routes critical streaming failures through dedicated high-priority paths while routine engagement logs use standard queues. The architectural challenge is implementing priority without creating starvation, head-of-line blocking, or resource exhaustion from priority escalation.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!15wQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!15wQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!15wQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!15wQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!15wQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!15wQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!15wQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png 424w, https://substackcdn.com/image/fetch/$s_!15wQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png 848w, https://substackcdn.com/image/fetch/$s_!15wQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png 1272w, https://substackcdn.com/image/fetch/$s_!15wQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28190c69-4424-42c1-b0b5-e70194f3236a_7000x4500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2></h2>
      <p>
          <a href="https://sdcourse.substack.com/p/day-37-priority-queues-for-critical-827">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 36: Dead Letter Queues for Failed Log Processing]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-36-dead-letter-queues-for-failed</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-36-dead-letter-queues-for-failed</guid><dc:creator><![CDATA[sdr]]></dc:creator><pubDate>Fri, 13 Feb 2026 08:54:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Gqva!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><ul><li><p><strong>Dead Letter Queue infrastructure</strong> for capturing and analyzing failed log processing attempts with automatic retry mechanisms</p></li><li><p><strong>Poison message detection system</strong> that prevents infinite retry loops and isolates problematic events</p></li><li><p><strong>DLQ monitoring dashboard</strong> with Grafana visualizations showing failure patterns, retry metrics, and message inspection capabilities</p></li><li><p><strong>Reprocessing pipeline</strong> enabling manual intervention and automated recovery from transient failures</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gqva!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gqva!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png 424w, https://substackcdn.com/image/fetch/$s_!Gqva!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png 848w, https://substackcdn.com/image/fetch/$s_!Gqva!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png 1272w, https://substackcdn.com/image/fetch/$s_!Gqva!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gqva!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png" width="638" height="350.9876373626374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:801,&quot;width&quot;:1456,&quot;resizeWidth&quot;:638,&quot;bytes&quot;:310257,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/183317537?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gqva!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png 424w, https://substackcdn.com/image/fetch/$s_!Gqva!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png 848w, https://substackcdn.com/image/fetch/$s_!Gqva!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png 1272w, https://substackcdn.com/image/fetch/$s_!Gqva!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1cde21-4378-4eef-940e-da26d54e2664_3200x1760.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Why This Matters: The Hidden Cost of Message Failures</h2><blockquote><p>When Uber processes 15 billion location updates daily, even a 0.01% failure rate means 1.5 million lost events. Without dead letter queues, these failures cascade: poison messages block consumer threads, retries amplify database load, and critical data vanishes silently. Netflix discovered this during a payment processing incident where failed transactions entered retry loops, creating 50x database load that took down their entire billing system for 3 hours.</p><p>Dead letter queues solve three production-critical problems: they prevent poison messages from blocking healthy traffic, preserve failed events for forensic analysis, and enable controlled reprocessing without impacting live systems. Amazon&#8217;s order processing uses DLQs to quarantine corrupted events that would otherwise trigger cascading failures across inventory, payment, and shipping services.</p></blockquote><h2>System Design Deep Dive</h2><h3>Pattern 1: Dead Letter Exchange with Message TTL</h3><p>The foundation of DLQ architecture combines message expiration with automatic rerouting. When a consumer fails to process a message after exhausting retries, Kafka routes it to a dedicated dead letter topic rather than discarding it. This pattern preserves message ordering guarantees while preventing head-of-line blocking.</p><p><strong>Trade-off Analysis</strong>: Immediate DLQ routing reduces consumer latency but may discard recoverable failures. Delayed routing with exponential backoff (1s, 2s, 4s, 8s) handles transient errors but increases system complexity. Production systems typically use hybrid approaches: fast-fail for validation errors, delayed retry for network timeouts.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://sdcourse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Hands On System Design Course - Code Everyday  is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Scalability Bottleneck</strong>: DLQs can become write-heavy during cascading failures. A database outage causing 10,000 msg/sec to fail creates a DLQ spike that overwhelms monitoring systems. Solution: Rate-limit DLQ writes and aggregate failure metrics.</p><h3>Pattern 2: Poison Message Detection</h3><p>Poison messages&#8212;events that consistently fail processing despite retries&#8212;create infinite loops that waste resources. Detection requires tracking per-message failure counts and identifying retry patterns. The pattern: attach metadata (attempt_count, first_failure_timestamp, error_signature) to each message, increment counters on failure, and route to DLQ after threshold breach.</p><p><strong>Critical Insight</strong>: Hash-based error signatures prevent different exceptions from counting toward the same threshold. A message failing with &#8220;DatabaseTimeout&#8221; then &#8220;ValidationError&#8221; represents two distinct failure modes requiring separate analysis. Netflix uses error signature clustering to identify systemic issues vs. data-specific problems.</p><p><strong>Anti-pattern</strong>: Global retry counters that reset on consumer restart. This creates a vulnerability where poison messages repeatedly block processing after each deployment. Solution: Store retry state in persistent headers, not consumer memory.</p><h3>Pattern 3: Graduated Retry Strategy</h3><p>Not all failures are equal. Network timeouts justify aggressive retries; schema validation errors don&#8217;t. Graduated retry implements failure classification: transient errors (network, rate limits) get exponential backoff up to 10 attempts, permanent errors (validation, authorization) go straight to DLQ, and ambiguous errors (generic exceptions) get limited retries.</p><p><strong>Implementation Complexity</strong>: Distinguishing transient from permanent failures requires exception taxonomy. Spring Boot&#8217;s <code>@Retryable</code> annotation supports this but needs careful configuration. The challenge: third-party library exceptions that don&#8217;t clearly indicate failure type.</p><p><strong>Production Example</strong>: Uber&#8217;s trip processing retries location errors (GPS drift, network issues) but immediately DLQs invalid passenger IDs. This prevents wasting resources on unrecoverable errors while maximizing success rate for transient issues.</p><h3>Pattern 4: DLQ Monitoring and Alerting</h3><p>Effective DLQ systems require real-time visibility into failure patterns. Key metrics: DLQ ingestion rate (msgs/sec), failure type distribution (validation vs. timeout vs. business logic), message age in DLQ, and reprocessing success rate. Alerts trigger when: DLQ rate exceeds 5% of main topic volume, any single error type dominates (&gt;50% of failures), or messages remain in DLQ beyond retention policy.</p><p><strong>Observability Challenge</strong>: Correlating DLQ messages back to original requests for distributed tracing. Solution: Preserve correlation IDs through the entire retry chain, enabling reconstruction of the complete processing timeline.</p><p><strong>Alert Fatigue Prevention</strong>: Aggregate similar failures into single alerts rather than individual message notifications. A schema change causing 1000 validation failures should generate one &#8220;schema mismatch&#8221; alert, not 1000.</p><h3>Pattern 5: Controlled Reprocessing Pipeline</h3><p>Dead letter queues aren&#8217;t graveyards&#8212;they&#8217;re holding areas for recovery. Reprocessing requires: manual inspection tools for root cause analysis, fix-and-replay mechanisms for code bugs, and automated recovery for transient failures that resolved. The pattern uses a separate reprocessing consumer that reads from DLQ at controlled rates, applies fixes/transformations, and publishes to original topic.</p><p><strong>Consistency Consideration</strong>: Reprocessed messages may arrive out-of-order relative to newer events. For log processing, this is acceptable (logs are idempotent). For financial transactions, it&#8217;s catastrophic. Solution: Timestamp-based deduplication and order verification before committing reprocessed events.</p><p><strong>Capacity Planning</strong>: Reprocessing after major incidents can create thundering herd. If 100,000 messages accumulated during a database outage, naive replay at full speed overwhelms the now-healthy database. Solution: Rate-limited replay with circuit breakers.</p><h2>Github Link:</h2><pre><code><strong><a href="https://github.com/sysdr/sdc-java/tree/main/day36/dlq-log-system">https://github.com/sysdr/sdc-java/tree/main/day36/dlq-log-system</a></strong></code></pre><h2>Implementation Walkthrough</h2><h3>Core Architecture</h3><p>Our implementation creates three Kafka topics: <code>log-events</code> (main processing), <code>log-events-retry</code> (temporary retry holding), and <code>log-events-dlq</code> (permanent failures). The consumer uses Spring Kafka&#8217;s error handler with custom retry logic that routes messages through this topology.</p><p><strong>Step 1: Enhanced Consumer with Retry Semantics</strong></p><p>The <code>LogConsumerService</code> adds retry metadata to message headers before reprocessing. Each failure increments <code>retry-count</code>, records <code>error-type</code>, and timestamps the failure. After 3 attempts (configurable), messages route to DLQ with full diagnostic context. This design ensures forensic data survives through the failure chain.</p><p><strong>Architectural Decision</strong>: We use Kafka headers for retry state rather than external storage (Redis/database) because headers travel with the message, preventing state synchronization issues during consumer scaling or rebalancing.</p><p><strong>Step 2: DLQ Producer with Classification</strong></p><p>The <code>DeadLetterQueueService</code> receives failed messages and classifies them by error type: VALIDATION, TIMEOUT, PROCESSING, UNKNOWN. Classification drives alerting and reprocessing strategies. Validation errors likely need code fixes, timeouts might self-resolve, processing errors require case-by-case analysis.</p><p><strong>Implementation Detail</strong>: Error classification uses pattern matching on exception stack traces. This is brittle but pragmatic&#8212;proper exception hierarchies across microservices is ideal but rarely achievable in practice.</p><p><strong>Step 3: Monitoring Integration</strong></p><p>Micrometer counters track DLQ metrics by error type: <code>dlq.messages.total{type=VALIDATION}</code>, <code>dlq.messages.total{type=TIMEOUT}</code>. Prometheus scrapes these and Grafana dashboards visualize failure patterns. We also emit custom metrics for average time-to-DLQ (how quickly failures are detected) and DLQ processing lag (how far behind the reprocessing consumer is).</p><p><strong>Step 4: Reprocessing API</strong></p><p>The API Gateway exposes endpoints for DLQ operations: list failed messages with filters, inspect individual message details including full payload and error history, manually trigger reprocessing with optional transformations, and bulk replay with rate limiting. This transforms DLQ from black hole to operational tool.</p><h3>Testing Failure Scenarios</h3><p>The integration test suite simulates production failure modes: poison messages with invalid JSON, database timeouts during peak load, network partitions during message processing, and schema evolution mismatches. Each test verifies that messages reach DLQ with correct classification and that retries respect exponential backoff timings.</p><p><strong>Load Test Design</strong>: We inject 1% failure rate into 10,000 msg/sec load to validate that DLQ handling doesn&#8217;t impact healthy message throughput. Success criteria: main topic latency remains under 100ms p99 despite DLQ activity.</p><h2>Working Code Demo:</h2><div id="youtube2-Pr8w-EI-huw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Pr8w-EI-huw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Pr8w-EI-huw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Production Considerations</h2><p><strong>Performance Profile</strong>: DLQ operations add 2-5ms latency per failed message (header writes, topic routing). At 0.1% failure rate (100 failures/sec in a 100K msg/sec system), DLQ overhead is negligible. But during cascading failures, DLQ writes can spike to 50K msg/sec, requiring dedicated partitions and consumer groups.</p><p><strong>Monitoring Strategy</strong>: Alert on DLQ rate &gt; 5% of main topic volume (indicates systemic issue), messages in DLQ older than 24 hours (retention violation), reprocessing success rate &lt; 80% (fix-and-replay not working), and DLQ consumer lag growing (reprocessing falling behind).</p><p><strong>Failure Mode</strong>: DLQ itself can fail. When DLQ topic is unavailable, consumers must decide: drop messages (data loss), block processing (availability loss), or buffer locally (memory exhaustion). Our implementation uses local disk buffering with size limits, preferring temporary degradation over silent data loss.</p><p><strong>Capacity Planning</strong>: Size DLQ topic for 10x average failure rate to handle spike scenarios. If normal failure rate is 100 msg/sec, provision DLQ for 1000 msg/sec sustained. DLQ retention should match investigation SLAs&#8212;24 hours minimum, 7 days recommended.</p><h2>Connection to Scale: FAANG DLQ Patterns</h2><p>Netflix&#8217;s payment processing uses multi-tier DLQs: immediate DLQ for validation errors, 3-hour delayed queue for transient failures, 24-hour manual review queue for business logic errors. This graduated approach maximizes automatic recovery while minimizing human intervention. Their DLQ dashboards show real-time failure taxonomy, enabling rapid incident response.</p><p>Amazon&#8217;s order processing DLQs preserve idempotency keys and business context, allowing customer service to manually retry failed orders without duplicate charges. During Black Friday, their DLQ systems capture millions of failures without impacting checkout latency, then automatically reprocess overnight as infrastructure stabilizes.</p><h2>Next Steps</h2><p>Tomorrow we implement priority queues, enabling critical logs to bypass normal processing delays and reach consumers within milliseconds.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://sdcourse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Hands On System Design Course - Code Everyday  is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Day 35: Topic-Based Routing - Building Multi-Pipeline Log Processing Systems ]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-35-topic-based-routing-building</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-35-topic-based-routing-building</guid><dc:creator><![CDATA[sdr]]></dc:creator><pubDate>Mon, 09 Feb 2026 08:30:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ek95!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><p>Today we implement intelligent routing mechanisms that direct different log types to specialized processing pipelines:</p><ul><li><p><strong>Content-based routing engine</strong> that inspects log attributes and routes to appropriate Kafka topics</p></li><li><p><strong>Multiple specialized consumer pipelines</strong> (security, performance, application, system logs)</p></li><li><p><strong>Dynamic routing rules</strong> supporting regex patterns and severity-based filtering</p></li><li><p><strong>Fanout patterns</strong> for logs requiring multiple processing paths simultaneously</p></li></ul><h2>Why This Matters: The Routing Challenge at Scale</h2><blockquote><p>When Uber processes 100 billion log events daily, they don&#8217;t send every log through the same pipeline. Security incidents need immediate alerting within milliseconds, performance metrics aggregate into time-series databases, application errors route to incident tracking systems, and audit logs archive to long-term storage. Each pipeline has different latency requirements, storage patterns, and processing logic.</p><p>Without intelligent routing, you face two critical problems: resource waste (processing irrelevant logs consumes compute unnecessarily) and latency inflation (high-priority security events queue behind low-priority debug logs). Netflix learned this lesson during a critical security incident when P0 alerts drowned in millions of debug logs, delaying detection by 18 minutes. Their solution? Topic-based routing that isolated security logs into dedicated high-priority pipelines.</p><p>The architectural challenge isn&#8217;t just filtering - it&#8217;s building routing logic that scales to millions of events per second while maintaining low latency, supports dynamic rule updates without deployment, and handles the fanout complexity when single events need multiple processing paths.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ek95!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ek95!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png 424w, https://substackcdn.com/image/fetch/$s_!Ek95!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png 848w, https://substackcdn.com/image/fetch/$s_!Ek95!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!Ek95!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ek95!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png" width="1456" height="786" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:786,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:626992,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/183213569?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ek95!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png 424w, https://substackcdn.com/image/fetch/$s_!Ek95!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png 848w, https://substackcdn.com/image/fetch/$s_!Ek95!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!Ek95!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6f142f-2dad-4d5a-9b56-f2b3e4b4b849_4000x2160.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>System Design Deep Dive</h2><h3>Pattern 1: Content-Based Routing with Topic Segmentation</h3><p>Traditional log processing systems use a single queue, creating head-of-line blocking where high-priority logs wait behind low-priority ones. Topic-based routing solves this by inspecting message content and directing to specialized topics.</p><p><strong>Architecture Decision</strong>: Use Kafka topic segmentation rather than application-level filtering. When a log arrives, the routing service examines attributes (severity, source, type) and publishes to specific topics: <code>logs-security</code>, <code>logs-performance</code>, <code>logs-application</code>, <code>logs-system</code>. Each topic has independent consumer groups with different processing characteristics.</p><p><strong>Trade-off Analysis</strong>: Content inspection adds 2-3ms latency at the router. However, specialized pipelines reduce downstream processing time by 10-100x by eliminating irrelevant log filtering at each consumer. For high-throughput systems (&gt;10K events/sec), the routing overhead is negligible compared to gains from targeted processing.</p><p><strong>Failure Mode</strong>: Routing logic bugs can silently drop critical logs. Mitigation: implement a catch-all default topic and monitor routing decision metrics. If 90% of logs route to default, your routing rules are failing.</p><h3>Pattern 2: Dynamic Routing Rules Engine</h3><p>Hard-coded routing logic requires deployment for rule changes. Production systems need runtime rule updates for emergency response (route all logs from compromised service to security pipeline immediately).</p><p><strong>Implementation</strong>: Store routing rules in Redis with pattern matching using regex and composite conditions. Router loads rules at startup and subscribes to Redis pub/sub for updates. Each rule defines: match pattern (source service, severity level, contains keyword), destination topic, priority (when multiple rules match).</p><pre><code><code>Rule Example:
- Pattern: severity=ERROR AND service=payment-api
- Destination: logs-critical-business
- Priority: 1 (highest)
</code></code></pre><p><strong>Scalability Consideration</strong>: Pattern matching is CPU-intensive. At 50K events/sec, regex evaluation can become the bottleneck. Solution: compile regex patterns once at rule load, use simple string comparisons for common cases (severity levels), and implement rule caching for repeated patterns.</p><h3>Pattern 3: Multi-Destination Fanout</h3><p>Some logs need multiple processing paths simultaneously. Security logs might need real-time alerting AND compliance archival AND audit trail storage. Single-destination routing creates duplication complexity.</p><p><strong>Kafka Approach</strong>: Publish to multiple topics atomically. The routing service evaluates all rules, collects matching destinations, and sends the log to each topic in a single transaction. Kafka&#8217;s producer batching optimizes multi-topic writes.</p><p><strong>Critical Implementation Detail</strong>: Use Kafka transactions to ensure all-or-nothing delivery. If routing to 3 topics, either all 3 succeed or none do. Partial failures create data inconsistency across pipelines.</p><p><strong>CAP Theorem Implication</strong>: Fanout increases write latency (waiting for multiple topic acks). For 3-way fanout with min.insync.replicas=2, you need 6 broker acknowledgments. This favors consistency over latency. For latency-critical paths, consider async fanout with best-effort delivery to secondary topics.</p><h3>Pattern 4: Priority-Based Topic Allocation</h3><p>Not all logs are equal. Security incidents need immediate processing; debug logs can tolerate delays. Separate topics enable independent consumer scaling and resource allocation.</p><p><strong>Resource Mapping</strong>:</p><ul><li><p><code>logs-critical</code>: 16 partitions, 8 consumer instances, dedicated CPU/memory</p></li><li><p><code>logs-high</code>: 12 partitions, 4 consumer instances</p></li><li><p><code>logs-medium</code>: 8 partitions, 2 consumer instances</p></li><li><p><code>logs-low</code>: 4 partitions, 1 consumer instance</p></li></ul><p><strong>Auto-Scaling Strategy</strong>: Monitor consumer lag per topic. When <code>logs-critical</code> lag exceeds 100 messages, scale consumers within 30 seconds. Low-priority topics can tolerate minutes of lag before scaling.</p><p><strong>Back-Pressure Handling</strong>: When downstream processing can&#8217;t keep up, routing logic can implement adaptive throttling - temporarily route low-priority logs to batch processing while maintaining real-time flow for critical logs.</p><h3>Pattern 5: Routing Metrics and Observability</h3><p>Routing decisions are invisible in single-queue systems. Topic-based routing enables granular observability: how many logs per source service, severity distribution per topic, routing rule hit rates, and misrouted log detection.</p><p><strong>Key Metrics</strong>:</p><ul><li><p>Routing decision latency (p50, p99, p999)</p></li><li><p>Logs per topic per second (detect anomalies)</p></li><li><p>Rule match rate (identify unused rules)</p></li><li><p>Default topic rate (detect routing failures)</p></li></ul><p><strong>Alerting Strategy</strong>: If &gt;5% of logs route to default topic, routing logic is failing. If critical topic receives &gt;10x normal rate, investigate potential security incident or service failure. If routing latency p99 exceeds 10ms, router is becoming bottleneck.</p><h2>Github Link:</h2><pre><code><strong><a href="https://github.com/sysdr/sdc-java/tree/main/day35/log-routing-system">https://github.com/sysdr/sdc-java/tree/main/day35/log-routing-system</a></strong></code></pre><h2>Implementation Walkthrough</h2><h3>Routing Service Architecture</h3><p>The routing service sits between log producers and Kafka topics. It receives logs via REST API, evaluates routing rules, and publishes to appropriate topics.</p><p><strong>Core Components</strong>:</p><ol><li><p><strong>REST Controller</strong>: Accepts log events, validates format, returns 202 Accepted immediately</p></li><li><p><strong>Routing Engine</strong>: Evaluates rules from Redis cache, determines destination topics</p></li><li><p><strong>Kafka Producer</strong>: Publishes to multiple topics with transaction support</p></li><li><p><strong>Rule Manager</strong>: Loads rules at startup, subscribes to Redis updates, recompiles patterns</p></li></ol><p><strong>Implementation Flow</strong>:</p><pre><code><code>1. Log arrives at POST /api/logs
2. Validate JSON structure (reject invalid immediately)
3. Evaluate routing rules in priority order
4. Collect all matching destinations (fanout)
5. Begin Kafka transaction
6. Publish to each destination topic
7. Commit transaction (all-or-nothing)
8. Return 202 to client
9. Record routing metrics
</code></code></pre><h3>Consumer Pipeline Specialization</h3><p>Each topic has dedicated consumers optimized for their log type:</p><p><strong>Security Pipeline</strong> (<code>logs-security</code>):</p><ul><li><p>Real-time processing (no batching)</p></li><li><p>Immediate alerting to PagerDuty</p></li><li><p>Enrichment with threat intelligence</p></li><li><p>Storage in security SIEM</p></li></ul><p><strong>Performance Pipeline</strong> (<code>logs-performance</code>):</p><ul><li><p>Batched aggregation (10-second windows)</p></li><li><p>Time-series database writes</p></li><li><p>Percentile calculation</p></li><li><p>Grafana dashboard updates</p></li></ul><p><strong>Application Pipeline</strong> (<code>logs-application</code>):</p><ul><li><p>Error tracking integration</p></li><li><p>Stack trace analysis</p></li><li><p>User session correlation</p></li><li><p>Ticket creation for errors</p></li></ul><h3>Configuration-Driven Routing Rules</h3><p>Rules defined in YAML, loaded to Redis:</p><pre><code><code>rules:
  - name: security-critical
    priority: 1
    conditions:
      severity: [ERROR, FATAL]
      source: [auth-service, payment-api]
    destinations: [logs-security, logs-critical]
  
  - name: performance-metrics
    priority: 2
    conditions:
      type: metric
      metric_name: "response_time_*"
    destinations: [logs-performance]
</code></code></pre><p>The rule manager compiles these into efficient matching logic, caching compiled patterns for reuse.</p><h1>Working Code Demo:</h1><div id="youtube2-h-OZa550Goc" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;h-OZa550Goc&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/h-OZa550Goc?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Production Considerations</h2><p><strong>Performance Characteristics</strong>: Routing adds 2-3ms p50 latency, 5-8ms p99. For 50K events/sec, router needs 4-8 cores with rule caching. Kafka producer batching reduces per-message overhead to &lt;100&#956;s.</p><p><strong>Failure Scenarios</strong>:</p><ul><li><p><strong>Rule compilation failure</strong>: Fall back to default routing, alert ops team</p></li><li><p><strong>Kafka topic unavailable</strong>: Queue in Redis with TTL, retry with backoff</p></li><li><p><strong>Transaction timeout</strong>: Reduce fanout destinations, implement async fallback</p></li><li><p><strong>Redis connection loss</strong>: Use last-known-good rule cache, alert for manual intervention</p></li></ul><p><strong>Monitoring Requirements</strong>:</p><ul><li><p>Track routing decision latency (should be &lt;5ms p99)</p></li><li><p>Monitor logs per topic per second (detect anomalies)</p></li><li><p>Alert on default topic rate spikes (routing failures)</p></li><li><p>Track transaction abort rate (downstream capacity issues)</p></li><li><p>Consumer lag per topic (identify processing bottlenecks)</p></li></ul><p><strong>Capacity Planning</strong>: Each router instance handles 10-15K events/sec. For 100K events/sec, deploy 8-10 router instances behind load balancer. Each Kafka topic needs partitions equal to max consumer count for horizontal scaling.</p><h2>Scale Connection: Enterprise Routing at FAANG</h2><p>Netflix routes 500K events/sec across 50+ specialized topics. Their routing engine uses multi-stage filtering: cheap checks first (severity string comparison), expensive checks last (regex pattern matching). They implement circuit breakers per topic - if security topic consumers are down, route security logs to backup archival topic to prevent data loss.</p><p>Amazon&#8217;s CloudWatch Logs uses hierarchical routing with namespace isolation. Each AWS service has dedicated topic namespaces, enabling independent scaling and preventing noisy neighbor problems. Their routing SLA: 99.99% of logs routed correctly within 10ms.</p><h2>Next Steps</h2><p>Tomorrow we implement dead letter queues for handling logs that fail processing despite retries, completing our fault-tolerance architecture.</p>]]></content:encoded></item><item><title><![CDATA[Day 34: Consumer Acknowledgments and Redelivery Mechanisms]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-34-consumer-acknowledgments-and</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-34-consumer-acknowledgments-and</guid><dc:creator><![CDATA[sdr]]></dc:creator><pubDate>Thu, 05 Feb 2026 08:30:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DKU3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><p>Today we implement reliable message processing with intelligent failure handling:</p><ul><li><p><strong>Manual acknowledgment control</strong> to prevent message loss during processing failures</p></li><li><p><strong>Configurable retry mechanisms</strong> with exponential backoff for transient errors</p></li><li><p><strong>Dead letter queues</strong> to isolate poison pill messages from healthy traffic</p></li><li><p><strong>Idempotency tracking</strong> to guarantee exactly-once processing semantics</p></li></ul><h2>Why This Matters</h2><blockquote><p>In distributed systems, message acknowledgment determines whether your system loses data or processes it twice. At scale, the difference is billions of events. When Uber&#8217;s payment system processes ride completions, a lost acknowledgment means a driver doesn&#8217;t get paid. A duplicate acknowledgment means charging a rider twice. Netflix&#8217;s recommendation system processes 500 billion events daily&#8212;without proper acknowledgment strategies, their entire pipeline would grind to halt from poison pill messages or cascade into infinite retry loops.</p><p>The acknowledgment pattern defines your system&#8217;s reliability guarantees. Auto-commit provides at-most-once delivery (fast but lossy). Manual commit after processing gives at-least-once (reliable but potentially duplicate). Transactional commit achieves exactly-once (correct but complex). Production systems trade throughput for reliability based on business requirements&#8212;financial transactions need exactness, video view counts tolerate approximation.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DKU3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DKU3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png 424w, https://substackcdn.com/image/fetch/$s_!DKU3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png 848w, https://substackcdn.com/image/fetch/$s_!DKU3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!DKU3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DKU3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png" width="1456" height="874" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1213d061-6135-4045-af72-7665db02e704_4000x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:874,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:781709,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/182679674?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DKU3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png 424w, https://substackcdn.com/image/fetch/$s_!DKU3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png 848w, https://substackcdn.com/image/fetch/$s_!DKU3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!DKU3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1213d061-6135-4045-af72-7665db02e704_4000x2400.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://sdcourse.substack.com/p/day-34-consumer-acknowledgments-and">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Day 33: Implement Consumers to Process Logs from Queues]]></title><description><![CDATA[What We&#8217;re Building Today]]></description><link>https://sdcourse.substack.com/p/day-33-implement-consumers-to-process</link><guid isPermaLink="false">https://sdcourse.substack.com/p/day-33-implement-consumers-to-process</guid><dc:creator><![CDATA[sdr]]></dc:creator><pubDate>Sun, 01 Feb 2026 08:30:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eK-C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What We&#8217;re Building Today</h2><ul><li><p><strong>Consumer group architecture</strong> with automatic partition rebalancing across multiple instances</p></li><li><p><strong>Parallel log processing pipeline</strong> handling 50,000+ events/second with sub-100ms P99 latency</p></li><li><p><strong>Offset management strategies</strong> for reliable message consumption and failure recovery</p></li><li><p><strong>Backpressure-aware processing</strong> with dynamic batch sizing and flow control mechanisms</p></li></ul><h2>Why This Matters: The Consumer Scalability Challenge</h2><blockquote><p>While producers must handle bursts of incoming events, consumers face a different scaling challenge: <strong>processing throughput must match or exceed production rate</strong> to prevent unbounded queue growth. At Uber, their logging infrastructure processes 100 billion events daily across thousands of consumer instances. When a deployment temporarily doubles processing latency from 50ms to 100ms, consumer lag can balloon to hours of backlog within minutes, causing cascading failures in dependent systems that rely on near-real-time log insights.</p><p>The consumer side introduces unique distributed systems challenges that don&#8217;t exist for producers. Consumer groups must dynamically rebalance partition assignments as instances fail or scale, requiring consensus protocols that temporarily pause all consumption. Netflix&#8217;s consumer infrastructure restarts ~10,000 instances daily across their fleet, triggering ~500 rebalance operations per minute during peak deployment windows. Poor rebalancing strategies can create 30-60 second processing gaps, causing violations of their 99.9% SLA for anomaly detection pipelines that power their recommendation engine.</p></blockquote><p></p><blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eK-C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eK-C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!eK-C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!eK-C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!eK-C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eK-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png" width="1456" height="849" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca24e6bf-eafe-4853-861b-503263562243_2400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:849,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:450073,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://sdcourse.substack.com/i/182677166?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eK-C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!eK-C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!eK-C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!eK-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca24e6bf-eafe-4853-861b-503263562243_2400x1400.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></blockquote>
      <p>
          <a href="https://sdcourse.substack.com/p/day-33-implement-consumers-to-process">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>