Master diagram

Linh Truong · MA (Harvard), MBA · LinhTruong.com · Linh@Alumni.Harvard.edu

System Architecture

I designed this reference architecture to map the full structural anatomy of large-scale distributed systems — from client interfaces and edge security through ingress, Kubernetes & Istio service mesh, event streaming, polyglot data persistence, analytics & BI, observability, DevOps, security, and AI/ML infrastructure. Twelve layers, each with its own detailed diagram.

01Master diagram

10B+

Users

99.999%

SLA

<50ms

P99 Latency

5M+

Req/sec

50+

Regions

10 EB

Data

Zero Trust

Security

Auto-Scale

Elastic Infra

🌎 US-EAST 🌎 US-WEST 🌍 EU-WEST 🌍 EU-CENTRAL 🌏 AP-SE 🌏 AP-NE 🌏 AP-SOUTH 🌍 ME-SOUTH 🌎 SA-EAST 🌍 AF-SOUTH

① Client Layer

🌐Web Browser React / Next.js · PWA SPA + SSR

📱iOS App Swift / SwiftUI Offline-First

🤖Android App Kotlin / Jetpack Offline-First

🖥️Desktop Electron / Tauri Cross-Platform

📡IoT / Devices MQTT · CoAP · BLE Billions

🔌3rd-Party API REST · GraphQL · SDK Partners

🥽AR / VR WebXR · Vision Pro Spatial

⌚Wearables Watch · Smart Glass Health

↓HTTPS / WSS / gRPC / MQTT over TLS 1.3 + QUIC

② Edge & Security Layer

🛡️DDoS Shield Cloudflare / AWS Shield 10 Tbps+

⚡CDN / Edge Cloudflare · 300+ PoPs Global Cache

🌍GeoDNS Route 53 · Anycast Latency-Based

🔥WAF OWASP Rules · Bot Filter L7 Firewall

🗺️Edge Workers CF Workers · V8 Isolates A/B · Personalize

🔐TLS Termination TLS 1.3 / HTTP/3 QUIC mTLS

🕵️Bot Management ML Fingerprint · CAPTCHA Real-Time

📦Static Cache Assets · Images · Fonts 95% Hit Rate

↓Filtered & Authenticated Traffic

③ Ingress Layer

⚖️Global Load Balancer L4/L7 · Health Checks Active-Active

🚪API Gateway Kong · Route · Transform REST / GraphQL

🔑Auth / IAM OAuth 2.0 · JWT · MFA Zero Trust

🚦Rate Limiter Token Bucket · Redis Per-User/IP

📡WebSocket GW Real-Time · Pub/Sub Bi-Directional

🔄GraphQL Federation Apollo Gateway · Stitching Unified Graph

🛑Circuit Breaker Hystrix · Bulkhead Fault Tolerance

📊gRPC Proxy Envoy Sidecar · Protobuf Service Mesh

↓Routed via Service Mesh (Istio / Linkerd) with mTLS

📡 Observability

📈Prometheus + Grafana Metrics · SLO · Alerts

🔍Distributed Tracing Jaeger · OpenTelemetry

📋Log Aggregation ELK / Loki · Structured

🚨Incident Mgmt PagerDuty · Runbooks

🚀 DevOps

🐳Kubernetes EKS/GKE · HPA · VPA

🔁CI/CD GitHub Actions · ArgoCD

🏗️IaC Terraform · Helm

④ Microservices Layer — Kubernetes · Istio Service Mesh

👤User Service Profile · Prefs · CQRS Go

📝Content Service CRUD · Rich Media Java

🔔Notification Push · Email · SMS Node.js

💳Payment Stripe · Idempotent Java · PCI-DSS

💬Messaging WebSocket · E2E Encrypt Erlang

🔍Search Elasticsearch · NLP Python

🗂️File Service Upload · Virus Scan Go

🌐Geo / Map Location · Geofencing C++

🎯Recommendation Collab Filter · Deep ML Python

📊Analytics Event Track · Funnel Python

🛒Order / Cart Checkout · Saga Java

⚙️Config / Flags LaunchDarkly · etcd Go

↓Async Events

⑤ Event Streaming & Async Processing

🌊Apache Kafka Event Streaming · 100M msg/s Partitioned

⚡Apache Flink Stream Processing · CEP Real-Time

🐰RabbitMQ Task Queues · Dead Letter AMQP

🕐Temporal Workflow · Long-Running Durable Sagas

📬AWS SQS / SNS Fan-out · FIFO · DLQ Managed

🧮Spark Batch ETL · MapReduce Petabyte+

🔒 Security

🗝️Secret Manager Vault · AWS KMS · Rotate

🛡️SIEM / SOC Splunk · Threat Detection

🔏Encryption AES-256 · TLS 1.3

📜Compliance GDPR · SOC2 · PCI-DSS

🕸️Zero Trust Net mTLS · SPIFFE · BeyondCorp

🧠 AI / ML

🤖LLM Inference Claude · GPT-4 · Llama

🔬Feature Store Feast · Online + Offline

🚂ML Pipeline Kubeflow · MLflow

↓Cache-Aside · Write-Through · Read-Through — Redis L1 · CDN L2

⑥ Data Layer — Polyglot Persistence

⚡ Cache

⚡Redis Cluster Session · Hot · Pub/Sub 1M ops/s

🗄️Memcached Object Cache · LRU Write-Aside

🏦 Relational (OLTP)

🐘PostgreSQL Primary + Replicas · Citus ACID

🌍CockroachDB Geo-Distributed SQL NewSQL

🐬Aurora MySQL Multi-AZ · Serverless v2 Managed

📦 NoSQL

🏛️Cassandra Write-Heavy · Time-Series Wide-Col

⚡DynamoDB Single-Digit ms · Global Serverless

🍃MongoDB Flexible Schema · Atlas Document

🔎 Search / Graph / TS

🔎Elasticsearch Full-Text · Vector · BM25 Search

🕸️Neptune / Neo4j Social Graph · GQL Graph

⏱️InfluxDB Metrics · IoT · TTL Time-Series

☁️ Storage & Warehouse

☁️Object Storage S3 / GCS · Exabytes Versioned

❄️Snowflake / BigQuery OLAP · Petabytes Data Warehouse

🏗️Lakehouse Delta Lake · Iceberg ACID on S3

↓Analytics Pipeline → ETL / ELT → Business Intelligence

⑦ Data Platform, Analytics & BI

🔄ETL / ELT Fivetran · dbt Automated

📊BI Dashboards Looker · Metabase Self-Service

🌊Real-Time OLAP Apache Druid · Pinot Sub-second

🎲A/B Testing Experiment Platform Stats Sig

🧬Data Catalog DataHub · Lineage Governance

📡CDP Segment · Identity Unified ID

🛡️Data Privacy PII Mask · GDPR Compliance

💾Backup & DR Cross-Region · RTO <1min RPO = 0

Horizontal scaling · Multi-region active-active · Event-driven architecture · CQRS + Event Sourcing · Polyglot persistence · Zero-downtime deploys · Chaos Engineering · SRE practices | 2025 Production-Grade Design

02Client layer

💻

Client Layer — Detailed Architecture

System Architecture · 10 Billion+ Users · All Client Platforms · 2025

① Client Layer

Client Types

10B+

End Users

Protocols

< 3s

TTI Target

Offline

First Design

E2E

Encrypted

A/B

Feature Flags

WCAG 2.2

Accessibility

Client Types & Technology Stacks

🌐

Web Browser

Desktop & Mobile Web

SPA + SSR

Framework & Rendering

React 19 Next.js 15 Vue / Nuxt Svelte / SvelteKit

State & Data Fetching

Redux Toolkit React Query Zustand SWR

Capabilities

PWA · Service Worker · Offline Cache WebAssembly (WASM) for compute Code splitting · Lazy loading · ISR WebGL / Canvas 2D rendering Web Workers for background tasks

Build & Deploy

Vite / Turbopack Vercel / Cloudflare Pages

Protocols

HTTPS WebSocket GraphQL HTTP/3

📱

iOS App

iPhone · iPad · Apple Silicon

Native Swift

Framework & Language

Swift 6 SwiftUI UIKit (Legacy) Combine / async/await

Architecture

MVVM TCA (Composable) Clean Architecture

Capabilities

Core Data + CloudKit sync Background fetch & push notifications Face ID / Touch ID · Secure Enclave ARKit · Core ML on-device inference WidgetKit · Live Activities · SharePlay

Offline & Sync

Realm · SQLite · Core Data

Protocols

HTTPS gRPC WebSocket APNs

🤖

Android App

Phone · Tablet · Foldable

Native Kotlin

Framework & Language

Kotlin Jetpack Compose Coroutines / Flow

Architecture

MVVM + Clean MVI (Orbit) Hilt DI

Capabilities

Room DB + WorkManager offline jobs Biometric API · Keystore secure storage ML Kit on-device · TensorFlow Lite Adaptive layouts for foldables DataStore · Paging 3 · Navigation

Toolchain

Gradle KTS · R8 · Play Store

Protocols

HTTPS gRPC FCM Push WebSocket

🖥️

Desktop App

Windows · macOS · Linux

Cross-Platform

Framework

Tauri 2.0 (Rust) Electron Flutter Desktop Qt / .NET MAUI

UI Layer

React / Svelte Native OS APIs

Capabilities

Native OS notifications & tray icon File system access · IPC bridge Auto-update via Sparkle / Squirrel Offline-first with local DB (SQLite) Hardware GPU acceleration

Distribution

Mac App Store · MS Store · Snap

Protocols

HTTPS WebSocket gRPC GraphQL

📡

IoT & Embedded

Sensors · Actuators · Edge

Billions of Devices

Platforms & Runtime

FreeRTOS Zephyr RTOS ESP-IDF · Arduino Raspberry Pi / Linux

Edge Computing

AWS Greengrass Azure IoT Edge TF Lite Micro

Capabilities

Local edge inference (TF Lite Micro) Store-and-forward when offline OTA firmware updates (delta) Hardware security module (HSM) Device twin / shadow state sync

Connectivity

WiFi 6 · BLE 5.3 · LoRa · NB-IoT

Protocols

MQTT CoAP AMQP HTTPS

🔌

3rd-Party / Partner API

B2B · Integrations · SDK

Partner Access

Integration Types

REST SDK GraphQL API Webhook Events gRPC Streaming

Auth & Security

OAuth 2.0 Client Creds API Key + HMAC mTLS Certs

Capabilities

Developer portal & sandbox env OpenAPI 3.1 spec & Postman collection Rate limit: 10k–1M req/min by tier Webhook retry with exponential backoff SDK: Python, Node, Java, Go, Ruby

Tooling

Swagger · Stoplight · Postman

Protocols

REST GraphQL gRPC Webhooks

🥽

AR / VR / Spatial

Vision Pro · Quest · HoloLens

Emerging

Platforms & SDK

visionOS / RealityKit WebXR Unity / Unreal Engine Meta SDK (Quest)

Rendering

Metal / Vulkan Three.js / Babylon.js

Capabilities

Spatial anchors & world-scale AR Hand tracking & eye gaze input Real-time 3D object streaming (USD) Spatial audio with HRTF rendering Shared AR sessions (multiplayer)

Compute

On-device ML · Cloud render offload

Protocols

WebRTC WebSocket HTTPS

⌚

Wearables

Watch · Smart Glass · Ring

Health & Fitness

Platforms

watchOS / WatchKit Wear OS Tizen · Garmin SDK

Sensors

HR · SpO2 · ECG Accelerometer · GPS Skin Temp · EDA

Capabilities

Continuous passive health monitoring Companion app BLE data bridge On-device ML for anomaly detection Haptic feedback & notifications HealthKit / Health Connect sync

Connectivity

BLE 5.3 · WiFi 6 · LTE (standalone)

Protocols

BLE GATT HTTPS MQTT

Request Flow — Client to Backend

👤

User Action

Tap · Click · Voice

→

🧩

UI Component

Event Handler

→

🔄

State Manager

Redux · TCA · ViewModel

→

📡

API Client

Axios · URLSession · OkHttp

→

🗃️

Local Cache

Check First (offline)

→

🔐

Auth Token

JWT · Bearer · mTLS

→

🛡️

TLS 1.3 / QUIC

Certificate Pinning

→

⚡

CDN / Edge

Nearest PoP · Cache Hit

→

🚪

API Gateway

Route · Rate Limit

↔

⚙️

Microservice

Business Logic

Cross-Cutting Concerns — Shared Across All Clients

🔐Authentication & Auth

OAuth 2.0 + PKCE flow JWT access + refresh tokens Passkeys (WebAuthn / FIDO2) Biometric: Face ID / Fingerprint MFA: TOTP · SMS · Hardware key Token rotation + silent refresh Certificate pinning (mobile)

📊Analytics & Telemetry

Session recording (Hotjar / FullStory) Funnel & conversion tracking Core Web Vitals monitoring Custom event tracking (Amplitude) Crash reporting (Sentry / Firebase) Real User Monitoring (RUM) Privacy-safe (GDPR consent)

🚩Feature Flags & A/B

LaunchDarkly / Unleash SDK Gradual rollout (1% → 100%) Kill switch for instant rollback Multivariate A/B experiments Targeting by user segment Client-side evaluation (local) SSE / streaming flag updates

🐛Error Reporting

Sentry · Bugsnag · Firebase Crashlytics Source map upload for JS/TS dSYM symbols for iOS crashes ProGuard mapping for Android Breadcrumb trail for debugging PII scrubbing before upload Alert on error-rate spike

🌍i18n & Localization

100+ languages · RTL support ICU message format (plurals) Lazy-load locale bundles Currency · Date · Number format Over-the-Air translation updates Crowdin / Lokalise workflow WCAG 2.2 AA accessibility

🔔Push & Notifications

APNs (iOS) · FCM (Android) Web Push API (browser) Rich push: image, action buttons Scheduled & silent pushes Opt-in/opt-out preference center In-app message (non-intrusive) Delivery receipt tracking

Communication Protocols

🌐

REST / HTTPS

JSON · OpenAPI 3.1
HTTP/2 multiplexing
Idempotency keys

Request–Response

⚡

WebSocket / SSE

Real-time push
Chat · Live feeds
Heartbeat + reconnect

Full-Duplex

🔄

gRPC / HTTP/2

Protobuf binary
Bi-directional stream
Low latency & typed

Service-to-Service

📐

GraphQL

Flexible queries
Subscriptions · Batching
Apollo / Relay client

Query Language

📡

MQTT / CoAP

IoT-optimized
Pub/Sub · QoS 0/1/2
Low bandwidth devices

IoT Telemetry

🎥

WebRTC / QUIC

P2P video & audio
DTLS-SRTP encrypted
ICE / STUN / TURN

Peer-to-Peer

Client Architecture Patterns

🏗️Backend for Frontend (BFF)

Dedicated API per client type (web, mobile, IoT) Aggregates multiple microservices in one call Reduces over-fetching & under-fetching Tailored response shape per platform

🧩Micro-Frontend Architecture

Module Federation (Webpack 5 / Rspack) Independent team deployments per feature Shared design system component library Lazy-loaded at runtime via CDN

💾Offline-First & Sync

CRDTs for conflict-free distributed state Optimistic UI with rollback on error Background sync via Service Worker IndexedDB / SQLite local persistence

🔒Client Security Model

Zero-trust: every request re-authenticated Certificate pinning on mobile clients Content Security Policy (CSP) headers Subresource Integrity (SRI) for scripts

Client Layer sits at the outermost tier of the system — responsible for user experience, offline resilience, security bootstrapping, and efficient protocol selection per device class. | All clients connect through the Edge & Security Layer next. | 2025 Production Design

03Edge & security

🛡️

Edge & Security Layer — Detailed Architecture

System Architecture · 10 Billion+ Users · Global Edge Network · Zero Trust Security · 2025

② Edge & Security Layer

300+

Global PoPs

10 Tbps+

DDoS Capacity

< 5ms

Edge Latency

99.999%

Uptime SLA

95%

Cache Hit Rate

TLS 1.3

Encryption

OWASP

WAF Ruleset

Zero Trust

Security Model

Edge & Security Components — Technology Stacks

🛡️

DDoS Protection

Volumetric · Protocol · Application

10 Tbps+

Providers

Cloudflare Magic Transit AWS Shield Advanced Akamai Prolexic Azure DDoS Protection

Attack Absorption

Anycast routing — absorb traffic at nearest PoP BGP Flowspec — null-route attack sources Scrubbing centers — clean traffic forwarded Adaptive rate limiting per source IP / ASN

Attack Types Covered

UDP FloodSYN Flood ICMP FloodHTTP Flood SlowlorisDNS Amplify

Mitigation Speed

Detection < 1 sec Mitigation < 3 sec

Protocols Protected

TCP/UDPHTTP/S DNSICMPGRE

⚡

CDN / Edge Network

Content Delivery · Acceleration

300+ PoPs

Providers

Cloudflare Akamai Fastly AWS CloudFront

Cache Hierarchy

L1 — Shield PoP (regional aggregator) L2 — Edge PoP (end-user facing) Origin Shield — collapses requests to origin Stale-while-revalidate for zero-miss penalty

Cache Strategies

Cache-Control headers Surrogate keys (tags) Instant purge API Vary-aware caching

Acceleration

TCP/IP anycast — closest PoP routing HTTP/2 & HTTP/3 (QUIC) push Brotli / Zstandard compression

Protocols

HTTP/3QUIC HTTP/2WebSocket

🌍

GeoDNS & Routing

Global Traffic Management

Anycast

Providers

AWS Route 53 NS1 / IBM DNS Cloudflare DNS Google Cloud DNS

Routing Policies

Latency-based — route to lowest latency region Geolocation — route by country / continent Weighted round-robin for gradual failover Health-check aware — auto-remove unhealthy Failover (active–passive) with TTL 30s

DNSSEC & Security

DNSSEC signing DNS-over-HTTPS DNS-over-TLS Response Rate Limit

Performance

TTL: 30s – 300s 100% Anycast coverage

Protocols

DNS/UDPDoH DoTDNSSEC

🔥

Web Application Firewall

OWASP · Custom Rules · L7

L7 Filter

Providers

Cloudflare WAF AWS WAF v2 ModSecurity (NGINX) Imperva WAF

Rule Categories

OWASP Core Rule Set (CRS 3.3) SQLi · XSS · RCE · SSRF · LFI Geo-blocking (country allowlist) IP reputation lists (threat intel feeds)

Modes & Actions

BlockChallenge (CAPTCHA) JS ChallengeLog / Monitor

Tuning

Paranoia level 1–4 (false positive tradeoff) Per-path exceptions for legacy endpoints

Protocols

HTTP/SHTTP/2 HTTP/3WebSocket

🗺️

Edge Workers / Compute

Serverless at the Edge

V8 Isolates

Runtimes

Cloudflare Workers AWS Lambda@Edge Vercel Edge Runtime Fastly Compute@Edge

Use Cases

A/B testing — rewrite response at edge Personalisation by user / geo / device Auth token validation (no origin round-trip) Request / response transformation Image optimisation (resize, WebP, AVIF)

KV Storage at Edge

Cloudflare KV (global) Durable Objects (strong) R2 Object Store

Cold-Start

< 1ms (V8 isolate, no container boot)

Runtimes

JS / TSRust/WASM PythonWinterCG

🔐

TLS / QUIC Termination

Encryption · Certificates · PKI

mTLS

TLS Versions

TLS 1.3 (primary) QUIC (HTTP/3) TLS 1.2 (legacy fallback)

Certificate Management

Auto-provisioned via Let's Encrypt / ACM Wildcard certs · SAN multi-domain ACME protocol — 90-day auto-renewal Certificate Transparency (CT) logging OCSP stapling — no client round-trip

Cipher Suites

TLS_AES_256_GCM_SHA384 TLS_CHACHA20_POLY1305 ECDHE (X25519)

mTLS (Service-to-Service)

SPIFFE / SPIRE identity framework Short-lived SVID certs (1h TTL)

Standards

RFC 8446FIPS 140-2 PCI-DSSHSTS

🕵️

Bot Management

ML Detection · Fingerprinting

Real-Time ML

Providers

Cloudflare Bot Mgmt DataDome Kasada · Arkose Labs

Detection Signals

Browser fingerprint (JS challenge) Mouse/scroll/keyboard behaviour heuristics TLS JA3/JA3S fingerprint HTTP/2 SETTINGS frame fingerprint IP reputation + ASN / datacenter ranges

Bot Categories

ScrapersCredential Stuffing Carding BotsClick Fraud Scalper BotsGood Bots (allow)

Actions

BlockCAPTCHA Slow down (throttle)Fake 200 (honeypot)

Signals

JS AgentHTTP Headers IP GeoML Score

📦

Static Asset Cache

CDN Cache · Object Storage

95% Hit Rate

Cached Asset Types

JS / CSS bundles Images (WebP/AVIF) Fonts (WOFF2) HTML (ISR/SSG) API responses (JSON)

Cache Control Strategy

Immutable assets: max-age=31536000 (1yr) HTML pages: max-age=0, s-maxage=300 Stale-while-revalidate: background refresh Surrogate key tags for instant bulk purge

Optimisations

Brotli / Zstd on-the-fly compression Image format negotiation (Accept header) HTTP Early Hints (103) for link preload

Protocols

HTTP/3QUIC HTTP/2 PushEarly Hints

Traffic Flow — Internet to Origin

🌐

Internet / Client

Any device worldwide

Anycast BGP

→

🛡️

DDoS Scrubbing

Attack traffic dropped

Clean traffic only

→

🌍

GeoDNS

Nearest PoP resolved

DNS A/AAAA record

→

🔐

TLS 1.3 Handshake

0-RTT resumption

Encrypted tunnel

→

🔥

WAF Inspection

OWASP + custom rules

Scored request

→

🕵️

Bot Score

ML fingerprint check

Human verified

→

📦

Cache Check

HIT → serve instantly

Cache MISS only

→

🗺️

Edge Worker

Transform / personalise

Forwarded upstream

→

🚪

API Gateway

Ingress Layer →

DDoS Defense Matrix

🌊Volumetric (L3/L4)

UDP Flood · ICMP · DNS Amplification · NTP Reflection

Anycast absorption — distribute globally Blackhole routing / RTBH for sources Scrubbing centres — clean & forward BGP Flowspec — ISP-level filtering

10 Tbps+ capacity Detection < 1s

⚡Protocol (L4)

SYN Flood · ACK Flood · TCP Exhaustion · Slowloris

SYN cookies — validate before allocating state TCP connection rate limits per source Timeout tuning — kill idle connections fast Load balancer health & back-pressure

Stateless defence 1M PPS handled

🌐Application (L7)

HTTP Flood · Credential Stuffing · Web Scraping · API Abuse

JS Challenge — headless browser detection Token bucket rate limiting per client ML bot score threshold enforcement CAPTCHA escalation on suspicious score

WAF + Bot Mgmt 99.9% accuracy

WAF Rule Categories — OWASP Top 10 Coverage

💉Injection AttacksCRITICAL

SQL Injection (UNION, DROP, '--) NoSQL / ORM injection LDAP / XPath injection OS command injection

📜XSS & CSRFHIGH

Reflected XSS pattern matching Stored XSS via input sanitisation DOM-based XSS (CSP enforcement) CSRF token validation

📁Path Traversal & LFIHIGH

../../ traversal sequence blocking /etc/passwd, /proc pattern rules URL-encoded bypass detection Null byte injection blocking

🔗SSRF & RCECRITICAL

169.254.x.x metadata endpoint block localhost / 127.0.0.1 request block Serialisation gadget pattern rules Log4Shell / Spring4Shell signatures

Certificate & PKI Lifecycle

1 🔑 Key Generation ECDSA P-256 / RSA-2048 — HSM-backed private key

2 📋 CSR & ACME Let's Encrypt / AWS ACM — automated DNS-01 challenge

3 ✅ Certificate Issued 90-day validity · Wildcard SAN · CT log recorded

4 🚀 Deploy to CDN Edge OCSP stapling enabled · pushed to all PoPs

5 🔄 Auto-Renew (day 75) Renewal triggered 15 days before expiry · zero downtime

6 📊 Monitor & Alert Prometheus cert-exporter · alert if expiry < 30 days

7 ⛔ Revocation (CRL / OCSP) Instant revocation via CA API on compromise

Zero Trust Architecture Pillars

🪪Identity Verification

Every request authenticated, no implicit trust OAuth 2.0 + short-lived JWT tokens MFA enforced for privileged actions

📱Device Posture

Device health check before access grant MDM compliance (Jamf / Intune) Jailbreak / root detection (mobile)

🕸️Network Micro-Seg.

mTLS between every service pair SPIFFE SVID identity certificates Istio policy enforces allowed traffic

📊Continuous Monitoring

All access logged to immutable audit trail SIEM anomaly detection (ML baseline) Automated threat response playbooks

🔒Least Privilege

Fine-grained RBAC per resource & action Just-in-time (JIT) elevated access Auto-expire access grants (4h max)

🌍Data Protection

AES-256 encryption at rest + in transit Tokenisation for PII in transit DLP rules — detect & block data exfil

Global Point-of-Presence (PoP) Distribution

🌎North America80+ PoPs

🌍Europe70+ PoPs

🌏Asia Pacific65+ PoPs

🌏Southeast Asia30+ PoPs

🌍Middle East15+ PoPs

🌍Africa12+ PoPs

🌎South America18+ PoPs

🌏Oceania10+ PoPs

⚡Anycast BackboneTier-1 ISPs

🔗Private BackboneArgo / SmartRouting

3ms

DNS Resolution

Anycast · 300+ resolvers

0-RTT

TLS Resumption

Session tickets / PSK

95%

Cache Hit Rate

Saved origin round-trips

<1s

DDoS Detection

ML flow analysis

<3s

DDoS Mitigation

Anycast blackhole

<1ms

Edge Worker Start

V8 isolate — no cold boot

99.999%

Availability SLA

5 nines · <5min/year down

10 Tbps

Absorption Capacity

Largest attacks mitigated

Edge & Security Layer is the first line of defence — absorbing attacks, encrypting all traffic, accelerating content globally, and enforcing Zero Trust before any request reaches the Ingress Layer. | Compliance: SOC 2 Type II · ISO 27001 · PCI-DSS Level 1 · GDPR Article 32 | 2025 Production Design

04Ingress layer

🚪

Ingress Layer — Detailed Architecture

System Architecture · 10 Billion+ Users · API Gateway · Auth · Rate Limiting · Service Mesh Entry · 2025

③ Ingress Layer

5M+

Req / Second

< 5ms

GW Overhead

99.999%

Availability

Zero Trust

Auth Model

mTLS

Service-to-Service

Active-Active

LB Mode

Per-User

Rate Limiting

OAuth 2.0

Auth Standard

Ingress Components — Technology Stacks

⚖️

Global Load Balancer

L4 / L7 · Health Checks · Failover

Active-Active

Technologies

AWS ALB / NLB GCP Cloud LB NGINX Plus HAProxy

Algorithms

Weighted Round Robin — traffic distribution Least Connections — route to least loaded IP Hash — sticky sessions per client Latency-based — route to fastest region

Health Checks

Active — HTTP /health every 5s, timeout 2s Passive — detect 5xx errors, remove node Graceful drain — 30s connection draining

Features

SSL Offload Connection Pooling HTTP/2 Upgrade Anycast VIP

Protocols

TCP/UDPHTTP/2 HTTP/3WebSocket

🚪

API Gateway

Route · Transform · Aggregate

REST + GraphQL

Technologies

Kong Gateway AWS API GW v2 Apigee X Envoy / Istio GW

Core Functions

Request routing — path, host, header matching Request / response transformation (Lua plugins) Protocol translation — REST ↔ gRPC ↔ WebSocket API versioning — /v1, /v2 routing strategies Request aggregation — fan-out and merge

Plugins / Middleware

Auth (JWT/OAuth) CORS Request Validator Cache Logging

Protocols

RESTGraphQL gRPCWebSocket

🔑

Auth & IAM Service

OAuth 2.0 · JWT · MFA · SSO

Zero Trust

Technologies

Keycloak Auth0 AWS Cognito Okta / Ping Identity

Auth Methods

OAuth 2.0 + PKCE — public clients OIDC — identity layer on top of OAuth Passkeys (WebAuthn / FIDO2) — passwordless SAML 2.0 — enterprise SSO federation API Key + HMAC — machine-to-machine

Token Strategy

Access token: JWT, 15-min TTL, RS256 signed Refresh token: opaque, 7-day, Redis-stored Silent refresh — background token rotation

MFA Factors

TOTP (Authenticator) SMS OTP Hardware Key (YubiKey) Biometric

Protocols

OAuth 2.0OIDC SAML 2.0FIDO2

🚦

Rate Limiter

Token Bucket · Sliding Window

Per-User / IP

Technologies

Redis Cluster Envoy Rate Limit Kong Rate Limit Nginx limit_req

Algorithms

Token bucket — burst allowance, steady refill Sliding window log — exact precision, Redis sorted set Fixed window counter — Redis INCR + EXPIRE Leaky bucket — smooth output, queue overflow

Limit Dimensions

Per user ID · per API key · per IP address Per endpoint path · per tenant (multi-tenant) Global cluster limit via Redis Lua scripts

Response Headers

X-RateLimit-Remaining X-RateLimit-Reset Retry-After

Actions on Breach

HTTP 429Throttle QueueBlock IP

📡

WebSocket Gateway

Real-Time · Pub/Sub · Bi-Directional

Full-Duplex

Technologies

Socket.IO / uWebSockets AWS API GW WebSocket Pusher / Ably Fanout / Fastly Real-Time

Architecture

Sticky sessions — consistent hashing by user ID Redis Pub/Sub — fan-out across gateway nodes Heartbeat ping/pong — 30s interval keep-alive Reconnect with exponential backoff (1s → 30s) SSE fallback — for environments blocking WS

Message Routing

Room / channel-based topic routing User-to-user direct message delivery

Protocols

WebSocketSSE Long PollSTOMP

🔄

GraphQL Federation

Apollo Gateway · Unified Graph

Schema Stitching

Technologies

Apollo Gateway Apollo Federation v2 GraphQL Mesh Hasura / StepZen

Federation Model

Supergraph — composed from subgraphs per service @key directive — entity references across services @external / @requires — cross-subgraph fields Query planning — optimal subgraph execution

Performance

Persisted queries — hash instead of full query DataLoader — batching + deduplication (N+1) Response cache — Redis, TTL per type

Security

Query depth limit (10) Cost analysis (query budget) Disable introspection (prod)

Protocols

GraphQL/HTTPSubscriptions WebSocketSSE

🛑

Circuit Breaker

Resilience · Bulkhead · Fallback

Fault Tolerant

Technologies

Resilience4j Hystrix (Legacy) Envoy Outlier Detection Istio DestinationRule

Patterns

Circuit Breaker — trip on 50% error rate (10s) Bulkhead — isolate thread pools per dependency Retry — 3 attempts, exponential backoff + jitter Timeout — per-request deadline propagation Fallback — cached response or graceful degradation

CB Configuration

Failure threshold: 50% in sliding window (100 calls) Half-open: 5 trial requests after 30s wait

Protocols

HTTPgRPC Async QueueDB Pool

📊

gRPC Proxy

Envoy Sidecar · Service Mesh

Protobuf

Technologies

Envoy Proxy Istio Ingress GW gRPC-Gateway (REST↔gRPC) Traefik / Linkerd

Features

gRPC-Web — browser clients via HTTP/1.1 Transcoding — REST JSON ↔ Protobuf Server-side streaming (SSE equivalent) Client streaming — large file uploads Bi-directional streaming — real-time bidding

Observability

gRPC status codes → Prometheus metrics Distributed trace injection (B3 / W3C)

Auth

mTLS per call gRPC metadata JWT ALTS (Google)

Protocols

HTTP/2Protobuf gRPC-WebmTLS

Request Lifecycle — Edge to Microservice

⚡

From CDN / Edge

Clean · TLS terminated

Anycast VIP

→

⚖️

Global LB

L7 routing, health check

Least-conn algorithm

→

🚪

API Gateway

Route + transform

Plugin chain runs

→

🔑

Auth Check

JWT verify · RBAC

Token validated

→

🚦

Rate Limiter

Quota check in Redis

Under quota

→

🛑

Circuit Breaker

CLOSED — pass through

Request forwarded

→

📊

gRPC Proxy

Envoy sidecar + mTLS

Service mesh

→

⚙️

Microservice

Business logic executes

Load Balancing Algorithms

🔁 Weighted Round Robin Default

Distribute traffic proportionally by configured weight. Node A=70%, B=30%. Ideal for heterogeneous instance sizes. Zero overhead, deterministic.

Low overhead Predictable

📊 Least Connections Recommended

Always route to the upstream with fewest active connections. Optimal for variable request duration (e.g. streaming, long polling). Adaptive to hot spots.

Adaptive Fair distribution

📌 IP Hash / Sticky Stateful

Hash client IP to always route to same upstream. Required for WebSocket sessions and stateful workloads. Consistent hashing ring for node add/remove.

Session affinity WS / gRPC

OAuth 2.0 + PKCE Authentication Flow

Client initiates login
Generate code_verifier (random 43-128 chars) + code_challenge = SHA256(verifier)

PKCE

Authorization request
Redirect to /authorize?client_id=…&code_challenge=…&scope=openid profile

Browser redirect

User authenticates (+ MFA)
Password / Passkey / Biometric → optional TOTP / SMS / hardware key

MFA enforced

Authorization code returned
Short-lived code (10min) sent to redirect_uri callback

One-time code

Token exchange (back-channel)
POST /token {code, code_verifier, client_id} → access_token (JWT 15min) + refresh_token (7d)

Secure exchange

JWT verified at API Gateway
Signature verified via JWKS endpoint · claims extracted · scopes enforced

RS256 / ES256

Silent refresh before expiry
At 13min: use refresh_token to obtain new access_token silently in background

Auto-renew

Rate Limiting Strategies

🪣Token BucketBurst Allowed

Tokens added at fixed rate (e.g. 100 req/s) Max bucket size = burst capacity (e.g. 200) Request costs 1 token; rejected if bucket empty Allows short bursts above steady-state rate

Use for: API consumers needing burst headroom

📊Sliding WindowMost Accurate

Redis sorted set: score = timestamp ZREMRANGEBYSCORE removes expired entries ZCARD counts requests in window ZADD inserts current request timestamp

Use for: strict per-user quotas (billing tier)

🪟Fixed WindowLightweight

Redis INCR key:{user}:{minute} EXPIRE set to window end time Atomic Lua script (INCR + check) Boundary burst problem at window edge

Use for: high-volume, lower precision needs

🚰Leaky BucketSmooth Output

Queue incoming requests (bucket capacity) Process at fixed outflow rate (drip) Overflow → reject with HTTP 429 Prevents bursty load reaching backends

Use for: downstream protection, smooth traffic

Circuit Breaker State Machine

🟢

CLOSED

Normal operation. All requests pass through. Failure counter tracking active. Resets on success window.

Healthy Pass-through

→ Failure rate > 50%

🔴

OPEN

All requests fail-fast with fallback response. No load on downstream. Reset timer: 30 seconds. Prevents cascade failures.

Fail-Fast Cached fallback

→ Timer expires (30s wait)

🟡

HALF-OPEN

5 trial requests allowed. If majority succeed → CLOSED. If failures continue → back to OPEN. Gradual recovery probe.

Probing 5 trial calls

← Trials fail

🔁

Fallback Strategy

Return cached data, static response, or degraded feature. Bulkhead pattern isolates thread pools per dependency.

Cache hit Degrade graceful

API Gateway Routing Table — Sample Rules

Method	Path Pattern	Target Service	Auth	Rate Limit	Cache	Notes
GET	/api/v2/users/{id}	user-service	JWT	1000/min	60s TTL	User profile, read-only
POST	/api/v2/orders	order-service	JWT+MFA	100/min	No cache	Idempotency-Key required
GET	/api/v2/search?q=*	search-service	API Key	500/min	10s TTL	Elasticsearch fanout
WS	/ws/v2/chat	msg-service	JWT	Sticky	No cache	Redis Pub/Sub fan-out
POST	/api/v2/payments	payment-service	JWT+MFA	20/min	No cache	PCI-DSS · Idempotent
POST	/graphql	apollo-gateway	JWT	200/min	Per-type	Depth limit=10, cost limit
GET	/health, /metrics	internal	No auth	Unlimited	No cache	LB health checks only

5M+

Req / Second

Peak throughput handled

<5ms

GW Overhead P99

Auth + route + transform

<1ms

JWT Validation

JWKS cached in memory

<2ms

Rate Limit Check

Redis pipeline command

30s

CB Open Duration

Before half-open probe

HTTP 429

Rate Exceeded

Retry-After header sent

15 min

Access Token TTL

Refresh at 13 min

99.999%

GW Availability

Active-active multi-AZ

Ingress Layer acts as the unified entry point for all traffic — enforcing auth, routing, rate limiting, and resilience patterns before passing requests into the Microservices Layer via the Service Mesh. | Standards: OAuth 2.0 RFC 6749 · PKCE RFC 7636 · JWT RFC 7519 · FIDO2/WebAuthn W3C · OpenAPI 3.1 | 2025 Production Design

05Microservices & service mesh

🐳

Microservices Layer — Kubernetes · Istio Service Mesh

12 Domain Services · Container Orchestration · Service Mesh · Autoscaling · GitOps · 2025

④ Microservices Layer

Microservices

1000+

Pods (peak)

50+

K8s Nodes

mTLS

All Svc Traffic

HPA + VPA

Autoscaling

GitOps

Deployment Model

< 30s

Deploy Time

99.99%

Uptime SLA

Infrastructure Foundation — Kubernetes Control Plane · Istio Service Mesh

🐳

Kubernetes (EKS / GKE / AKS) — Container Orchestration

Manages container lifecycle, scheduling, scaling, and self-healing across all microservices

🧠 Control Plane

kube-apiserver — REST API, auth, admission etcd — distributed KV config store kube-scheduler — bin-pack pod placement controller-manager — reconcile loop cloud-controller — node/LB provisioning

⚙️ Worker Nodes

kubelet — node agent, pod lifecycle kube-proxy — iptables / eBPF networking Container runtime — containerd / CRI-O Karpenter — node auto-provisioner Node OS — bottlerocket / flatcar

📦 Workload Resources

Deployment + ReplicaSet — stateless pods StatefulSet — ordered, stable network ID DaemonSet — one pod per node (logging) CronJob — scheduled batch tasks HPA · VPA · KEDA — autoscaling

🔧 Config & Storage

ConfigMap — environment config (non-secret) Secret — base64 encoded (Vault injector) PersistentVolume — EBS / GCS / NFS Namespace isolation — per team/env RBAC — least-privilege service accounts

🕸️

Istio Service Mesh — Traffic, Security, Observability

Envoy sidecar proxy injected into every pod — handles all in-cluster communication transparently

🔀 Traffic Management

VirtualService — traffic routing rules per host DestinationRule — circuit breaker, retries, mTLS Gateway — L4/L7 ingress / egress control Canary — weighted splits (10/90, 50/50) Header-based routing — A/B test by cookie

🔒 Security

mTLS — SPIFFE/SPIRE auto-cert per pod AuthorizationPolicy — RBAC at L7 (path) PeerAuthentication — enforce STRICT mode RequestAuthentication — JWT verification Egress control — allowlist external services

📡 Observability

Envoy metrics → Prometheus (RED signals) Distributed traces — auto-injected headers Access logs — all svc-to-svc calls logged Kiali — live service map + health graph Grafana dashboards — mesh golden signals

🛡️ Resilience Policies

Retries — 3 attempts on 5xx, 502, 503 Timeout — per-route deadline (2s default) Circuit breaker — consecutive 5xx threshold Outlier detection — eject unhealthy hosts Fault injection — chaos testing in staging

Domain Microservices — Technology, Responsibilities & Interfaces

👤

User Service

Profile · Preferences · CQRS

Go · 10 pods

Responsibilities

CRUD user profiles, avatars, preferences CQRS — write model PostgreSQL, read model Redis Event publish: UserCreated, UserUpdated (Kafka) Privacy: GDPR delete / export API

Tech Stack

Go 1.23 PostgreSQL Redis gRPC

Scale

HPA: 5–50 podsCPU 70%

🔐

Auth & IAM Service

Token · Session · RBAC

Rust · 8 pods

Responsibilities

Issue & validate JWT access/refresh tokens Manage OAuth 2.0 client credentials Role & permission resolution (RBAC/ABAC) MFA enrollment and verification flow

Tech Stack

Rust Redis (sessions) PostgreSQL REST + gRPC

Scale

HPA: 4–30 podsCPU 60%

📝

Content Service

CRUD · Versioning · Rich Media

Java · 15 pods

Responsibilities

Create/edit/publish content with versioning Rich-text storage: Quill Delta / ProseMirror Tag taxonomy, categories, slug routing CDN cache invalidation on publish events

Tech Stack

Java 21 / Spring Boot PostgreSQL MongoDB

Scale

HPA: 5–60 podsRPS-based

🔔

Notification Service

Push · Email · SMS · In-App

Node.js · 12 pods

Responsibilities

Multi-channel delivery: push, email, SMS, in-app Template engine — personalised, localised (i18n) Delivery tracking, bounce, open-rate metrics User preference centre — opt-in/opt-out per channel

Tech Stack

Node.js / BullMQ SendGrid Twilio · Firebase FCM

Scale

KEDA: Kafka lag5–80 pods

💳

Payment Service

PCI-DSS · Idempotent · Saga

Java · 8 pods

Responsibilities

Process charges via Stripe / Adyen / PayPal Idempotency — Idempotency-Key header enforced Saga pattern — compensating tx on failure PCI-DSS scope isolation — dedicated namespace

Tech Stack

Java 21 PostgreSQL Vault (card data)

Scale

HPA: 4–20 podsStrict mTLS

🔍

Search Service

Full-Text · Vector · Faceted

Python · 20 pods

Responsibilities

Full-text search via Elasticsearch (BM25) Vector search — semantic similarity (FAISS) Faceted filters — category, price, date range Query suggestion, typo tolerance, synonyms

Tech Stack

Python / FastAPI Elasticsearch 8 FAISS

Scale

HPA: 10–80 podsQPS metric

💬

Messaging Service

Real-Time · E2E Encrypt · Rooms

Erlang · 30 pods

Responsibilities

WebSocket connections — 100k+ per pod End-to-end encryption (Signal Protocol) Message persistence: Cassandra time-series Presence, typing indicators, read receipts

Tech Stack

Erlang/OTP Cassandra Redis Pub/Sub

Scale

KEDA: WS conns10–100 pods

🗂️

File Service

Upload · CDN · Virus Scan

Go · 8 pods

Responsibilities

Presigned S3 URLs — direct client upload Virus scan (ClamAV) on upload completion Image processing — resize, WebP/AVIF convert CDN URL generation with signed expiry

Tech Stack

Go AWS S3 / GCS ClamAV · Imgproxy

Scale

HPA: 4–30 podsCPU 70%

🌐

Geo / Map Service

Location · Geofencing · Routing

C++ · 12 pods

Responsibilities

Real-time location updates via WebSocket Geofencing — enter/exit event triggers Nearest-entity query (PostGIS / Redis Geo) Route calculation — Dijkstra / A* graph

Tech Stack

C++ / gRPC PostGIS Redis Geo

Scale

HPA: 6–40 podsLat-based

🎯

Recommendation Service

ML · Collaborative · Real-Time

Python · 20 pods

Responsibilities

Collaborative filtering (Matrix Factorisation) Two-tower neural network for retrieval Online serving via Feature Store (Feast) A/B experiment routing per user segment

Tech Stack

Python / TF Serving Feast Redis (online)

Scale

GPU nodes: 8–40GPU util 80%

🛒

Order / Cart Service

Checkout · Saga · Inventory

Java · 15 pods

Responsibilities

Shopping cart — Redis-backed session state Checkout — Saga orchestration across services Inventory reservation with optimistic locking Order state machine: pending → paid → shipped

Tech Stack

Java / Temporal PostgreSQL Redis (cart)

Scale

HPA: 8–50 podsRPS + CPU

⚙️

Config & Feature Flag

LaunchDarkly · etcd · Rollout

Go · 6 pods

Responsibilities

Dynamic config updates without redeploy Feature flags — gradual rollout, kill switch SSE push — services receive flag changes < 1s Audit log — who changed what flag, when

Tech Stack

Go etcd LaunchDarkly SDK

Scale

Fixed: 3–6 podsHA quorum

Deployment Strategies

🔄Rolling UpdateDefault

maxSurge: 25% extra pods during rollout maxUnavailable: 0 — zero downtime Readiness gate before traffic shifts Auto rollback on failed health checks

No downtime Slow rollout

🔵Blue / GreenMajor releases

Run v1 (blue) + v2 (green) simultaneously Switch VirtualService weight 0→100 instantly Instant rollback: flip weight back to blue Double resource cost during transition

Instant cutover Safe

🐦Canary ReleaseRisk control

Route 1% → 5% → 20% → 100% traffic Istio VirtualService weighted routing Auto-promote on SLO gate pass (Flagger) Auto-rollback on error budget breach

Progressive Flagger

Service Communication Patterns

🔄Sync — gRPC / REST

Request-response with mTLS via Envoy Protobuf — compact binary, typed schema HTTP/2 multiplexing — concurrent streams Retries with backoff at mesh level

Low latency

🌊Async — Kafka Events

Domain events: CloudEvents spec Exactly-once via idempotent producers Consumer group per service — independent Schema Registry (Avro) — contract safety

Decoupled

📡Pub/Sub — Redis

Fan-out notifications to WebSocket nodes Ephemeral — no persistence, fire and forget Channels per user / room / tenant Keyspace events for cache invalidation

Real-time

🔗Saga — Temporal

Long-running distributed transactions Durable workflow state — survives crashes Compensating transactions on failure Order checkout: reserve → charge → fulfill

Consistent

Autoscaling Policies

📈HPA — Horizontal Pod

Scale out/in pods based on custom metrics CPU threshold: scale at 70% utilization Custom: RPS, p99 latency, queue depth Scale-down stabilization: 5 min cooldown Metrics server + Prometheus adapter

Pods: 5–500

📊VPA — Vertical Pod

Right-size CPU & memory requests/limits Off — recommendation only (no auto-apply) Initial — apply on pod creation only Prevents OOMKill and CPU throttling Goldilocks dashboard — view recommendations

Resource-aware

⚡KEDA — Event-Driven

Scale on Kafka consumer lag (msg count) Scale on SQS queue depth or age Scale to zero — no traffic = 0 pods ScaledObject CR per workload / trigger Scale-up latency: < 15 seconds

Scale-to-zero

Istio Service Mesh — Traffic Rules

Service	Retries	Timeout	Circuit Breaker	Traffic Split	mTLS
User Service	3× on 5xx	2s	5 consec. 5xx	Canary 10%	STRICT
Payment Service	1× (idempotent)	5s	2 consec. 5xx	Stable only	STRICT
Search Service	3× on 5xx	1s	10 consec. 5xx	Canary 20%	STRICT
Messaging	0× (WS sticky)	30s	5 consec. 5xx	Hash-based	STRICT
Notification	3× on 5xx	10s	5 consec. 5xx	Canary 5%	STRICT
Recommendation	2× on 5xx	500ms	3 consec. 5xx	Shadow 100%	STRICT

GitOps Pipeline — CI/CD Flow

💻Git PushPR + review

→

🔁CI PipelineBuild · Test · Lint

→

🐳Docker BuildScan · Sign (Cosign)

→

📦Registry PushECR / GCR · tag SHA

→

📋Helm Chart PRUpdate image tag

→

🔀ArgoCD SyncGit = cluster state

→

🐦Flagger CanaryProgressive rollout

→

✅SLO Gate Pass100% traffic → prod

Microservices

Domain-driven bounded contexts

1000+

Pods Peak

Auto-scaled by HPA/KEDA

mTLS

All Traffic

SPIFFE identity per pod

< 30s

Deploy Time

ArgoCD sync + rollout

< 15s

Scale-Up

KEDA event-driven pods

GitOps

Deploy Model

ArgoCD · Flagger · Helm

100%

Canary Coverage

All services via Flagger

99.99%

Uptime SLA

Multi-AZ active-active

Microservices run inside Kubernetes with Istio injecting an Envoy sidecar into every pod — providing transparent mTLS, traffic shaping, retries, and distributed tracing with zero application code changes. | Standards: CNCF · OCI · CloudEvents · OpenTelemetry · SPIFFE · Helm 3 · ArgoCD GitOps | 2025 Production Design

06Event streaming & async

⑤ Event Streaming & Async Processing

Distributed messaging · Stream processing · Workflow orchestration · 10B+ events/day

Apache Kafka Apache Pulsar Apache Flink Temporal.io NATS JetStream Schema Registry Redis Streams Exactly-Once Event Sourcing CQRS · Saga · Outbox

10B+

Events / Day

< 5ms

P99 Latency

2M+

Msgs / Second

99.99%

Durability SLA

3×

Replication Factor

30 Days

Retention Policy

500+

Topics

EOS

Exactly-Once

Core Components

🟢

Apache KafkaDistributed Commit Log

Kafka 3.7+ — KRaft mode (no ZooKeeper)
Partitions: 1024 per topic, key-based routing
Replication: RF=3, ISR ≥ 2 for durability
Producers: acks=all, idempotent + transactions
Consumers: consumer groups, parallel reads
Tiered storage: S3 offload for cold data
Strimzi: Kubernetes operator deployment

KRaftISR IdempotentTiered Storage

🔵

Apache PulsarMulti-Tenant Messaging

Compute/storage separated: Brokers + BookKeeper
Multi-tenancy: namespaces, quotas, auth per tenant
Geo-replication: async cross-region sync
Pulsar Functions: serverless stream processing
Subscription types: Exclusive, Shared, Key_Shared
Schema evolution: built-in registry + backward compat
Offload: S3/GCS/HDFS for long retention

BookKeeperGeo-Replication FunctionsMulti-Tenant

⚡

Apache FlinkStateful Stream Processing

Event-time: watermarks, out-of-order handling
Windows: tumbling, sliding, session, global
State backends: RocksDB for large state, FS for small
Checkpointing: distributed snapshots, exactly-once
CEP: complex event pattern matching
Flink SQL: streaming SQL on Kafka sources
K8s Operator: native Kubernetes deployment

WatermarksRocksDB CEPFlink SQL

🔮

Temporal.ioDurable Workflow Engine

Durable execution: code survives crashes, restarts
Saga orchestration: compensating transactions, rollback
Timers: durable delays up to years
Signal / Query: external interactions mid-workflow
Versioning: safe workflow code migration
Visibility: search attributes, workflow history
Namespaces: multi-tenant isolation per domain

SagaDurable CompensateTimers

🚀

NATS JetStreamCloud-Native Messaging

Core NATS: < 1ms pub/sub, fire-and-forget
JetStream: persistence, replay, exactly-once
Work queues: load-balanced consumer groups
KV / Object store: built-in key-value store
Leaf nodes: edge connectivity, low-bandwidth sites
Clustering: Raft consensus, no external deps
IoT use case: MQTT bridge, lightweight clients

JetStreamRaft KV StoreIoT/Edge

📋

Schema RegistryConfluent / AWS Glue

Avro / Protobuf / JSON Schema support
Schema evolution: backward, forward, full compat
Producer validation: reject non-conforming msgs
Consumer deserialization: auto schema fetch by ID
Schema ID embedded: 4-byte magic header in payload
Governance: subjects, versioning, soft delete
Data catalog integration: Apache Atlas / DataHub

AvroProtobuf Backward CompatSchema ID

⚡

Redis StreamsLow-Latency Event Bus

XADD / XREAD: append-only log, O(1) writes
Consumer groups: parallel processing with ack
XPENDING / XCLAIM: ownership transfer on failure
MAXLEN ~: approximate trimming, O(1) memory cap
Use cases: activity feeds, real-time notifications
Latency: sub-millisecond for small payloads
Cluster mode: hash slot partitioning

XADDConsumer Groups Sub-msNotifications

🔁

DLQ & Error HandlingRetry / Poison Pill

Dead Letter Queue: isolate unprocessable msgs
Retry policy: exponential backoff + jitter
Max retries: configurable per consumer group
Poison pill: schema violation → DLQ + alert
DLQ inspector: manual review + replay UI
Alerting: DLQ depth threshold triggers PagerDuty
Archival: S3 long-term storage for forensics

BackoffReplay Poison PillAlert

End-to-End Event Pipeline

ProducersServices / IoT
Mobile Apps

→

Schema
RegistryValidate &
Serialize

→

Kafka
BrokersPartition +
Replicate

→

Flink
ProcessingTransform
Aggregate

→

Temporal
WorkflowsOrchestrate
Sagas

→

Consumer
GroupsMicroservices
Workers

→

DLQ
HandlerRetry /
Replay

→

Data
SinksDB / Cache
/ Warehouse

🟢 Kafka Architecture Deep-Dive — Internals & Configuration

Cluster Configuration

ModeKRaft (no ZooKeeper)

Brokers9 (3 per AZ × 3 AZs)

Replication Factor3 (ISR ≥ 2)

Min In-Sync Replicas2

Default Partitions24 per topic

Max Partitions1024 per topic

Retention (Hot)7 days on NVMe

Retention (Cold)30 days on S3

Compressionzstd (best ratio)

Max Message Size10 MB

Strimzi OperatorTiered Storage Cruise ControlMirrorMaker 2

Producer & Consumer Config

PRODUCER

acksall (strongest durability)

idempotenttrue (dedup)

transactional.idper-service ID

batch.size64 KB

linger.ms5ms (micro-batching)

compressionzstd

CONSUMER

isolation.levelread_committed

auto.offset.resetearliest

enable.auto.commitfalse (manual)

max.poll.records500

session.timeout.ms45 000

Partition Layout & Consumer Groups

Topic: user-events (8 partitions, RF=3)

P0 Leader
Broker-1

P1 Leader
Broker-2

P2 Leader
Broker-3

P3 Leader
Broker-1

P4 Replica
Broker-2

P5 Replica
Broker-3

P6 Replica
Broker-1

P7 Replica
Broker-2

Consumer Groups

analytics-cg8 consumers

search-indexer-cg4 consumers

notification-cg4 consumers

audit-log-cg2 consumers

Event Patterns · Delivery Guarantees · Stream Processing · Monitoring

Event-Driven Patterns

📥 Event Sourcing

All state changes stored as immutable events. Current state derived by replaying event log. Enables time-travel queries and audit.

↔ CQRS

Separate Read (Query) from Write (Command) models. Write to Kafka → project into read-optimized stores (Elasticsearch, Redis).

📤 Transactional Outbox

Write event to DB outbox table + business data atomically. Debezium CDC polls outbox → publishes to Kafka. Eliminates dual-write.

🔄 Saga Pattern

Choreography (event-driven, decoupled) or Orchestration (Temporal, explicit control). Compensating transactions on failure.

Delivery Guarantees

Semantic	Producer	Consumer	Use Case
At-Most-Once	acks=0	Auto-commit before process	Metrics, telemetry
At-Least-Once	acks=all	Commit after process	Most services (idempotent)
Exactly-Once	Idempotent + Txn	read_committed + txn	Payments, inventory

Exactly-Once Semantics (EOS)

Producer: transactional.id + initTransactions()
Broker: PID epoch deduplication across restarts
Consumer: isolation.level=read_committed
Flink: 2-phase commit sink connector
Idempotency: dedupe key in payload + Redis TTL

Flink Stream Processing

Windowing Strategies

Tumbling: fixed non-overlapping (1-min aggregates)
Sliding: overlapping windows (5-min/1-min slide)
Session: inactivity gap-based (user sessions)
Global: unbounded, trigger-controlled

Watermarks & Late Data

Event time: embedded timestamp in payload
Watermark: max-event-time − 5s allowed lateness
Side output: late events routed to DLQ topic

State & Checkpointing

State backend: RocksDB for TB-scale state
Checkpoint: every 30s → S3 (incremental)
Savepoint: manual for upgrades, migrations
Recovery: restart from last checkpoint

Observability & Monitoring

Key Kafka Metrics

Consumer Lag (P99)1 200 msgs

Throughput1.8M msg/s

Produce Latency P993.2ms

Disk Usage68%

Under-Replicated Parts0

DLQ Depth42

Tooling Stack

Prometheus JMXGrafana Kafka UICruise Control Burrow LagOpenTelemetry

Temporal Saga — Order Fulfillment Workflow

CreateOrder — persist to DB, emit OrderCreated event to Kafka

ReserveInventory — call Inventory Service, reserve SKU quantity

ChargePayment — Stripe/Braintree API, idempotency key = orderId

AllocateShipping — assign carrier, generate tracking number

SendConfirmation — email/SMS notification via Notification Service

⟳ Compensating Transactions (on failure)

RefundPayment — void/refund Stripe charge

ReleaseInventory — restore reserved quantity

CancelOrder — mark CANCELLED, emit OrderCancelled event

Backpressure & Auto-Scaling

Consumer Lag-Based Scaling (KEDA)

Trigger: Kafka consumer group lag > 10 000 msgs
ScaledObject: min=2, max=50 consumer pods
Scale-up: lag × 0.001 = target replicas
Cooldown: 60s before scale-down to prevent thrash
KEDA Scaler: kafka-topic-lag-percentage

Backpressure Strategies

Producer throttle: broker quota enforcement per client
Rate limit topics: quota.producer.byte.rate per user
⚠ Avoid unbounded lag: alert at lag > 100K + PagerDuty P2
Flink: credit-based flow control between operators
RocksDB state: async I/O prevents blocking pipeline

Geo-Replication (MirrorMaker 2)

Active-Active: us-east-1 ↔ eu-west-1 bidirectional
Active-Passive: ap-southeast-1 (DR, async)
Offset sync: translated offsets per cluster
Heartbeat topic: lag between mirror clusters tracked

Topic Topology

Critical Topics

user-eventsP=64 RF=3

order-commandsP=32 RF=3

payment-eventsP=16 RF=3

notification-fanoutP=24 RF=3

audit-logP=8 RF=3

dlq.*.errorsP=4 RF=3

cdc.postgres.*P=8 RF=3

CDC Pipeline

PostgreSQL
WAL

→

Debezium
Connect

→

Kafka
Topic

→

Search /
Analytics

DebeziumOutbox CDCEOS

07Data layer

⑥ Data Layer — Polyglot Persistence

Right database for the right job · ACID · Eventual Consistency · CAP Theorem · 10B+ users · Multi-region replication

PostgreSQL · Citus Apache Cassandra Redis · Valkey S3 · Object Storage Elasticsearch · OpenSearch MongoDB · Atlas ClickHouse · OLAP TiDB · CockroachDB CAP Theorem CQRS · Event Sourcing

Database Types

<1ms

Redis P99

<5ms

Postgres P99

10PB+

Total Data

99.999%

Durability

3×

Geo-Replication

RPO 0s

Sync Replication

RTO <30s

Failover

Core Data Stores — Polyglot Selection

🐘

PostgreSQLPrimary OLTP · Citus · Aurora

Engine: PostgreSQL 16 + Citus sharding extension
Hosted: AWS Aurora Postgres / RDS Multi-AZ
ACID: full transactions, MVCC, row-level locking
Read replicas: up to 15 Aurora replicas, load-balanced
Sharding: Citus distributed tables by tenant/user_id
Extensions: pgvector (AI), TimescaleDB, pg_partman
Pooling: PgBouncer transaction mode, 10K connections

ACIDCitus Aurorapgvector

🔶

Apache CassandraWide-Column · Geo-Distributed

Model: wide-column, partition key → clustering key
Consistency: tunable — LOCAL_QUORUM for balance
Write path: Memtable → CommitLog → SSTable flush
Replication: NetworkTopologyStrategy, RF=3 per DC
Use cases: time-series, activity feeds, IoT, audit logs
Compaction: TWCS for time-series, LCS for reads
Alternatives: ScyllaDB (10× Cassandra throughput, C++)

Tunable Cons.RF=3 ScyllaDBTWCS

⚡

Redis / ValkeyIn-Memory · Cache · Streams

Cache-aside: L1 local → L2 Redis → L3 DB fallback
Data structures: String, Hash, List, Set, ZSet, Stream
Leaderboards: ZRANGEBYSCORE on sorted sets
Rate limiting: sliding window with sorted sets + Lua
Sessions: 30-min TTL, hashed session token key
Pub/Sub: real-time notifications, channel fan-out
Cluster: Redis Cluster, 16K hash slots, 3 primary+replica

Cache-AsideZSet ClusterValkey

🪣

Object StorageS3 · GCS · Data Lake

AWS S3: 11 nines durability, versioning, MFA delete
Data lake: raw zone → curated zone → consumption zone
Formats: Parquet (columnar), ORC, Avro, Delta Lake
Lifecycle: S3 Standard → IA (30d) → Glacier (90d)
Query: Athena (serverless SQL), S3 Select, Iceberg
CDN: CloudFront pre-signed URLs for media delivery
Encryption: SSE-KMS per bucket, TLS in-transit

ParquetDelta Lake IcebergAthena

🔍

ElasticsearchSearch · Analytics · OpenSearch

Full-text: inverted index, BM25 relevance scoring
Ingestion: Debezium CDC → Kafka → Logstash → ES
Aggregations: faceted search, histogram, geo-distance
KNN search: dense vector ANN with HNSW index
Index lifecycle: hot→warm→cold→frozen→delete (ILM)
Sharding: 5 primary shards, 1 replica, per-index routing
Hybrid: BM25 + dense vector reciprocal rank fusion

BM25HNSW ILMOpenSearch

🍃

MongoDBDocument Store · Atlas

Model: flexible BSON documents, nested arrays/objects
Atlas: fully managed, multi-cloud, global clusters
Transactions: multi-document ACID since v4.0
Change streams: real-time CDC via oplog tailing
Aggregation pipeline: $lookup, $unwind, $group, $facet
Atlas Search: full-text built-in (Lucene), no ES needed
Use cases: product catalog, user profiles, CMS content

BSON DocsAtlas Change StreamsACID

📊

ClickHouseColumnar OLAP · Real-time Analytics

Columnar: MergeTree family, vectorised execution
Ingestion: Kafka engine reads directly from topics
Speed: 10B+ row scans in seconds, 100× faster than PG
Compression: LZ4 / ZSTD per column, 10:1 ratio typical
Materialized views: real-time aggregations on ingest
ReplicatedMergeTree: HA via ZooKeeper / ClickHouse Keeper
Use cases: dashboards, funnels, cohorts, product analytics

MergeTreeVectorised Kafka EngineMat. Views

🌐

NewSQL / TiDBDistributed SQL · Global ACID

TiDB: MySQL-compatible, horizontal scale-out OLTP
CockroachDB: Postgres-compatible, serializable isolation
Raft consensus: strong consistency across geo nodes
Auto-sharding: TiKV range-based partitioning, auto-rebalance
HTAP: TiFlash columnar replica for analytics queries
Global txn: cross-region ACID without 2PC latency trap
Use cases: global inventory, financial ledger, multi-region write

TiDBCockroachDB RaftHTAP

Data Routing — Query to Store

🔀 Polyglot Query Routing Layer

API /
ServiceBusiness
Logic

→

Data
Access LayerRepository
Pattern

→

PostgreSQLOLTP · ACID
Writes + Reads

RedisCache L2
Sessions

CassandraFeeds
Time-Series

ElasticFull-Text
Search

MongoDBCatalog
Profiles

ClickHouseOLAP
Analytics

S3Data Lake
Media

← Read path: Redis (L2 cache) → PostgreSQL read replica → Elasticsearch (search) | Write path: PostgreSQL (primary) + Kafka CDC → Cassandra / Elasticsearch / ClickHouse | Analytics: Kafka → ClickHouse / S3 (Data Lake)

Database Selection · Replication · Caching Strategy · Connection Management

Database Selection Matrix

Use Case	Store	Why
User accounts	PostgreSQL	ACID, relations
Activity feeds	Cassandra	Write-heavy, time-series
Session store	Redis	Sub-ms, TTL-native
Product catalog	MongoDB	Flexible schema
Full-text search	Elasticsearch	Inverted index, BM25
Analytics queries	ClickHouse	Columnar, vectorised
Media / backups	S3	Infinite scale, 11×9s
Global write ACID	TiDB	Distributed SQL
Rate limiting	Redis	Atomic ZSet + Lua
AI embeddings	pgvector	HNSW in Postgres

Replication & Consistency Models

CAP Theorem Positioning

PostgreSQL CAP CP · strong

Cassandra CAP AP · eventual

Redis Cluster CAP CP · strong

MongoDB CAP CP · tunable

Elasticsearch CAP CP · near-RT

TiDB CAP CP · serializable

Replication Targets

Postgres RPO0s (sync replica)

Cassandra RF3 per DC (3 DCs)

Redis AOFfsync every second

S3 Durability99.999999999%

Multi-Tier Caching Strategy

Cache Hierarchy

In-process — Caffeine/Guava (JVM heap), 500ms TTL, hot data only, ~10K items

Redis Cluster — shared cache, <1ms P99, 30 min TTL, 100GB per node

Read Replica — Postgres replica, <5ms P99, for cache miss fallback

Cache Patterns

Cache-Aside: app reads cache, on miss fetches DB + populates
Write-Through: write DB + cache atomically (sessions)
Write-Behind: write cache, async flush to DB (counters)
Read-Through: cache proxies DB transparently

Cache Invalidation

TTL-based: expire after N seconds (simple, safe)
Event-driven: Kafka → cache invalidation consumer
Versioned keys: key = resource:v{version}, bump on write

Connection Pooling & Backup / DR

Connection Pooling

PgBouncer modeTransaction pooling

Pool size100 per shard

Max connections10 000 (via PgBouncer)

Idle timeout60s

Redis poolLettuce async, 32 conns

Backup Strategy

Postgres: continuous WAL → S3 (Barman / pgBackRest)
Cassandra: Medusa snapshots → S3 every 6h
MongoDB: Atlas continuous backup, point-in-time restore
Redis: RDB snapshot hourly + AOF every second
Retention: 7 days hot, 30 days warm, 1 year cold (Glacier)

Failover

Postgres RTO< 30s (Aurora)

Redis sentinel< 10s failover

CassandraZero (AP, no failover)

Data Partitioning & Sharding Strategies

PostgreSQL — Citus Sharding

Distribution column: tenant_id or user_id

Shard count: 32 shards per table, consistent hash

Co-location: same tenant rows on same shard node

Reference tables: replicated to all shards (lookup tables)

Rebalance: shard move with pg_repack, zero downtime

Partitioning (native PG)

events tableRange by month

audit_logsRange by day

ordersHash by order_id

notificationsList by region

Cassandra — Partition Design

Partition key: user_id (even distribution)
Clustering key: created_at DESC (latest-first reads)
Max partition: < 100MB, < 100K rows
Hotspot: avoid by adding bucket to key
Anti-pattern: no SELECT * without partition key

Redis Key Design

Namespace: app:entity:id:field
Max value: < 100KB per key (avoid big keys)
Scan: SCAN with MATCH, never KEYS * in prod
Expiry: all cache keys must have TTL set

CDC Pipeline & Cross-Store Sync

Change Data Capture Flow

PostgreSQL WAL logical replication slot enabled

Debezium connector tails WAL → publishes to Kafka topic per table

Kafka consumers: Elasticsearch indexer, ClickHouse sink, search cache invalidator

Schema registry validates Avro payload before downstream write

Exactly-once — Kafka transactions + idempotent consumer

Sync Latency SLAs

PG → Elasticsearch<2s

PG → ClickHouse<5s

PG → Redis (invalidate)<500ms

Kafka → S3 (archive)<60s

DebeziumKafka CDC Avro SchemaExactly-Once

Performance & Observability

Query Performance

PG read P99<5ms

Redis P99<1ms

ES search P99<20ms

CH analytics P99<2s

Cache hit rate92%

Key Monitoring

PG: pg_stat_statements, slow query log >100ms
Redis: latency monitor, eviction rate, memory fragmentation
Cassandra: nodetool tablestats, read/write latency histograms
ES: index size, search latency, JVM heap pressure

pgBadgerRedisInsight nodetoolGrafana

08Data platform & analytics

⑦ Data Platform, Analytics & BI

Lakehouse · ELT · dbt · Airflow · Spark · Real-time Analytics · Governance · Self-serve BI · 10PB+ scale

Snowflake · BigQuery Apache Spark · Delta dbt · SQL Transforms Apache Airflow Superset · Looker · Tableau DataHub · OpenMetadata Apache Druid · Pinot Great Expectations · MonteCarlo Medallion Architecture Data Mesh · Governance

10PB+

Data Lake Size

500+

dbt Models

1000+

DAGs / Airflow

<1s

Druid Query P99

200+

BI Dashboards

99.9%

Pipeline SLA

Real-time

Streaming Analytics

Data Mesh

Domain Ownership

Core Platform Components

🏔

Data WarehouseSnowflake · BigQuery · Redshift

Snowflake: virtual warehouses, auto-suspend/resume, zero-copy clone
BigQuery: serverless, slot-based, BI Engine for sub-second
Separation: compute/storage decoupled, pay-per-query
Data sharing: Snowflake Marketplace, secure live shares
Time travel: query historical data up to 90 days
Iceberg tables: open format, interop with Spark/Trino
Clustering: micro-partition pruning on cluster keys

SnowflakeBigQuery IcebergTime Travel

⚡

Apache SparkBatch · Streaming · Delta Lake

Spark 3.5: Adaptive Query Execution (AQE), dynamic pruning
Delta Lake: ACID on object storage, schema enforcement
Structured Streaming: micro-batch, watermarks, exactly-once
Photon engine: Databricks native vectorised C++ execution
Kubernetes: Spark Operator, dynamic resource allocation
Unity Catalog: fine-grained access, column-level masking
Optimization: Z-ordering, liquid clustering, auto-compact

Spark 3.5Delta Lake AQEPhoton

🔧

dbt (Data Build Tool)ELT · SQL Transforms · Lineage

Models: SQL SELECT statements, auto-generated DDL
Materializations: view, table, incremental, ephemeral
Incremental: merge strategy — only process new/changed rows
Tests: not_null, unique, accepted_values, relationships
Lineage graph: DAG from sources → staging → marts
dbt Metrics: semantic layer — single source of metric truth
CI: dbt test + dbt build in GitHub Actions on every PR

IncrementalSemantic Layer Lineage DAGTests

🌀

Apache AirflowOrchestration · DAG Scheduling

Airflow 2.8: TaskFlow API, dynamic task mapping
Executors: KubernetesExecutor — each task = isolated pod
DAG versioning: git-sync sidecar pulls DAGs from repo
Sensors: S3KeySensor, ExternalTaskSensor, event-driven
Dataset-aware: trigger DAGs on upstream dataset updates
SLA alerts: PagerDuty on missed SLA, auto-retry 3×
Secrets: Airflow Connections backed by HashiCorp Vault

TaskFlowK8s Executor Dataset-awareVault

📈

BI & DashboardsSuperset · Looker · Tableau

Apache Superset: open-source, 40+ chart types, SQL Lab
Looker: LookML semantic model, embedded analytics
Tableau: executive dashboards, Tableau Server SSO
Metabase: self-serve for non-technical stakeholders
Cube.dev: headless BI, pre-aggregations, API layer
Caching: Superset result cache (Redis), query dedup
Embedding: iframe + JWT signed for product analytics

SupersetLooker Cube.devEmbedded

🗂

Data Catalog & GovernanceDataHub · OpenMetadata · Atlas

DataHub: auto-discovery, lineage, ownership metadata
OpenMetadata: unified catalog, profiling, quality scores
Column lineage: field-level tracing from source to dashboard
PII tagging: auto-classify sensitive columns, masking policies
Data contracts: schema agreements between producers/consumers
RBAC: row/column-level security per team/role in Snowflake
GDPR: right-to-erasure automation, PII inventory registry

DataHubLineage PII MaskingGDPR

⏱

Real-time AnalyticsApache Druid · Pinot · ClickHouse

Apache Druid: sub-second OLAP, Kafka-native ingestion
Apache Pinot: user-facing analytics, <10ms P99 at scale
ClickHouse: columnar, 10B rows/s scan, materialized views
Star-tree index: Pinot pre-aggregation for low-latency rollups
Lambda: batch (Spark) + speed (Druid) layers unified via API
Kappa: single Flink/Kafka stream, no batch layer
Use cases: product funnels, anomaly detection, user cohorts

DruidPinot Star-TreeSub-second

✅

Data Quality & ObservabilityGreat Expectations · Monte Carlo

Great Expectations: expectation suites, data docs, checkpoints
Monte Carlo: ML-based anomaly detection, circuit breaker
dbt tests: freshness, uniqueness, referential integrity checks
Soda Core: SQL-based checks, SodaCL YAML config
Freshness: alert if table not updated within SLA window
Volume: row count deviation >20% triggers incident
Schema drift: auto-detect column add/remove/type change

Great Expect.Monte Carlo Soda CoreSchema Drift

Modern Data Stack — End-to-End Pipeline

🔄 Lakehouse ELT Pipeline: Ingest → Transform → Serve

SourcesApp DBs
APIs · SaaS

→

IngestionFivetran
Debezium
Airbyte

→

Bronze
S3 / GCSRaw · Avro
Parquet · Delta

→

Spark
ProcessingClean · Join
Deduplicate

→

Silver
Delta LakeValidated
Conformed

→

dbt
ModelsAggregate
Business logic

→

Gold
WarehouseSnowflake
BigQuery

→

BI / APIsSuperset
Looker
Cube.dev

⬤ Orchestration: Airflow DAGs trigger each stage, dataset-aware dependencies ⬤ Quality gates: Great Expectations checks at Bronze→Silver, dbt tests at Silver→Gold ⬤ Real-time fork: Kafka → Druid/Pinot for sub-second serving (bypass batch)

Medallion Architecture · dbt DAG · Data Mesh · Quality SLAs

Medallion Architecture — Bronze · Silver · Gold

🟤 Bronze — Raw Zone

Exact copy of source. No transformations. Schema-on-read. Append-only. Partitioned by ingestion date. Retention: 2 years. Format: Parquet / Avro / JSON.

Raw IngestAppend-onlyS3

🔵 Silver — Cleansed Zone

Cleaned, deduplicated, validated, type-cast. Schema-on-write. Delta Lake with ACID. Conformed dimensions. Spark jobs scheduled by Airflow.

Delta ACIDDedupValidated

🟢 Gold — Business Zone

Business-ready aggregates, star schema, wide tables. dbt models materialised into Snowflake. Optimised for BI query patterns, role-based access.

Star SchemadbtSnowflake

dbt Model Lineage — DAG Example

Order Analytics Domain

src_postgres__orders

→

stg_orders

src_postgres__users

→

stg_users

stg_orders

→

int_order_items_enriched

→

fct_orders

stg_users

→

dim_customers

fct_orders

→

mart_revenue_daily

Model Config

fct_ordersincremental + unique_key

dim_customersSCD Type 2 snapshot

mart_revenuetable + cluster_by date

stg_* modelsview (no cost)

Data Mesh — Domain Ownership

4 Data Mesh Principles

Domain ownership: each business domain owns its data products (Orders, Users, Payments)

Data as product: domains publish discoverable, SLA-backed datasets via catalog

Self-serve infra: platform team provides templates, tooling, compute — domain self-onboard

Federated governance: global policies (PII, retention) + domain-level control

Domain Data Products

Orders domainfct_orders · mart_revenue

Users domaindim_customers · churn_risk

Payments domainfct_transactions · fraud

Marketing domainattribution · funnels

Data ProductSLA Contract Federated Gov.

Data Quality SLAs & Observability

Quality Dimensions

Completeness>99%

Freshness (Gold)<1 hr

Uniqueness>99.9%

Volume anomaly±20% alert

Schema drift alerts0 silent

Pipeline SLA Monitoring

Airflow SLA Miss: PagerDuty P2, auto-retry 3× with backoff
dbt test failure: block downstream DAGs, Slack alert
Monte Carlo circuit breaker: halt dashboard refresh on anomaly
Data contracts: producer schema change requires consumer sign-off

Cost Controls

Snowflake: resource monitors, query timeout 60s, auto-suspend 5 min
Spark: spot instances 70%, auto-termination on idle

Ingestion Tools & ELT Patterns

Managed Connectors

Fivetran200+ SaaS connectors

AirbyteOpen-source ELT

DebeziumCDC from PG/MySQL

Kafka ConnectStreaming ingest

dlt (data load tool)Python-native ELT

Ingestion Patterns

Full refresh: truncate + reload (small tables, daily)

Incremental: updated_at watermark, append new rows

CDC: WAL → Kafka → Bronze (true real-time)

Log-based: event stream → Kinesis → S3 raw zone

ELT vs ETL Choice

Aspect	ELT	ETL
Transform where	In warehouse	In pipeline
Raw data kept	Yes ✓	Often no
Reprocessing	Easy (replay)	Hard
Tool	dbt	Spark / Flink
Best for	SQL analytics	Complex logic

Airflow Schedule

Bronze loadEvery 15 min

Silver transformHourly

Gold dbt runEvery 4 hours

BI cache refreshEvery 30 min

Real-time Analytics Architecture

Lambda Architecture

⚡ Speed Layer

Kafka → Flink → Druid/Pinot
Sub-second, approximate
Recent data (7 days)
Low latency user queries

🏋 Batch Layer

S3 → Spark → Snowflake
Accurate, complete
Historical all-time
Complex aggregations

Apache Druid Config

IngestionKafka native, real-time

Query latency<1s P99 (10B rows)

Roll-upPre-agg at ingest time

SegmentsTime-partitioned, 500MB

Retention90 days hot, S3 deep

Apache Pinot — User-Facing

Star-tree index: pre-aggregated for common rollup queries
Sorted index: range filters on timestamp, user_id
Upsert support: latest value via primary key dedup
Multi-stage: distributed join engine for complex queries

Semantic Layer & Metrics Store

dbt Semantic Layer

Metrics: revenue, DAU, churn defined once in dbt
Single truth: all tools query same metric definition
Dimensions: date, region, product, user_segment
MetricFlow: compile metric queries to SQL per DW

Cube.dev (Headless BI)

Pre-aggregations: materialise rollups for <100ms queries
REST + GraphQL API: serve metrics to any frontend
Access control: row-level security per user role

Key Business Metrics

DAU / MAUPinot real-time

Revenuedbt fct_orders

Churn rateML model → BQ

Funnel conversionDruid + Superset

MetricFlowCube.dev dbt MetricsSingle Source

09Observability

📡

Observability Platform — Detailed Architecture

Metrics · Logs · Traces · Alerting · SLO · Incident Response · 2025

Observability

3 Pillars

M · L · T

1M+

Metrics / min

100TB+

Logs / day

99.9%

Trace Coverage

< 1 min

Alert Trigger

< 5 min

MTTA

99.95%

SLO Target

30 days

Retention (hot)

Observability Components — Technology Stacks

📈

Prometheus + Grafana

Metrics Collection · Dashboards · Alerting

Metrics Pillar

Stack

Prometheus 3.x Grafana 11 Thanos (long-term) VictoriaMetrics

Collection

Pull model — scrape /metrics every 15s Pushgateway — batch jobs, short-lived pods Remote Write — fan-out to Thanos / Mimir Service discovery via Kubernetes SD

Key Metric Types

CounterGauge HistogramSummary

Grafana Features

Unified dashboards — metrics + logs + traces Alerting — multi-dimensional, silencing OnCall — built-in on-call scheduling

Query

PromQLHTTP APIRemote Write

🔍

Distributed Tracing

Jaeger · Tempo · OpenTelemetry

Traces Pillar

Stack

Jaeger Grafana Tempo Zipkin AWS X-Ray

Instrumentation

OpenTelemetry SDK — auto & manual spans Context propagation — W3C TraceContext header Baggage propagation — user-ID, tenant-ID Sampling — head-based 1%, tail-based 100% errors

Span Attributes

service.namehttp.status_code db.statementspan.kind

Storage

Tempo backend: object storage (S3 / GCS) Parquet format — columnar, fast query

Protocols

OTLP/gRPCOTLP/HTTPJaeger Thrift

📋

Log Aggregation

ELK · Loki · Structured Logging

Logs Pillar

Stack

Grafana Loki Elasticsearch Kibana AWS OpenSearch

Collection Pipeline

Fluent Bit (DaemonSet) — collect pod logs Logstash / Vector — parse, enrich, route Kafka buffer — decouple collection from storage Structured JSON logs — no regex parsing

Log Fields (Standard)

timestamplevel trace_idservice user_idduration_ms

Retention

Hot (Elasticsearch / Loki): 30 days — fast query Warm (S3 / GCS): 1 year — compressed Parquet Cold (Glacier): 7 years — compliance archive

Query

LogQLLuceneKQL

🚨

Incident Management

PagerDuty · Opsgenie · Runbooks

SRE On-Call

Stack

PagerDuty Opsgenie FireHydrant Incident.io

Process

Alert fires → routed by service ownership On-call paged via SMS + push + phone call Slack war room auto-created with context Incident declared → severity P1–P4 Postmortem auto-template after resolution

Runbook Automation

Automated diagnostics on alert trigger Self-healing scripts (pod restart, cache flush)

Channels

SMSPushPhoneSlackEmail

🎯

SLI / SLO / SLA Tracking

Error Budget · Burn Rate · Reliability

99.95% Target

Tools

Sloth (Prometheus) Nobl9 OpenSLO spec Grafana SLO plugin

Key SLIs

Availability: % requests returning 2xx Latency: % requests completing < 200ms Error rate: % requests returning 5xx Throughput: requests per second sustained

Error Budget

99.95% SLO = 21.9 min/month error budget Burn rate alert: 14× at 1h → page immediately Budget exhausted → freeze non-critical releases

Query

PromQLOpenSLO YAML

🤖

Synthetic Monitoring

Proactive · Outside-In · Canary

24/7 Probing

Tools

Grafana k6 Datadog Synthetics Checkly Playwright Cloud

Check Types

HTTP checks — endpoint up, status, latency Browser checks — full user journey (Playwright) API checks — contract correctness assertions Multi-step transactions — login → checkout SSL cert expiry — alert at 30/14/7 days

Frequency & Locations

Every 60s globally 20+ probe locations

Protocols

HTTP/SgRPCTCP/DNS

👥

Real User Monitoring

Core Web Vitals · Session · UX

Client-Side

Tools

Grafana Faro Datadog RUM Elastic RUM Sentry Performance

Core Web Vitals

LCP — Largest Contentful Paint (< 2.5s target) INP — Interaction to Next Paint (< 200ms) CLS — Cumulative Layout Shift (< 0.1) TTFB — Time to First Byte (< 800ms)

Session Data

Session replay — privacy-safe pixel replay JS error rate — stack trace + breadcrumbs Resource timing — waterfall breakdown

Protocols

Beacon APIXHR/FetchWeb Vitals API

🔭

OpenTelemetry

Unified Instrumentation · Collector

CNCF Standard

Components

OTel SDK (all langs) OTel Collector Auto-instrumentation Semantic Conventions

Collector Pipeline

Receivers — OTLP, Jaeger, Zipkin, Prometheus Processors — batch, filter, attribute enrichment Exporters — Tempo, Loki, Prometheus, S3 Tail sampling — 100% errors, 1% success

SDK Languages

Go · Java · Python Node.js · Rust · .NET Ruby · PHP · C++

Signals

TracesMetricsLogsProfiles

The Three Pillars of Observability

📊

Metrics — What is happening?

Quantitative measurements over time — aggregated, cheap to store

Infrastructure: CPU, RAM, disk I/O, network bytes, pod restarts Application: request rate, error rate, latency percentiles (p50/p95/p99) Business: orders/sec, revenue/min, active users, conversion rate SLO: availability ratio, error budget remaining, burn rate Database: query latency, connection pool, cache hit ratio Queue: consumer lag, message rate, partition count

PrometheusPromQL GrafanaThanos

📄

Logs — Why did it happen?

Event records with rich context — searchable, verbose, structured JSON

Application logs: request/response details, business events, errors Access logs: NGINX / Envoy — IP, path, status, duration, user-agent Audit logs: who did what, when — immutable, tamper-proof (WORM) Security logs: auth attempts, permission denials, anomalies Slow query logs: DB queries exceeding 100ms threshold Correlation: trace_id on every log line links log ↔ trace

LokiElasticsearch Fluent BitLogQL

🔗

Traces — Where did it happen?

End-to-end request journey across services — causal, low-level, sampled

Trace: one root span + child spans forming a DAG across services Span: single operation with start/end time, attributes, events, status Context propagation: W3C traceparent header flows across all hops Service map: auto-generated from trace data — shows dependencies Flame graph: visual breakdown of where latency is spent per span Error attribution: root-cause service identified from 5xx span

JaegerTempo OpenTelemetryTraceQL

Observability Data Pipeline

⚙️

Services / Pods

OTel SDK auto-instrument

OTLP / Prometheus / log stdout

→

🔭

OTel Collector

Receive · Process · Export

Fan-out to backends

→

📦

Kafka Buffer

Decouple + replay

Stream consume

→

🗄️

Signal Stores

Prometheus · Loki · Tempo

Query API

→

📊

Grafana

Unified dashboards

Alert evaluation

→

🚨

Alertmanager

Route · Dedupe · Silence

Notify on-call

→

📱

PagerDuty

Escalation policy

Page SRE

→

👷

SRE On-Call

Investigate → resolve

Alert Severity Levels

P1 — Critical5 min

Total outage · data loss risk · security breachFull on-call team + management notifiedWar room opened immediately in Slack

P2 — High15 min

Partial outage · degraded performancePrimary on-call engineer pagedSLO burn rate > 14× in 1h window

P3 — Medium1 hr

Non-critical degradation · elevated errorsTicket created · next business hoursError budget burn rate alert

P4 — Low24 hr

Warning · approaching thresholdSlack channel notification onlyNo immediate action required

On-Call Escalation Policy

Primary On-Call Engineer
Paged via mobile push + SMS. Expected to acknowledge within 5 min.

0–5 min

Secondary On-Call (if no ack)
Backup engineer paged. Parallel phone call initiated. Slack bridge created.

5–15 min

Engineering Manager
Manager paged. Stakeholder comms drafted. Customer status page updated.

15–30 min

VP Engineering + Incident Commander
Executive bridge. External comms. Customer notifications sent. SLA clock running.

30+ min

SLO Definitions — Production Services

Service	SLI	SLO Target	Error Budget/mo	Window	Status
API Gateway	Availability (2xx rate)	99.99%	4.3 min	30-day rolling	✓ Met
API Gateway	Latency p99 < 200ms	99.5%	3.6 hrs	30-day rolling	✓ Met
Payment Svc	Success rate (non-5xx)	99.99%	4.3 min	30-day rolling	✓ Met
User Service	Availability	99.95%	21.9 min	30-day rolling	⚠ 89%
Search Svc	Latency p95 < 500ms	99.9%	43.8 min	30-day rolling	✓ Met
Messaging	Message delivery < 1s	99.9%	43.8 min	7-day rolling	✓ Met
CDN / Edge	Cache hit rate > 90%	99.0%	7.2 hrs	30-day rolling	✓ Met

Key Grafana Dashboard Types

🏠

System Overview

All services health at a glanceRED metrics (Rate/Error/Duration)SLO burn rate summary

🔬

Service Deep Dive

Per-service latency heatmapError breakdown by endpointTrace → log correlation links

🐳

Kubernetes Cluster

Node CPU / memory / diskPod restarts and OOMKillsHPA scaling events timeline

💰

Business Metrics

Orders/revenue per minuteFunnel conversion ratesActive users by region

1M+

Metrics / min

Prometheus scrape rate

100 TB

Logs / day

Fluent Bit → Loki

99.9%

Trace Coverage

All error paths traced

< 1 min

Alert Trigger

Prometheus eval interval

< 5 min

MTTA

Mean time to acknowledge

30 days

Hot Retention

Fast query tier

1 year

Warm Retention

S3 compressed Parquet

7 years

Cold Archive

Compliance / Glacier

Observability enables teams to understand system behaviour in production without manual debugging. The three pillars — Metrics, Logs, Traces — must be correlated via a shared trace_id to achieve true observability. | Standards: OpenTelemetry CNCF · OpenSLO · SRE Book (Google) · W3C TraceContext | 2025 Production Design

10DevOps

⚙ DevOps & Platform Engineering

CI/CD · GitOps · IaC · Progressive Delivery · Developer Experience · DORA Elite Performance

GitHub Actions ArgoCD · Flux Terraform · Pulumi Harbor · Cosign Kubernetes · Karpenter LaunchDarkly · OpenFeature Flagger · Argo Rollouts Backstage IDP DORA Elite GitOps · SRE

< 1hr

Lead Time

50+/day

Deployments

< 5min

MTTR

< 1%

Change Failure

99.9%

Pipeline Uptime

3 min

Build Time P50

0-touch

Prod Deploy

ELITE

DORA Band

Core Platform Components

🔄

CI/CD PipelineGitHub Actions · Tekton

GitHub Actions: matrix builds, reusable workflows
Tekton Pipelines: K8s-native CI for heavy workloads
Build cache: layer + Bazel remote cache (80% hit)
Parallelism: unit, integration, security scans in parallel
Artifacts: SBOM, attestation, signed OCI image
Branch strategy: trunk-based development, short-lived PRs
Environments: dev → staging → prod with gates

ActionsTekton Bazel CacheSBOM

🌿

GitOpsArgoCD · Flux CD

ArgoCD: declarative GitOps, ApplicationSets
Flux CD: image reflector, multi-tenancy support
Helm: templated manifests, OCI chart registry
Kustomize: overlay-based environment patches
Drift detection: auto-sync on manifest divergence
Rollback: git revert = instant cluster rollback
App of Apps: hierarchical dependency management

ArgoCDFlux HelmKustomize

🏗

Infrastructure as CodeTerraform · Pulumi · Crossplane

Terraform: cloud infra, modular + remote state (S3+DynamoDB)
Pulumi: TypeScript/Python IaC for complex logic
Crossplane: K8s-native cloud resource provisioning
Atlantis: PR-based Terraform plan/apply automation
Drift detection: daily Terraform plan scheduled scan
Policy: OPA/Conftest enforces tagging, cost controls
Modules: internal registry for approved patterns

TerraformPulumi CrossplaneAtlantis

📦

Container RegistryHarbor · ECR · Cosign

Harbor: self-hosted OCI registry with Helm charts
ECR / GCR: managed regional mirrors for low latency
Cosign (Sigstore): keyless image signing + verification
Trivy: vulnerability scan on every image push
SBOM: Syft generates CycloneDX SBOM per image
Admission: Kyverno blocks unsigned/critical-CVE images
Replication: geo-replicated for multi-region pull speed

HarborCosign TrivySLSA L3

☸

Kubernetes PlatformEKS · GKE · Karpenter

Cluster API: declarative cluster lifecycle management
Karpenter: node autoprovisioning, right-sized instances
Multi-cluster: prod / staging / dev isolated clusters
Namespaces: per-team isolation with ResourceQuotas
OPA Gatekeeper: policy enforcement at admission
Velero: cluster backup + disaster recovery
Kubecost: per-team cost attribution and budgets

Cluster APIKarpenter GatekeeperVelero

🚩

Feature FlagsLaunchDarkly · OpenFeature

LaunchDarkly: targeting rules, % rollout, segments
OpenFeature: vendor-neutral SDK standard (CNCF)
Experimentation: A/B test tied to flag + metrics
Kill switch: instant feature disable without deploy
Context-aware: user, region, plan-based targeting
Audit log: who changed what flag and when
SDK: server-side (Node/Go/Java) + client-side (React)

LaunchDarklyOpenFeature A/B TestKill Switch

🚀

Progressive DeliveryFlagger · Argo Rollouts

Flagger: automated canary analysis with Istio/Nginx
Argo Rollouts: BlueGreen + canary with analysis runs
SLO gate: promote only if error rate < 1% + p99 < 200ms
Traffic split: Istio VirtualService weight shifting (5%→25%→100%)
Auto-rollback: instant on metric breach, zero manual action
Shadow mode: mirror traffic to new version, no user impact
Header routing: internal users get canary via x-canary header

FlaggerCanary BlueGreenSLO Gate

🧑‍💻

Developer ExperienceBackstage · Dev Containers

Backstage IDP: service catalog, golden path templates
Dev Containers: reproducible local dev via devcontainer.json
Telepresence: local code ↔ K8s cluster live bridge
Self-service: scaffold new service in < 5 min via Backstage
Docs-as-code: TechDocs auto-generated from repo markdown
API catalog: OpenAPI/AsyncAPI registered in Backstage
Scorecards: automated ownership, SLO, security health checks

BackstageDevContainer TelepresenceTechDocs

End-to-End CI/CD Pipeline — Code to Production

🔄 Automated Pipeline Stages

① Code Push
+ PR

Trunk-based dev
Short-lived branch
Conventional commits

→

② Build &
Unit Test

Docker buildkit
Bazel cache hit
Matrix: OS × arch

→

③ Security
Scan

Semgrep SAST
Trivy image scan
Snyk SCA deps

→

④ Sign &
Attest

Cosign keyless sign
SBOM generation
SLSA provenance

→

⑤ Push to
Registry

Harbor + ECR mirror
Kyverno admission
CVE gate: block CRIT

→

⑥ ArgoCD
Sync

Update Helm values
Git PR merged
ArgoCD auto-sync

→

⑦ Canary
Flagger

5%→25%→50%→100%
SLO analysis
Auto-rollback

→

⑧ Observe
& Verify

Grafana SLO
Synthetic checks
Slack notify

DORA Metrics · GitOps Reconciliation · Release Strategies · Platform SRE

DORA Elite Performance Metrics

Metric	Our Target	Band
Deployment Frequency	50+ / day	Elite
Lead Time for Changes	< 1 hour	Elite
Time to Restore (MTTR)	< 5 min	Elite
Change Failure Rate	< 1%	Elite
Reliability (SLO)	99.99%	High

Pipeline Efficiency

Build Time P503 min

Test Coverage87%

Cache Hit Rate82%

PR Merge Time42 min

Rollback Rate0.4%

GitOps Reconciliation Loop

Developer merges PR → Helm values updated in config repo

ArgoCD detects Git diff (polls every 3 min or webhook)

Diff computed between desired state (Git) vs live state (K8s)

Kyverno admission validates manifests against policies

ArgoCD applies manifests → K8s API server

Flagger intercepts Deployment → starts canary analysis

Grafana SLO gate passes → promote to 100% traffic

Sync interval3 min (+ webhook)

Drift auto-healEnabled

Rollback triggergit revert + push

Release Strategies

🔵 Rolling Update (Default)

Gradually replace old pods. maxSurge=25%, maxUnavailable=0. Zero downtime. K8s native.

🟢 Blue / Green

Full parallel stack. Instant cutover via Istio weight shift. Instant rollback — keep Blue warm 30 min post-deploy.

🟡 Canary (Flagger)

5%→25%→50%→100% over 20 min. Prometheus SLO analysis each step. Auto-rollback on breach.

👥 Shadow / Dark Launch

Mirror 100% of traffic to new version. No user impact. Validate correctness and performance before cutover.

FlaggerArgo Rollouts Istio WeightsHeader Route

Platform SRE Practices

Error Budgets

SLO: 99.99% availability = 52 min/year budget
Budget burn: alert at 5% consumed in 1hr (P1)
Freeze: no deploys when < 10% budget remains

On-Call Runbooks

Incident: PagerDuty → Slack war-room → Zoom bridge
Runbooks: Backstage TechDocs, auto-linked from alert
Post-mortem: blameless, 48hr SLA, published to wiki

Toil Elimination

Target: < 20% SRE time on toil (Google SRE book)
Automation: self-healing pods, auto-restart, node drain
Chaos Eng: Chaos Monkey + LitmusChaos in staging

Error BudgetBlameless PM LitmusChaosToil <20%

IaC Workflow — Terraform GitOps with Atlantis

Terraform Module Structure

Engineer opens PR with Terraform changes

Atlantis runs terraform plan on PR

OPA/Conftest validates policy (tagging, costs)

Cost estimate shown via Infracost comment on PR

Approval from platform team + Atlantis apply

State stored in S3 + DynamoDB lock

Drift scan runs nightly → alerts if out of sync

Golden Module Library

vpcnetworking

eks-clustercompute

rds-postgresdata

redis-clustercache

s3-securestorage

alb-ingressnetworking

iam-rolesecurity

monitoringobservability

Crossplane (K8s-native)

CompositionXR abstractions

ProviderAWS / GCP / Azure

Claimdev self-service

Backstage Internal Developer Portal

Golden Path — New Service Bootstrap (< 5 min)

Dev opens Backstage → Create from Template

Selects: language, team, SLO tier, dependencies

Scaffolder creates: GitHub repo, CI workflow, Helm chart, ArgoCD app

Service auto-registered in catalog with ownership metadata

Grafana dashboard + PagerDuty route provisioned

Catalog Features

Service catalog400+ services

API catalogOpenAPI + AsyncAPI

TechDocsDocs-as-code, mkdocs

ScorecardsMaturity per service

Cost pluginKubecost per team

ScaffolderTechDocs CatalogScorecards

Environment Topology

🟢 Production3 AZ · HA

🟡 StagingMirror of prod

🔵 PreviewPer-PR ephemeral

🟠 DevShared namespace

Promotion Gates

dev→staging: CI green + security scan pass
staging→prod: E2E tests + manual approval
prod canary: SLO gate each traffic step

Cost Controls

Preview envs: auto-deleted after 24hr inactivity
Karpenter: spot instances for non-prod (70% savings)
Kubecost: per-team budgets + Slack spend alerts

Ephemeral PreviewSpot Instances Budget Alerts

11Security

🔒

Security Platform — Detailed Architecture

Zero Trust · Defense-in-Depth · Secrets · SIEM/SOC · Compliance · AppSec · Supply Chain · 2025

Security

Zero Trust

Security Model

7 Layers

Defense-in-Depth

AES-256

Encryption

FIPS 140-2

Key Standard

< 1 hr

P1 Patch SLA

SOC 2 II

Compliance

100%

mTLS Coverage

SBOM

Supply Chain

Security Components — Technology Stacks & Controls

🗝️

Secret Manager

Vault · AWS KMS · Auto-Rotation

FIPS 140-2

Technologies

HashiCorp Vault AWS KMS / Secrets Manager GCP Secret Manager Azure Key Vault

Secret Types

Database credentials — dynamic, short-lived (1h TTL) API keys — versioned, automated rotation TLS certificates — PKCS#12, auto-renewed ACME Encryption keys — HSM-backed, never exported SSH keys — just-in-time, 1-use CA signing

Access Patterns

Vault Agent — sidecar injection into pods Kubernetes ESO (External Secrets Operator) AWS IAM role — workload identity binding

Audit

Every read loggedAlert on anomalyImmutable trail

🛡️

SIEM / SOC

Splunk · Sentinel · SOAR · Threat Intel

24/7 SOC

Technologies

Splunk Enterprise Security Microsoft Sentinel Elastic SIEM Chronicle (Google)

Detection Sources

WAF & network flow logs (VPC Flow Logs) CloudTrail / audit logs — API call anomalies EDR (CrowdStrike) — endpoint telemetry Identity logs — impossible travel, new device Threat intel feeds — STIX/TAXII, VirusTotal

SOAR Automation

Auto-block IP on confirmed attack (iptables) Auto-revoke compromised credential token Slack alert + PagerDuty ticket creation

Standards

MITRE ATT&CKSTIX 2.1TAXII

🔏

Data Encryption

At Rest · In Transit · In Use

AES-256 / TLS 1.3

At Rest

AES-256-GCM — all databases, object storage Envelope encryption — DEK wrapped by KEK Column-level encryption — PII fields (Vault Transit) S3 SSE-KMS — per-object keys, auto-managed Full-disk encryption — NVMe drives (LUKS/BitLocker)

In Transit

TLS 1.3 + QUIC — all external traffic mTLS — all service-to-service (Istio/SPIFFE) HSTS preload — max-age=31536000, includeSubDomains

In Use (Confidential Computing)

Intel SGX / AMD SEV — encrypted memory enclaves Homomorphic encryption — compute on ciphertext

Standards

FIPS 140-2PCI-DSS 4.0NIST SP 800-57

📜

Compliance & Governance

GDPR · SOC 2 · PCI-DSS · ISO 27001

Multi-Framework

Frameworks

GDPR / DPDPA SOC 2 Type II PCI-DSS 4.0 ISO 27001 HIPAA NIST CSF 2.0

Continuous Compliance

CSPM — Wiz / Prisma Cloud policy enforcement OPA / Gatekeeper — Kubernetes policy-as-code AWS Config rules — auto-remediate misconfig Evidence collection — Drata / Vanta automation

Privacy Controls

Data residency — regional storage per country law Right-to-erasure — automated GDPR delete API Consent management — OneTrust CMP

Audits

Annual pen-testQuarterly scan3rd-party audit

🕸️

Zero Trust Network

BeyondCorp · SPIFFE · Micro-Seg

Never Trust

Technologies

SPIFFE / SPIRE Istio PeerAuth Cloudflare Access BeyondCorp Enterprise

Principles Applied

Assume breach — every request re-verified Least privilege — scoped, time-limited access Explicit verification — identity + device + context Micro-segmentation — no lateral movement Continuous evaluation — session risk re-scored

Network Controls

SPIFFE SVIDs — short-lived X.509 identity certs Istio AuthzPolicy — L7 path-level RBAC Network policies — deny-all default, allowlist

Standards

NIST SP 800-207SPIFFE RFCBeyondCorp

🪪

Identity & Access (IAM/PAM)

MFA · SSO · RBAC · JIT Access

Least Privilege

Technologies

Okta / Entra ID CyberArk PAM AWS IAM Identity Center HashiCorp Boundary

Access Controls

RBAC — role-based, attribute-based (ABAC) JIT access — elevated rights expire in 4h max PAM — session recording for privileged SSH/RDP SSO — SAML 2.0 / OIDC for all internal tools MFA enforcement — phishing-resistant FIDO2

Lifecycle

Auto-provision — SCIM sync from HR system Access review — quarterly certification Auto-deprovision — off-board within 1h

Standards

SCIM 2.0SAML 2.0FIDO2OAuth 2.0

🛡️

Application Security

SAST · DAST · IAST · OWASP

Shift Left

Testing Tools

Semgrep (SAST) OWASP ZAP (DAST) Contrast Security (IAST) Burp Suite (Manual)

CI/CD Integration

SAST on every PR — block merge on critical finding Secret scanning — detect leaked API keys in code SCA — dependency CVE check (Snyk / Dependabot) DAST — nightly against staging environment Fuzz testing — continuous corpus with libFuzzer

OWASP Top 10 Coverage

Injection, Broken Auth, XSS, SSRF, IDOR — all tested Security headers — CSP, CORP, COOP, Permissions-Policy

Standards

OWASP ASVSNIST SSDFCWE Top 25

🔗

Container & Supply Chain

SBOM · Image Signing · SCA

SLSA Level 3

Technologies

Sigstore / Cosign Trivy / Grype Syft (SBOM) Snyk Container

Image Pipeline Controls

Base image — distroless / Wolfi (minimal attack surface) Trivy scan — CVE check before push to registry Cosign sign — keyless signing via OIDC + Rekor OPA admission — reject unsigned / critical-CVE images SBOM generated — every build, stored in registry

Dependency Controls

Snyk / Dependabot — auto-PR for CVE patches Private registry — no direct internet pulls in prod

Standards

SLSASPDX SBOMCycloneDXSigstore

Defense-in-Depth — 7 Security Layers

🏰 Defense-in-Depth Architecture — Concentric Security Rings · No Single Point of Failure Each layer independently blocks attack classes — a breach of one does not compromise all

🌐Perimeter L1

DDoS absorption (10Tbps) Anycast routing BGP Flowspec block GeoDNS anycast

WAF · Shield

🔥Network Edge L2

WAF — OWASP rules Bot management TLS 1.3 termination Rate limiting

Cloudflare

🚪API Gateway L3

OAuth 2.0 + JWT auth Input validation Quota enforcement Circuit breaker

Kong · Envoy

🕸️Service Mesh L4

mTLS all svc traffic SPIFFE identity L7 AuthzPolicy Egress allowlist

Istio · SPIRE

📦Application L5

SAST / DAST / IAST Dep vulnerability scan Runtime RASP Signed containers

Semgrep · Trivy

🗄️Data Layer L6

AES-256 at rest Column-level encrypt Tokenisation PII DB firewall rules

Vault Transit

👁️Observability L7

SIEM — 24/7 SOC SOAR auto-response Anomaly detection Immutable audit trail

Splunk · Sentinel

Security Operations Center (SOC) — Detect to Resolve

📡

Data Sources

Logs · Flows · EDR

Ingest / SIEM

→

🔍

Correlation

MITRE rules + ML

Alert fired

→

⚖️

Triage

False positive filter

Confirmed threat

→

🚨

Incident Declared

Severity P1–P4

SOAR playbook

→

🤖

Auto-Contain

Block IP · Revoke token

SRE engaged

→

👷

Investigation

Forensics · RCA

Remediate

→

🩹

Eradicate

Patch · Rotate · Harden

Closed

→

📋

Post-Mortem

Lessons · Rule update

Incident Response Playbook (NIST)

🔍 Prepare
IRP documented, SOC staffed 24/7, SOAR runbooks ready, tabletop exercises quarterly, IR retainer with vendor

Ongoing

🚨 Detect & Analyse
SIEM alert triggered. Analyst confirms true positive. Severity assigned. Timeline documented in ticketing system.

< 15 min

🔒 Contain
Isolate affected pod/node. Block malicious IPs at WAF. Revoke compromised credentials. Snapshot for forensics.

< 30 min

🩹 Eradicate & Recover
Root cause removed. Patch deployed via canary. Secrets rotated. Service restored with monitoring heightened.

< 4 hrs (P1)

📋 Post-Incident Review
5-whys RCA. Blameless postmortem published. Detection rule improved. NIST framework updated.

≤ 5 days

Vulnerability Management Lifecycle

Continuous Scanning
Trivy · Wiz · Qualys — scan containers, hosts, IaC, and dependencies every 24h and on every CI build

24h cycle

Severity Classification
CVSS v3.1 + EPSS — exploit probability adjusts priority. CVE enriched with threat intel feeds.

Automated

Patch SLAs
Critical (CVSS 9+): 1h patch/mitigate · High (7–8.9): 24h · Medium (4–6.9): 7 days · Low: 30 days

Enforced

Auto-PR Remediation
Dependabot / Snyk auto-creates PRs for transitive dep upgrades. Merged via CI gating.

Automated

Virtual Patching
WAF rule deployed within hours of 0-day disclosure while code patch is prepared. Buys remediation time.

< 4 hrs

Penetration Testing
Annual full-scope pen test by certified firm (CREST/OSCP). Quarterly internal red-team exercises.

Annual

Metrics & Reporting
MTTR per severity tracked in Grafana. Board-level security dashboard — open vulns, SLA breach rate, coverage.

Weekly report

Threat Landscape — Top Attack Vectors

🧨Credential TheftCritical

Phishing / AiTM attacksCredential stuffing (bot)Control: FIDO2 MFA + SIEM anomaly

💉Injection / RCEHigh

SQLi · Log4Shell · SSRFSupply chain exploitControl: WAF + SAST + SBOM

☁️Cloud MisconfigurationHigh

Open S3 bucketsOver-permissive IAM rolesControl: CSPM + OPA policy

🔑Insider ThreatMedium

Data exfiltration by employeePrivilege abuse via PAMControl: UEBA + DLP + JIT

Compliance Framework Coverage

🇪🇺GDPR / DPDPA — Data PrivacyActive

Consent management, right-to-erasure API, data residency, DPA appointed

🔐SOC 2 Type II — Trust Service CriteriaCertified

Security, availability, confidentiality, privacy — annual 3rd-party audit (Drata automated)

💳PCI-DSS 4.0 — Payment Card SecurityLevel 1

Cardholder data in isolated namespace, tokenisation, QSA-assessed annually

🏆ISO 27001 — Information Security ISMSCertified

ISMS scope covers all production systems, risk register maintained, surveillance audit annually

🏛️NIST CSF 2.0 — Cyber Security FrameworkAligned

Govern · Identify · Protect · Detect · Respond · Recover — all 6 functions mapped

Secure Development Lifecycle (SDL) — Security at Every Phase

📐Design

Threat modelling (STRIDE) Security architecture review Data flow diagram + trust boundaries Privacy-by-design review

💻Code

Secure coding guidelines Pre-commit secret scanning Peer security code review IDE security plugins (Semgrep)

🔨Build

SAST — block on critical/high SCA — CVE dependency check Container scan (Trivy) Cosign image signing

🧪Test

DAST — OWASP ZAP nightly IAST — Contrast in staging Fuzzing — libFuzzer continuous Security regression tests

🚀Deploy

OPA admission control Image signature verify Canary with error budget gate Least-privilege K8s SA

👁️Operate

SIEM 24/7 monitoring Runtime RASP protection Threat hunting weekly Patch SLA enforcement

🔁Feedback

Bug bounty (HackerOne) Red team quarterly Security postmortems NIST CSF maturity review

Zero Trust

Security Model

Never trust, always verify

100%

mTLS Coverage

All svc-to-svc traffic

AES-256

Encryption at Rest

FIPS 140-2 HSM keys

< 1 hr

Critical Patch SLA

CVSS 9.0+ response time

SLSA L3

Supply Chain

Signed SBOM every build

SOC 2 II

Compliance

Annual 3rd-party audit

JIT Access Max

Auto-expire elevated access

24/7

SOC Coverage

SIEM + SOAR + on-call SRE

Security is embedded at every layer — from perimeter DDoS absorption to encrypted data at rest — implementing Defense-in-Depth so no single control failure results in a breach. | Frameworks: NIST CSF 2.0 · ISO 27001 · SOC 2 Type II · PCI-DSS 4.0 · GDPR · OWASP ASVS · MITRE ATT&CK · SLSA | 2025 Production Design

12AI / ML

🤖 AI / ML Platform

Model Training · Feature Store · LLMOps · RAG · Vector Search · ML Observability · Real-Time Inference

PyTorch · Ray Train MLflow · Kubeflow Feast · Tecton KServe · Triton vLLM · Claude · GPT-4 Pinecone · pgvector Evidently · Arize RLHF · Fine-Tuning RAG · Embeddings A/B · Shadow Mode

<50ms

Inference P99

10K+

Req / Second

500+

Models Served

1B+

Daily Predictions

256 GPU

Training Cluster

10TB+

Feature Store

99.9%

Serving SLA

LLMOps

RAG · Agents

Core Platform Components

🧠

Model TrainingPyTorch · Ray Train · DeepSpeed

PyTorch 2.x: torch.compile, dynamo, inductor backend
Ray Train: distributed training across GPU clusters
DeepSpeed ZeRO-3: LLM fine-tuning on 256×A100
FSDP: fully sharded data parallel for 70B+ models
Mixed precision: bf16 training, fp8 on H100s
Spot GPU: fault-tolerant checkpointing every 10 min
Hyperparameter: Ray Tune + Optuna Bayesian search

PyTorch 2.xRay Train DeepSpeedFSDP

📊

Experiment TrackingMLflow · Weights & Biases

MLflow: runs, params, metrics, artifact logging
W&B: rich visualizations, sweep hyper-param search
Model Registry: staged promotion (dev→staging→prod)
Artifact store: S3-backed, versioned model checkpoints
Lineage: dataset → code → model traceability
Comparison: multi-run diff, metric plots, confusion matrix
Git integration: auto-log commit SHA per training run

MLflowW&B Model RegistryLineage

🗄

Feature StoreFeast · Tecton · Redis

Feast: open-source, offline (S3/BigQuery) + online (Redis)
Tecton: managed, real-time streaming features
Point-in-time joins: prevent training/serving skew
Online store: Redis for <5ms feature retrieval at inference
Offline store: Parquet on S3, Spark batch materialization
Feature versioning: immutable, named feature views
Transformation: on-demand compute + pre-computed batch

FeastTecton Point-in-TimeSkew-Free

⚡

Model ServingKServe · Triton · TorchServe

KServe: K8s-native serving, InferenceService CRD
Triton Inference Server: GPU batching, TensorRT, multi-framework
TorchServe: PyTorch native, custom handlers
ONNX Runtime: cross-framework acceleration, 2–4× speedup
Dynamic batching: accumulate requests, GPU utilisation 90%+
Canary: traffic-split for model A/B within KServe
Autoscaling: KEDA on queue depth + HPA on GPU utilisation

KServeTriton TensorRTONNX RT

💬

LLM PlatformvLLM · Claude · GPT-4 · Gemini

vLLM: PagedAttention, continuous batching, 24× throughput
LiteLLM: unified proxy — route across Claude/GPT/Gemini
Prompt registry: versioned templates, A/B experiments
Fine-tuning: QLoRA / LoRA on domain-specific data
RLHF: human feedback loop with TRL + reward model
Guardrails: NeMo Guardrails — safety, topic, PII filter
Cost routing: smart model selection by task complexity

vLLMLiteLLM QLoRAGuardrails

🔍

Vector DatabasePinecone · Weaviate · pgvector

Pinecone: managed, serverless, 1B+ vector scale
Weaviate: hybrid search (BM25 + dense), multi-modal
pgvector: Postgres extension, HNSW + IVFFlat indexes
Milvus: open-source, ANN on GPU, distributed shards
Embeddings: text-embedding-3-large, BGE-M3, E5-large
HNSW index: ef=200, M=16, recall@10=97%
Metadata filter: pre-filter by tenant, date, category

PineconeWeaviate pgvectorHNSW

🔁

ML OrchestrationKubeflow · Airflow · Metaflow

Kubeflow Pipelines: DAG-based ML workflows on K8s
Airflow: batch feature engineering, data ingestion DAGs
Metaflow: Netflix-style, versioned step caching
Ray Workflows: dynamic pipelines, sub-step parallelism
Retraining triggers: drift detected → auto-retrain pipeline
Caching: step-level output cache — skip unchanged stages
Notifications: Slack alert on pipeline success/failure

KubeflowAirflow MetaflowRay WF

📡

AI ObservabilityEvidently · Arize · Prometheus

Evidently AI: data drift, target drift, data quality reports
Arize Phoenix: embedding drift, latent space monitoring
PSI / KL-divergence: feature distribution shift detection
LLM evals: G-Eval, RAGAS faithfulness, relevance scores
Prediction logging: all predictions → S3 + BigQuery
Feedback loop: human labels vs. model predictions comparison
Alert: drift > threshold → PagerDuty + retrain trigger

EvidentlyArize RAGASDrift Alert

ML Lifecycle Flow & RAG Pipeline

🔄 End-to-End ML Lifecycle

Raw DataLake · DB
Kafka CDC

→

Feature
Eng.Spark/Feast
Materialize

→

Train
EvalPyTorch
Ray Train

→

MLflow
RegistryStage→
Promote

→

KServe
ServeCanary
A/B split

→

Monitor
DriftEvidently
Retrain?

🔍 RAG (Retrieval-Augmented Generation) Pipeline

User
QueryAPI GW
+ Auth

→

Embedtext-embed-
3-large

→

Vector
SearchPinecone
top-k=10

→

Re-rankCohere
CrossEnc.

→

LLM
GenerateClaude 4/
GPT-4o

→

ResponseStream
+ Cache

Training Infrastructure · LLMOps · Model Evaluation · Cost Optimization

Training Infrastructure

GPU Cluster

GPU TypeNVIDIA H100 80GB

Cluster Size256 GPUs (32 nodes×8)

InterconnectInfiniBand 400Gb/s

Spot Fraction60% (fault-tolerant)

CheckpointEvery 10 min → S3

Distributed Strategies

Strategy	Best For
Data Parallel	Standard models, large batch
FSDP / ZeRO-3	70B+ LLMs, memory saving
Pipeline Parallel	Sequential layer splits
Tensor Parallel	Transformer attention layers

H100 SXM5NVLink Flash Attn 2bf16

LLMOps — Production LLM Management

Prompt Engineering

Prompt Registry: versioned templates in Git, reviewed
A/B prompts: LaunchDarkly feature flags per user segment
Few-shot: dynamic example selection from vector store

Fine-Tuning Pipeline

Curate domain dataset, dedup + quality filter

QLoRA fine-tune on 4-bit base model (3×A100)

RLHF — reward model trained on human pref data

Eval on held-out set (MMLU, domain benchmarks)

Shadow deploy → compare vs. base model online

Inference Optimisation

vLLM PagedAttention24× throughput

Speculative decoding2× generation speed

Quantization (AWQ)4-bit, <1% quality drop

KV CacheRedis Prefix Caching

Model Evaluation Framework

Offline Evaluation

ClassificationF1, AUC-ROC, PR curve

RankingNDCG@10, MRR

NLPBLEU, ROUGE, BERTScore

LLMG-Eval, MT-Bench

RAG FaithfulnessRAGAS > 0.85

Online A/B Evaluation

Traffic split: 10% challenger vs 90% champion
Business KPIs: CTR, conversion, revenue per session
Statistical sig.: p<0.05, min 10K samples/arm
Guardrail metric: latency must not regress >10%

Shadow Evaluation

Mirror traffic to candidate model, compare outputs
Zero user impact — purely observational phase

Cost Optimization

GPU Efficiency

GPU Utilisation88%

Spot Savings65%

Cache Hit (LLM)42%

Quant. Cost Save70%

Inference Cost Strategies

Tiered routing: small model → large model escalation
Semantic cache: GPTCache — reuse similar query results
Quantization: AWQ 4-bit, GGUF for CPU inference
Batching: continuous batching, 90%+ GPU utilisation
Spot instances: preemption-safe Kubeflow pipelines

Model Routing Logic

Simple Q&AHaiku 4.5

Code / ReasoningSonnet 4.6

Complex AgentsOpus 4.7

AI Agent Architecture & RAG Deep-Dive

Agentic Patterns

ReAct: Reason + Act loop with tool calls

Tool use: search, code exec, DB query, API

Memory: short-term (context), long-term (vector DB)

Multi-agent: orchestrator + specialist sub-agents

Human-in-loop: Temporal + approval checkpoints

Frameworks

LangChain LlamaIndex Claude SDK AutoGen CrewAI

RAG Configuration

Chunk size512 tokens

Overlap64 tokens

Embed modeltext-embed-3-large

Dimensions3072

Top-K retrieve10 chunks

Re-rank top-N3 (Cohere)

FaithfulnessRAGAS > 0.85

Hallucination< 2% (eval)

Hybrid Search

Dense (ANN)70% weight

Sparse (BM25)30% weight

FusionReciprocal Rank

ML Monitoring & Drift Detection

Drift Types & Detection

Drift Type	Method	Threshold
Data Drift Input feature shift	PSI, KS-test, KL-div	PSI > 0.2 → alert
Concept Drift Label relationship shift	ADWIN, Page-Hinkley	Accuracy drop > 3%
Prediction Drift Output distribution shift	Chi-square, Wasserstein	p < 0.05
Embedding Drift Latent space shift	Arize cosine dist.	Centroid drift > 0.1

Automated Retraining Triggers

Drift alert → Kubeflow pipeline auto-triggered
Scheduled: weekly retraining for all production models
Data volume: retrain when 10K new labeled samples arrive
Business metric: CTR drop > 5% over 24h rolling window

EvidentlyArize PSIADWINAuto-retrain

Model Governance & Safety

Responsible AI

Fairness: demographic parity, equal opportunity audit
Explainability: SHAP values, LIME for black-box models
Bias checks: Fairlearn, AIF360 pre-deploy
Data privacy: differential privacy in training (DP-SGD)

LLM Safety

NeMo Guardrails: topic, safety, PII filters
Prompt injection: detection + sanitisation layer
Output validation: schema + toxicity classifier

Model Cards

Every model: intended use, limitations, eval results
Lineage: dataset + code SHA + training config

SHAPFairlearn DP-SGDModel Cards

System Architecture — 10 Billion+ Users

⑤ Event Streaming & Async Processing

⑥ Data Layer — Polyglot Persistence

⑦ Data Platform, Analytics & BI

⚙ DevOps & Platform Engineering

🤖 AI / ML Platform