Skip to table of contents
Master diagram

Linh Truong  ·  MA (Harvard), MBA  ·  LinhTruong.com  ·  Linh@Alumni.Harvard.edu

System Architecture

I designed this reference architecture to map the full structural anatomy of large-scale distributed systems — from client interfaces and edge security through ingress, Kubernetes & Istio service mesh, event streaming, polyglot data persistence, analytics & BI, observability, DevOps, security, and AI/ML infrastructure. Twelve layers, each with its own detailed diagram.

01Master diagram

System Architecture — 10 Billion+ Users

Multi-Region · Cloud-Native · Zero Trust · Auto-Scaling · Production-Grade · 2025

10B+
Users
99.999%
SLA
<50ms
P99 Latency
5M+
Req/sec
50+
Regions
10 EB
Data
Zero Trust
Security
Auto-Scale
Elastic Infra
🌎 US-EAST 🌎 US-WEST 🌍 EU-WEST 🌍 EU-CENTRAL 🌏 AP-SE 🌏 AP-NE 🌏 AP-SOUTH 🌍 ME-SOUTH 🌎 SA-EAST 🌍 AF-SOUTH
① Client Layer
🌐Web Browser React / Next.js · PWA SPA + SSR
📱iOS App Swift / SwiftUI Offline-First
🤖Android App Kotlin / Jetpack Offline-First
🖥️Desktop Electron / Tauri Cross-Platform
📡IoT / Devices MQTT · CoAP · BLE Billions
🔌3rd-Party API REST · GraphQL · SDK Partners
🥽AR / VR WebXR · Vision Pro Spatial
Wearables Watch · Smart Glass Health
HTTPS / WSS / gRPC / MQTT over TLS 1.3 + QUIC
② Edge & Security Layer
🛡️DDoS Shield Cloudflare / AWS Shield 10 Tbps+
CDN / Edge Cloudflare · 300+ PoPs Global Cache
🌍GeoDNS Route 53 · Anycast Latency-Based
🔥WAF OWASP Rules · Bot Filter L7 Firewall
🗺️Edge Workers CF Workers · V8 Isolates A/B · Personalize
🔐TLS Termination TLS 1.3 / HTTP/3 QUIC mTLS
🕵️Bot Management ML Fingerprint · CAPTCHA Real-Time
📦Static Cache Assets · Images · Fonts 95% Hit Rate
Filtered & Authenticated Traffic
③ Ingress Layer
⚖️Global Load Balancer L4/L7 · Health Checks Active-Active
🚪API Gateway Kong · Route · Transform REST / GraphQL
🔑Auth / IAM OAuth 2.0 · JWT · MFA Zero Trust
🚦Rate Limiter Token Bucket · Redis Per-User/IP
📡WebSocket GW Real-Time · Pub/Sub Bi-Directional
🔄GraphQL Federation Apollo Gateway · Stitching Unified Graph
🛑Circuit Breaker Hystrix · Bulkhead Fault Tolerance
📊gRPC Proxy Envoy Sidecar · Protobuf Service Mesh
Routed via Service Mesh (Istio / Linkerd) with mTLS
📡 Observability
📈Prometheus + Grafana Metrics · SLO · Alerts
🔍Distributed Tracing Jaeger · OpenTelemetry
📋Log Aggregation ELK / Loki · Structured
🚨Incident Mgmt PagerDuty · Runbooks
🚀 DevOps
🐳Kubernetes EKS/GKE · HPA · VPA
🔁CI/CD GitHub Actions · ArgoCD
🏗️IaC Terraform · Helm
④ Microservices Layer — Kubernetes · Istio Service Mesh
👤User Service Profile · Prefs · CQRS Go
📝Content Service CRUD · Rich Media Java
🔔Notification Push · Email · SMS Node.js
💳Payment Stripe · Idempotent Java · PCI-DSS
💬Messaging WebSocket · E2E Encrypt Erlang
🔍Search Elasticsearch · NLP Python
🗂️File Service Upload · Virus Scan Go
🌐Geo / Map Location · Geofencing C++
🎯Recommendation Collab Filter · Deep ML Python
📊Analytics Event Track · Funnel Python
🛒Order / Cart Checkout · Saga Java
⚙️Config / Flags LaunchDarkly · etcd Go
Async Events
⑤ Event Streaming & Async Processing
🌊Apache Kafka Event Streaming · 100M msg/s Partitioned
Apache Flink Stream Processing · CEP Real-Time
🐰RabbitMQ Task Queues · Dead Letter AMQP
🕐Temporal Workflow · Long-Running Durable Sagas
📬AWS SQS / SNS Fan-out · FIFO · DLQ Managed
🧮Spark Batch ETL · MapReduce Petabyte+
🔒 Security
🗝️Secret Manager Vault · AWS KMS · Rotate
🛡️SIEM / SOC Splunk · Threat Detection
🔏Encryption AES-256 · TLS 1.3
📜Compliance GDPR · SOC2 · PCI-DSS
🕸️Zero Trust Net mTLS · SPIFFE · BeyondCorp
🧠 AI / ML
🤖LLM Inference Claude · GPT-4 · Llama
🔬Feature Store Feast · Online + Offline
🚂ML Pipeline Kubeflow · MLflow
Cache-Aside · Write-Through · Read-Through — Redis L1 · CDN L2
⑥ Data Layer — Polyglot Persistence
⚡ Cache
Redis Cluster Session · Hot · Pub/Sub 1M ops/s
🗄️Memcached Object Cache · LRU Write-Aside
🏦 Relational (OLTP)
🐘PostgreSQL Primary + Replicas · Citus ACID
🌍CockroachDB Geo-Distributed SQL NewSQL
🐬Aurora MySQL Multi-AZ · Serverless v2 Managed
📦 NoSQL
🏛️Cassandra Write-Heavy · Time-Series Wide-Col
DynamoDB Single-Digit ms · Global Serverless
🍃MongoDB Flexible Schema · Atlas Document
🔎 Search / Graph / TS
🕸️Neptune / Neo4j Social Graph · GQL Graph
⏱️InfluxDB Metrics · IoT · TTL Time-Series
☁️ Storage & Warehouse
☁️Object Storage S3 / GCS · Exabytes Versioned
❄️Snowflake / BigQuery OLAP · Petabytes Data Warehouse
🏗️Lakehouse Delta Lake · Iceberg ACID on S3
Analytics Pipeline → ETL / ELT → Business Intelligence
⑦ Data Platform, Analytics & BI
🔄ETL / ELT Fivetran · dbt Automated
📊BI Dashboards Looker · Metabase Self-Service
🌊Real-Time OLAP Apache Druid · Pinot Sub-second
🎲A/B Testing Experiment Platform Stats Sig
🧬Data Catalog DataHub · Lineage Governance
📡CDP Segment · Identity Unified ID
🛡️Data Privacy PII Mask · GDPR Compliance
💾Backup & DR Cross-Region · RTO <1min RPO = 0

Horizontal scaling · Multi-region active-active · Event-driven architecture · CQRS + Event Sourcing · Polyglot persistence · Zero-downtime deploys · Chaos Engineering · SRE practices  |  2025 Production-Grade Design

02Client layer

💻
Client Layer — Detailed Architecture
System Architecture · 10 Billion+ Users · All Client Platforms · 2025
① Client Layer
8
Client Types
10B+
End Users
6
Protocols
< 3s
TTI Target
Offline
First Design
E2E
Encrypted
A/B
Feature Flags
WCAG 2.2
Accessibility
Client Types & Technology Stacks
🌐
Web Browser
Desktop & Mobile Web
SPA + SSR
Framework & Rendering
React 19 Next.js 15 Vue / Nuxt Svelte / SvelteKit
State & Data Fetching
Redux Toolkit React Query Zustand SWR
Capabilities
PWA · Service Worker · Offline Cache WebAssembly (WASM) for compute Code splitting · Lazy loading · ISR WebGL / Canvas 2D rendering Web Workers for background tasks
Build & Deploy
Vite / Turbopack Vercel / Cloudflare Pages
Protocols
HTTPS WebSocket GraphQL HTTP/3
📱
iOS App
iPhone · iPad · Apple Silicon
Native Swift
Framework & Language
Swift 6 SwiftUI UIKit (Legacy) Combine / async/await
Architecture
MVVM TCA (Composable) Clean Architecture
Capabilities
Core Data + CloudKit sync Background fetch & push notifications Face ID / Touch ID · Secure Enclave ARKit · Core ML on-device inference WidgetKit · Live Activities · SharePlay
Offline & Sync
Realm · SQLite · Core Data
Protocols
HTTPS gRPC WebSocket APNs
🤖
Android App
Phone · Tablet · Foldable
Native Kotlin
Framework & Language
Kotlin Jetpack Compose Coroutines / Flow
Architecture
MVVM + Clean MVI (Orbit) Hilt DI
Capabilities
Room DB + WorkManager offline jobs Biometric API · Keystore secure storage ML Kit on-device · TensorFlow Lite Adaptive layouts for foldables DataStore · Paging 3 · Navigation
Toolchain
Gradle KTS · R8 · Play Store
Protocols
HTTPS gRPC FCM Push WebSocket
🖥️
Desktop App
Windows · macOS · Linux
Cross-Platform
Framework
Tauri 2.0 (Rust) Electron Flutter Desktop Qt / .NET MAUI
UI Layer
React / Svelte Native OS APIs
Capabilities
Native OS notifications & tray icon File system access · IPC bridge Auto-update via Sparkle / Squirrel Offline-first with local DB (SQLite) Hardware GPU acceleration
Distribution
Mac App Store · MS Store · Snap
Protocols
HTTPS WebSocket gRPC GraphQL
📡
IoT & Embedded
Sensors · Actuators · Edge
Billions of Devices
Platforms & Runtime
FreeRTOS Zephyr RTOS ESP-IDF · Arduino Raspberry Pi / Linux
Edge Computing
AWS Greengrass Azure IoT Edge TF Lite Micro
Capabilities
Local edge inference (TF Lite Micro) Store-and-forward when offline OTA firmware updates (delta) Hardware security module (HSM) Device twin / shadow state sync
Connectivity
WiFi 6 · BLE 5.3 · LoRa · NB-IoT
Protocols
MQTT CoAP AMQP HTTPS
🔌
3rd-Party / Partner API
B2B · Integrations · SDK
Partner Access
Integration Types
REST SDK GraphQL API Webhook Events gRPC Streaming
Auth & Security
OAuth 2.0 Client Creds API Key + HMAC mTLS Certs
Capabilities
Developer portal & sandbox env OpenAPI 3.1 spec & Postman collection Rate limit: 10k–1M req/min by tier Webhook retry with exponential backoff SDK: Python, Node, Java, Go, Ruby
Tooling
Swagger · Stoplight · Postman
Protocols
REST GraphQL gRPC Webhooks
🥽
AR / VR / Spatial
Vision Pro · Quest · HoloLens
Emerging
Platforms & SDK
visionOS / RealityKit WebXR Unity / Unreal Engine Meta SDK (Quest)
Rendering
Metal / Vulkan Three.js / Babylon.js
Capabilities
Spatial anchors & world-scale AR Hand tracking & eye gaze input Real-time 3D object streaming (USD) Spatial audio with HRTF rendering Shared AR sessions (multiplayer)
Compute
On-device ML · Cloud render offload
Protocols
WebRTC WebSocket HTTPS
Wearables
Watch · Smart Glass · Ring
Health & Fitness
Platforms
watchOS / WatchKit Wear OS Tizen · Garmin SDK
Sensors
HR · SpO2 · ECG Accelerometer · GPS Skin Temp · EDA
Capabilities
Continuous passive health monitoring Companion app BLE data bridge On-device ML for anomaly detection Haptic feedback & notifications HealthKit / Health Connect sync
Connectivity
BLE 5.3 · WiFi 6 · LTE (standalone)
Protocols
BLE GATT HTTPS MQTT
Request Flow — Client to Backend
👤
User Action
Tap · Click · Voice
🧩
UI Component
Event Handler
🔄
State Manager
Redux · TCA · ViewModel
📡
API Client
Axios · URLSession · OkHttp
🗃️
Local Cache
Check First (offline)
🔐
Auth Token
JWT · Bearer · mTLS
🛡️
TLS 1.3 / QUIC
Certificate Pinning
CDN / Edge
Nearest PoP · Cache Hit
🚪
API Gateway
Route · Rate Limit
⚙️
Microservice
Business Logic
Cross-Cutting Concerns — Shared Across All Clients
🔐Authentication & Auth
OAuth 2.0 + PKCE flow JWT access + refresh tokens Passkeys (WebAuthn / FIDO2) Biometric: Face ID / Fingerprint MFA: TOTP · SMS · Hardware key Token rotation + silent refresh Certificate pinning (mobile)
📊Analytics & Telemetry
Session recording (Hotjar / FullStory) Funnel & conversion tracking Core Web Vitals monitoring Custom event tracking (Amplitude) Crash reporting (Sentry / Firebase) Real User Monitoring (RUM) Privacy-safe (GDPR consent)
🚩Feature Flags & A/B
LaunchDarkly / Unleash SDK Gradual rollout (1% → 100%) Kill switch for instant rollback Multivariate A/B experiments Targeting by user segment Client-side evaluation (local) SSE / streaming flag updates
🐛Error Reporting
Sentry · Bugsnag · Firebase Crashlytics Source map upload for JS/TS dSYM symbols for iOS crashes ProGuard mapping for Android Breadcrumb trail for debugging PII scrubbing before upload Alert on error-rate spike
🌍i18n & Localization
100+ languages · RTL support ICU message format (plurals) Lazy-load locale bundles Currency · Date · Number format Over-the-Air translation updates Crowdin / Lokalise workflow WCAG 2.2 AA accessibility
🔔Push & Notifications
APNs (iOS) · FCM (Android) Web Push API (browser) Rich push: image, action buttons Scheduled & silent pushes Opt-in/opt-out preference center In-app message (non-intrusive) Delivery receipt tracking
Communication Protocols
🌐
REST / HTTPS
JSON · OpenAPI 3.1
HTTP/2 multiplexing
Idempotency keys
Request–Response
WebSocket / SSE
Real-time push
Chat · Live feeds
Heartbeat + reconnect
Full-Duplex
🔄
gRPC / HTTP/2
Protobuf binary
Bi-directional stream
Low latency & typed
Service-to-Service
📐
GraphQL
Flexible queries
Subscriptions · Batching
Apollo / Relay client
Query Language
📡
MQTT / CoAP
IoT-optimized
Pub/Sub · QoS 0/1/2
Low bandwidth devices
IoT Telemetry
🎥
WebRTC / QUIC
P2P video & audio
DTLS-SRTP encrypted
ICE / STUN / TURN
Peer-to-Peer
Client Architecture Patterns
🏗️Backend for Frontend (BFF)
Dedicated API per client type (web, mobile, IoT) Aggregates multiple microservices in one call Reduces over-fetching & under-fetching Tailored response shape per platform
🧩Micro-Frontend Architecture
Module Federation (Webpack 5 / Rspack) Independent team deployments per feature Shared design system component library Lazy-loaded at runtime via CDN
💾Offline-First & Sync
CRDTs for conflict-free distributed state Optimistic UI with rollback on error Background sync via Service Worker IndexedDB / SQLite local persistence
🔒Client Security Model
Zero-trust: every request re-authenticated Certificate pinning on mobile clients Content Security Policy (CSP) headers Subresource Integrity (SRI) for scripts

Client Layer sits at the outermost tier of the system — responsible for user experience, offline resilience, security bootstrapping, and efficient protocol selection per device class.  |  All clients connect through the Edge & Security Layer next.  |  2025 Production Design

03Edge & security

🛡️
Edge & Security Layer — Detailed Architecture
System Architecture · 10 Billion+ Users · Global Edge Network · Zero Trust Security · 2025
② Edge & Security Layer
300+
Global PoPs
10 Tbps+
DDoS Capacity
< 5ms
Edge Latency
99.999%
Uptime SLA
95%
Cache Hit Rate
TLS 1.3
Encryption
OWASP
WAF Ruleset
Zero Trust
Security Model
Edge & Security Components — Technology Stacks
🛡️
DDoS Protection
Volumetric · Protocol · Application
10 Tbps+
Providers
Cloudflare Magic Transit AWS Shield Advanced Akamai Prolexic Azure DDoS Protection
Attack Absorption
Anycast routing — absorb traffic at nearest PoP BGP Flowspec — null-route attack sources Scrubbing centers — clean traffic forwarded Adaptive rate limiting per source IP / ASN
Attack Types Covered
UDP FloodSYN Flood ICMP FloodHTTP Flood SlowlorisDNS Amplify
Mitigation Speed
Detection < 1 sec Mitigation < 3 sec
Protocols Protected
TCP/UDPHTTP/S DNSICMPGRE
CDN / Edge Network
Content Delivery · Acceleration
300+ PoPs
Providers
Cloudflare Akamai Fastly AWS CloudFront
Cache Hierarchy
L1 — Shield PoP (regional aggregator) L2 — Edge PoP (end-user facing) Origin Shield — collapses requests to origin Stale-while-revalidate for zero-miss penalty
Cache Strategies
Cache-Control headers Surrogate keys (tags) Instant purge API Vary-aware caching
Acceleration
TCP/IP anycast — closest PoP routing HTTP/2 & HTTP/3 (QUIC) push Brotli / Zstandard compression
Protocols
HTTP/3QUIC HTTP/2WebSocket
🌍
GeoDNS & Routing
Global Traffic Management
Anycast
Providers
AWS Route 53 NS1 / IBM DNS Cloudflare DNS Google Cloud DNS
Routing Policies
Latency-based — route to lowest latency region Geolocation — route by country / continent Weighted round-robin for gradual failover Health-check aware — auto-remove unhealthy Failover (active–passive) with TTL 30s
DNSSEC & Security
DNSSEC signing DNS-over-HTTPS DNS-over-TLS Response Rate Limit
Performance
TTL: 30s – 300s 100% Anycast coverage
Protocols
DNS/UDPDoH DoTDNSSEC
🔥
Web Application Firewall
OWASP · Custom Rules · L7
L7 Filter
Providers
Cloudflare WAF AWS WAF v2 ModSecurity (NGINX) Imperva WAF
Rule Categories
OWASP Core Rule Set (CRS 3.3) SQLi · XSS · RCE · SSRF · LFI Geo-blocking (country allowlist) IP reputation lists (threat intel feeds)
Modes & Actions
BlockChallenge (CAPTCHA) JS ChallengeLog / Monitor
Tuning
Paranoia level 1–4 (false positive tradeoff) Per-path exceptions for legacy endpoints
Protocols
HTTP/SHTTP/2 HTTP/3WebSocket
🗺️
Edge Workers / Compute
Serverless at the Edge
V8 Isolates
Runtimes
Cloudflare Workers AWS Lambda@Edge Vercel Edge Runtime Fastly Compute@Edge
Use Cases
A/B testing — rewrite response at edge Personalisation by user / geo / device Auth token validation (no origin round-trip) Request / response transformation Image optimisation (resize, WebP, AVIF)
KV Storage at Edge
Cloudflare KV (global) Durable Objects (strong) R2 Object Store
Cold-Start
< 1ms (V8 isolate, no container boot)
Runtimes
JS / TSRust/WASM PythonWinterCG
🔐
TLS / QUIC Termination
Encryption · Certificates · PKI
mTLS
TLS Versions
TLS 1.3 (primary) QUIC (HTTP/3) TLS 1.2 (legacy fallback)
Certificate Management
Auto-provisioned via Let's Encrypt / ACM Wildcard certs · SAN multi-domain ACME protocol — 90-day auto-renewal Certificate Transparency (CT) logging OCSP stapling — no client round-trip
Cipher Suites
TLS_AES_256_GCM_SHA384 TLS_CHACHA20_POLY1305 ECDHE (X25519)
mTLS (Service-to-Service)
SPIFFE / SPIRE identity framework Short-lived SVID certs (1h TTL)
Standards
RFC 8446FIPS 140-2 PCI-DSSHSTS
🕵️
Bot Management
ML Detection · Fingerprinting
Real-Time ML
Providers
Cloudflare Bot Mgmt DataDome Kasada · Arkose Labs
Detection Signals
Browser fingerprint (JS challenge) Mouse/scroll/keyboard behaviour heuristics TLS JA3/JA3S fingerprint HTTP/2 SETTINGS frame fingerprint IP reputation + ASN / datacenter ranges
Bot Categories
ScrapersCredential Stuffing Carding BotsClick Fraud Scalper BotsGood Bots (allow)
Actions
BlockCAPTCHA Slow down (throttle)Fake 200 (honeypot)
Signals
JS AgentHTTP Headers IP GeoML Score
📦
Static Asset Cache
CDN Cache · Object Storage
95% Hit Rate
Cached Asset Types
JS / CSS bundles Images (WebP/AVIF) Fonts (WOFF2) HTML (ISR/SSG) API responses (JSON)
Cache Control Strategy
Immutable assets: max-age=31536000 (1yr) HTML pages: max-age=0, s-maxage=300 Stale-while-revalidate: background refresh Surrogate key tags for instant bulk purge
Optimisations
Brotli / Zstd on-the-fly compression Image format negotiation (Accept header) HTTP Early Hints (103) for link preload
Protocols
HTTP/3QUIC HTTP/2 PushEarly Hints
Traffic Flow — Internet to Origin
🌐
Internet / Client
Any device worldwide
Anycast BGP
🛡️
DDoS Scrubbing
Attack traffic dropped
Clean traffic only
🌍
GeoDNS
Nearest PoP resolved
DNS A/AAAA record
🔐
TLS 1.3 Handshake
0-RTT resumption
Encrypted tunnel
🔥
WAF Inspection
OWASP + custom rules
Scored request
🕵️
Bot Score
ML fingerprint check
Human verified
📦
Cache Check
HIT → serve instantly
Cache MISS only
🗺️
Edge Worker
Transform / personalise
Forwarded upstream
🚪
API Gateway
Ingress Layer →
DDoS Defense Matrix
🌊Volumetric (L3/L4)
UDP Flood · ICMP · DNS Amplification · NTP Reflection
Anycast absorption — distribute globally Blackhole routing / RTBH for sources Scrubbing centres — clean & forward BGP Flowspec — ISP-level filtering
10 Tbps+ capacity Detection < 1s
Protocol (L4)
SYN Flood · ACK Flood · TCP Exhaustion · Slowloris
SYN cookies — validate before allocating state TCP connection rate limits per source Timeout tuning — kill idle connections fast Load balancer health & back-pressure
Stateless defence 1M PPS handled
🌐Application (L7)
HTTP Flood · Credential Stuffing · Web Scraping · API Abuse
JS Challenge — headless browser detection Token bucket rate limiting per client ML bot score threshold enforcement CAPTCHA escalation on suspicious score
WAF + Bot Mgmt 99.9% accuracy
WAF Rule Categories — OWASP Top 10 Coverage
💉Injection AttacksCRITICAL
SQL Injection (UNION, DROP, '--) NoSQL / ORM injection LDAP / XPath injection OS command injection
📜XSS & CSRFHIGH
Reflected XSS pattern matching Stored XSS via input sanitisation DOM-based XSS (CSP enforcement) CSRF token validation
📁Path Traversal & LFIHIGH
../../ traversal sequence blocking /etc/passwd, /proc pattern rules URL-encoded bypass detection Null byte injection blocking
🔗SSRF & RCECRITICAL
169.254.x.x metadata endpoint block localhost / 127.0.0.1 request block Serialisation gadget pattern rules Log4Shell / Spring4Shell signatures
Certificate & PKI Lifecycle
1 🔑 Key Generation ECDSA P-256 / RSA-2048 — HSM-backed private key
2 📋 CSR & ACME Let's Encrypt / AWS ACM — automated DNS-01 challenge
3 ✅ Certificate Issued 90-day validity · Wildcard SAN · CT log recorded
4 🚀 Deploy to CDN Edge OCSP stapling enabled · pushed to all PoPs
5 🔄 Auto-Renew (day 75) Renewal triggered 15 days before expiry · zero downtime
6 📊 Monitor & Alert Prometheus cert-exporter · alert if expiry < 30 days
7 ⛔ Revocation (CRL / OCSP) Instant revocation via CA API on compromise
Zero Trust Architecture Pillars
🪪Identity Verification
Every request authenticated, no implicit trust OAuth 2.0 + short-lived JWT tokens MFA enforced for privileged actions
📱Device Posture
Device health check before access grant MDM compliance (Jamf / Intune) Jailbreak / root detection (mobile)
🕸️Network Micro-Seg.
mTLS between every service pair SPIFFE SVID identity certificates Istio policy enforces allowed traffic
📊Continuous Monitoring
All access logged to immutable audit trail SIEM anomaly detection (ML baseline) Automated threat response playbooks
🔒Least Privilege
Fine-grained RBAC per resource & action Just-in-time (JIT) elevated access Auto-expire access grants (4h max)
🌍Data Protection
AES-256 encryption at rest + in transit Tokenisation for PII in transit DLP rules — detect & block data exfil
Global Point-of-Presence (PoP) Distribution
🌎North America80+ PoPs
🌍Europe70+ PoPs
🌏Asia Pacific65+ PoPs
🌏Southeast Asia30+ PoPs
🌍Middle East15+ PoPs
🌍Africa12+ PoPs
🌎South America18+ PoPs
🌏Oceania10+ PoPs
Anycast BackboneTier-1 ISPs
🔗Private BackboneArgo / SmartRouting
3ms
DNS Resolution
Anycast · 300+ resolvers
0-RTT
TLS Resumption
Session tickets / PSK
95%
Cache Hit Rate
Saved origin round-trips
<1s
DDoS Detection
ML flow analysis
<3s
DDoS Mitigation
Anycast blackhole
<1ms
Edge Worker Start
V8 isolate — no cold boot
99.999%
Availability SLA
5 nines · <5min/year down
10 Tbps
Absorption Capacity
Largest attacks mitigated

Edge & Security Layer is the first line of defence — absorbing attacks, encrypting all traffic, accelerating content globally, and enforcing Zero Trust before any request reaches the Ingress Layer.  |  Compliance: SOC 2 Type II · ISO 27001 · PCI-DSS Level 1 · GDPR Article 32  |  2025 Production Design

04Ingress layer

🚪
Ingress Layer — Detailed Architecture
System Architecture · 10 Billion+ Users · API Gateway · Auth · Rate Limiting · Service Mesh Entry · 2025
③ Ingress Layer
5M+
Req / Second
< 5ms
GW Overhead
99.999%
Availability
Zero Trust
Auth Model
mTLS
Service-to-Service
Active-Active
LB Mode
Per-User
Rate Limiting
OAuth 2.0
Auth Standard
Ingress Components — Technology Stacks
⚖️
Global Load Balancer
L4 / L7 · Health Checks · Failover
Active-Active
Technologies
AWS ALB / NLB GCP Cloud LB NGINX Plus HAProxy
Algorithms
Weighted Round Robin — traffic distribution Least Connections — route to least loaded IP Hash — sticky sessions per client Latency-based — route to fastest region
Health Checks
Active — HTTP /health every 5s, timeout 2s Passive — detect 5xx errors, remove node Graceful drain — 30s connection draining
Features
SSL Offload Connection Pooling HTTP/2 Upgrade Anycast VIP
Protocols
TCP/UDPHTTP/2 HTTP/3WebSocket
🚪
API Gateway
Route · Transform · Aggregate
REST + GraphQL
Technologies
Kong Gateway AWS API GW v2 Apigee X Envoy / Istio GW
Core Functions
Request routing — path, host, header matching Request / response transformation (Lua plugins) Protocol translation — REST ↔ gRPC ↔ WebSocket API versioning — /v1, /v2 routing strategies Request aggregation — fan-out and merge
Plugins / Middleware
Auth (JWT/OAuth) CORS Request Validator Cache Logging
Protocols
RESTGraphQL gRPCWebSocket
🔑
Auth & IAM Service
OAuth 2.0 · JWT · MFA · SSO
Zero Trust
Technologies
Keycloak Auth0 AWS Cognito Okta / Ping Identity
Auth Methods
OAuth 2.0 + PKCE — public clients OIDC — identity layer on top of OAuth Passkeys (WebAuthn / FIDO2) — passwordless SAML 2.0 — enterprise SSO federation API Key + HMAC — machine-to-machine
Token Strategy
Access token: JWT, 15-min TTL, RS256 signed Refresh token: opaque, 7-day, Redis-stored Silent refresh — background token rotation
MFA Factors
TOTP (Authenticator) SMS OTP Hardware Key (YubiKey) Biometric
Protocols
OAuth 2.0OIDC SAML 2.0FIDO2
🚦
Rate Limiter
Token Bucket · Sliding Window
Per-User / IP
Technologies
Redis Cluster Envoy Rate Limit Kong Rate Limit Nginx limit_req
Algorithms
Token bucket — burst allowance, steady refill Sliding window log — exact precision, Redis sorted set Fixed window counter — Redis INCR + EXPIRE Leaky bucket — smooth output, queue overflow
Limit Dimensions
Per user ID · per API key · per IP address Per endpoint path · per tenant (multi-tenant) Global cluster limit via Redis Lua scripts
Response Headers
X-RateLimit-Remaining X-RateLimit-Reset Retry-After
Actions on Breach
HTTP 429Throttle QueueBlock IP
📡
WebSocket Gateway
Real-Time · Pub/Sub · Bi-Directional
Full-Duplex
Technologies
Socket.IO / uWebSockets AWS API GW WebSocket Pusher / Ably Fanout / Fastly Real-Time
Architecture
Sticky sessions — consistent hashing by user ID Redis Pub/Sub — fan-out across gateway nodes Heartbeat ping/pong — 30s interval keep-alive Reconnect with exponential backoff (1s → 30s) SSE fallback — for environments blocking WS
Message Routing
Room / channel-based topic routing User-to-user direct message delivery
Protocols
WebSocketSSE Long PollSTOMP
🔄
GraphQL Federation
Apollo Gateway · Unified Graph
Schema Stitching
Technologies
Apollo Gateway Apollo Federation v2 GraphQL Mesh Hasura / StepZen
Federation Model
Supergraph — composed from subgraphs per service @key directive — entity references across services @external / @requires — cross-subgraph fields Query planning — optimal subgraph execution
Performance
Persisted queries — hash instead of full query DataLoader — batching + deduplication (N+1) Response cache — Redis, TTL per type
Security
Query depth limit (10) Cost analysis (query budget) Disable introspection (prod)
Protocols
GraphQL/HTTPSubscriptions WebSocketSSE
🛑
Circuit Breaker
Resilience · Bulkhead · Fallback
Fault Tolerant
Technologies
Resilience4j Hystrix (Legacy) Envoy Outlier Detection Istio DestinationRule
Patterns
Circuit Breaker — trip on 50% error rate (10s) Bulkhead — isolate thread pools per dependency Retry — 3 attempts, exponential backoff + jitter Timeout — per-request deadline propagation Fallback — cached response or graceful degradation
CB Configuration
Failure threshold: 50% in sliding window (100 calls) Half-open: 5 trial requests after 30s wait
Protocols
HTTPgRPC Async QueueDB Pool
📊
gRPC Proxy
Envoy Sidecar · Service Mesh
Protobuf
Technologies
Envoy Proxy Istio Ingress GW gRPC-Gateway (REST↔gRPC) Traefik / Linkerd
Features
gRPC-Web — browser clients via HTTP/1.1 Transcoding — REST JSON ↔ Protobuf Server-side streaming (SSE equivalent) Client streaming — large file uploads Bi-directional streaming — real-time bidding
Observability
gRPC status codes → Prometheus metrics Distributed trace injection (B3 / W3C)
Auth
mTLS per call gRPC metadata JWT ALTS (Google)
Protocols
HTTP/2Protobuf gRPC-WebmTLS
Request Lifecycle — Edge to Microservice
From CDN / Edge
Clean · TLS terminated
Anycast VIP
⚖️
Global LB
L7 routing, health check
Least-conn algorithm
🚪
API Gateway
Route + transform
Plugin chain runs
🔑
Auth Check
JWT verify · RBAC
Token validated
🚦
Rate Limiter
Quota check in Redis
Under quota
🛑
Circuit Breaker
CLOSED — pass through
Request forwarded
📊
gRPC Proxy
Envoy sidecar + mTLS
Service mesh
⚙️
Microservice
Business logic executes
Load Balancing Algorithms
🔁 Weighted Round Robin Default
Distribute traffic proportionally by configured weight. Node A=70%, B=30%. Ideal for heterogeneous instance sizes. Zero overhead, deterministic.
Low overhead Predictable
📊 Least Connections Recommended
Always route to the upstream with fewest active connections. Optimal for variable request duration (e.g. streaming, long polling). Adaptive to hot spots.
Adaptive Fair distribution
📌 IP Hash / Sticky Stateful
Hash client IP to always route to same upstream. Required for WebSocket sessions and stateful workloads. Consistent hashing ring for node add/remove.
Session affinity WS / gRPC
OAuth 2.0 + PKCE Authentication Flow
1
Client initiates login
Generate code_verifier (random 43-128 chars) + code_challenge = SHA256(verifier)
PKCE
2
Authorization request
Redirect to /authorize?client_id=…&code_challenge=…&scope=openid profile
Browser redirect
3
User authenticates (+ MFA)
Password / Passkey / Biometric → optional TOTP / SMS / hardware key
MFA enforced
4
Authorization code returned
Short-lived code (10min) sent to redirect_uri callback
One-time code
5
Token exchange (back-channel)
POST /token {code, code_verifier, client_id} → access_token (JWT 15min) + refresh_token (7d)
Secure exchange
6
JWT verified at API Gateway
Signature verified via JWKS endpoint · claims extracted · scopes enforced
RS256 / ES256
7
Silent refresh before expiry
At 13min: use refresh_token to obtain new access_token silently in background
Auto-renew
Rate Limiting Strategies
🪣Token BucketBurst Allowed
Tokens added at fixed rate (e.g. 100 req/s) Max bucket size = burst capacity (e.g. 200) Request costs 1 token; rejected if bucket empty Allows short bursts above steady-state rate
Use for: API consumers needing burst headroom
📊Sliding WindowMost Accurate
Redis sorted set: score = timestamp ZREMRANGEBYSCORE removes expired entries ZCARD counts requests in window ZADD inserts current request timestamp
Use for: strict per-user quotas (billing tier)
🪟Fixed WindowLightweight
Redis INCR key:{user}:{minute} EXPIRE set to window end time Atomic Lua script (INCR + check) Boundary burst problem at window edge
Use for: high-volume, lower precision needs
🚰Leaky BucketSmooth Output
Queue incoming requests (bucket capacity) Process at fixed outflow rate (drip) Overflow → reject with HTTP 429 Prevents bursty load reaching backends
Use for: downstream protection, smooth traffic
Circuit Breaker State Machine
🟢
CLOSED
Normal operation. All requests pass through. Failure counter tracking active. Resets on success window.
Healthy Pass-through
Failure rate > 50%
🔴
OPEN
All requests fail-fast with fallback response. No load on downstream. Reset timer: 30 seconds. Prevents cascade failures.
Fail-Fast Cached fallback
Timer expires (30s wait)
🟡
HALF-OPEN
5 trial requests allowed. If majority succeed → CLOSED. If failures continue → back to OPEN. Gradual recovery probe.
Probing 5 trial calls
Trials fail
🔁
Fallback Strategy
Return cached data, static response, or degraded feature. Bulkhead pattern isolates thread pools per dependency.
Cache hit Degrade graceful
API Gateway Routing Table — Sample Rules
Method Path Pattern Target Service Auth Rate Limit Cache Notes
GET /api/v2/users/{id} user-service JWT 1000/min 60s TTL User profile, read-only
POST /api/v2/orders order-service JWT+MFA 100/min No cache Idempotency-Key required
GET /api/v2/search?q=* search-service API Key 500/min 10s TTL Elasticsearch fanout
WS /ws/v2/chat msg-service JWT Sticky No cache Redis Pub/Sub fan-out
POST /api/v2/payments payment-service JWT+MFA 20/min No cache PCI-DSS · Idempotent
POST /graphql apollo-gateway JWT 200/min Per-type Depth limit=10, cost limit
GET /health, /metrics internal No auth Unlimited No cache LB health checks only
5M+
Req / Second
Peak throughput handled
<5ms
GW Overhead P99
Auth + route + transform
<1ms
JWT Validation
JWKS cached in memory
<2ms
Rate Limit Check
Redis pipeline command
30s
CB Open Duration
Before half-open probe
HTTP 429
Rate Exceeded
Retry-After header sent
15 min
Access Token TTL
Refresh at 13 min
99.999%
GW Availability
Active-active multi-AZ

Ingress Layer acts as the unified entry point for all traffic — enforcing auth, routing, rate limiting, and resilience patterns before passing requests into the Microservices Layer via the Service Mesh.  |  Standards: OAuth 2.0 RFC 6749 · PKCE RFC 7636 · JWT RFC 7519 · FIDO2/WebAuthn W3C · OpenAPI 3.1  |  2025 Production Design

05Microservices & service mesh

🐳
Microservices Layer — Kubernetes · Istio Service Mesh
12 Domain Services · Container Orchestration · Service Mesh · Autoscaling · GitOps · 2025
④ Microservices Layer
12
Microservices
1000+
Pods (peak)
50+
K8s Nodes
mTLS
All Svc Traffic
HPA + VPA
Autoscaling
GitOps
Deployment Model
< 30s
Deploy Time
99.99%
Uptime SLA
Infrastructure Foundation — Kubernetes Control Plane · Istio Service Mesh
🐳
Kubernetes (EKS / GKE / AKS) — Container Orchestration
Manages container lifecycle, scheduling, scaling, and self-healing across all microservices
🧠 Control Plane
kube-apiserver — REST API, auth, admission etcd — distributed KV config store kube-scheduler — bin-pack pod placement controller-manager — reconcile loop cloud-controller — node/LB provisioning
⚙️ Worker Nodes
kubelet — node agent, pod lifecycle kube-proxy — iptables / eBPF networking Container runtime — containerd / CRI-O Karpenter — node auto-provisioner Node OS — bottlerocket / flatcar
📦 Workload Resources
Deployment + ReplicaSet — stateless pods StatefulSet — ordered, stable network ID DaemonSet — one pod per node (logging) CronJob — scheduled batch tasks HPA · VPA · KEDA — autoscaling
🔧 Config & Storage
ConfigMap — environment config (non-secret) Secret — base64 encoded (Vault injector) PersistentVolume — EBS / GCS / NFS Namespace isolation — per team/env RBAC — least-privilege service accounts
🕸️
Istio Service Mesh — Traffic, Security, Observability
Envoy sidecar proxy injected into every pod — handles all in-cluster communication transparently
🔀 Traffic Management
VirtualService — traffic routing rules per host DestinationRule — circuit breaker, retries, mTLS Gateway — L4/L7 ingress / egress control Canary — weighted splits (10/90, 50/50) Header-based routing — A/B test by cookie
🔒 Security
mTLS — SPIFFE/SPIRE auto-cert per pod AuthorizationPolicy — RBAC at L7 (path) PeerAuthentication — enforce STRICT mode RequestAuthentication — JWT verification Egress control — allowlist external services
📡 Observability
Envoy metrics → Prometheus (RED signals) Distributed traces — auto-injected headers Access logs — all svc-to-svc calls logged Kiali — live service map + health graph Grafana dashboards — mesh golden signals
🛡️ Resilience Policies
Retries — 3 attempts on 5xx, 502, 503 Timeout — per-route deadline (2s default) Circuit breaker — consecutive 5xx threshold Outlier detection — eject unhealthy hosts Fault injection — chaos testing in staging
Domain Microservices — Technology, Responsibilities & Interfaces
👤
User Service
Profile · Preferences · CQRS
Go · 10 pods
Responsibilities
CRUD user profiles, avatars, preferences CQRS — write model PostgreSQL, read model Redis Event publish: UserCreated, UserUpdated (Kafka) Privacy: GDPR delete / export API
Tech Stack
Go 1.23 PostgreSQL Redis gRPC
Scale
HPA: 5–50 podsCPU 70%
🔐
Auth & IAM Service
Token · Session · RBAC
Rust · 8 pods
Responsibilities
Issue & validate JWT access/refresh tokens Manage OAuth 2.0 client credentials Role & permission resolution (RBAC/ABAC) MFA enrollment and verification flow
Tech Stack
Rust Redis (sessions) PostgreSQL REST + gRPC
Scale
HPA: 4–30 podsCPU 60%
📝
Content Service
CRUD · Versioning · Rich Media
Java · 15 pods
Responsibilities
Create/edit/publish content with versioning Rich-text storage: Quill Delta / ProseMirror Tag taxonomy, categories, slug routing CDN cache invalidation on publish events
Tech Stack
Java 21 / Spring Boot PostgreSQL MongoDB
Scale
HPA: 5–60 podsRPS-based
🔔
Notification Service
Push · Email · SMS · In-App
Node.js · 12 pods
Responsibilities
Multi-channel delivery: push, email, SMS, in-app Template engine — personalised, localised (i18n) Delivery tracking, bounce, open-rate metrics User preference centre — opt-in/opt-out per channel
Tech Stack
Node.js / BullMQ SendGrid Twilio · Firebase FCM
Scale
KEDA: Kafka lag5–80 pods
💳
Payment Service
PCI-DSS · Idempotent · Saga
Java · 8 pods
Responsibilities
Process charges via Stripe / Adyen / PayPal Idempotency — Idempotency-Key header enforced Saga pattern — compensating tx on failure PCI-DSS scope isolation — dedicated namespace
Tech Stack
Java 21 PostgreSQL Vault (card data)
Scale
HPA: 4–20 podsStrict mTLS
💬
Messaging Service
Real-Time · E2E Encrypt · Rooms
Erlang · 30 pods
Responsibilities
WebSocket connections — 100k+ per pod End-to-end encryption (Signal Protocol) Message persistence: Cassandra time-series Presence, typing indicators, read receipts
Tech Stack
Erlang/OTP Cassandra Redis Pub/Sub
Scale
KEDA: WS conns10–100 pods
🗂️
File Service
Upload · CDN · Virus Scan
Go · 8 pods
Responsibilities
Presigned S3 URLs — direct client upload Virus scan (ClamAV) on upload completion Image processing — resize, WebP/AVIF convert CDN URL generation with signed expiry
Tech Stack
Go AWS S3 / GCS ClamAV · Imgproxy
Scale
HPA: 4–30 podsCPU 70%
🌐
Geo / Map Service
Location · Geofencing · Routing
C++ · 12 pods
Responsibilities
Real-time location updates via WebSocket Geofencing — enter/exit event triggers Nearest-entity query (PostGIS / Redis Geo) Route calculation — Dijkstra / A* graph
Tech Stack
C++ / gRPC PostGIS Redis Geo
Scale
HPA: 6–40 podsLat-based
🎯
Recommendation Service
ML · Collaborative · Real-Time
Python · 20 pods
Responsibilities
Collaborative filtering (Matrix Factorisation) Two-tower neural network for retrieval Online serving via Feature Store (Feast) A/B experiment routing per user segment
Tech Stack
Python / TF Serving Feast Redis (online)
Scale
GPU nodes: 8–40GPU util 80%
🛒
Order / Cart Service
Checkout · Saga · Inventory
Java · 15 pods
Responsibilities
Shopping cart — Redis-backed session state Checkout — Saga orchestration across services Inventory reservation with optimistic locking Order state machine: pending → paid → shipped
Tech Stack
Java / Temporal PostgreSQL Redis (cart)
Scale
HPA: 8–50 podsRPS + CPU
⚙️
Config & Feature Flag
LaunchDarkly · etcd · Rollout
Go · 6 pods
Responsibilities
Dynamic config updates without redeploy Feature flags — gradual rollout, kill switch SSE push — services receive flag changes < 1s Audit log — who changed what flag, when
Tech Stack
Go etcd LaunchDarkly SDK
Scale
Fixed: 3–6 podsHA quorum
Deployment Strategies
🔄Rolling UpdateDefault
maxSurge: 25% extra pods during rollout maxUnavailable: 0 — zero downtime Readiness gate before traffic shifts Auto rollback on failed health checks
No downtime Slow rollout
🔵Blue / GreenMajor releases
Run v1 (blue) + v2 (green) simultaneously Switch VirtualService weight 0→100 instantly Instant rollback: flip weight back to blue Double resource cost during transition
Instant cutover Safe
🐦Canary ReleaseRisk control
Route 1% → 5% → 20% → 100% traffic Istio VirtualService weighted routing Auto-promote on SLO gate pass (Flagger) Auto-rollback on error budget breach
Progressive Flagger
Service Communication Patterns
🔄Sync — gRPC / REST
Request-response with mTLS via Envoy Protobuf — compact binary, typed schema HTTP/2 multiplexing — concurrent streams Retries with backoff at mesh level
Low latency
🌊Async — Kafka Events
Domain events: CloudEvents spec Exactly-once via idempotent producers Consumer group per service — independent Schema Registry (Avro) — contract safety
Decoupled
📡Pub/Sub — Redis
Fan-out notifications to WebSocket nodes Ephemeral — no persistence, fire and forget Channels per user / room / tenant Keyspace events for cache invalidation
Real-time
🔗Saga — Temporal
Long-running distributed transactions Durable workflow state — survives crashes Compensating transactions on failure Order checkout: reserve → charge → fulfill
Consistent
Autoscaling Policies
📈HPA — Horizontal Pod
Scale out/in pods based on custom metrics CPU threshold: scale at 70% utilization Custom: RPS, p99 latency, queue depth Scale-down stabilization: 5 min cooldown Metrics server + Prometheus adapter
Pods: 5–500
📊VPA — Vertical Pod
Right-size CPU & memory requests/limits Off — recommendation only (no auto-apply) Initial — apply on pod creation only Prevents OOMKill and CPU throttling Goldilocks dashboard — view recommendations
Resource-aware
KEDA — Event-Driven
Scale on Kafka consumer lag (msg count) Scale on SQS queue depth or age Scale to zero — no traffic = 0 pods ScaledObject CR per workload / trigger Scale-up latency: < 15 seconds
Scale-to-zero
Istio Service Mesh — Traffic Rules
ServiceRetriesTimeoutCircuit BreakerTraffic SplitmTLS
User Service3× on 5xx2s5 consec. 5xxCanary 10%STRICT
Payment Service1× (idempotent)5s2 consec. 5xxStable onlySTRICT
Search Service3× on 5xx1s10 consec. 5xxCanary 20%STRICT
Messaging0× (WS sticky)30s5 consec. 5xxHash-basedSTRICT
Notification3× on 5xx10s5 consec. 5xxCanary 5%STRICT
Recommendation2× on 5xx500ms3 consec. 5xxShadow 100%STRICT
GitOps Pipeline — CI/CD Flow
💻Git PushPR + review
🔁CI PipelineBuild · Test · Lint
🐳Docker BuildScan · Sign (Cosign)
📦Registry PushECR / GCR · tag SHA
📋Helm Chart PRUpdate image tag
🔀ArgoCD SyncGit = cluster state
🐦Flagger CanaryProgressive rollout
SLO Gate Pass100% traffic → prod
12
Microservices
Domain-driven bounded contexts
1000+
Pods Peak
Auto-scaled by HPA/KEDA
mTLS
All Traffic
SPIFFE identity per pod
< 30s
Deploy Time
ArgoCD sync + rollout
< 15s
Scale-Up
KEDA event-driven pods
GitOps
Deploy Model
ArgoCD · Flagger · Helm
100%
Canary Coverage
All services via Flagger
99.99%
Uptime SLA
Multi-AZ active-active

Microservices run inside Kubernetes with Istio injecting an Envoy sidecar into every pod — providing transparent mTLS, traffic shaping, retries, and distributed tracing with zero application code changes.  |  Standards: CNCF · OCI · CloudEvents · OpenTelemetry · SPIFFE · Helm 3 · ArgoCD GitOps  |  2025 Production Design

06Event streaming & async

⑤ Event Streaming & Async Processing

Distributed messaging · Stream processing · Workflow orchestration · 10B+ events/day

Apache Kafka Apache Pulsar Apache Flink Temporal.io NATS JetStream Schema Registry Redis Streams Exactly-Once Event Sourcing CQRS · Saga · Outbox
10B+
Events / Day
< 5ms
P99 Latency
2M+
Msgs / Second
99.99%
Durability SLA
Replication Factor
30 Days
Retention Policy
500+
Topics
EOS
Exactly-Once
Core Components
🟢
Apache KafkaDistributed Commit Log
  • Kafka 3.7+ — KRaft mode (no ZooKeeper)
  • Partitions: 1024 per topic, key-based routing
  • Replication: RF=3, ISR ≥ 2 for durability
  • Producers: acks=all, idempotent + transactions
  • Consumers: consumer groups, parallel reads
  • Tiered storage: S3 offload for cold data
  • Strimzi: Kubernetes operator deployment
KRaftISR IdempotentTiered Storage
🔵
Apache PulsarMulti-Tenant Messaging
  • Compute/storage separated: Brokers + BookKeeper
  • Multi-tenancy: namespaces, quotas, auth per tenant
  • Geo-replication: async cross-region sync
  • Pulsar Functions: serverless stream processing
  • Subscription types: Exclusive, Shared, Key_Shared
  • Schema evolution: built-in registry + backward compat
  • Offload: S3/GCS/HDFS for long retention
BookKeeperGeo-Replication FunctionsMulti-Tenant
Apache FlinkStateful Stream Processing
  • Event-time: watermarks, out-of-order handling
  • Windows: tumbling, sliding, session, global
  • State backends: RocksDB for large state, FS for small
  • Checkpointing: distributed snapshots, exactly-once
  • CEP: complex event pattern matching
  • Flink SQL: streaming SQL on Kafka sources
  • K8s Operator: native Kubernetes deployment
WatermarksRocksDB CEPFlink SQL
🔮
Temporal.ioDurable Workflow Engine
  • Durable execution: code survives crashes, restarts
  • Saga orchestration: compensating transactions, rollback
  • Timers: durable delays up to years
  • Signal / Query: external interactions mid-workflow
  • Versioning: safe workflow code migration
  • Visibility: search attributes, workflow history
  • Namespaces: multi-tenant isolation per domain
SagaDurable CompensateTimers
🚀
NATS JetStreamCloud-Native Messaging
  • Core NATS: < 1ms pub/sub, fire-and-forget
  • JetStream: persistence, replay, exactly-once
  • Work queues: load-balanced consumer groups
  • KV / Object store: built-in key-value store
  • Leaf nodes: edge connectivity, low-bandwidth sites
  • Clustering: Raft consensus, no external deps
  • IoT use case: MQTT bridge, lightweight clients
JetStreamRaft KV StoreIoT/Edge
📋
Schema RegistryConfluent / AWS Glue
  • Avro / Protobuf / JSON Schema support
  • Schema evolution: backward, forward, full compat
  • Producer validation: reject non-conforming msgs
  • Consumer deserialization: auto schema fetch by ID
  • Schema ID embedded: 4-byte magic header in payload
  • Governance: subjects, versioning, soft delete
  • Data catalog integration: Apache Atlas / DataHub
AvroProtobuf Backward CompatSchema ID
Redis StreamsLow-Latency Event Bus
  • XADD / XREAD: append-only log, O(1) writes
  • Consumer groups: parallel processing with ack
  • XPENDING / XCLAIM: ownership transfer on failure
  • MAXLEN ~: approximate trimming, O(1) memory cap
  • Use cases: activity feeds, real-time notifications
  • Latency: sub-millisecond for small payloads
  • Cluster mode: hash slot partitioning
XADDConsumer Groups Sub-msNotifications
🔁
DLQ & Error HandlingRetry / Poison Pill
  • Dead Letter Queue: isolate unprocessable msgs
  • Retry policy: exponential backoff + jitter
  • Max retries: configurable per consumer group
  • Poison pill: schema violation → DLQ + alert
  • DLQ inspector: manual review + replay UI
  • Alerting: DLQ depth threshold triggers PagerDuty
  • Archival: S3 long-term storage for forensics
BackoffReplay Poison PillAlert
End-to-End Event Pipeline
ProducersServices / IoT
Mobile Apps
Schema
RegistryValidate &
Serialize
Kafka
BrokersPartition +
Replicate
Flink
ProcessingTransform
Aggregate
Temporal
WorkflowsOrchestrate
Sagas
Consumer
GroupsMicroservices
Workers
DLQ
HandlerRetry /
Replay
Data
SinksDB / Cache
/ Warehouse
🟢 Kafka Architecture Deep-Dive — Internals & Configuration
Cluster Configuration
ModeKRaft (no ZooKeeper)
Brokers9 (3 per AZ × 3 AZs)
Replication Factor3 (ISR ≥ 2)
Min In-Sync Replicas2
Default Partitions24 per topic
Max Partitions1024 per topic
Retention (Hot)7 days on NVMe
Retention (Cold)30 days on S3
Compressionzstd (best ratio)
Max Message Size10 MB
Strimzi OperatorTiered Storage Cruise ControlMirrorMaker 2
Producer & Consumer Config
PRODUCER
acksall (strongest durability)
idempotenttrue (dedup)
transactional.idper-service ID
batch.size64 KB
linger.ms5ms (micro-batching)
compressionzstd
CONSUMER
isolation.levelread_committed
auto.offset.resetearliest
enable.auto.commitfalse (manual)
max.poll.records500
session.timeout.ms45 000
Partition Layout & Consumer Groups
Topic: user-events (8 partitions, RF=3)
P0 Leader
Broker-1
P1 Leader
Broker-2
P2 Leader
Broker-3
P3 Leader
Broker-1
P4 Replica
Broker-2
P5 Replica
Broker-3
P6 Replica
Broker-1
P7 Replica
Broker-2
Consumer Groups
analytics-cg8 consumers
search-indexer-cg4 consumers
notification-cg4 consumers
audit-log-cg2 consumers
Event Patterns · Delivery Guarantees · Stream Processing · Monitoring
Event-Driven Patterns
📥 Event Sourcing
All state changes stored as immutable events. Current state derived by replaying event log. Enables time-travel queries and audit.
↔ CQRS
Separate Read (Query) from Write (Command) models. Write to Kafka → project into read-optimized stores (Elasticsearch, Redis).
📤 Transactional Outbox
Write event to DB outbox table + business data atomically. Debezium CDC polls outbox → publishes to Kafka. Eliminates dual-write.
🔄 Saga Pattern
Choreography (event-driven, decoupled) or Orchestration (Temporal, explicit control). Compensating transactions on failure.
Delivery Guarantees
SemanticProducerConsumerUse Case
At-Most-Once acks=0 Auto-commit before process Metrics, telemetry
At-Least-Once acks=all Commit after process Most services (idempotent)
Exactly-Once Idempotent + Txn read_committed + txn Payments, inventory
Exactly-Once Semantics (EOS)
  • Producer: transactional.id + initTransactions()
  • Broker: PID epoch deduplication across restarts
  • Consumer: isolation.level=read_committed
  • Flink: 2-phase commit sink connector
  • Idempotency: dedupe key in payload + Redis TTL
Flink Stream Processing
Windowing Strategies
  • Tumbling: fixed non-overlapping (1-min aggregates)
  • Sliding: overlapping windows (5-min/1-min slide)
  • Session: inactivity gap-based (user sessions)
  • Global: unbounded, trigger-controlled
Watermarks & Late Data
  • Event time: embedded timestamp in payload
  • Watermark: max-event-time − 5s allowed lateness
  • Side output: late events routed to DLQ topic
State & Checkpointing
  • State backend: RocksDB for TB-scale state
  • Checkpoint: every 30s → S3 (incremental)
  • Savepoint: manual for upgrades, migrations
  • Recovery: restart from last checkpoint
Observability & Monitoring
Key Kafka Metrics
Consumer Lag (P99)1 200 msgs
Throughput1.8M msg/s
Produce Latency P993.2ms
Disk Usage68%
Under-Replicated Parts0
DLQ Depth42
Tooling Stack
Prometheus JMXGrafana Kafka UICruise Control Burrow LagOpenTelemetry
Temporal Saga — Order Fulfillment Workflow
1
CreateOrder — persist to DB, emit OrderCreated event to Kafka
2
ReserveInventory — call Inventory Service, reserve SKU quantity
3
ChargePayment — Stripe/Braintree API, idempotency key = orderId
4
AllocateShipping — assign carrier, generate tracking number
5
SendConfirmation — email/SMS notification via Notification Service
⟳ Compensating Transactions (on failure)
C3
RefundPayment — void/refund Stripe charge
C2
ReleaseInventory — restore reserved quantity
C1
CancelOrder — mark CANCELLED, emit OrderCancelled event
Backpressure & Auto-Scaling
Consumer Lag-Based Scaling (KEDA)
  • Trigger: Kafka consumer group lag > 10 000 msgs
  • ScaledObject: min=2, max=50 consumer pods
  • Scale-up: lag × 0.001 = target replicas
  • Cooldown: 60s before scale-down to prevent thrash
  • KEDA Scaler: kafka-topic-lag-percentage
Backpressure Strategies
  • Producer throttle: broker quota enforcement per client
  • Rate limit topics: quota.producer.byte.rate per user
  • Avoid unbounded lag: alert at lag > 100K + PagerDuty P2
  • Flink: credit-based flow control between operators
  • RocksDB state: async I/O prevents blocking pipeline
Geo-Replication (MirrorMaker 2)
  • Active-Active: us-east-1 ↔ eu-west-1 bidirectional
  • Active-Passive: ap-southeast-1 (DR, async)
  • Offset sync: translated offsets per cluster
  • Heartbeat topic: lag between mirror clusters tracked
Topic Topology
Critical Topics
user-eventsP=64 RF=3
order-commandsP=32 RF=3
payment-eventsP=16 RF=3
notification-fanoutP=24 RF=3
audit-logP=8 RF=3
dlq.*.errorsP=4 RF=3
cdc.postgres.*P=8 RF=3
CDC Pipeline
PostgreSQL
WAL
Debezium
Connect
Kafka
Topic
Search /
Analytics
DebeziumOutbox CDCEOS

07Data layer

⑥ Data Layer — Polyglot Persistence

Right database for the right job · ACID · Eventual Consistency · CAP Theorem · 10B+ users · Multi-region replication

PostgreSQL · Citus Apache Cassandra Redis · Valkey S3 · Object Storage Elasticsearch · OpenSearch MongoDB · Atlas ClickHouse · OLAP TiDB · CockroachDB CAP Theorem CQRS · Event Sourcing
8+
Database Types
<1ms
Redis P99
<5ms
Postgres P99
10PB+
Total Data
99.999%
Durability
Geo-Replication
RPO 0s
Sync Replication
RTO <30s
Failover
Core Data Stores — Polyglot Selection
🐘
PostgreSQLPrimary OLTP · Citus · Aurora
  • Engine: PostgreSQL 16 + Citus sharding extension
  • Hosted: AWS Aurora Postgres / RDS Multi-AZ
  • ACID: full transactions, MVCC, row-level locking
  • Read replicas: up to 15 Aurora replicas, load-balanced
  • Sharding: Citus distributed tables by tenant/user_id
  • Extensions: pgvector (AI), TimescaleDB, pg_partman
  • Pooling: PgBouncer transaction mode, 10K connections
ACIDCitus Aurorapgvector
🔶
Apache CassandraWide-Column · Geo-Distributed
  • Model: wide-column, partition key → clustering key
  • Consistency: tunable — LOCAL_QUORUM for balance
  • Write path: Memtable → CommitLog → SSTable flush
  • Replication: NetworkTopologyStrategy, RF=3 per DC
  • Use cases: time-series, activity feeds, IoT, audit logs
  • Compaction: TWCS for time-series, LCS for reads
  • Alternatives: ScyllaDB (10× Cassandra throughput, C++)
Tunable Cons.RF=3 ScyllaDBTWCS
Redis / ValkeyIn-Memory · Cache · Streams
  • Cache-aside: L1 local → L2 Redis → L3 DB fallback
  • Data structures: String, Hash, List, Set, ZSet, Stream
  • Leaderboards: ZRANGEBYSCORE on sorted sets
  • Rate limiting: sliding window with sorted sets + Lua
  • Sessions: 30-min TTL, hashed session token key
  • Pub/Sub: real-time notifications, channel fan-out
  • Cluster: Redis Cluster, 16K hash slots, 3 primary+replica
Cache-AsideZSet ClusterValkey
🪣
Object StorageS3 · GCS · Data Lake
  • AWS S3: 11 nines durability, versioning, MFA delete
  • Data lake: raw zone → curated zone → consumption zone
  • Formats: Parquet (columnar), ORC, Avro, Delta Lake
  • Lifecycle: S3 Standard → IA (30d) → Glacier (90d)
  • Query: Athena (serverless SQL), S3 Select, Iceberg
  • CDN: CloudFront pre-signed URLs for media delivery
  • Encryption: SSE-KMS per bucket, TLS in-transit
ParquetDelta Lake IcebergAthena
🔍
ElasticsearchSearch · Analytics · OpenSearch
  • Full-text: inverted index, BM25 relevance scoring
  • Ingestion: Debezium CDC → Kafka → Logstash → ES
  • Aggregations: faceted search, histogram, geo-distance
  • KNN search: dense vector ANN with HNSW index
  • Index lifecycle: hot→warm→cold→frozen→delete (ILM)
  • Sharding: 5 primary shards, 1 replica, per-index routing
  • Hybrid: BM25 + dense vector reciprocal rank fusion
BM25HNSW ILMOpenSearch
🍃
MongoDBDocument Store · Atlas
  • Model: flexible BSON documents, nested arrays/objects
  • Atlas: fully managed, multi-cloud, global clusters
  • Transactions: multi-document ACID since v4.0
  • Change streams: real-time CDC via oplog tailing
  • Aggregation pipeline: $lookup, $unwind, $group, $facet
  • Atlas Search: full-text built-in (Lucene), no ES needed
  • Use cases: product catalog, user profiles, CMS content
BSON DocsAtlas Change StreamsACID
📊
ClickHouseColumnar OLAP · Real-time Analytics
  • Columnar: MergeTree family, vectorised execution
  • Ingestion: Kafka engine reads directly from topics
  • Speed: 10B+ row scans in seconds, 100× faster than PG
  • Compression: LZ4 / ZSTD per column, 10:1 ratio typical
  • Materialized views: real-time aggregations on ingest
  • ReplicatedMergeTree: HA via ZooKeeper / ClickHouse Keeper
  • Use cases: dashboards, funnels, cohorts, product analytics
MergeTreeVectorised Kafka EngineMat. Views
🌐
NewSQL / TiDBDistributed SQL · Global ACID
  • TiDB: MySQL-compatible, horizontal scale-out OLTP
  • CockroachDB: Postgres-compatible, serializable isolation
  • Raft consensus: strong consistency across geo nodes
  • Auto-sharding: TiKV range-based partitioning, auto-rebalance
  • HTAP: TiFlash columnar replica for analytics queries
  • Global txn: cross-region ACID without 2PC latency trap
  • Use cases: global inventory, financial ledger, multi-region write
TiDBCockroachDB RaftHTAP
Data Routing — Query to Store
🔀 Polyglot Query Routing Layer
API /
ServiceBusiness
Logic
Data
Access LayerRepository
Pattern
PostgreSQLOLTP · ACID
Writes + Reads
|
RedisCache L2
Sessions
|
CassandraFeeds
Time-Series
|
ElasticFull-Text
Search
|
MongoDBCatalog
Profiles
|
ClickHouseOLAP
Analytics
|
S3Data Lake
Media
Read path: Redis (L2 cache) → PostgreSQL read replica → Elasticsearch (search)  |  Write path: PostgreSQL (primary) + Kafka CDC → Cassandra / Elasticsearch / ClickHouse  |  Analytics: Kafka → ClickHouse / S3 (Data Lake)
Database Selection · Replication · Caching Strategy · Connection Management
Database Selection Matrix
Use CaseStoreWhy
User accountsPostgreSQLACID, relations
Activity feedsCassandraWrite-heavy, time-series
Session storeRedisSub-ms, TTL-native
Product catalogMongoDBFlexible schema
Full-text searchElasticsearchInverted index, BM25
Analytics queriesClickHouseColumnar, vectorised
Media / backupsS3Infinite scale, 11×9s
Global write ACIDTiDBDistributed SQL
Rate limitingRedisAtomic ZSet + Lua
AI embeddingspgvectorHNSW in Postgres
Replication & Consistency Models
CAP Theorem Positioning
PostgreSQL CAP CP · strong
Cassandra CAP AP · eventual
Redis Cluster CAP CP · strong
MongoDB CAP CP · tunable
Elasticsearch CAP CP · near-RT
TiDB CAP CP · serializable
Replication Targets
Postgres RPO0s (sync replica)
Cassandra RF3 per DC (3 DCs)
Redis AOFfsync every second
S3 Durability99.999999999%
Multi-Tier Caching Strategy
Cache Hierarchy
L1
In-process — Caffeine/Guava (JVM heap), 500ms TTL, hot data only, ~10K items
L2
Redis Cluster — shared cache, <1ms P99, 30 min TTL, 100GB per node
L3
Read Replica — Postgres replica, <5ms P99, for cache miss fallback
Cache Patterns
  • Cache-Aside: app reads cache, on miss fetches DB + populates
  • Write-Through: write DB + cache atomically (sessions)
  • Write-Behind: write cache, async flush to DB (counters)
  • Read-Through: cache proxies DB transparently
Cache Invalidation
  • TTL-based: expire after N seconds (simple, safe)
  • Event-driven: Kafka → cache invalidation consumer
  • Versioned keys: key = resource:v{version}, bump on write
Connection Pooling & Backup / DR
Connection Pooling
PgBouncer modeTransaction pooling
Pool size100 per shard
Max connections10 000 (via PgBouncer)
Idle timeout60s
Redis poolLettuce async, 32 conns
Backup Strategy
  • Postgres: continuous WAL → S3 (Barman / pgBackRest)
  • Cassandra: Medusa snapshots → S3 every 6h
  • MongoDB: Atlas continuous backup, point-in-time restore
  • Redis: RDB snapshot hourly + AOF every second
  • Retention: 7 days hot, 30 days warm, 1 year cold (Glacier)
Failover
Postgres RTO< 30s (Aurora)
Redis sentinel< 10s failover
CassandraZero (AP, no failover)
Data Partitioning & Sharding Strategies
PostgreSQL — Citus Sharding
1
Distribution column: tenant_id or user_id
2
Shard count: 32 shards per table, consistent hash
3
Co-location: same tenant rows on same shard node
4
Reference tables: replicated to all shards (lookup tables)
5
Rebalance: shard move with pg_repack, zero downtime
Partitioning (native PG)
events tableRange by month
audit_logsRange by day
ordersHash by order_id
notificationsList by region
Cassandra — Partition Design
  • Partition key: user_id (even distribution)
  • Clustering key: created_at DESC (latest-first reads)
  • Max partition: < 100MB, < 100K rows
  • Hotspot: avoid by adding bucket to key
  • Anti-pattern: no SELECT * without partition key
Redis Key Design
  • Namespace: app:entity:id:field
  • Max value: < 100KB per key (avoid big keys)
  • Scan: SCAN with MATCH, never KEYS * in prod
  • Expiry: all cache keys must have TTL set
CDC Pipeline & Cross-Store Sync
Change Data Capture Flow
1
PostgreSQL WAL logical replication slot enabled
2
Debezium connector tails WAL → publishes to Kafka topic per table
3
Kafka consumers: Elasticsearch indexer, ClickHouse sink, search cache invalidator
4
Schema registry validates Avro payload before downstream write
5
Exactly-once — Kafka transactions + idempotent consumer
Sync Latency SLAs
PG → Elasticsearch<2s
PG → ClickHouse<5s
PG → Redis (invalidate)<500ms
Kafka → S3 (archive)<60s
DebeziumKafka CDC Avro SchemaExactly-Once
Performance & Observability
Query Performance
PG read P99<5ms
Redis P99<1ms
ES search P99<20ms
CH analytics P99<2s
Cache hit rate92%
Key Monitoring
  • PG: pg_stat_statements, slow query log >100ms
  • Redis: latency monitor, eviction rate, memory fragmentation
  • Cassandra: nodetool tablestats, read/write latency histograms
  • ES: index size, search latency, JVM heap pressure
pgBadgerRedisInsight nodetoolGrafana

08Data platform & analytics

⑦ Data Platform, Analytics & BI

Lakehouse · ELT · dbt · Airflow · Spark · Real-time Analytics · Governance · Self-serve BI · 10PB+ scale

Snowflake · BigQuery Apache Spark · Delta dbt · SQL Transforms Apache Airflow Superset · Looker · Tableau DataHub · OpenMetadata Apache Druid · Pinot Great Expectations · MonteCarlo Medallion Architecture Data Mesh · Governance
10PB+
Data Lake Size
500+
dbt Models
1000+
DAGs / Airflow
<1s
Druid Query P99
200+
BI Dashboards
99.9%
Pipeline SLA
Real-time
Streaming Analytics
Data Mesh
Domain Ownership
Core Platform Components
🏔
Data WarehouseSnowflake · BigQuery · Redshift
  • Snowflake: virtual warehouses, auto-suspend/resume, zero-copy clone
  • BigQuery: serverless, slot-based, BI Engine for sub-second
  • Separation: compute/storage decoupled, pay-per-query
  • Data sharing: Snowflake Marketplace, secure live shares
  • Time travel: query historical data up to 90 days
  • Iceberg tables: open format, interop with Spark/Trino
  • Clustering: micro-partition pruning on cluster keys
SnowflakeBigQuery IcebergTime Travel
Apache SparkBatch · Streaming · Delta Lake
  • Spark 3.5: Adaptive Query Execution (AQE), dynamic pruning
  • Delta Lake: ACID on object storage, schema enforcement
  • Structured Streaming: micro-batch, watermarks, exactly-once
  • Photon engine: Databricks native vectorised C++ execution
  • Kubernetes: Spark Operator, dynamic resource allocation
  • Unity Catalog: fine-grained access, column-level masking
  • Optimization: Z-ordering, liquid clustering, auto-compact
Spark 3.5Delta Lake AQEPhoton
🔧
dbt (Data Build Tool)ELT · SQL Transforms · Lineage
  • Models: SQL SELECT statements, auto-generated DDL
  • Materializations: view, table, incremental, ephemeral
  • Incremental: merge strategy — only process new/changed rows
  • Tests: not_null, unique, accepted_values, relationships
  • Lineage graph: DAG from sources → staging → marts
  • dbt Metrics: semantic layer — single source of metric truth
  • CI: dbt test + dbt build in GitHub Actions on every PR
IncrementalSemantic Layer Lineage DAGTests
🌀
Apache AirflowOrchestration · DAG Scheduling
  • Airflow 2.8: TaskFlow API, dynamic task mapping
  • Executors: KubernetesExecutor — each task = isolated pod
  • DAG versioning: git-sync sidecar pulls DAGs from repo
  • Sensors: S3KeySensor, ExternalTaskSensor, event-driven
  • Dataset-aware: trigger DAGs on upstream dataset updates
  • SLA alerts: PagerDuty on missed SLA, auto-retry 3×
  • Secrets: Airflow Connections backed by HashiCorp Vault
TaskFlowK8s Executor Dataset-awareVault
📈
BI & DashboardsSuperset · Looker · Tableau
  • Apache Superset: open-source, 40+ chart types, SQL Lab
  • Looker: LookML semantic model, embedded analytics
  • Tableau: executive dashboards, Tableau Server SSO
  • Metabase: self-serve for non-technical stakeholders
  • Cube.dev: headless BI, pre-aggregations, API layer
  • Caching: Superset result cache (Redis), query dedup
  • Embedding: iframe + JWT signed for product analytics
SupersetLooker Cube.devEmbedded
🗂
Data Catalog & GovernanceDataHub · OpenMetadata · Atlas
  • DataHub: auto-discovery, lineage, ownership metadata
  • OpenMetadata: unified catalog, profiling, quality scores
  • Column lineage: field-level tracing from source to dashboard
  • PII tagging: auto-classify sensitive columns, masking policies
  • Data contracts: schema agreements between producers/consumers
  • RBAC: row/column-level security per team/role in Snowflake
  • GDPR: right-to-erasure automation, PII inventory registry
DataHubLineage PII MaskingGDPR
Real-time AnalyticsApache Druid · Pinot · ClickHouse
  • Apache Druid: sub-second OLAP, Kafka-native ingestion
  • Apache Pinot: user-facing analytics, <10ms P99 at scale
  • ClickHouse: columnar, 10B rows/s scan, materialized views
  • Star-tree index: Pinot pre-aggregation for low-latency rollups
  • Lambda: batch (Spark) + speed (Druid) layers unified via API
  • Kappa: single Flink/Kafka stream, no batch layer
  • Use cases: product funnels, anomaly detection, user cohorts
DruidPinot Star-TreeSub-second
Data Quality & ObservabilityGreat Expectations · Monte Carlo
  • Great Expectations: expectation suites, data docs, checkpoints
  • Monte Carlo: ML-based anomaly detection, circuit breaker
  • dbt tests: freshness, uniqueness, referential integrity checks
  • Soda Core: SQL-based checks, SodaCL YAML config
  • Freshness: alert if table not updated within SLA window
  • Volume: row count deviation >20% triggers incident
  • Schema drift: auto-detect column add/remove/type change
Great Expect.Monte Carlo Soda CoreSchema Drift
Modern Data Stack — End-to-End Pipeline
🔄 Lakehouse ELT Pipeline: Ingest → Transform → Serve
SourcesApp DBs
APIs · SaaS
IngestionFivetran
Debezium
Airbyte
Bronze
S3 / GCSRaw · Avro
Parquet · Delta
Spark
ProcessingClean · Join
Deduplicate
Silver
Delta LakeValidated
Conformed
dbt
ModelsAggregate
Business logic
Gold
WarehouseSnowflake
BigQuery
BI / APIsSuperset
Looker
Cube.dev
Orchestration: Airflow DAGs trigger each stage, dataset-aware dependencies Quality gates: Great Expectations checks at Bronze→Silver, dbt tests at Silver→Gold Real-time fork: Kafka → Druid/Pinot for sub-second serving (bypass batch)
Medallion Architecture · dbt DAG · Data Mesh · Quality SLAs
Medallion Architecture — Bronze · Silver · Gold
🟤 Bronze — Raw Zone
Exact copy of source. No transformations. Schema-on-read. Append-only. Partitioned by ingestion date. Retention: 2 years. Format: Parquet / Avro / JSON.
Raw IngestAppend-onlyS3
🔵 Silver — Cleansed Zone
Cleaned, deduplicated, validated, type-cast. Schema-on-write. Delta Lake with ACID. Conformed dimensions. Spark jobs scheduled by Airflow.
Delta ACIDDedupValidated
🟢 Gold — Business Zone
Business-ready aggregates, star schema, wide tables. dbt models materialised into Snowflake. Optimised for BI query patterns, role-based access.
Star SchemadbtSnowflake
dbt Model Lineage — DAG Example
Order Analytics Domain
src_postgres__orders
stg_orders
src_postgres__users
stg_users
stg_orders
int_order_items_enriched
int_order_items_enriched
fct_orders
stg_users
dim_customers
fct_orders
mart_revenue_daily
Model Config
fct_ordersincremental + unique_key
dim_customersSCD Type 2 snapshot
mart_revenuetable + cluster_by date
stg_* modelsview (no cost)
Data Mesh — Domain Ownership
4 Data Mesh Principles
1
Domain ownership: each business domain owns its data products (Orders, Users, Payments)
2
Data as product: domains publish discoverable, SLA-backed datasets via catalog
3
Self-serve infra: platform team provides templates, tooling, compute — domain self-onboard
4
Federated governance: global policies (PII, retention) + domain-level control
Domain Data Products
Orders domainfct_orders · mart_revenue
Users domaindim_customers · churn_risk
Payments domainfct_transactions · fraud
Marketing domainattribution · funnels
Data ProductSLA Contract Federated Gov.
Data Quality SLAs & Observability
Quality Dimensions
Completeness>99%
Freshness (Gold)<1 hr
Uniqueness>99.9%
Volume anomaly±20% alert
Schema drift alerts0 silent
Pipeline SLA Monitoring
  • Airflow SLA Miss: PagerDuty P2, auto-retry 3× with backoff
  • dbt test failure: block downstream DAGs, Slack alert
  • Monte Carlo circuit breaker: halt dashboard refresh on anomaly
  • Data contracts: producer schema change requires consumer sign-off
Cost Controls
  • Snowflake: resource monitors, query timeout 60s, auto-suspend 5 min
  • Spark: spot instances 70%, auto-termination on idle
Ingestion Tools & ELT Patterns
Managed Connectors
Fivetran200+ SaaS connectors
AirbyteOpen-source ELT
DebeziumCDC from PG/MySQL
Kafka ConnectStreaming ingest
dlt (data load tool)Python-native ELT
Ingestion Patterns
1
Full refresh: truncate + reload (small tables, daily)
2
Incremental: updated_at watermark, append new rows
3
CDC: WAL → Kafka → Bronze (true real-time)
4
Log-based: event stream → Kinesis → S3 raw zone
ELT vs ETL Choice
AspectELTETL
Transform whereIn warehouseIn pipeline
Raw data keptYes ✓Often no
ReprocessingEasy (replay)Hard
TooldbtSpark / Flink
Best forSQL analyticsComplex logic
Airflow Schedule
Bronze loadEvery 15 min
Silver transformHourly
Gold dbt runEvery 4 hours
BI cache refreshEvery 30 min
Real-time Analytics Architecture
Lambda Architecture
⚡ Speed Layer
Kafka → Flink → Druid/Pinot
Sub-second, approximate
Recent data (7 days)
Low latency user queries
🏋 Batch Layer
S3 → Spark → Snowflake
Accurate, complete
Historical all-time
Complex aggregations
Apache Druid Config
IngestionKafka native, real-time
Query latency<1s P99 (10B rows)
Roll-upPre-agg at ingest time
SegmentsTime-partitioned, 500MB
Retention90 days hot, S3 deep
Apache Pinot — User-Facing
  • Star-tree index: pre-aggregated for common rollup queries
  • Sorted index: range filters on timestamp, user_id
  • Upsert support: latest value via primary key dedup
  • Multi-stage: distributed join engine for complex queries
Semantic Layer & Metrics Store
dbt Semantic Layer
  • Metrics: revenue, DAU, churn defined once in dbt
  • Single truth: all tools query same metric definition
  • Dimensions: date, region, product, user_segment
  • MetricFlow: compile metric queries to SQL per DW
Cube.dev (Headless BI)
  • Pre-aggregations: materialise rollups for <100ms queries
  • REST + GraphQL API: serve metrics to any frontend
  • Access control: row-level security per user role
Key Business Metrics
DAU / MAUPinot real-time
Revenuedbt fct_orders
Churn rateML model → BQ
Funnel conversionDruid + Superset
MetricFlowCube.dev dbt MetricsSingle Source

09Observability

📡
Observability Platform — Detailed Architecture
Metrics · Logs · Traces · Alerting · SLO · Incident Response · 2025
Observability
3 Pillars
M · L · T
1M+
Metrics / min
100TB+
Logs / day
99.9%
Trace Coverage
< 1 min
Alert Trigger
< 5 min
MTTA
99.95%
SLO Target
30 days
Retention (hot)
Observability Components — Technology Stacks
📈
Prometheus + Grafana
Metrics Collection · Dashboards · Alerting
Metrics Pillar
Stack
Prometheus 3.x Grafana 11 Thanos (long-term) VictoriaMetrics
Collection
Pull model — scrape /metrics every 15s Pushgateway — batch jobs, short-lived pods Remote Write — fan-out to Thanos / Mimir Service discovery via Kubernetes SD
Key Metric Types
CounterGauge HistogramSummary
Grafana Features
Unified dashboards — metrics + logs + traces Alerting — multi-dimensional, silencing OnCall — built-in on-call scheduling
Query
PromQLHTTP APIRemote Write
🔍
Distributed Tracing
Jaeger · Tempo · OpenTelemetry
Traces Pillar
Stack
Jaeger Grafana Tempo Zipkin AWS X-Ray
Instrumentation
OpenTelemetry SDK — auto & manual spans Context propagation — W3C TraceContext header Baggage propagation — user-ID, tenant-ID Sampling — head-based 1%, tail-based 100% errors
Span Attributes
service.namehttp.status_code db.statementspan.kind
Storage
Tempo backend: object storage (S3 / GCS) Parquet format — columnar, fast query
Protocols
OTLP/gRPCOTLP/HTTPJaeger Thrift
📋
Log Aggregation
ELK · Loki · Structured Logging
Logs Pillar
Stack
Grafana Loki Elasticsearch Kibana AWS OpenSearch
Collection Pipeline
Fluent Bit (DaemonSet) — collect pod logs Logstash / Vector — parse, enrich, route Kafka buffer — decouple collection from storage Structured JSON logs — no regex parsing
Log Fields (Standard)
timestamplevel trace_idservice user_idduration_ms
Retention
Hot (Elasticsearch / Loki): 30 days — fast query Warm (S3 / GCS): 1 year — compressed Parquet Cold (Glacier): 7 years — compliance archive
Query
LogQLLuceneKQL
🚨
Incident Management
PagerDuty · Opsgenie · Runbooks
SRE On-Call
Stack
PagerDuty Opsgenie FireHydrant Incident.io
Process
Alert fires → routed by service ownership On-call paged via SMS + push + phone call Slack war room auto-created with context Incident declared → severity P1–P4 Postmortem auto-template after resolution
Runbook Automation
Automated diagnostics on alert trigger Self-healing scripts (pod restart, cache flush)
Channels
SMSPushPhoneSlackEmail
🎯
SLI / SLO / SLA Tracking
Error Budget · Burn Rate · Reliability
99.95% Target
Tools
Sloth (Prometheus) Nobl9 OpenSLO spec Grafana SLO plugin
Key SLIs
Availability: % requests returning 2xx Latency: % requests completing < 200ms Error rate: % requests returning 5xx Throughput: requests per second sustained
Error Budget
99.95% SLO = 21.9 min/month error budget Burn rate alert: 14× at 1h → page immediately Budget exhausted → freeze non-critical releases
Query
PromQLOpenSLO YAML
🤖
Synthetic Monitoring
Proactive · Outside-In · Canary
24/7 Probing
Tools
Grafana k6 Datadog Synthetics Checkly Playwright Cloud
Check Types
HTTP checks — endpoint up, status, latency Browser checks — full user journey (Playwright) API checks — contract correctness assertions Multi-step transactions — login → checkout SSL cert expiry — alert at 30/14/7 days
Frequency & Locations
Every 60s globally 20+ probe locations
Protocols
HTTP/SgRPCTCP/DNS
👥
Real User Monitoring
Core Web Vitals · Session · UX
Client-Side
Tools
Grafana Faro Datadog RUM Elastic RUM Sentry Performance
Core Web Vitals
LCP — Largest Contentful Paint (< 2.5s target) INP — Interaction to Next Paint (< 200ms) CLS — Cumulative Layout Shift (< 0.1) TTFB — Time to First Byte (< 800ms)
Session Data
Session replay — privacy-safe pixel replay JS error rate — stack trace + breadcrumbs Resource timing — waterfall breakdown
Protocols
Beacon APIXHR/FetchWeb Vitals API
🔭
OpenTelemetry
Unified Instrumentation · Collector
CNCF Standard
Components
OTel SDK (all langs) OTel Collector Auto-instrumentation Semantic Conventions
Collector Pipeline
Receivers — OTLP, Jaeger, Zipkin, Prometheus Processors — batch, filter, attribute enrichment Exporters — Tempo, Loki, Prometheus, S3 Tail sampling — 100% errors, 1% success
SDK Languages
Go · Java · Python Node.js · Rust · .NET Ruby · PHP · C++
Signals
TracesMetricsLogsProfiles
The Three Pillars of Observability
📊
Metrics — What is happening?
Quantitative measurements over time — aggregated, cheap to store
Infrastructure: CPU, RAM, disk I/O, network bytes, pod restarts Application: request rate, error rate, latency percentiles (p50/p95/p99) Business: orders/sec, revenue/min, active users, conversion rate SLO: availability ratio, error budget remaining, burn rate Database: query latency, connection pool, cache hit ratio Queue: consumer lag, message rate, partition count
PrometheusPromQL GrafanaThanos
📄
Logs — Why did it happen?
Event records with rich context — searchable, verbose, structured JSON
Application logs: request/response details, business events, errors Access logs: NGINX / Envoy — IP, path, status, duration, user-agent Audit logs: who did what, when — immutable, tamper-proof (WORM) Security logs: auth attempts, permission denials, anomalies Slow query logs: DB queries exceeding 100ms threshold Correlation: trace_id on every log line links log ↔ trace
LokiElasticsearch Fluent BitLogQL
🔗
Traces — Where did it happen?
End-to-end request journey across services — causal, low-level, sampled
Trace: one root span + child spans forming a DAG across services Span: single operation with start/end time, attributes, events, status Context propagation: W3C traceparent header flows across all hops Service map: auto-generated from trace data — shows dependencies Flame graph: visual breakdown of where latency is spent per span Error attribution: root-cause service identified from 5xx span
JaegerTempo OpenTelemetryTraceQL
Observability Data Pipeline
⚙️
Services / Pods
OTel SDK auto-instrument
OTLP / Prometheus / log stdout
🔭
OTel Collector
Receive · Process · Export
Fan-out to backends
📦
Kafka Buffer
Decouple + replay
Stream consume
🗄️
Signal Stores
Prometheus · Loki · Tempo
Query API
📊
Grafana
Unified dashboards
Alert evaluation
🚨
Alertmanager
Route · Dedupe · Silence
Notify on-call
📱
PagerDuty
Escalation policy
Page SRE
👷
SRE On-Call
Investigate → resolve
Alert Severity Levels
P1 — Critical5 min
Total outage · data loss risk · security breachFull on-call team + management notifiedWar room opened immediately in Slack
P2 — High15 min
Partial outage · degraded performancePrimary on-call engineer pagedSLO burn rate > 14× in 1h window
P3 — Medium1 hr
Non-critical degradation · elevated errorsTicket created · next business hoursError budget burn rate alert
P4 — Low24 hr
Warning · approaching thresholdSlack channel notification onlyNo immediate action required
On-Call Escalation Policy
1
Primary On-Call Engineer
Paged via mobile push + SMS. Expected to acknowledge within 5 min.
0–5 min
2
Secondary On-Call (if no ack)
Backup engineer paged. Parallel phone call initiated. Slack bridge created.
5–15 min
3
Engineering Manager
Manager paged. Stakeholder comms drafted. Customer status page updated.
15–30 min
4
VP Engineering + Incident Commander
Executive bridge. External comms. Customer notifications sent. SLA clock running.
30+ min
SLO Definitions — Production Services
ServiceSLISLO TargetError Budget/moWindowStatus
API GatewayAvailability (2xx rate)99.99%4.3 min30-day rolling✓ Met
API GatewayLatency p99 < 200ms99.5%3.6 hrs30-day rolling✓ Met
Payment SvcSuccess rate (non-5xx)99.99%4.3 min30-day rolling✓ Met
User ServiceAvailability99.95%21.9 min30-day rolling⚠ 89%
Search SvcLatency p95 < 500ms99.9%43.8 min30-day rolling✓ Met
MessagingMessage delivery < 1s99.9%43.8 min7-day rolling✓ Met
CDN / EdgeCache hit rate > 90%99.0%7.2 hrs30-day rolling✓ Met
Key Grafana Dashboard Types
🏠
System Overview
All services health at a glanceRED metrics (Rate/Error/Duration)SLO burn rate summary
🔬
Service Deep Dive
Per-service latency heatmapError breakdown by endpointTrace → log correlation links
🐳
Kubernetes Cluster
Node CPU / memory / diskPod restarts and OOMKillsHPA scaling events timeline
💰
Business Metrics
Orders/revenue per minuteFunnel conversion ratesActive users by region
1M+
Metrics / min
Prometheus scrape rate
100 TB
Logs / day
Fluent Bit → Loki
99.9%
Trace Coverage
All error paths traced
< 1 min
Alert Trigger
Prometheus eval interval
< 5 min
MTTA
Mean time to acknowledge
30 days
Hot Retention
Fast query tier
1 year
Warm Retention
S3 compressed Parquet
7 years
Cold Archive
Compliance / Glacier

Observability enables teams to understand system behaviour in production without manual debugging. The three pillars — Metrics, Logs, Traces — must be correlated via a shared trace_id to achieve true observability.  |  Standards: OpenTelemetry CNCF · OpenSLO · SRE Book (Google) · W3C TraceContext  |  2025 Production Design

10DevOps

⚙ DevOps & Platform Engineering

CI/CD · GitOps · IaC · Progressive Delivery · Developer Experience · DORA Elite Performance

GitHub Actions ArgoCD · Flux Terraform · Pulumi Harbor · Cosign Kubernetes · Karpenter LaunchDarkly · OpenFeature Flagger · Argo Rollouts Backstage IDP DORA Elite GitOps · SRE
< 1hr
Lead Time
50+/day
Deployments
< 5min
MTTR
< 1%
Change Failure
99.9%
Pipeline Uptime
3 min
Build Time P50
0-touch
Prod Deploy
ELITE
DORA Band
Core Platform Components
🔄
CI/CD PipelineGitHub Actions · Tekton
  • GitHub Actions: matrix builds, reusable workflows
  • Tekton Pipelines: K8s-native CI for heavy workloads
  • Build cache: layer + Bazel remote cache (80% hit)
  • Parallelism: unit, integration, security scans in parallel
  • Artifacts: SBOM, attestation, signed OCI image
  • Branch strategy: trunk-based development, short-lived PRs
  • Environments: dev → staging → prod with gates
ActionsTekton Bazel CacheSBOM
🌿
GitOpsArgoCD · Flux CD
  • ArgoCD: declarative GitOps, ApplicationSets
  • Flux CD: image reflector, multi-tenancy support
  • Helm: templated manifests, OCI chart registry
  • Kustomize: overlay-based environment patches
  • Drift detection: auto-sync on manifest divergence
  • Rollback: git revert = instant cluster rollback
  • App of Apps: hierarchical dependency management
ArgoCDFlux HelmKustomize
🏗
Infrastructure as CodeTerraform · Pulumi · Crossplane
  • Terraform: cloud infra, modular + remote state (S3+DynamoDB)
  • Pulumi: TypeScript/Python IaC for complex logic
  • Crossplane: K8s-native cloud resource provisioning
  • Atlantis: PR-based Terraform plan/apply automation
  • Drift detection: daily Terraform plan scheduled scan
  • Policy: OPA/Conftest enforces tagging, cost controls
  • Modules: internal registry for approved patterns
TerraformPulumi CrossplaneAtlantis
📦
Container RegistryHarbor · ECR · Cosign
  • Harbor: self-hosted OCI registry with Helm charts
  • ECR / GCR: managed regional mirrors for low latency
  • Cosign (Sigstore): keyless image signing + verification
  • Trivy: vulnerability scan on every image push
  • SBOM: Syft generates CycloneDX SBOM per image
  • Admission: Kyverno blocks unsigned/critical-CVE images
  • Replication: geo-replicated for multi-region pull speed
HarborCosign TrivySLSA L3
Kubernetes PlatformEKS · GKE · Karpenter
  • Cluster API: declarative cluster lifecycle management
  • Karpenter: node autoprovisioning, right-sized instances
  • Multi-cluster: prod / staging / dev isolated clusters
  • Namespaces: per-team isolation with ResourceQuotas
  • OPA Gatekeeper: policy enforcement at admission
  • Velero: cluster backup + disaster recovery
  • Kubecost: per-team cost attribution and budgets
Cluster APIKarpenter GatekeeperVelero
🚩
Feature FlagsLaunchDarkly · OpenFeature
  • LaunchDarkly: targeting rules, % rollout, segments
  • OpenFeature: vendor-neutral SDK standard (CNCF)
  • Experimentation: A/B test tied to flag + metrics
  • Kill switch: instant feature disable without deploy
  • Context-aware: user, region, plan-based targeting
  • Audit log: who changed what flag and when
  • SDK: server-side (Node/Go/Java) + client-side (React)
LaunchDarklyOpenFeature A/B TestKill Switch
🚀
Progressive DeliveryFlagger · Argo Rollouts
  • Flagger: automated canary analysis with Istio/Nginx
  • Argo Rollouts: BlueGreen + canary with analysis runs
  • SLO gate: promote only if error rate < 1% + p99 < 200ms
  • Traffic split: Istio VirtualService weight shifting (5%→25%→100%)
  • Auto-rollback: instant on metric breach, zero manual action
  • Shadow mode: mirror traffic to new version, no user impact
  • Header routing: internal users get canary via x-canary header
FlaggerCanary BlueGreenSLO Gate
🧑‍💻
Developer ExperienceBackstage · Dev Containers
  • Backstage IDP: service catalog, golden path templates
  • Dev Containers: reproducible local dev via devcontainer.json
  • Telepresence: local code ↔ K8s cluster live bridge
  • Self-service: scaffold new service in < 5 min via Backstage
  • Docs-as-code: TechDocs auto-generated from repo markdown
  • API catalog: OpenAPI/AsyncAPI registered in Backstage
  • Scorecards: automated ownership, SLO, security health checks
BackstageDevContainer TelepresenceTechDocs
End-to-End CI/CD Pipeline — Code to Production
🔄 Automated Pipeline Stages
① Code Push
+ PR
Trunk-based dev
Short-lived branch
Conventional commits
② Build &
Unit Test
Docker buildkit
Bazel cache hit
Matrix: OS × arch
③ Security
Scan
Semgrep SAST
Trivy image scan
Snyk SCA deps
④ Sign &
Attest
Cosign keyless sign
SBOM generation
SLSA provenance
⑤ Push to
Registry
Harbor + ECR mirror
Kyverno admission
CVE gate: block CRIT
⑥ ArgoCD
Sync
Update Helm values
Git PR merged
ArgoCD auto-sync
⑦ Canary
Flagger
5%→25%→50%→100%
SLO analysis
Auto-rollback
⑧ Observe
& Verify
Grafana SLO
Synthetic checks
Slack notify
DORA Metrics · GitOps Reconciliation · Release Strategies · Platform SRE
DORA Elite Performance Metrics
MetricOur TargetBand
Deployment Frequency 50+ / day Elite
Lead Time for Changes < 1 hour Elite
Time to Restore (MTTR) < 5 min Elite
Change Failure Rate < 1% Elite
Reliability (SLO) 99.99% High
Pipeline Efficiency
Build Time P503 min
Test Coverage87%
Cache Hit Rate82%
PR Merge Time42 min
Rollback Rate0.4%
GitOps Reconciliation Loop
1
Developer merges PR → Helm values updated in config repo
2
ArgoCD detects Git diff (polls every 3 min or webhook)
3
Diff computed between desired state (Git) vs live state (K8s)
4
Kyverno admission validates manifests against policies
5
ArgoCD applies manifests → K8s API server
6
Flagger intercepts Deployment → starts canary analysis
7
Grafana SLO gate passes → promote to 100% traffic
Sync interval3 min (+ webhook)
Drift auto-healEnabled
Rollback triggergit revert + push
Release Strategies
🔵 Rolling Update (Default)
Gradually replace old pods. maxSurge=25%, maxUnavailable=0. Zero downtime. K8s native.
🟢 Blue / Green
Full parallel stack. Instant cutover via Istio weight shift. Instant rollback — keep Blue warm 30 min post-deploy.
🟡 Canary (Flagger)
5%→25%→50%→100% over 20 min. Prometheus SLO analysis each step. Auto-rollback on breach.
👥 Shadow / Dark Launch
Mirror 100% of traffic to new version. No user impact. Validate correctness and performance before cutover.
FlaggerArgo Rollouts Istio WeightsHeader Route
Platform SRE Practices
Error Budgets
  • SLO: 99.99% availability = 52 min/year budget
  • Budget burn: alert at 5% consumed in 1hr (P1)
  • Freeze: no deploys when < 10% budget remains
On-Call Runbooks
  • Incident: PagerDuty → Slack war-room → Zoom bridge
  • Runbooks: Backstage TechDocs, auto-linked from alert
  • Post-mortem: blameless, 48hr SLA, published to wiki
Toil Elimination
  • Target: < 20% SRE time on toil (Google SRE book)
  • Automation: self-healing pods, auto-restart, node drain
  • Chaos Eng: Chaos Monkey + LitmusChaos in staging
Error BudgetBlameless PM LitmusChaosToil <20%
IaC Workflow — Terraform GitOps with Atlantis
Terraform Module Structure
1
Engineer opens PR with Terraform changes
2
Atlantis runs terraform plan on PR
3
OPA/Conftest validates policy (tagging, costs)
4
Cost estimate shown via Infracost comment on PR
5
Approval from platform team + Atlantis apply
6
State stored in S3 + DynamoDB lock
7
Drift scan runs nightly → alerts if out of sync
Golden Module Library
vpcnetworking
eks-clustercompute
rds-postgresdata
redis-clustercache
s3-securestorage
alb-ingressnetworking
iam-rolesecurity
monitoringobservability
Crossplane (K8s-native)
CompositionXR abstractions
ProviderAWS / GCP / Azure
Claimdev self-service
Backstage Internal Developer Portal
Golden Path — New Service Bootstrap (< 5 min)
1
Dev opens Backstage → Create from Template
2
Selects: language, team, SLO tier, dependencies
3
Scaffolder creates: GitHub repo, CI workflow, Helm chart, ArgoCD app
4
Service auto-registered in catalog with ownership metadata
5
Grafana dashboard + PagerDuty route provisioned
Catalog Features
Service catalog400+ services
API catalogOpenAPI + AsyncAPI
TechDocsDocs-as-code, mkdocs
ScorecardsMaturity per service
Cost pluginKubecost per team
ScaffolderTechDocs CatalogScorecards
Environment Topology
🟢 Production3 AZ · HA
🟡 StagingMirror of prod
🔵 PreviewPer-PR ephemeral
🟠 DevShared namespace
Promotion Gates
  • dev→staging: CI green + security scan pass
  • staging→prod: E2E tests + manual approval
  • prod canary: SLO gate each traffic step
Cost Controls
  • Preview envs: auto-deleted after 24hr inactivity
  • Karpenter: spot instances for non-prod (70% savings)
  • Kubecost: per-team budgets + Slack spend alerts
Ephemeral PreviewSpot Instances Budget Alerts

11Security

🔒
Security Platform — Detailed Architecture
Zero Trust · Defense-in-Depth · Secrets · SIEM/SOC · Compliance · AppSec · Supply Chain · 2025
Security
Zero Trust
Security Model
7 Layers
Defense-in-Depth
AES-256
Encryption
FIPS 140-2
Key Standard
< 1 hr
P1 Patch SLA
SOC 2 II
Compliance
100%
mTLS Coverage
SBOM
Supply Chain
Security Components — Technology Stacks & Controls
🗝️
Secret Manager
Vault · AWS KMS · Auto-Rotation
FIPS 140-2
Technologies
HashiCorp Vault AWS KMS / Secrets Manager GCP Secret Manager Azure Key Vault
Secret Types
Database credentials — dynamic, short-lived (1h TTL) API keys — versioned, automated rotation TLS certificates — PKCS#12, auto-renewed ACME Encryption keys — HSM-backed, never exported SSH keys — just-in-time, 1-use CA signing
Access Patterns
Vault Agent — sidecar injection into pods Kubernetes ESO (External Secrets Operator) AWS IAM role — workload identity binding
Audit
Every read loggedAlert on anomalyImmutable trail
🛡️
SIEM / SOC
Splunk · Sentinel · SOAR · Threat Intel
24/7 SOC
Technologies
Splunk Enterprise Security Microsoft Sentinel Elastic SIEM Chronicle (Google)
Detection Sources
WAF & network flow logs (VPC Flow Logs) CloudTrail / audit logs — API call anomalies EDR (CrowdStrike) — endpoint telemetry Identity logs — impossible travel, new device Threat intel feeds — STIX/TAXII, VirusTotal
SOAR Automation
Auto-block IP on confirmed attack (iptables) Auto-revoke compromised credential token Slack alert + PagerDuty ticket creation
Standards
MITRE ATT&CKSTIX 2.1TAXII
🔏
Data Encryption
At Rest · In Transit · In Use
AES-256 / TLS 1.3
At Rest
AES-256-GCM — all databases, object storage Envelope encryption — DEK wrapped by KEK Column-level encryption — PII fields (Vault Transit) S3 SSE-KMS — per-object keys, auto-managed Full-disk encryption — NVMe drives (LUKS/BitLocker)
In Transit
TLS 1.3 + QUIC — all external traffic mTLS — all service-to-service (Istio/SPIFFE) HSTS preload — max-age=31536000, includeSubDomains
In Use (Confidential Computing)
Intel SGX / AMD SEV — encrypted memory enclaves Homomorphic encryption — compute on ciphertext
Standards
FIPS 140-2PCI-DSS 4.0NIST SP 800-57
📜
Compliance & Governance
GDPR · SOC 2 · PCI-DSS · ISO 27001
Multi-Framework
Frameworks
GDPR / DPDPA SOC 2 Type II PCI-DSS 4.0 ISO 27001 HIPAA NIST CSF 2.0
Continuous Compliance
CSPM — Wiz / Prisma Cloud policy enforcement OPA / Gatekeeper — Kubernetes policy-as-code AWS Config rules — auto-remediate misconfig Evidence collection — Drata / Vanta automation
Privacy Controls
Data residency — regional storage per country law Right-to-erasure — automated GDPR delete API Consent management — OneTrust CMP
Audits
Annual pen-testQuarterly scan3rd-party audit
🕸️
Zero Trust Network
BeyondCorp · SPIFFE · Micro-Seg
Never Trust
Technologies
SPIFFE / SPIRE Istio PeerAuth Cloudflare Access BeyondCorp Enterprise
Principles Applied
Assume breach — every request re-verified Least privilege — scoped, time-limited access Explicit verification — identity + device + context Micro-segmentation — no lateral movement Continuous evaluation — session risk re-scored
Network Controls
SPIFFE SVIDs — short-lived X.509 identity certs Istio AuthzPolicy — L7 path-level RBAC Network policies — deny-all default, allowlist
Standards
NIST SP 800-207SPIFFE RFCBeyondCorp
🪪
Identity & Access (IAM/PAM)
MFA · SSO · RBAC · JIT Access
Least Privilege
Technologies
Okta / Entra ID CyberArk PAM AWS IAM Identity Center HashiCorp Boundary
Access Controls
RBAC — role-based, attribute-based (ABAC) JIT access — elevated rights expire in 4h max PAM — session recording for privileged SSH/RDP SSO — SAML 2.0 / OIDC for all internal tools MFA enforcement — phishing-resistant FIDO2
Lifecycle
Auto-provision — SCIM sync from HR system Access review — quarterly certification Auto-deprovision — off-board within 1h
Standards
SCIM 2.0SAML 2.0FIDO2OAuth 2.0
🛡️
Application Security
SAST · DAST · IAST · OWASP
Shift Left
Testing Tools
Semgrep (SAST) OWASP ZAP (DAST) Contrast Security (IAST) Burp Suite (Manual)
CI/CD Integration
SAST on every PR — block merge on critical finding Secret scanning — detect leaked API keys in code SCA — dependency CVE check (Snyk / Dependabot) DAST — nightly against staging environment Fuzz testing — continuous corpus with libFuzzer
OWASP Top 10 Coverage
Injection, Broken Auth, XSS, SSRF, IDOR — all tested Security headers — CSP, CORP, COOP, Permissions-Policy
Standards
OWASP ASVSNIST SSDFCWE Top 25
🔗
Container & Supply Chain
SBOM · Image Signing · SCA
SLSA Level 3
Technologies
Sigstore / Cosign Trivy / Grype Syft (SBOM) Snyk Container
Image Pipeline Controls
Base image — distroless / Wolfi (minimal attack surface) Trivy scan — CVE check before push to registry Cosign sign — keyless signing via OIDC + Rekor OPA admission — reject unsigned / critical-CVE images SBOM generated — every build, stored in registry
Dependency Controls
Snyk / Dependabot — auto-PR for CVE patches Private registry — no direct internet pulls in prod
Standards
SLSASPDX SBOMCycloneDXSigstore
Defense-in-Depth — 7 Security Layers
🏰 Defense-in-Depth Architecture — Concentric Security Rings · No Single Point of Failure Each layer independently blocks attack classes — a breach of one does not compromise all
🌐Perimeter L1
DDoS absorption (10Tbps) Anycast routing BGP Flowspec block GeoDNS anycast
WAF · Shield
🔥Network Edge L2
WAF — OWASP rules Bot management TLS 1.3 termination Rate limiting
Cloudflare
🚪API Gateway L3
OAuth 2.0 + JWT auth Input validation Quota enforcement Circuit breaker
Kong · Envoy
🕸️Service Mesh L4
mTLS all svc traffic SPIFFE identity L7 AuthzPolicy Egress allowlist
Istio · SPIRE
📦Application L5
SAST / DAST / IAST Dep vulnerability scan Runtime RASP Signed containers
Semgrep · Trivy
🗄️Data Layer L6
AES-256 at rest Column-level encrypt Tokenisation PII DB firewall rules
Vault Transit
👁️Observability L7
SIEM — 24/7 SOC SOAR auto-response Anomaly detection Immutable audit trail
Splunk · Sentinel
Security Operations Center (SOC) — Detect to Resolve
📡
Data Sources
Logs · Flows · EDR
Ingest / SIEM
🔍
Correlation
MITRE rules + ML
Alert fired
⚖️
Triage
False positive filter
Confirmed threat
🚨
Incident Declared
Severity P1–P4
SOAR playbook
🤖
Auto-Contain
Block IP · Revoke token
SRE engaged
👷
Investigation
Forensics · RCA
Remediate
🩹
Eradicate
Patch · Rotate · Harden
Closed
📋
Post-Mortem
Lessons · Rule update
Incident Response Playbook (NIST)
1
🔍 Prepare
IRP documented, SOC staffed 24/7, SOAR runbooks ready, tabletop exercises quarterly, IR retainer with vendor
Ongoing
2
🚨 Detect & Analyse
SIEM alert triggered. Analyst confirms true positive. Severity assigned. Timeline documented in ticketing system.
< 15 min
3
🔒 Contain
Isolate affected pod/node. Block malicious IPs at WAF. Revoke compromised credentials. Snapshot for forensics.
< 30 min
4
🩹 Eradicate & Recover
Root cause removed. Patch deployed via canary. Secrets rotated. Service restored with monitoring heightened.
< 4 hrs (P1)
5
📋 Post-Incident Review
5-whys RCA. Blameless postmortem published. Detection rule improved. NIST framework updated.
≤ 5 days
Vulnerability Management Lifecycle
1
Continuous Scanning
Trivy · Wiz · Qualys — scan containers, hosts, IaC, and dependencies every 24h and on every CI build
24h cycle
2
Severity Classification
CVSS v3.1 + EPSS — exploit probability adjusts priority. CVE enriched with threat intel feeds.
Automated
3
Patch SLAs
Critical (CVSS 9+): 1h patch/mitigate · High (7–8.9): 24h · Medium (4–6.9): 7 days · Low: 30 days
Enforced
4
Auto-PR Remediation
Dependabot / Snyk auto-creates PRs for transitive dep upgrades. Merged via CI gating.
Automated
5
Virtual Patching
WAF rule deployed within hours of 0-day disclosure while code patch is prepared. Buys remediation time.
< 4 hrs
6
Penetration Testing
Annual full-scope pen test by certified firm (CREST/OSCP). Quarterly internal red-team exercises.
Annual
7
Metrics & Reporting
MTTR per severity tracked in Grafana. Board-level security dashboard — open vulns, SLA breach rate, coverage.
Weekly report
Threat Landscape — Top Attack Vectors
🧨Credential TheftCritical
Phishing / AiTM attacksCredential stuffing (bot)Control: FIDO2 MFA + SIEM anomaly
💉Injection / RCEHigh
SQLi · Log4Shell · SSRFSupply chain exploitControl: WAF + SAST + SBOM
☁️Cloud MisconfigurationHigh
Open S3 bucketsOver-permissive IAM rolesControl: CSPM + OPA policy
🔑Insider ThreatMedium
Data exfiltration by employeePrivilege abuse via PAMControl: UEBA + DLP + JIT
Compliance Framework Coverage
GDPR / DPDPA — Data PrivacyActive
Consent management, right-to-erasure API, data residency, DPA appointed
SOC 2 Type II — Trust Service CriteriaCertified
Security, availability, confidentiality, privacy — annual 3rd-party audit (Drata automated)
PCI-DSS 4.0 — Payment Card SecurityLevel 1
Cardholder data in isolated namespace, tokenisation, QSA-assessed annually
ISO 27001 — Information Security ISMSCertified
ISMS scope covers all production systems, risk register maintained, surveillance audit annually
NIST CSF 2.0 — Cyber Security FrameworkAligned
Govern · Identify · Protect · Detect · Respond · Recover — all 6 functions mapped
Secure Development Lifecycle (SDL) — Security at Every Phase
📐Design
Threat modelling (STRIDE) Security architecture review Data flow diagram + trust boundaries Privacy-by-design review
💻Code
Secure coding guidelines Pre-commit secret scanning Peer security code review IDE security plugins (Semgrep)
🔨Build
SAST — block on critical/high SCA — CVE dependency check Container scan (Trivy) Cosign image signing
🧪Test
DAST — OWASP ZAP nightly IAST — Contrast in staging Fuzzing — libFuzzer continuous Security regression tests
🚀Deploy
OPA admission control Image signature verify Canary with error budget gate Least-privilege K8s SA
👁️Operate
SIEM 24/7 monitoring Runtime RASP protection Threat hunting weekly Patch SLA enforcement
🔁Feedback
Bug bounty (HackerOne) Red team quarterly Security postmortems NIST CSF maturity review
Zero Trust
Security Model
Never trust, always verify
100%
mTLS Coverage
All svc-to-svc traffic
AES-256
Encryption at Rest
FIPS 140-2 HSM keys
< 1 hr
Critical Patch SLA
CVSS 9.0+ response time
SLSA L3
Supply Chain
Signed SBOM every build
SOC 2 II
Compliance
Annual 3rd-party audit
4h
JIT Access Max
Auto-expire elevated access
24/7
SOC Coverage
SIEM + SOAR + on-call SRE

Security is embedded at every layer — from perimeter DDoS absorption to encrypted data at rest — implementing Defense-in-Depth so no single control failure results in a breach.  |  Frameworks: NIST CSF 2.0 · ISO 27001 · SOC 2 Type II · PCI-DSS 4.0 · GDPR · OWASP ASVS · MITRE ATT&CK · SLSA  |  2025 Production Design

12AI / ML

🤖 AI / ML Platform

Model Training · Feature Store · LLMOps · RAG · Vector Search · ML Observability · Real-Time Inference

PyTorch · Ray Train MLflow · Kubeflow Feast · Tecton KServe · Triton vLLM · Claude · GPT-4 Pinecone · pgvector Evidently · Arize RLHF · Fine-Tuning RAG · Embeddings A/B · Shadow Mode
<50ms
Inference P99
10K+
Req / Second
500+
Models Served
1B+
Daily Predictions
256 GPU
Training Cluster
10TB+
Feature Store
99.9%
Serving SLA
LLMOps
RAG · Agents
Core Platform Components
🧠
Model TrainingPyTorch · Ray Train · DeepSpeed
  • PyTorch 2.x: torch.compile, dynamo, inductor backend
  • Ray Train: distributed training across GPU clusters
  • DeepSpeed ZeRO-3: LLM fine-tuning on 256×A100
  • FSDP: fully sharded data parallel for 70B+ models
  • Mixed precision: bf16 training, fp8 on H100s
  • Spot GPU: fault-tolerant checkpointing every 10 min
  • Hyperparameter: Ray Tune + Optuna Bayesian search
PyTorch 2.xRay Train DeepSpeedFSDP
📊
Experiment TrackingMLflow · Weights & Biases
  • MLflow: runs, params, metrics, artifact logging
  • W&B: rich visualizations, sweep hyper-param search
  • Model Registry: staged promotion (dev→staging→prod)
  • Artifact store: S3-backed, versioned model checkpoints
  • Lineage: dataset → code → model traceability
  • Comparison: multi-run diff, metric plots, confusion matrix
  • Git integration: auto-log commit SHA per training run
MLflowW&B Model RegistryLineage
🗄
Feature StoreFeast · Tecton · Redis
  • Feast: open-source, offline (S3/BigQuery) + online (Redis)
  • Tecton: managed, real-time streaming features
  • Point-in-time joins: prevent training/serving skew
  • Online store: Redis for <5ms feature retrieval at inference
  • Offline store: Parquet on S3, Spark batch materialization
  • Feature versioning: immutable, named feature views
  • Transformation: on-demand compute + pre-computed batch
FeastTecton Point-in-TimeSkew-Free
Model ServingKServe · Triton · TorchServe
  • KServe: K8s-native serving, InferenceService CRD
  • Triton Inference Server: GPU batching, TensorRT, multi-framework
  • TorchServe: PyTorch native, custom handlers
  • ONNX Runtime: cross-framework acceleration, 2–4× speedup
  • Dynamic batching: accumulate requests, GPU utilisation 90%+
  • Canary: traffic-split for model A/B within KServe
  • Autoscaling: KEDA on queue depth + HPA on GPU utilisation
KServeTriton TensorRTONNX RT
💬
LLM PlatformvLLM · Claude · GPT-4 · Gemini
  • vLLM: PagedAttention, continuous batching, 24× throughput
  • LiteLLM: unified proxy — route across Claude/GPT/Gemini
  • Prompt registry: versioned templates, A/B experiments
  • Fine-tuning: QLoRA / LoRA on domain-specific data
  • RLHF: human feedback loop with TRL + reward model
  • Guardrails: NeMo Guardrails — safety, topic, PII filter
  • Cost routing: smart model selection by task complexity
vLLMLiteLLM QLoRAGuardrails
🔍
Vector DatabasePinecone · Weaviate · pgvector
  • Pinecone: managed, serverless, 1B+ vector scale
  • Weaviate: hybrid search (BM25 + dense), multi-modal
  • pgvector: Postgres extension, HNSW + IVFFlat indexes
  • Milvus: open-source, ANN on GPU, distributed shards
  • Embeddings: text-embedding-3-large, BGE-M3, E5-large
  • HNSW index: ef=200, M=16, recall@10=97%
  • Metadata filter: pre-filter by tenant, date, category
PineconeWeaviate pgvectorHNSW
🔁
ML OrchestrationKubeflow · Airflow · Metaflow
  • Kubeflow Pipelines: DAG-based ML workflows on K8s
  • Airflow: batch feature engineering, data ingestion DAGs
  • Metaflow: Netflix-style, versioned step caching
  • Ray Workflows: dynamic pipelines, sub-step parallelism
  • Retraining triggers: drift detected → auto-retrain pipeline
  • Caching: step-level output cache — skip unchanged stages
  • Notifications: Slack alert on pipeline success/failure
KubeflowAirflow MetaflowRay WF
📡
AI ObservabilityEvidently · Arize · Prometheus
  • Evidently AI: data drift, target drift, data quality reports
  • Arize Phoenix: embedding drift, latent space monitoring
  • PSI / KL-divergence: feature distribution shift detection
  • LLM evals: G-Eval, RAGAS faithfulness, relevance scores
  • Prediction logging: all predictions → S3 + BigQuery
  • Feedback loop: human labels vs. model predictions comparison
  • Alert: drift > threshold → PagerDuty + retrain trigger
EvidentlyArize RAGASDrift Alert
ML Lifecycle Flow & RAG Pipeline
🔄 End-to-End ML Lifecycle
Raw DataLake · DB
Kafka CDC
Feature
Eng.Spark/Feast
Materialize
Train
EvalPyTorch
Ray Train
MLflow
RegistryStage→
Promote
KServe
ServeCanary
A/B split
Monitor
DriftEvidently
Retrain?
🔍 RAG (Retrieval-Augmented Generation) Pipeline
User
QueryAPI GW
+ Auth
Embedtext-embed-
3-large
Vector
SearchPinecone
top-k=10
Re-rankCohere
CrossEnc.
LLM
GenerateClaude 4/
GPT-4o
ResponseStream
+ Cache
Training Infrastructure · LLMOps · Model Evaluation · Cost Optimization
Training Infrastructure
GPU Cluster
GPU TypeNVIDIA H100 80GB
Cluster Size256 GPUs (32 nodes×8)
InterconnectInfiniBand 400Gb/s
Spot Fraction60% (fault-tolerant)
CheckpointEvery 10 min → S3
Distributed Strategies
StrategyBest For
Data ParallelStandard models, large batch
FSDP / ZeRO-370B+ LLMs, memory saving
Pipeline ParallelSequential layer splits
Tensor ParallelTransformer attention layers
H100 SXM5NVLink Flash Attn 2bf16
LLMOps — Production LLM Management
Prompt Engineering
  • Prompt Registry: versioned templates in Git, reviewed
  • A/B prompts: LaunchDarkly feature flags per user segment
  • Few-shot: dynamic example selection from vector store
Fine-Tuning Pipeline
1
Curate domain dataset, dedup + quality filter
2
QLoRA fine-tune on 4-bit base model (3×A100)
3
RLHF — reward model trained on human pref data
4
Eval on held-out set (MMLU, domain benchmarks)
5
Shadow deploy → compare vs. base model online
Inference Optimisation
vLLM PagedAttention24× throughput
Speculative decoding2× generation speed
Quantization (AWQ)4-bit, <1% quality drop
KV CacheRedis Prefix Caching
Model Evaluation Framework
Offline Evaluation
ClassificationF1, AUC-ROC, PR curve
RankingNDCG@10, MRR
NLPBLEU, ROUGE, BERTScore
LLMG-Eval, MT-Bench
RAG FaithfulnessRAGAS > 0.85
Online A/B Evaluation
  • Traffic split: 10% challenger vs 90% champion
  • Business KPIs: CTR, conversion, revenue per session
  • Statistical sig.: p<0.05, min 10K samples/arm
  • Guardrail metric: latency must not regress >10%
Shadow Evaluation
  • Mirror traffic to candidate model, compare outputs
  • Zero user impact — purely observational phase
Cost Optimization
GPU Efficiency
GPU Utilisation88%
Spot Savings65%
Cache Hit (LLM)42%
Quant. Cost Save70%
Inference Cost Strategies
  • Tiered routing: small model → large model escalation
  • Semantic cache: GPTCache — reuse similar query results
  • Quantization: AWQ 4-bit, GGUF for CPU inference
  • Batching: continuous batching, 90%+ GPU utilisation
  • Spot instances: preemption-safe Kubeflow pipelines
Model Routing Logic
Simple Q&AHaiku 4.5
Code / ReasoningSonnet 4.6
Complex AgentsOpus 4.7
AI Agent Architecture & RAG Deep-Dive
Agentic Patterns
1
ReAct: Reason + Act loop with tool calls
2
Tool use: search, code exec, DB query, API
3
Memory: short-term (context), long-term (vector DB)
4
Multi-agent: orchestrator + specialist sub-agents
5
Human-in-loop: Temporal + approval checkpoints
Frameworks
LangChain LlamaIndex Claude SDK AutoGen CrewAI
RAG Configuration
Chunk size512 tokens
Overlap64 tokens
Embed modeltext-embed-3-large
Dimensions3072
Top-K retrieve10 chunks
Re-rank top-N3 (Cohere)
FaithfulnessRAGAS > 0.85
Hallucination< 2% (eval)
Hybrid Search
Dense (ANN)70% weight
Sparse (BM25)30% weight
FusionReciprocal Rank
ML Monitoring & Drift Detection
Drift Types & Detection
Drift TypeMethodThreshold
Data Drift
Input feature shift
PSI, KS-test, KL-divPSI > 0.2 → alert
Concept Drift
Label relationship shift
ADWIN, Page-HinkleyAccuracy drop > 3%
Prediction Drift
Output distribution shift
Chi-square, Wassersteinp < 0.05
Embedding Drift
Latent space shift
Arize cosine dist.Centroid drift > 0.1
Automated Retraining Triggers
  • Drift alert → Kubeflow pipeline auto-triggered
  • Scheduled: weekly retraining for all production models
  • Data volume: retrain when 10K new labeled samples arrive
  • Business metric: CTR drop > 5% over 24h rolling window
EvidentlyArize PSIADWINAuto-retrain
Model Governance & Safety
Responsible AI
  • Fairness: demographic parity, equal opportunity audit
  • Explainability: SHAP values, LIME for black-box models
  • Bias checks: Fairlearn, AIF360 pre-deploy
  • Data privacy: differential privacy in training (DP-SGD)
LLM Safety
  • NeMo Guardrails: topic, safety, PII filters
  • Prompt injection: detection + sanitisation layer
  • Output validation: schema + toxicity classifier
Model Cards
  • Every model: intended use, limitations, eval results
  • Lineage: dataset + code SHA + training config
SHAPFairlearn DP-SGDModel Cards