Personal notes · May 2026

System Design Trade-Offs

There are no best architectures — only architectures that are right for a set of constraints. I wrote this note to map the trade-off space I navigate in system design: the laws that bound the design, the axes where you spend your one budget, and the framework I use to choose deliberately and defend the choice.

Author: Linh Truong, MA (Harvard), MBA · Source: LinhTruong.com · Email: Linh@Alumni.Harvard.edu

The question behind this note: which trade-offs are we making on purpose — and which ones are we drifting into because no one wrote them down? What follows is how I think through that.

Covers: Distributed systems · data · scaling · resilience Written: May 2026 ✍️ By: Linh Truong

Foundations

The Trade-Off Mental Model

Every architectural decision moves a slider. Moving it toward one virtue spends a budget that could have bought another. The skill I care about is not knowing the "right" answer — it is naming the budget you are spending and proving it is the cheapest one to spend for this system.

"There are no solutions, only trade-offs." — Thomas Sowell · Adapted to engineering by every architect who ever shipped at scale.

Principle 1

You cannot maximize everything

Latency, consistency, cost, simplicity, and flexibility pull against each other. Optimizing one past a point degrades another. Pick the two or three that matter for the business and let the rest be "good enough."

Principle 2

Constraints reveal the answer

The "best" design falls out of the requirements + constraints: read/write ratio, scale, consistency needs, team size, latency budget, and money. Quantify these first; the architecture is then mostly a derivation.

Principle 3

Reversible vs one-way doors

Decide fast on reversible choices; deliberate hard on one-way doors (data model, partition key, public API contract, sync↔async boundaries). Spend your analysis budget where the cost of being wrong is highest.

The four budgets you are always spending

You sit in the middle. Dragging the system toward any corner stretches the cords to the others — more performance often costs more money and simplicity; more scale stresses consistency.

Foundations

Laws & Theorems That Bound Every Design

These are not opinions — they are constraints. Knowing them stops you from promising the impossible (e.g., "strongly consistent, always available, across regions").

Distributed

CAP Theorem

During a network partition, you must choose Consistency or Availability. You cannot have both. When the network is healthy, the choice is moot — which is why PACELC matters more in practice.

Distributed

PACELC

If Partition → trade A vs C; Else → trade Latency vs Consistency. Captures the everyday cost: strong consistency adds round-trips even when nothing is broken.

Scaling

Amdahl's & Universal Scalability Law

Speedup is capped by the serial fraction (Amdahl). USL adds a coherency penalty: past a point, adding nodes makes throughput worse due to cross-node coordination. Contention is the enemy of scale.

Queueing

Little's Law

L = λ × W. Concurrency = arrival rate × latency. Cut latency and you cut the in-flight work (threads, connections, memory) you must provision. The cheapest capacity is lower latency.

Queueing

Tail latency & the queueing wall

As utilization → 100%, queue time explodes non-linearly. Run hot (>~70–80%) and p99 latency detonates. Plan capacity against the tail, not the mean.

Org

Conway's Law

Systems mirror the communication structure of the org that builds them. Want clean service boundaries? Shape the teams first. Microservices imposed on a monolithic org produce a distributed monolith.

How I end the debate Quote the law that ends the debate. "We can't be both strongly consistent and multi-region available during a partition — CAP forbids it. Which do we sacrifice when us-east can't reach eu-west?" turns an opinion war into a requirements decision.

Foundations

Latency Numbers Every Engineer Should Know

Order-of-magnitude intuition is what separates a back-of-the-envelope estimate from hand-waving. Memorize the relative gaps, not the exact figures.

The gaps that matter: memory is ~100× slower than L1; SSD ~100× slower than memory; a network hop ~30× slower than SSD; crossing the planet ~300× a same-DC round trip. Design data locality accordingly.

How to use these Capacity math in 60 seconds: 1 M requests/day ≈ ~12 req/s. 1 M requests with a 200 ms p99 needs ≈ 0.2 × peak_rps concurrent workers (Little's Law). 1 KB × 1 M rows ≈ 1 GB. These let you sanity-check any design on a whiteboard.

Core Axis · Distributed State

CAP & PACELC — Consistency vs Availability

The first fork in any distributed datastore decision. State which side you land on, and what happens to the other guarantee during a partition.

CAP only forces a choice during a partition. PACELC reminds you that even when healthy ("Else"), strong consistency costs latency via cross-node coordination.

System	PACELC class	Behavior	Use it for
PostgreSQL / MySQL (single primary)	PC/EC	Strong consistency; primary is a SPOF for writes	Transactions, money, relational integrity
Spanner / CockroachDB / YugabyteDB	PC/EC	Global strong consistency via consensus + synced clocks; pays latency	Global OLTP needing strong guarantees
DynamoDB / Cassandra (default)	PA/EL	Available, eventually consistent; tunable quorums	High-scale writes, carts, sessions, feeds
MongoDB (default)	PA/EC	Available under partition; consistent reads from primary when healthy	Flexible documents, mixed workloads
Redis (single)	—	In-memory, strong on one node; replication is async (can lose writes on failover)	Cache, ephemeral, leaderboards

Core Axis · Distributed State

Consistency Models — The Spectrum

"Consistency" is not binary. It is a ladder from strict to eventual; each rung trades latency and availability for stronger guarantees. Pick the weakest model that still meets correctness.

The spectrum of consistency guarantees. Most user-facing systems are happy at causal or read-your-writes — full linearizability is expensive and rarely required outside money and locks.

Strong

When you truly need it

Account balances, inventory decrements, unique username claims, distributed locks, "exactly one winner" auctions. Anywhere a stale read causes double-spend or oversell.

Eventual / Weak

When weak is plenty

Like counts, view counts, social feeds, product catalogs, recommendations, analytics, presence indicators. A few seconds of staleness is invisible to users and buys massive scale + availability.

Hidden cost Eventual consistency pushes complexity to the client and the reconciliation layer: conflict resolution (LWW, vector clocks, CRDTs), idempotent retries, and "undo" UX for rejected writes. It is cheaper on the write path and more expensive everywhere else. Budget for it.

Core Axis · Scale

Vertical vs Horizontal Scaling

Buy a bigger box, or buy more boxes? Vertical is simpler and faster to reach for; horizontal is the only path past a single machine's ceiling — at the price of distributed-systems complexity.

Scale Up vs Scale Outsimplicity ⇄ ceiling

⬆ Vertical (scale up)

Bigger CPU / RAM / faster disks on one node
No code changes; no distribution problems
Strong consistency stays trivial
Hard ceiling + exponential $ per unit
Single point of failure; disruptive upgrades

➡ Horizontal (scale out)

Many commodity nodes behind a balancer
Near-linear capacity & built-in redundancy
Requires statelessness, sharding, coordination
Introduces CAP, partial failure, data movement
Operational + cognitive overhead climbs

Decision rule: Scale up first — it is cheaper in engineering time until you hit the box ceiling or need HA. Scale out when a single node can't hold the load/data or you require fault tolerance. Make services stateless early so scaling out is later a config change, not a rewrite.

The prerequisite: statelessness

Statelessness is what makes horizontal scaling cheap. Externalize sessions and data so any node can serve any request; the load balancer then becomes the only thing that "knows" who is where.

Core Axis · Data Model

SQL vs NoSQL — and the Many NoSQLs

The most consequential one-way door in most systems. Choose for your access patterns and consistency needs, not for hype. "NoSQL" is four very different things.

Dimension	Relational (SQL)	Document	Key-Value	Wide-Column	Graph
Examples	Postgres, MySQL	MongoDB, Couchbase	Redis, DynamoDB	Cassandra, ScyllaDB, Bigtable	Neo4j, Neptune
Schema	Rigid, enforced	Flexible	None	Flexible columns	Nodes + edges
Best query	Joins, ad-hoc, aggregates	Whole-document fetch	Get/put by key	Wide rows by partition key	Relationship traversal
Consistency	Strong (ACID)	Tunable	Tunable	Tunable	Strong
Horizontal scale	Harder (sharding)	Good	Excellent	Excellent	Hard
Sweet spot	Transactions, reporting	Catalogs, profiles, CMS	Cache, sessions, counters	Time-series, events, feeds	Social, fraud, recommendations

Normalize vs Denormalizewrite cost ⇄ read cost

Normalized

One source of truth, no duplication
Cheap, safe writes
Reads pay join cost
Default for OLTP / SQL

Denormalized

Pre-joined, read-optimized
Fast reads, no joins
Writes fan out; risk of drift
Default for NoSQL / read-heavy

Rule: Normalize until reads hurt, then denormalize the hot paths. In NoSQL you model the query first and duplicate freely.

ACID vs BASEguarantee ⇄ scale

ACID

Atomic, Consistent, Isolated, Durable
Correctness by construction
Coordination limits scale

BASE

Basically Available, Soft state, Eventual
Scales horizontally with ease
App handles conflicts/retries

Rule: ACID for money & integrity; BASE for scale-first, tolerant-to-staleness workloads. Many systems use both — ACID core, BASE edges.

OLTP vs OLAP — don't run analytics on your transactional store

	OLTP (transactional)	OLAP (analytical)
Pattern	Many small reads/writes by key	Few huge scans & aggregations
Storage	Row-oriented	Column-oriented (Parquet, Redshift, BigQuery, ClickHouse)
Goal	Low latency per op	High throughput per query
Move data via	—	CDC / ETL / ELT into a warehouse or lakehouse

Common failure Heavy analytics queries on the production OLTP primary lock rows and blow the latency budget for users. Replicate to a read replica or warehouse; isolate the workloads.

Core Axis · Scale

Replication & Partitioning

Replication buys availability and read scale (copies of the same data). Partitioning/sharding buys write scale and capacity (splits of different data). Most large systems do both.

Replication topologies

Single-leader

Leader–follower

All writes to one leader, async/sync to followers. Simple, strong-ish. Trade: leader is a write SPOF; failover + replication lag.

Multi-leader

Active–active

Writes accepted in multiple regions. Great for geo-latency & offline. Trade: write conflicts you must resolve (LWW / CRDT / app logic).

Leaderless

Quorum (Dynamo-style)

Read/write to any N nodes; tune W + R > N for consistency. Trade: tunable but complex; read-repair & anti-entropy.

Quorum math With N replicas, writing to W and reading from R: if W + R > N you are guaranteed to read the latest write. Example N=3, W=2, R=2: strong-ish with one node down. Lower R for fast reads (accept staleness); lower W for fast writes (accept lost-update risk).

Partitioning (sharding) strategies

Strategy	How	Pro	Con
Hash / consistent hashing	shard = hash(key) mod ring	Even distribution; minimal reshuffle on resize	Range scans impossible; hard to query ranges
Range	shard = key falls in [a,b)	Efficient range scans & ordering	Hotspots if keys are skewed (e.g. timestamps)
Directory / lookup	explicit map key→shard	Full flexibility, easy rebalancing	Lookup service is a dependency & SPOF
Geo	shard by region	Data locality, residency compliance	Cross-region queries are expensive

The shard key is a one-way door Choose it for (1) even load, (2) the dominant query, and (3) avoiding cross-shard transactions. A bad shard key (e.g. customer_id when one customer is 40% of traffic) creates a permanent hotspot that is brutally expensive to fix later. Model the access pattern before you commit.

Core Axis · Performance

Caching — The Cheapest Latency Win and the Hardest Bug

Caching trades freshness and complexity for latency and load reduction. "There are only two hard things in CS: cache invalidation and naming things." Plan invalidation before you add the cache.

Write/Read strategies

Pattern	How it works	Pro	Con / risk
Cache-aside (lazy)	App checks cache; on miss, reads DB and populates	Resilient, only caches what's used	First hit slow; stale until TTL; thundering herd on miss
Read-through	Cache library fetches from DB on miss	Transparent to app code	Cache becomes a hard dependency
Write-through	Write to cache & DB synchronously	Cache always fresh	Write latency = cache + DB; caches unread data
Write-back (write-behind)	Write to cache, flush to DB async	Fast writes, absorbs bursts	Data loss if cache dies before flush
Write-around	Write to DB only; cache fills on read	Avoids caching write-once data	Recently written data reads are slow misses

Eviction

When the cache is full

LRU — evict least recently used (good general default)
LFU — evict least frequently used (good for skewed popularity)
FIFO / TTL — simple, time-bounded freshness
W-TinyLFU — modern hybrid (Caffeine), high hit rates

Failure modes

The three cache stampedes

Penetration — queries for keys that don't exist; cache never helps. Fix: cache negatives / bloom filter.
Avalanche — many keys expire at once → DB spike. Fix: jittered TTLs.
Stampede / dogpile — hot key expires, N requests rebuild it. Fix: single-flight lock / stale-while-revalidate.

Where the cache lives (layers)

Each layer trades invalidation control for latency. CDN edge caching is the single biggest lever for global read-heavy workloads; in-memory local caches are fastest but per-node and incoherent.

Core Axis · Communication

Synchronous vs Asynchronous & the API Spectrum

How services talk is as consequential as how they store. Sync is simple and immediate but couples availability; async decouples and absorbs load but adds eventual consistency and operational surface.

Synchronous (request/response) vs Asynchronous (messaging)simplicity ⇄ resilience

Synchronous · REST/gRPC

Immediate result, easy to reason about
Linear request flow, simple debugging
Temporal coupling: callee down ⇒ caller down
Latency adds up across the chain
Backpressure must be handled explicitly

Asynchronous · queues/events

Decoupled; producer ≠ consumer uptime
Absorbs spikes; natural load leveling
Eventual consistency & out-of-order delivery
Harder to trace; need idempotency + DLQs
Enables fan-out & replay

Rule: Use sync when the caller genuinely needs the answer to proceed (a read, a validation). Use async for work that can happen later (emails, thumbnails, indexing, fan-out) or to decouple availability. A user click should rarely block on five synchronous downstream calls.

API styles

Style	Transport / shape	Strengths	Weaknesses	Use for
REST	HTTP/JSON, resources	Ubiquitous, cacheable, simple	Over/under-fetching; chatty	Public APIs, CRUD, broad compatibility
gRPC	HTTP/2, protobuf, RPC	Fast, typed contracts, streaming	Binary, browser/edge friction	Internal service-to-service, low latency
GraphQL	HTTP, query language	Client picks fields; one round-trip	Caching & rate-limiting complexity; N+1	Aggregating many sources for varied UIs
Webhooks / SSE / WebSocket	Server-push	Real-time, server-initiated	Connection state, scaling fan-out	Notifications, live updates, chat

Queue vs Log (message broker vs event stream)

Message queue

RabbitMQ / SQS — work distribution

Message consumed once and deleted; competing consumers split a task queue. Great for job processing. Trade: no replay, ordering is limited.

Event log

Kafka / Kinesis / Pulsar — event streaming

Append-only, retained log; many consumers read independently and can replay. Great for event sourcing, analytics, fan-out. Trade: more ops, partition ordering only.

Always design for Idempotency (same message twice = same result; use idempotency keys), at-least-once delivery (duplicates happen — dedupe), and a dead-letter queue (poison messages need somewhere to go). "Exactly once" is mostly a marketing term; engineer for at-least-once + idempotent consumers.

Architecture

Monolith vs Microservices

The defining org-and-tech trade-off of the last decade. Microservices trade in-process simplicity for independent deployability — and buy a distributed system's worth of new problems. Most teams should start with a modular monolith.

The monolith keeps everything in one process and one transaction boundary. Microservices grant independent scaling and deploys per team, but every in-process call becomes a network call that can fail, and every cross-entity write becomes a distributed-transaction problem.

Dimension	Monolith	Microservices
Deployment	One artifact, simple	Independent per service, needs CI/CD maturity
Scaling	Scale the whole app	Scale hot services independently
Team autonomy	Coupled; coordination tax	Teams own & ship independently
Transactions	ACID, trivial	Distributed; sagas + eventual consistency
Debugging	Stack trace, one process	Distributed tracing required
Failure mode	All-or-nothing	Partial failure (good & bad)
Operational cost	Low	High (mesh, observability, infra)

The distributed monolith The worst of both worlds: services split by layer not by domain, sharing a database and deploying together. You pay the network/ops tax of microservices and keep the coupling of a monolith. Split by business capability and data ownership, or don't split. Conway's Law applies — align services to teams.

Architecture

Patterns: CQRS, Event Sourcing, Saga, EDA

Powerful tools that each solve a real problem — and each add real complexity. Reach for them only when the constraint they address is actually present.

Read/write split

CQRS

Separate the write model from one or more read models, each optimized for its job.

Use when: read & write loads/shapes diverge sharply.

Cost: two models to keep in sync; eventual consistency between them.

Audit / replay

Event Sourcing

Store the sequence of events, derive state by replaying. The log is the source of truth.

Use when: you need a full audit trail, time-travel, or to rebuild projections.

Cost: schema evolution of events, snapshots, steep mental model.

Distributed txn

Saga

Replace a cross-service ACID transaction with a sequence of local transactions + compensating actions on failure.

Use when: a business process spans services (order → payment → inventory → ship).

Cost: you write the rollback logic; no isolation — intermediate states are visible.

Decoupling

Event-Driven Architecture

Services emit events; others react. Producers don't know consumers.

Use when: you need loose coupling, fan-out, extensibility without changing producers.

Cost: emergent behavior is hard to trace; eventual consistency everywhere.

Saga: choreography vs orchestration

	Choreography (events)	Orchestration (coordinator)
Control	Decentralized; each service reacts to events	Central orchestrator drives the steps
Pro	Loose coupling, no central bottleneck	Explicit flow, easy to see & debug
Con	Hard to follow the whole flow; cyclic risk	Orchestrator is coupling + a SPOF to manage
Use	Few steps, simple flows	Many steps, complex compensation logic

Architecture

Resilience & Reliability Patterns

At scale, failure is the steady state, not the exception. Reliability is bought with redundancy, isolation, and graceful degradation — each trading cost and complexity for uptime.

Stop the bleeding

Circuit Breaker

Trip open after repeated downstream failures; fail fast instead of piling on. Half-open to test recovery. Prevents cascading failure.

Shed load

Rate Limiting & Throttling

Token/leaky bucket caps inbound load. Protects you from abuse & thundering herds. Trade: legitimate spikes get 429'd — tune carefully.

Don't amplify

Retries + Backoff + Jitter

Retry transient errors with exponential backoff and jitter. Naïve retries cause retry storms that turn a blip into an outage.

Contain blast radius

Bulkheads

Isolate resource pools (thread pools, connection pools, cells) so one failing dependency can't drown the whole service.

Degrade gracefully

Fallbacks & Timeouts

Every remote call needs a timeout. On failure, serve cached/default/partial data. A degraded page beats an error page.

Smooth the flow

Backpressure

Let slow consumers signal producers to slow down (bounded queues, reactive streams). Unbounded buffering just relocates the crash to OOM.

Redundancy & the cost of nines

Availability	Downtime / year	Typical cost driver
99% ("two nines")	~3.65 days	Single region, manual recovery
99.9% ("three nines")	~8.77 hours	Redundant instances, automated failover
99.99% ("four nines")	~52.6 minutes	Multi-AZ, no single points of failure
99.999% ("five nines")	~5.26 minutes	Multi-region active-active, heavy investment

Diminishing returns Each extra nine roughly 10×'s the cost and complexity. Match the SLO to the business: a checkout flow and an internal admin dashboard do not deserve the same investment. Define SLOs and error budgets explicitly, then engineer to them — not beyond.

Failure-domain checklist

No single point of failure on critical paths (LB, DB, queue all redundant)
Health checks + automated failover, tested via game-days / chaos engineering
Timeouts on every network call; circuit breakers on every dependency
Idempotent writes so retries are safe
Graceful degradation paths defined for each dependency outage
Backups + restore tested (an untested backup is a hope, not a backup)

Decision Toolkit

Master Trade-Off Matrix

My one-screen reference. For each decision: what you gain, what you give up, and the signal that tells you which way to lean.

Decision	Lean A when…	Lean B when…	The cost you pay
Consistency vs Availability	Stale read causes harm (money, inventory)	Uptime > freshness (feeds, carts)	Latency & coordination, or staleness
Vertical vs Horizontal	Below the box ceiling; want simplicity	Past one machine; need HA	Distributed-systems complexity
SQL vs NoSQL	Relations, transactions, ad-hoc queries	Known access pattern, massive scale	Joins & flexibility, or write scale
Normalize vs Denormalize	Write-heavy, integrity-critical	Read-heavy, latency-critical	Read joins, or write fan-out + drift
Sync vs Async	Caller needs the answer now	Work can defer; decouple uptime	Coupling/latency, or eventual consistency
Monolith vs Microservices	Small team, early product, <clear domains>	Many teams, clear bounded contexts	Coupling, or distributed ops tax
Cache vs no cache	Read-heavy, tolerant of slight staleness	Strong freshness, low read volume	Invalidation complexity & stale bugs
Strong vs eventual replication	Correctness on every read	Geo-latency & availability matter	Write latency, or conflict resolution
Build vs Buy	Core differentiator, special needs	Commodity capability (auth, email, search)	Maintenance burden, or vendor lock-in + $
Batch vs Stream	Periodic, high-throughput, simpler	Real-time needs, freshness matters	Latency, or operational complexity

Decision Toolkit

Decision Framework I Use

How I actually run a design decision in a room full of strong opinions. The output is not "the answer" — it is a defensible answer with the rejected alternatives written down.

Quantify requirements & constraints first. Read/write ratio, QPS (avg & peak), data size & growth, latency budget (p50/p99), consistency needs, availability SLO, budget, team size & skills, compliance/residency. Numbers, not adjectives. Most arguments dissolve once these are on the board.
Estimate on the back of an envelope. Use the latency & capacity numbers to find the binding constraint. Is this storage-bound, compute-bound, bandwidth-bound, or latency-bound? The bottleneck picks the architecture.
Identify the one-way doors. Mark which decisions are expensive to reverse (data model, shard key, sync/async boundaries, public contracts). Spend analysis time proportional to reversal cost.
Generate 2–3 candidate designs. Not one. A single option is a decision in disguise. Include the boring option (often a modular monolith + Postgres) as a baseline.
Score against the constraints, name the trade-off. For each candidate, state explicitly what it optimizes and what it sacrifices. There is no winner without a named loser.
Pick the weakest sufficient option. The simplest design that meets the constraints — not the most impressive. Complexity must be earned by a requirement.
Write the ADR. Record context, options, decision, and consequences. Future-you and new joiners need the why, especially the rejected paths.
Define the trigger to revisit. "We'll shard when the table exceeds X GB / p99 exceeds Y ms." Decisions have expiry conditions; name them so you evolve deliberately, not in a panic.

The question I keep asking Junior engineers ask "what's the best technology?" Senior engineers ask "what are the requirements?" I ask "what do we have to be wrong about for this to be the wrong choice, and how would we know?" Design for the decision to be reviewable, not just right.

Decision Toolkit

Documenting & Communicating Decisions

A trade-off you can't explain is a trade-off you can't defend. The Architecture Decision Record (ADR) is the artifact I rely on most.

ADR template (keep it to one page)

Title	Short, e.g. "ADR-014: Use Kafka for order events"
Status	Proposed · Accepted · Superseded by ADR-NNN
Context	The forces at play: requirements, constraints, the problem. Numbers here.
Options considered	2–3 candidates, each with pros/cons. Show your work.
Decision	What we chose and the primary reason (the trade-off we accepted).
Consequences	What gets easier, what gets harder, what we now owe (new ops, risks).
Revisit when	The condition that should make us reopen this (scale, cost, SLA breach).

Communicating up vs down To executives: frame trade-offs in business terms — cost, time-to-market, risk, customer impact. ("Strong global consistency adds ~80ms to checkout and ~$X/mo; eventual is invisible to users here.") To engineers: frame in mechanism — round-trips, failure modes, coupling. Same decision, two languages.

Decision Toolkit

Trade-Off Anti-Patterns

The expensive mistakes — usually made by optimizing a virtue no one asked for, or copying a hyperscaler's solution without their problem.

Resume-driven

Resume-driven development

Choosing tech for novelty/CV value, not fit. Kubernetes + microservices + Kafka for a 1000-user app is a tax on every future change.

Premature

Premature optimization & scaling

Architecting for Google scale at startup scale. You pay full complexity now for load that may never arrive. Build for 10× current, not 10000×.

Cargo cult

Cargo-culting hyperscalers

"Netflix does microservices, so we must." Netflix's constraints (thousands of engineers, planetary scale) are not yours. Copy the reasoning, not the architecture.

Golden hammer

One tool for everything

"We use Mongo/Postgres/Kafka for everything." Forcing every workload into one store ignores that storage choice is a trade-off per access pattern.

Distributed monolith

Microservices without boundaries

Services that share a DB and deploy together — all the cost, none of the benefit. Split by data ownership or not at all.

Ignoring ops

"It works on my machine" scale

Designing for the happy path only. No observability, no failure injection, no capacity headroom. The trade-off you forgot to make gets made for you at 3 a.m.

Decision Toolkit

Pre-Flight Checklist

I run this before signing off on any non-trivial system design. If I can't answer a line, that's where the risk is hiding.

Requirements & scale

Read/write ratio and peak QPS are quantified
Data volume + growth rate estimated (1yr / 3yr)
Latency budget set per path (p50/p99)
Availability SLO + error budget agreed with the business
Consistency requirement named per data type

Data & state

Storage chosen per access pattern, not by default
Shard/partition key justified for load + queries
Replication topology & failover path defined
Backup + tested restore exists
Schema/contract evolution strategy in place

Resilience

No single point of failure on critical paths
Timeouts, retries (w/ jitter), circuit breakers everywhere
Idempotency keys on all writes that can retry
Graceful degradation path per dependency
Backpressure / load shedding designed, not assumed

Operability & cost

Observability: logs, metrics, traces, alerts on SLOs
Cost estimated and matched to the SLO (no over-buying nines)
Rollout/rollback & migration plan exists
ADR written with rejected options & revisit trigger
Simplest design that meets the constraints — complexity earned

The one sentence I leave with The goal is not the most sophisticated system — it is the simplest system that satisfies the constraints, with every trade-off named, measured, and written down so it can be revisited when the constraints change.

Sources

References & Sources

Annotated bibliography behind this system design trade-offs note — distributed-systems laws, latency intuition, data and scaling axes, communication patterns, resilience, and decision governance. Section tags (e.g. §04) show where each source informed the prose, tables, and diagrams. SVG figures, the master trade-off matrix, constraint diamond, and synthesis prose are my own unless noted.

Scope. Synthesis of textbooks, seminal papers, practitioner blogs, and industry patterns (May 2026). Latency bars, PACELC class labels, and availability-nines tables are teaching aids — re-measure against your workload, region, and SLO before committing to a one-way door (shard key, public API, sync↔async boundary).

Citations are numbered continuously [1]–[n] within this section.

Trade-off mental model & one-way doors (§01, §15, §18)

Sowell, T., A Conflict of Visions / The Vision of the Anointed. Basic Books, 1980s–1990s. "There are no solutions, only trade-offs" — §01 lead-rule quote and §14 matrix framing. — §01, §14, §15.
Richards, M., & Ford, N., Fundamentals of Software Architecture. O'Reilly, 2020. First Law ("everything is a trade-off") — hero lede and §01 budget-spending theme. — §01, hero, §15.
Bezos, J., "Day 1 / Type 1 vs Type 2 decisions." Amazon shareholder letters & internal memo (circulated 2011). One-way vs two-way doors — §01 Principle 3 and §15 step 3. — §01, §15, §08.
Truong, L. (synthesis). Constraint diamond SVG (performance · scale · cost · simplicity) — §01 four-budgets figure. LinhTruong.com — §01, §14.

Laws, theorems & org constraints (§02, §04, §11, §17)

Brewer, E. A., "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer, 2012. CAP during partition vs everyday latency trade-offs — §02 CAP card and §04 opening. — §02, §04.
Gilbert, S., & Lynch, N., "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, 2002. Formal CAP proof — background for §02 and §04 partition diagram. — §02, §04.
Abadi, D., "Consistency Tradeoffs in Modern Distributed Database System Design." IEEE Computer, 2012. PACELC (PA/EL vs PC/EC) — §02 PACELC card and §04 PACELC table. — §02, §04.
Amdahl, G. M., "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities." AFIPS, 1967. Serial fraction caps speedup — §02 Amdahl card. — §02, §06.
Gunther, N. J., Guerrilla Capacity Planning / USL papers. Universal Scalability Law coherency penalty — §02 USL card. — §02, §06.
Little, J. D. C., "A Proof for the Queuing Formula: L = λW." Operations Research, 1961. Little's Law — §02 queueing card and §03 capacity-math callout. — §02, §03, §15.
Dean, J., & Barroso, L. A., "The Tail at Scale." Communications of the ACM, 2013. Tail latency under high utilization — §02 queueing-wall card. — §02, §13, §18.
Conway, M. E., "How Do Committees Invent?" Datamation, 1968. Conway's Law — §02 org card and §11 distributed-monolith callout. — §02, §11, §17.

Latency numbers & back-of-envelope capacity (§03, §15)

Dean, J., & Barroso, L. A., "Numbers Everyone Should Know" (talk slides). Order-of-magnitude latency ladder — §03 bar chart (L1 → cross-continent RTT). — §03.
Bryant, R., & O'Hallaron, D., Computer Systems: A Programmer's Perspective. Pearson, 3rd ed. Memory hierarchy and I/O latency — §03 figcaption gaps. — §03.
High Scalability, "Latency Numbers Every Programmer Should Know" (curated table). Community-maintained latency reference — §03 relative gaps. highscalability.com — §03.
Kleppmann, M., Designing Data-Intensive Applications. O'Reilly, 2017. Envelope math, throughput, and latency budgets — §03 tip and §15 step 2. — §03, §15, §18.
Barroso, L. A., Clidaras, J., & Hölzle, U., The Datacenter as a Computer. Morgan & Claypool, 2013. Scale-out economics — background for §06 horizontal scaling. — §03, §06.

CAP, PACELC & datastore classes (§04, §05, §08, §14)

Kleppmann, Designing Data-Intensive Applications — Ch. 5–9. Replication, consistency, and partition behavior — §04–§05 and §08 replication topologies. — §04, §05, §08.
DeCandia, G., et al., "Dynamo: Amazon's Highly Available Key-value Store." SOSP, 2007. AP/quorum tunability — §04 DynamoDB/Cassandra row and §08 leaderless card. — §04, §08.
Corbett, J. C., et al., "Spanner: Google's Globally-Distributed Database." OSDI, 2012. Global strong consistency — §04 Spanner/Cockroach row. — §04, §05.
Herlihy, M. P., & Wing, J. M., "Linearizability: A Correctness Condition for Concurrent Objects." ACM TOPLAS, 1990. Strongest consistency rung — §05 linearizable tick. — §05.
Lamport, L., "Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, 1978. Causal ordering — §05 causal consistency tick. — §05.
Shapiro, M., et al., "Conflict-Free Replicated Data Types." SSS, 2011. CRDTs for conflict resolution — §05 hidden-cost callout and §08 multi-leader card. — §05, §08.
Richardson, C., "Pattern: Database per service." microservices.io — §11 microservices diagram caption. microservices.io — §11, §12.

Scaling, statelessness & replication/partitioning (§06, §08, §14, §18)

Kleppmann, DDIA — partitioning & replication chapters. Leader/follower, quorum, sharding strategies — §08 entire section. — §06, §08, §18.
Karger, D., et al., "Consistent Hashing and Random Trees." STOC, 1997. Hash/ring partitioning — §08 hash-sharding row. — §08.
Amazon ElastiCache / AWS Architecture Blog — horizontal scaling patterns. Stateless app tier + externalized state — §06 statelessness figure. aws.amazon.com/architecture — §06.
Google SRE Team, Site Reliability Engineering. O'Reilly, 2016. Capacity planning and utilization targets — §02 tail-latency theme and §13 nines table. sre.google — §02, §13, §18.
Truong, L. (synthesis). Vertical vs horizontal trade-off block and stateless-LB diagram — §06. — §06, §14.

Data models: SQL, NoSQL, OLTP/OLAP (§07, §14, §18)

Codd, E. F., "A Relational Model of Data for Large Shared Data Banks." Communications of the ACM, 1970. Relational/SQL foundation — §07 SQL column. — §07.
Kleppmann, DDIA — data models chapter. Document, key-value, wide-column, graph comparison — §07 NoSQL table. — §07.
Pritchett, D., "BASE: An Acid Alternative." ACM Queue, 2008. BASE vs ACID framing — §07 ACID/BASE trade-off block. — §07.
Stonebraker, M., et al., "The End of an Architectural Era (It's Time for a Complete Rewrite)." VLDB, 2007. OLTP vs specialized stores — background for §07 OLTP/OLAP table. — §07.
Inmon, W. H., & Kimball, R. — data warehousing literature. Column-oriented OLAP and ETL/ELT — §07 OLAP row and analytics-isolation callout. — §07, §18.
Dehghani, Z., "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh." martinfowler.com, 2019. Analytics across services — background for §07 CDC/warehouse row. martinfowler.com — §07.

Caching strategies (§09, §14, §18)

Fitzpatrick, B., "Distributed Caching with Memcached." Linux Journal, 2004. Cache-aside at scale — §09 cache-layer stack. — §09.
AWS / Azure / Cloudflare CDN documentation (synthesized). Edge vs CDN vs app vs DB cache tiers — §09 caching-layer figure. — §09.
Kleppmann, DDIA — caching & materialized views. Staleness vs latency trade — §09 figcaption and §14 cache row. — §09, §14.
Facebook / Meta engineering posts on cache invalidation (synthesized). "Hard problem" of invalidation — §09 trade-off theme. — §09, §18.

Sync vs async, APIs & messaging (§10, §12, §14, §18)

Fielding, R. T., Architectural Styles and the Design of Network-based Software Architectures (REST). UC Irvine, 2000. REST constraints — §10 REST row. — §10.
gRPC / Protocol Buffers documentation. HTTP/2 RPC and typed contracts — §10 gRPC row. grpc.io — §10.
GraphQL specification & Facebook engineering notes. Client-driven field selection — §10 GraphQL row. graphql.org — §10.
Hohpe, G., & Woolf, B., Enterprise Integration Patterns. Addison-Wesley, 2003. Messaging, pub/sub, competing consumers — §10 queue vs log cards. — §10, §12.
Kreps, J., et al., "Kafka: A Distributed Messaging System for Log Processing." NetDB workshop, 2011. Event log / replay — §10 Kafka card and §12 EDA card. — §10, §12.
Kreps, J., "Exactly-Once Semantics Are Possible: Here's How Kafka Does It." Confluent blog, 2017. At-least-once + idempotency reality — §10 idempotency callout. confluent.io — §10, §13, §18.
Richardson, C., "Pattern: Messaging." microservices.io. Sync vs async coupling — §10 trade-off block. microservices.io — §10, §14.

Monolith vs microservices (§11, §14, §17)

Fowler, M., & Lewis, J., "Microservices." martinfowler.com, 2014. Service boundaries and distributed costs — §11 table and §17 cargo-cult card. martinfowler.com — §11, §17.
Newman, S., "Monolith First." martinfowler.com, 2015. Modular monolith baseline — §11 opening sub and §15 step 4. martinfowler.com — §11, §15, §17.
Newman, S., Building Microservices (2nd ed.). O'Reilly, 2021. Independent deployability vs ops tax — §11 comparison table. — §11, §14.
Evans, E., Domain-Driven Design. Addison-Wesley, 2003. Bounded contexts and modular monolith — §11 split-by-capability callout. — §11, §15.
Richardson, C., "Antipattern: Shared database." microservices.io — §11 distributed-monolith callout and §17 distributed-monolith card. — §11, §17.
Skelton, M., & Pais, M., Team Topologies. IT Revolution, 2019. Team–service alignment — §11 Conway callout. — §11, §17.

CQRS, event sourcing, saga & EDA (§12, §14, §18)

Young, G., "CQRS Documents" / CQRS pattern posts. Separate read/write models — §12 CQRS card. cqrs.files.wordpress.com — §12.
Fowler, M., "Event Sourcing." martinfowler.com. Event log as source of truth — §12 event-sourcing card. martinfowler.com — §12.
Richardson, C., "Pattern: Saga." microservices.io. Compensating transactions — §12 saga card and §11 diagram caption. microservices.io — §11, §12, §18.
Garcia-Molina, H., & Salem, K., "Sagas." ACM SIGMOD Record, 1987. Original saga concept — background for §12. — §12.
Richardson, C., Microservices Patterns. Manning, 2018. Choreography vs orchestration sagas — §12 saga table. — §12.
Evans, G., Domain-Driven Design Reference — domain events. Event-driven architecture — §12 EDA card. — §12.

Resilience, SLOs & failure modes (§13, §18)

Nygard, M., Release It! (2nd ed.). Pragmatic Bookshelf, 2018. Circuit breaker, bulkhead, timeout, stability patterns — §13 pattern cards. — §13, §18.
Netflix Hystrix / resilience4j documentation (synthesized). Circuit breaker states — §13 circuit-breaker card. — §13.
Google SRE — SLOs, error budgets, nines. Availability vs cost — §13 nines table and diminishing-returns callout. sre.google — §13, §18.
AWS Architecture Blog, "Exponential Backoff And Jitter." Retry-storm prevention — §13 retries card. aws.amazon.com — §13, §18.
Basiri, A., et al., "Chaos Engineering." IEEE Software / Netflix practice. Game-days and failure injection — §13 failure-domain checklist. — §13, §18.
Reactive Streams / backpressure specification. Bounded queues and flow control — §13 backpressure card. reactive-streams.org — §13.
Dean & Barroso, "The Tail at Scale." Cascading latency under load — §13 resilience context. — §13.

Decision framework, ADRs & anti-patterns (§14–§18)

Truong, L. (synthesis). Master trade-off matrix — §14 one-screen reference table. — §14.
Nygard, M., "Documenting Architecture Decisions." Cognitect, 2011. ADR template — §16 entire section and §18 operability checklist. cognitect.com — §15, §16, §18.
Thomson, J., "Architecture Decision Records." adr.github.io — §16 ADR template table. — §16, §18.
Beck, K., Extreme Programming Explained — YAGNI. Simplest sufficient design — §15 step 6 and §17 golden-hammer card. — §15, §17.
Knuth, D., "Structured Programming with go to Statements." ACM Computing Surveys, 1974. "Premature optimization" quote — §17 premature-scaling card. — §17.
Fowler, M., "Big Ball of Mud." martinfowler.com. Structureless systems — analogy for §17 resume-driven theme. martinfowler.com — §17.
Netflix engineering culture posts (synthesized). Hyperscaler context — §17 cargo-cult card. — §17.
Truong, L., System Design Trade-Offs — personal working notes. May 2026. Constraint diamond, latency bar chart, CAP partition diagram, consistency spectrum, stateless scaling figure, monolith/microservices SVG, caching stack, trade-off blocks, pre-flight checklist, and synthesis prose. LinhTruong.com — all sections.

Before you cite externally. PACELC labels and datastore examples in §04 reflect typical defaults — vendors change tunables. Latency numbers in §03 drift with hardware generations; use them for ratios, not SLAs. "Exactly once" in messaging (§10) usually means idempotent at-least-once in practice. Match availability nines (§13) to business impact, not ego. Most teams should start with a modular monolith (Newman, Fowler) and earn microservices with team and load evidence. Re-measure every one-way door against your own constraints before production.