Personal notes · May 2026

The System Architecture Playbook

I wrote this note to capture how I think about architecture trade-offs — what to optimize, what to defer, and how to document decisions so they survive the next refactor. Diagrams, patterns, and checklists I use when reviewing cloud-native and distributed systems.

Author: Linh Truong, MA (Harvard), MBA · Source: LinhTruong.com · Email: Linh@Alumni.Harvard.edu

The question behind this note: when a design choice is expensive to reverse, what forces am I trading — and can I defend that trade in writing? What follows is the framework I use to answer that.

Written: May 2026 Context: Cloud-native, distributed, data & AI-intensive systems Covers: Scale · resilience · security · delivery · decisions ✍️ By: Linh Truong

00Introduction — How I Use These Notes

Architecture is the set of design decisions that are expensive to change. I find the hardest part isn't writing code — it's picking the few decisions that determine whether a system still works five to ten years out.

I organized this note as a progression: mindset (how to think), quality attributes (what to optimize), concrete building blocks (styles, data, scale, resilience, security, observability, delivery), and the operating practices (estimation, decisions, anti-patterns, checklists) I run in design reviews.

▸ The idea I keep coming back to

There is no "best architecture." There is only the architecture that is least wrong for your constraints. Every section below is a menu of trade-offs, not a list of correct answers. The skill is matching the trade-off to the context — and being able to defend the choice in writing.

Sections covering the full design surface

∞

Valid architectures — context decides

5–10 yr

Horizon I try to design for

Things that age you: state, scale, coupling

How I read it

Skim the diagrams first. Each section has a canonical diagram; I use them as a visual index when I'm orienting on a problem.
Use the trade-off tables as decision aids, not as verdicts. I add my own column for the constraints I'm actually under.
The checklists in §18 are the operational distillation. I print them and run them in design reviews.

01The Architect's Mindset

Before any pattern, the disposition. For me, architecture is a thinking discipline applied under uncertainty, organizational politics, and incomplete information.

1.1 First principles over patterns

Patterns are compressed experience, but they encode the context in which they were discovered. Always decompose to the underlying forces — latency budgets, failure domains, data gravity, team topology — before reaching for a named solution. A pattern applied without its forces is cargo-culting.

1.2 The laws you cannot repeal

⚖ Conway's Law

Organizations design systems that mirror their communication structures. Corollary (the Inverse Conway Maneuver): shape teams to match the architecture you want, because the system will grow to match the org chart whether you plan it or not.

⚖ The CAP & PACELC trade-off

Under a network Partition you choose Consistency or Availability. PACELC extends it: Else (no partition) you still trade Latency vs. Consistency. There is no free lunch in distributed state — only a choice of which guarantee to sacrifice.

⚖ Amdahl & the Universal Scalability Law

Speedup is capped by the serial fraction (Amdahl). The USL adds a coherency penalty: beyond some point, adding nodes makes the system slower because of cross-talk and contention. Scaling has a peak, not an asymptote.

Heuristic

Reversibility test

Classify every decision as a one-way door or a two-way door. Spend your scarce deliberation on one-way doors; move fast on reversible ones.

Heuristic

Last responsible moment

Defer irreversible decisions until the cost of deferring exceeds the cost of deciding. Keep options open while information is still arriving.

Heuristic

YAGNI vs. evolvability

Don't build for imagined futures, but do leave seams (interfaces, boundaries) where change is plausible. Evolvability is cheap; speculative generality is not.

1.3 What I try to deliver in an architecture role

Clarity under ambiguity — turning a vague mandate into bounded problems with explicit trade-offs.
Written, reviewable decisions — RFCs/ADRs that survive the author leaving.
Force multiplication — raising the design ceiling of every team they touch, not just shipping their own code.
Risk pricing — naming what could go catastrophically wrong and what it would cost to insure against it.

◆ Principle: optimize for the team's throughput, not the artifact's elegance

A "perfect" design no one can operate, debug, or extend is a liability. Design for the median engineer at 3 a.m. during an incident — not for the architecture diagram's beauty.

02Quality Attributes — What You Are Actually Optimizing

Architecture exists to satisfy non-functional requirements (the "-ilities"). Functional requirements decide what the system does; quality attributes decide whether it is worth running. You cannot maximize all of them — they trade against each other.

flowchart TB A([Business Goals]) --> B{Architecturally
Significant
Requirements} B --> C[Operational
availability · scalability · performance · reliability] B --> D[Structural
modularity · maintainability · evolvability · testability] B --> E[Cross-cutting
security · privacy · observability · cost · compliance] C --> F([Architecture Decisions
+ Trade-off Records]) D --> F E --> F F --> G([Fitness Functions
continuously verify the -ilities]) G -.feedback.-> F

Figure 2.1 — Business goals are refined into architecturally significant requirements, which drive decisions, which are continuously verified by fitness functions.

2.1 The attribute catalog & how to make each measurable

A quality attribute you cannot measure is a wish. Every attribute below must be expressed as a testable scenario: source → stimulus → environment → response → response measure.

Attribute	Question it answers	How to measure it	Primary tactics
Availability	What fraction of time is it usable?	SLO, e.g. 99.95% (≈4.4h/yr down)	Redundancy, failover, health checks, graceful degradation
Performance	How fast under load?	p50/p95/p99 latency, throughput	Caching, async, indexing, concurrency, CDN
Scalability	Does cost-per-unit stay flat as load grows?	Throughput vs. resources curve	Statelessness, sharding, horizontal scale, queues
Reliability	Does it produce correct results consistently?	MTBF, error rate, data-loss events	Idempotency, retries, transactions, validation
Resilience	Does it recover from failure?	MTTR, blast radius, RTO/RPO	Circuit breakers, bulkheads, chaos testing
Security	Can it resist misuse?	Attack surface, time-to-detect, audit pass	Zero trust, least privilege, encryption, defense-in-depth
Maintainability	How cheap is change?	Lead time for change, change-fail rate	Modularity, low coupling, tests, clear contracts
Evolvability	Can it absorb unforeseen change?	Coupling metrics, fitness functions	Bounded contexts, versioned APIs, anti-corruption layers
Observability	Can you ask new questions of prod?	Trace coverage, MTTD	Structured logs, metrics, traces, high-cardinality events
Cost efficiency	What does a unit of value cost to serve?	$ / request, $ / tenant, unit economics	Right-sizing, autoscaling, tiered storage, FinOps

⚠ The trade-off that catches everyone

Pushing consistency and strong durability up tends to push latency and availability down, and pushing security up tends to push usability and performance down. Name which attributes you are deliberately sacrificing — an unstated sacrifice is a future incident.

2.2 Availability math you should know cold

SLO ("nines")	Downtime / year	Downtime / month	Typical use
99% (two nines)	3.65 days	7.2 hours	Internal tools, batch
99.9% (three nines)	8.77 hours	43.8 min	Most SaaS features
99.95%	4.38 hours	21.9 min	Paid B2B, commerce
99.99% (four nines)	52.6 min	4.4 min	Payments, critical APIs
99.999% (five nines)	5.26 min	26 sec	Telco, core infra — very costly

Series vs. parallel: dependencies in series multiply (each hard dependency lowers your ceiling — 5 deps at 99.9% ≈ 99.5%). Redundant components in parallel add nines (two 99% replicas ≈ 99.99% if failures are independent). Architect for independent failure domains.

03The Design Process — A Repeatable Method

When a design has to be legible to others — not just workable in my head — I follow a method like the one below. It scales from a one-pager to a platform decision.

flowchart LR S1["1 · Frame
problem, goals,
non-goals, constraints"] --> S2["2 · Requirements
functional +
quality attributes"] S2 --> S3["3 · Estimate
scale, traffic,
data, growth"] S3 --> S4["4 · Define
APIs & data
contracts"] S4 --> S5["5 · High-level
design
(components)"] S5 --> S6["6 · Deep-dive
bottlenecks &
failure modes"] S6 --> S7["7 · Trade-off
analysis &
ADR"] S7 --> S8["8 · Validate
fitness functions,
review, prototype"] S8 -.iterate.-> S5

Figure 3.1 — The eight-step design loop. Steps 5–8 iterate; the framing in step 1 is the highest-leverage and most-skipped step.

3.1 The steps, with the questions to ask at each

Frame the problem. What user/business outcome? What are the explicit non-goals? What constraints are hard (compliance, budget, deadline, existing systems)? Most bad designs are answers to the wrong question.
Requirements. Separate functional from quality attributes. Rank the top 3 -ilities; you will optimize for these and explicitly compromise the rest.
Estimate scale (see §13). Reads vs. writes, peak vs. average, data size, growth rate, latency budget. Numbers convert opinions into engineering.
Define contracts. APIs and data schemas first — they are the longest-lived artifacts and the hardest to change once clients depend on them.
High-level design. Boxes and arrows: clients, gateways, services, stores, queues, caches. Keep it to one whiteboard.
Deep dives. Attack the riskiest part: the hottest path, the largest table, the strongest consistency need, the worst failure mode.
Trade-off analysis. For each significant decision, list ≥2 options and why you rejected the alternatives. Write the ADR (§15).
Validate. Prototype the risky bit, write fitness functions, run a design review, and load-test assumptions before committing.

▸ The C4 model for communicating the design

Describe systems at four zoom levels: Context (system + users + external systems), Containers (apps, services, stores), Components (inside a container), and Code (only when needed). Match the diagram's altitude to the audience — executives want Context; engineers want Containers/Components.

04Architecture Styles — The Macro Choice

The style is your top-level structural commitment. It is the most expensive decision to reverse, so it deserves the most rigor. Choose by team size, domain complexity, and required independence of deployment — not by fashion.

flowchart TB subgraph MONO["Modular Monolith — one deployable, internal modules"] direction LR m1[Orders mod] --- m2[Billing mod] --- m3[Inventory mod] end subgraph MICRO["Microservices — independent deployables"] direction LR s1[(Orders)] -.- s2[(Billing)] -.- s3[(Inventory)] end subgraph EVENT["Event-Driven — choreography via broker"] direction LR p1[Producer] --> BR{{Event Broker}} BR --> c1[Consumer A] BR --> c2[Consumer B] end MONO ==>|"split when team & domain
complexity grow"| MICRO MICRO ==>|"decouple temporal &
scaling concerns"| EVENT

Figure 4.1 — The common evolutionary path: start as a modular monolith, extract services along proven seams, decouple with events. Skipping straight to distributed micro-everything is the #1 self-inflicted wound.

4.1 The styles in brief

Default for most teams

Modular Monolith

Single deployable, strict internal module boundaries (own data, explicit interfaces). Simplest ops, in-process calls, easy transactions. Best when: <~3 teams, domain still being discovered. The boundaries become your future service seams.

Scale of org & load

Microservices

Independently deployable services around bounded contexts, each owning its data. Enables team autonomy and independent scaling — at the cost of distributed-systems complexity, network failure modes, and operational overhead. Earn it; don't start with it.

Decoupling in time

Event-Driven

Components react to events via a broker (Kafka, Pulsar, etc.). Excellent for decoupling, buffering load spikes, fan-out, and audit/replay. Cost: eventual consistency, harder debugging, requires strong observability and schema governance.

Spiky / event-glue

Serverless / FaaS

Functions + managed services; scale-to-zero, pay-per-use. Great for glue, spiky and unpredictable workloads, and small teams. Cost: cold starts, vendor coupling, execution limits, harder local dev and stateful flows.

Domain isolation

Cell-Based / Bulkheaded

Replicate the whole stack into isolated "cells," each serving a slice of users/tenants. Limits blast radius, enables zonal/regional isolation. Used at hyperscale for fault containment and predictable scaling units.

Layering discipline

Hexagonal / Clean / Ports & Adapters

An orthogonal pattern (works inside any style): isolate domain logic from I/O via ports and adapters. Maximizes testability and keeps infrastructure swappable. Pairs well with DDD.

4.2 Style trade-off matrix

Dimension	Modular Monolith	Microservices	Event-Driven	Serverless
Operational complexity	Low	High	Med-High	Medium
Independent deploy/scale	No	Yes	Yes	Yes
Transactional simplicity	High	Low (sagas)	Low	Medium
Debuggability	Easy	Needs tracing	Hard	Medium
Team autonomy	Medium	High	High	High
Fit for spiky load	Medium	Medium	Good	Excellent
Best at team count	1–3	4+	4+	1–small

⚠ The distributed monolith — the worst of both worlds

Microservices that share a database, deploy together, or call each other synchronously in long chains give you distributed-systems pain without the independence benefits. If services can't be deployed independently, they aren't microservices — they're a monolith you have to debug over the network.

05Communication & Integration

How components talk is as architecturally significant as how they're divided. The core axis: synchronous (request/response, temporal coupling) vs. asynchronous (messaging, temporal decoupling).

flowchart LR C([Client]) --> GW[API Gateway
auth · rate-limit · routing] GW -->|REST / gRPC| SA[Service A] GW -->|GraphQL| BFF[BFF] SA -->|gRPC sync| SB[Service B] SA -->|publish event| Q{{Message Broker}} Q -->|subscribe| SC[Service C] Q -->|subscribe| SD[Service D] SB --> DB[(DB B)] SC --> DBC[(DB C)]

Figure 5.1 — A typical hybrid: synchronous edge (gateway → services) for user-facing latency, asynchronous events for side-effects, fan-out, and decoupling.

5.1 Choosing a protocol

Style	Sweet spot	Strengths	Watch out for
REST/JSON	Public APIs, CRUD, broad compatibility	Ubiquitous, cacheable, simple	Over/under-fetching, chatty, weak typing
gRPC / Protobuf	Internal service-to-service, low latency	Fast, strongly typed, streaming, codegen	Browser support, debuggability, binary
GraphQL	Aggregating many backends for varied clients	Client-shaped queries, one round-trip	Caching, N+1, query-cost abuse, complexity
Messaging / events	Decoupling, fan-out, buffering, async work	Resilient, scalable, replayable	Eventual consistency, ordering, duplicates
WebSocket / SSE	Realtime push, live updates	Low-latency bidirectional / server push	Connection state, scaling fan-out

5.2 Patterns that earn their keep

API Gateway — single entry point for auth, rate limiting, routing, TLS termination, request shaping. Keep business logic out of it.
Backend-for-Frontend (BFF) — a tailored aggregation layer per client type (web, mobile) to avoid one-size-fits-none APIs.
Service mesh (sidecar) — push mTLS, retries, timeouts, traffic shaping, and telemetry out of app code into the platform (e.g., Envoy-based meshes).
Outbox pattern — write the business change and the "to-publish" event in one local transaction, then relay — the cure for dual-write inconsistency.
Async request-reply — for long operations: return 202 Accepted + a status/polling URL or callback, rather than holding a connection.

◆ Principle: prefer asynchrony at boundaries you don't control

Every synchronous cross-service call is a coupling of availability — the caller is only as available as the callee. Where you can tolerate eventual results, an event/queue converts a hard dependency into a soft one and absorbs load spikes for free.

06Data Architecture — Where Systems Truly Age

Stateless tiers are easy; state is where the hard, irreversible decisions live. Data has gravity: it is heavy to move, expensive to reshape, and the schema outlives the code. Spend disproportionate care here.

6.1 Choosing a store by access pattern, not by hype

Store type	Data shape / access	Examples	Trade-off
Relational (OLTP)	Structured, relational, strong consistency, transactions	PostgreSQL, MySQL	Vertical scaling limits; sharding adds complexity
Document	Flexible/nested, aggregate-oriented reads	MongoDB, DynamoDB	Weaker cross-doc joins/consistency
Key-Value	Simple lookups, caching, sessions	Redis, DynamoDB	No rich queries
Wide-column	Massive write throughput, time-series	Cassandra, Bigtable	Query patterns fixed at design time
Graph	Relationship-heavy traversal	Neo4j, Neptune	Niche; scaling traversals is hard
Search	Full-text, faceted, relevance	Elasticsearch, OpenSearch	Not a source of truth; eventual
Analytical (OLAP)	Aggregations over huge datasets	Snowflake, BigQuery, ClickHouse	Not for low-latency point writes
Vector	Similarity search / embeddings (AI)	pgvector, Pinecone, Milvus	Recall vs. latency tuning; freshness

◆ Principle: polyglot persistence, single source of truth

Use the right store for each access pattern, but designate one system of record per piece of data and treat all others (search indexes, caches, read replicas, OLAP) as derived, rebuildable projections. If you can't rebuild a derived store from the source, you have two sources of truth and a future divergence bug.

6.2 Scaling data: replication and partitioning

flowchart TB subgraph REP["Replication — copies for read scale & HA"] direction LR PR[(Primary
writes)] -->|stream| R1[(Replica 1
reads)] PR -->|stream| R2[(Replica 2
reads)] end subgraph SHARD["Partitioning / Sharding — split for write scale"] direction LR RT{Router
shard key} --> S1[(Shard A
users 0-3)] RT --> S2[(Shard B
users 4-7)] RT --> S3[(Shard C
users 8-9)] end

Figure 6.1 — Replication scales reads and provides HA/failover. Partitioning (sharding) scales writes and dataset size. Large systems use both.

Replication (leader–follower) scales reads & gives failover, but introduces replication lag → read-your-writes anomalies. Mitigate with read-from-leader for the writer, or sticky sessions.
Sharding scales writes/storage. The hardest part is the shard key: it must spread load evenly and avoid hot spots and cross-shard transactions. Bad shard keys are nearly impossible to change later.
Consistent hashing minimizes data movement when nodes are added/removed — the standard for distributed caches and stores.

6.3 Advanced data patterns

CQRS

Separate the write model (normalized, transactional) from the read model (denormalized, query-optimized projections). Lets each side scale and evolve independently. Adds eventual consistency and projection-maintenance cost — use it where read and write shapes genuinely diverge.

Event Sourcing

Persist the sequence of state-changing events as the source of truth; current state is a fold over events. Gives a perfect audit log, time travel, and replay — at the cost of schema/versioning discipline and a steeper mental model.

Saga (distributed transactions)

Replace cross-service ACID with a sequence of local transactions plus compensating actions on failure. Orchestrated (central coordinator) or choreographed (events). The standard answer to "how do I do a transaction across microservices?" — the honest answer is "you don't; you do a saga."

Data Mesh / Lakehouse

Decentralize analytical data ownership to domains that publish data products with contracts and SLAs, over a self-serve platform. The lakehouse unifies the data lake's flexibility with warehouse-grade transactions/governance. For org-scale analytics.

⚠ The shared database anti-pattern

Multiple services reading/writing the same tables couples them at the deepest level — a schema change for one breaks the others, and no one owns the data's invariants. Each service should own its data and expose it only through APIs/events.

07Scalability Strategy

Scalability is keeping cost-per-unit roughly flat as load grows by orders of magnitude. The toolkit is small and well-understood; the art is applying it before — not during — the incident.

flowchart TB U([Users]) --> CDN[CDN / Edge
cache static + near user] CDN --> LB[Load Balancer
health-aware] LB --> A1[App N1] LB --> A2[App N2] LB --> A3[App N3 ...autoscale] A1 --> CA[(Cache
Redis)] A2 --> CA A3 --> CA CA -.miss.-> DB[(Primary DB)] DB --> RR[(Read Replicas)] A1 --> MQ{{Queue}} MQ --> W[Async Workers]

Figure 7.1 — The canonical scalable web topology: edge caching, stateless autoscaled app tier, distributed cache fronting the DB, read replicas, and a queue offloading slow work to workers.

7.1 The scaling toolkit

Statelessness — push session/state to a shared store so any node can serve any request; the precondition for horizontal scale and autoscaling.
Horizontal > vertical — add commodity nodes (linear-ish, fault-tolerant) rather than bigger boxes (hard ceiling, single point of failure). Vertical buys time; horizontal buys headroom.
Caching at every layer (browser → CDN → app → distributed cache → DB buffer). The cheapest performance you'll ever buy — and the richest source of staleness bugs.
Load balancing — distribute work; remove unhealthy nodes; enable zero-downtime deploys.
Async offload — move anything not needed for the response (emails, thumbnails, indexing, analytics) onto a queue.
Read/write split — replicas for reads, primary for writes; CQRS when shapes diverge.

7.2 Caching strategy & the hard parts

Pattern	How it works	Use when
Cache-aside (lazy)	App checks cache, on miss loads DB + populates	Read-heavy, tolerant of slight staleness (most common)
Read-through	Cache loads from DB on miss transparently	Want caching logic centralized
Write-through	Write to cache + DB synchronously	Read-after-write consistency needed
Write-behind	Write to cache, async flush to DB	Write-heavy, can tolerate small loss window

⚠ The three caching demons

Invalidation (the famously hard problem — prefer TTLs + event-driven busting), stampede / thundering herd (many misses hit the DB at once on expiry — use request coalescing, jittered TTLs, locks), and hot keys (one key overwhelms a node — replicate or shard hot keys). Plan for all three before launch.

7.3 Back-pressure & load shedding

A scalable system must protect itself when demand exceeds capacity. Back-pressure propagates "slow down" upstream (bounded queues, flow control). Load shedding deliberately rejects low-priority work (with 429/503 + retry-after) to keep the core healthy. Rate limiting (token/leaky bucket) caps per-client demand. Failing fast and partial beats failing slow and total.

08Reliability & Resilience Engineering

At scale, failure is not an exception — it is the steady state. Disks die, networks partition, dependencies time out, deploys regress. Resilience is designing so that partial failure stays partial and the system degrades gracefully instead of collapsing.

flowchart LR CL([Caller]) --> CB{Circuit
Breaker} CB -->|closed| TO[Timeout +
bounded retry
w/ backoff+jitter] TO --> BH[Bulkhead
isolated pool] BH --> DEP[Dependency] CB -->|open| FB[Fallback /
degraded response] DEP -.fails.-> FB

Figure 8.1 — The resilience call chain: a timeout bounds waiting, bounded retries with backoff+jitter handle transients, a bulkhead isolates the resource pool, and a circuit breaker stops hammering a failing dependency and serves a fallback.

8.1 The resilience pattern library

Pattern	Failure it addresses	Key rule
Timeout	Unbounded waits exhausting threads/connections	Every remote call has a timeout. Always.
Retry + backoff + jitter	Transient blips (network, brief overload)	Bound attempts; jitter to avoid synchronized retries
Circuit breaker	Repeated calls to a dead dependency	Open after threshold; half-open to probe recovery
Bulkhead	One slow dependency draining all resources	Isolate thread/connection pools per dependency
Fallback / degrade	Dependency unavailable	Serve cached/default/partial — never a hard error if avoidable
Idempotency	Duplicate delivery from retries	Idempotency keys; "at-least-once" + dedupe
Redundancy + failover	Node/zone/region loss	Independent failure domains; tested failover

⚠ Retries are a double-edged sword

Naïve retries amplify load during an outage and can cause retry storms that turn a brief degradation into a full collapse. Always combine retries with: bounded attempts, exponential backoff, jitter, circuit breakers, and a retry budget (cap retries as a % of total traffic).

8.2 RTO, RPO, and the spectrum of DR

RTO (Recovery Time Objective) = how long until service is restored. RPO (Recovery Point Objective) = how much data loss is acceptable. These business-set numbers dictate the (rising) cost of your DR strategy:

DR strategy	RTO	RPO	Relative cost
Backup & restore	Hours–days	Hours	$
Pilot light	Tens of min	Minutes	$$
Warm standby	Minutes	Seconds–min	$$$
Active-active (multi-region)	~Zero	~Zero	$$$$

8.3 Operating reliability: SRE practices

SLI / SLO / error budgets — measure the indicator, set the objective, and spend the budget: when you're within budget, ship fast; when you're burning it, freeze features and fix reliability. This aligns velocity and stability with one number.
Chaos engineering — deliberately inject failure (kill nodes, add latency, partition) in controlled experiments to validate resilience before reality does it for you.
Blameless postmortems — treat incidents as system failures, not human failures; the output is action items that change the system, not blame.
Game days — rehearse failover and incident response so muscle memory exists when it counts.

09Distributed Systems Foundations

The moment your system spans more than one machine, a new physics applies. The "fallacies of distributed computing" — the network is reliable, latency is zero, bandwidth is infinite, topology is stable — are all false, and assuming otherwise produces the subtlest, most expensive bugs.

9.1 Consistency models — the spectrum, not the binary

Model	Guarantee	Cost	Typical use
Strong / linearizable	Reads see the latest write, globally ordered	Highest latency, lowest availability under partition	Balances, inventory, locks
Causal	Causally related ops seen in order	Moderate	Collaboration, comments, feeds
Read-your-writes	You see your own updates	Low (session pinning)	Profile edits, settings
Eventual	Replicas converge given no new writes	Lowest latency, highest availability	Caches, counts, DNS, search

◆ Principle: pick the weakest consistency the use case tolerates

Strong consistency is a tax on latency and availability. Most features (likes, view counts, recommendations, search) are perfectly happy with eventual consistency. Reserve strong consistency for the genuinely money-or-safety-critical paths and pay the tax only there.

9.2 Consensus, ordering, and time

Consensus (Raft, Paxos) — how a cluster agrees on a value/leader despite failures. You rarely implement it; you use systems built on it (etcd, ZooKeeper, consensus-backed DBs). Know that it requires a majority quorum and cannot make progress without one.
Logical clocks (Lamport, vector clocks) — order events without synchronized wall clocks. Never trust wall-clock ordering across machines — clock skew is real.
Quorums — with replication factor N, choosing read quorum R and write quorum W such that R + W > N guarantees overlap and thus strong reads; tuning R/W trades latency vs. consistency.

9.3 Delivery guarantees & idempotency

Messaging systems offer at-most-once (may lose), at-least-once (may duplicate), or exactly-once (expensive, often "effectively once" via dedupe). In practice, the robust pattern is: at-least-once delivery + idempotent consumers. Design every side-effecting operation to be safely repeatable — via idempotency keys, natural dedupe keys, or conditional writes. This single discipline prevents a huge class of distributed bugs.

10Security Architecture

Security is a quality attribute that must be designed in, not bolted on. The modern posture is zero trust: never trust the network, always verify identity, and assume breach. Defense in depth means no single control failing should be catastrophic.

flowchart TB U([User / Service]) --> ID[Identity Provider
AuthN: who are you?] ID --> PEP[Policy Enforcement
AuthZ: what may you do?] PEP --> EDGE[Edge: WAF · TLS · rate limit · DDoS] EDGE --> APP[App: input validation ·
output encoding · least privilege] APP --> SEC[Secrets manager
rotation · no creds in code] APP --> DATA[(Data: encrypted at rest
+ in transit · tokenized PII)] AUD[[Audit log · SIEM · anomaly detection]] -.observes.- PEP AUD -.observes.- APP AUD -.observes.- DATA

Figure 10.1 — Layered zero-trust controls: strong identity, fine-grained authorization, hardened edge, defensive app tier, managed secrets, encrypted data, and pervasive audit.

10.1 The pillars

Identity & Access

Strong authentication (OIDC/OAuth2, MFA, short-lived tokens, mTLS between services) and least-privilege authorization (RBAC/ABAC, scoped tokens). Identity is the new perimeter.

Data protection

Encrypt in transit (TLS everywhere) and at rest. Classify data; tokenize/encrypt PII; minimize what you collect and retain. Manage keys in a KMS with rotation.

Secrets management

No credentials in code, config, or images. Use a secrets manager / vault with rotation and dynamic, short-lived credentials. Prefer workload identity over long-lived keys.

Supply chain

Pin and scan dependencies, generate an SBOM, sign artifacts, verify provenance, and lock down CI/CD permissions. The build pipeline is now a top attack target.

10.2 Threat modeling — STRIDE

Systematically ask, per component and data flow, how it could be attacked:

Threat	Violates	Countermeasure
Spoofing	Authentication	Strong auth, MFA, mTLS
Tampering	Integrity	Signing, hashing, validation
Repudiation	Non-repudiation	Audit logs, signed actions
Information disclosure	Confidentiality	Encryption, access control
Denial of service	Availability	Rate limiting, autoscale, WAF
Elevation of privilege	Authorization	Least privilege, isolation

▸ Shift left, but also shift everywhere

Bake security into design (threat models), code (SAST, secure defaults), build (dependency & container scanning, SBOM), and runtime (WAF, anomaly detection, least-privilege workloads). Privacy-by-design and compliance (GDPR, SOC 2, etc.) are architectural constraints, not afterthoughts.

11Observability

You cannot operate what you cannot see. Monitoring tells you whether the system is healthy against known questions; observability lets you ask new questions of production without shipping new code — essential for debugging novel failures in distributed systems.

flowchart LR APP[Instrumented Services
OpenTelemetry SDK] --> COL[OTel Collector] COL --> M[(Metrics
rates · errors · latency)] COL --> L[(Logs
structured · correlated)] COL --> T[(Traces
end-to-end spans)] M --> DASH[Dashboards + Alerts] L --> DASH T --> DASH DASH --> SLO{SLO / error budget} SLO --> ONCALL([On-call action])

Figure 11.1 — The three pillars unified through OpenTelemetry: correlated metrics, logs, and traces feed dashboards and SLO-based alerting that drive on-call action.

11.1 The three pillars (and the fourth)

Metrics — cheap, aggregatable numeric time-series. Track the RED method (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for resources.
Logs — structured (JSON), with a correlation/trace ID on every line. Unstructured logs don't scale to investigation.
Traces — follow a single request across every service hop; the only practical way to find latency and failure in a distributed call graph.
(Profiles / events) — continuous profiling and high-cardinality wide events round out modern observability.

◆ Principle: instrument with OpenTelemetry, alert on symptoms

Standardize instrumentation on a vendor-neutral standard (OpenTelemetry) so you're not locked in. Alert on user-facing symptoms (SLO burn, error rate, latency) — not on causes (high CPU). Cause-based alerts page you for non-problems and miss novel ones; symptom-based alerts catch what actually matters.

11.2 Good alerting hygiene

Every alert must be actionable and urgent; otherwise it's a dashboard, not a page. Alert fatigue is a reliability risk.
Use multi-window, multi-burn-rate SLO alerts to catch both fast and slow budget burns without noise.
Every page links to a runbook. The metric-to-action gap is where MTTR goes to die.

12Delivery, Deployment & Platform

Architecture includes how software reaches production. The DORA research is clear: elite performers deploy frequently, with low lead time, low change-failure rate, and fast recovery — and these come from architecture that supports safe, small, independent deploys.

flowchart LR DEV([Commit]) --> CI[CI: build · unit ·
SAST · dep scan] CI --> ART[Signed Artifact
+ SBOM] ART --> STG[Deploy to Staging
integration + e2e] STG --> GATE{Quality &
policy gate} GATE -->|pass| CANARY[Canary / Progressive
1% → 10% → 100%] CANARY --> METRICS{SLO healthy?} METRICS -->|yes| FULL[Full rollout] METRICS -->|no| RB[Auto rollback]

Figure 12.1 — A modern progressive-delivery pipeline: build → sign → test → policy gate → canary with automatic rollback on SLO regression.

12.1 Deployment strategies

Strategy	How	Best for
Rolling	Replace instances batch by batch	Default; simple, no extra capacity
Blue-Green	Stand up full new env, switch traffic atomically	Instant rollback; needs 2× capacity
Canary	Route a small % to new version, watch metrics	Risk reduction at scale; needs good observability
Feature flags	Decouple deploy from release; toggle per cohort	Trunk-based dev, experiments, kill switches

12.2 Platform foundations

Containers + orchestration (Kubernetes) — package once, run anywhere; declarative scaling, self-healing, rollouts. Powerful but operationally heavy — adopt deliberately.
Infrastructure as Code (Terraform, etc.) — all infra versioned, reviewed, reproducible. No click-ops in production.
GitOps — Git as the single source of truth for desired state; a controller continuously reconciles reality to it. Auditable, revertible deployments.
Platform engineering / Internal Developer Platform — pave golden paths so product teams ship safely without re-deriving infra. Reduces cognitive load; this is where Conway's Law meets DevEx.
Service mesh — when service count grows, externalize mTLS, retries, traffic policy, and telemetry to the platform.

◆ Principle: small, frequent, reversible deploys beat big, rare, risky ones

Batch size is the hidden variable in deployment risk. A 1-line change is easy to review, easy to roll back, and easy to attribute when something breaks. Decouple deploy (shipping code, dark) from release (turning it on, via flags) so you can do both safely and independently.

13Capacity Planning & Back-of-the-Envelope Estimation

I try to turn hand-waving into numbers in minutes. Estimation isn't about precision — it's about getting the order of magnitude right so you choose the right architecture and don't over- or under-build by 100×.

13.1 Latency numbers every engineer should know (approx.)

Operation	≈ Latency	Mental model
L1 cache reference	~1 ns	Free
Main memory reference	~100 ns	100× slower than L1
Read 1 MB sequentially from RAM	~5–10 µs	Fast
SSD random read	~100 µs	1000× slower than RAM ref
Read 1 MB from SSD	~1 ms	—
Round trip within a datacenter	~0.5 ms	Cheap, but adds up in chains
Disk (HDD) seek	~10 ms	Avoid in hot paths
Round trip across continents	~150 ms	Physics — put data near users

Takeaway: memory ≫ SSD ≫ disk ≫ cross-region network. Every order of magnitude changes the design. Keep hot data in memory; keep users close to their data.

13.2 A worked estimation (template)

Assume: 100M daily active users, each does 10 reads + 1 write/day.

Reads/day  = 100M × 10 = 1B  → /86,400s ≈ 11,600 RPS avg
Writes/day = 100M × 1  = 100M → ≈ 1,160 WPS avg
Peak factor ≈ 3×  →  ~35k RPS read, ~3.5k WPS write at peak

Storage/write: ~1 KB per record
  100M writes/day × 1 KB = 100 GB/day  → ~36 TB/year (before replication)
  With 3× replication → ~110 TB/year

Cache: keep hot 20% of reads in memory
  working set ≈ (some GB) → size Redis cluster accordingly
Bandwidth (read): 35k RPS × ~1 KB ≈ 35 MB/s egress (plan CDN offload)

This 5-minute exercise immediately tells you: you need a distributed cache, read replicas or sharding, a CDN, and roughly how many app nodes — before writing any code.

13.3 Capacity headroom & cost

Plan for peak × growth × safety margin, not average. Run hot at ~50–70% utilization so spikes don't tip you over.
Autoscale on the right signal (queue depth, RPS, p99) — CPU is often a lagging proxy.
FinOps: track unit economics ($/request, $/tenant). The cheapest-to-build design and the cheapest-to-run design are rarely the same — make the trade explicit.

14Architecting for AI / ML & Modern Workloads

AI-intensive systems are now mainstream, and they bend several classic assumptions: non-deterministic outputs, GPU economics, large-payload latency, prompt/data-driven behavior, and new failure and safety modes. A current playbook must address them.

flowchart LR U([User query]) --> GW[Gateway
auth · rate-limit] GW --> ORCH[Orchestrator] ORCH --> RET[Retriever] RET --> VDB[(Vector DB
embeddings)] RET --> KB[(Source docs)] ORCH --> LLM[LLM / Model Service
GPU-backed, autoscaled] LLM --> GUARD[Guardrails
safety · validation] GUARD --> CACHE[(Semantic cache)] GUARD --> U OBS[[Eval · tracing · cost · drift monitoring]] -.observes.- ORCH OBS -.observes.- LLM

Figure 14.1 — A retrieval-augmented (RAG) inference architecture: orchestrate retrieval over a vector store, call a GPU-backed model, enforce guardrails, cache semantically, and observe quality, cost, and drift end-to-end.

14.1 What changes when AI is in the loop

Non-determinism & evals

Outputs vary; "correct" is fuzzy. Replace exact-match tests with evaluation suites (offline + online), human/auto graders, and regression gates. Treat prompts and model versions as deployable, versioned artifacts.

Retrieval & data freshness (RAG)

Ground models in your data via embeddings + a vector store to reduce hallucination and add recency. The architecture problem becomes retrieval quality, chunking, and freshness as much as the model.

Cost & latency economics

Inference is expensive and often slow. Use semantic caching, model routing (small model first, escalate), batching, streaming responses, and right-sized models. Token cost is now a unit-economics line item.

Safety, guardrails & abuse

Validate inputs/outputs, constrain tools the model can call, sandbox actions, and add content/safety filters. Prompt injection and data exfiltration are real threat-model entries — treat model inputs as untrusted.

14.2 Inference vs. training paths

Serving/inference is the latency-critical, always-on path: GPU autoscaling (with cold-start mitigation), request batching, streaming, and aggressive caching dominate.
Training/fine-tuning is a batch, data-pipeline problem: feature stores, reproducible datasets, experiment tracking, and a model registry with lineage.
MLOps/LLMOps closes the loop: deploy models like code (CI/CD), monitor for drift and quality regression, and enable safe rollback of model/prompt versions.

◆ Principle: wrap the model behind a stable contract

Models, prompts, and providers change monthly. Put a provider-agnostic abstraction (your own gateway/orchestration layer) between your product and any specific model so you can swap models, route by cost/quality, A/B test, and enforce guardrails centrally — without rewriting the product.

15Making & Recording Decisions

The artifact I care most about is the written decision. An architecture no one can reconstruct the reasoning behind becomes legacy the day its author leaves. ADRs and RFCs make decisions durable, reviewable, and reversible-with-eyes-open.

15.1 The Architecture Decision Record (ADR)

A short, immutable, append-only document per significant decision. The key sections:

# ADR-0042: Use event-driven sagas for order fulfillment

## Status
Accepted  (supersedes ADR-0017)

## Context
Orders span inventory, payment, and shipping services. We need
"transaction-like" consistency without a distributed 2PC, and the
fulfillment flow must survive partial failures and be auditable.

## Decision
Use a choreographed saga over Kafka with compensating actions.
Each step is idempotent; the outbox pattern guarantees event publish.

## Consequences
+ Services stay decoupled and independently deployable.
+ Full audit trail via the event log.
- Eventual consistency: UI must reflect "pending" states.
- Requires saga monitoring + dead-letter handling.

## Alternatives considered
- 2PC across services: rejected (availability + coupling cost).
- Synchronous orchestration service: rejected (single point of
  coupling, harder to scale steps independently).

▸ The structure forces good thinking

"Alternatives considered" is the most valuable section — it proves you weighed options and lets a future reader judge whether the context has changed enough to revisit the decision. Context + alternatives, not just the verdict.

15.2 A decision rubric

Is this a one-way or two-way door? Calibrate rigor to reversibility.
What are we optimizing, and what are we sacrificing? Name the losing -ilities.
What's the blast radius if we're wrong? Can we contain/roll it back?
What would change our mind? Write the falsifiable condition for revisiting.
Who must be in the room? The people who own the consequences.

◆ Principle: disagree-and-commit, in writing

Architecture is a team sport with real disagreement. Surface dissent in the RFC, decide explicitly, record the dissent, and commit. A documented decision that some disagreed with beats an undocumented consensus that no one can reconstruct.

16Anti-Patterns & Failure Modes to Recognize

Knowing what not to do is half of architecture. These recur across companies and decades; recognizing them early is a superpower.

Anti-pattern	What it looks like	Why it hurts	Antidote
Distributed monolith	Services that must deploy together, share a DB, or call each other in long sync chains	Distributed pain, none of the independence	Real boundaries; async; own your data
Big Ball of Mud	No discernible structure; everything depends on everything	Change is unpredictable; fear-driven dev	Bounded contexts; enforce module boundaries
Premature microservices	10 services, 3 engineers, undiscovered domain	Ops overhead crushes velocity	Start modular monolith; extract on evidence
God service / object	One component everything routes through	Bottleneck + SPOF + change magnet	Decompose by responsibility
Chatty interfaces	Dozens of fine-grained calls per use case	Latency multiplies; failure surface grows	Coarse-grained APIs; batch; BFF
Shared mutable database	Many services writing the same tables	Hidden coupling; no invariant owner	One owner; integrate via API/events
Retry storms	Naïve retries with no backoff/budget	Amplifies outages into collapse	Backoff+jitter, circuit breaker, budget
Cache as source of truth	Data only lives in the cache	Eviction/restart = data loss	Cache is derived & rebuildable
Resume-driven design	Adopting tech for novelty, not need	Complexity with no business return	Boring tech; justify every new dependency
Golden hammer	One tool/pattern forced onto every problem	Poor fit; accidental complexity	Match tool to access pattern

⚠ The meta anti-pattern: complexity without a customer

Every box, queue, service, and database you add is a thing to operate, secure, monitor, and pay for, forever. The default answer to "should we add this component?" is no, until the requirement proves otherwise. Boring, well-understood technology is a feature, not a compromise.

17A Reference Architecture (Putting It Together)

A concrete, modern reference for a high-scale, multi-region SaaS — annotated with which decisions matter and why. Treat it as a starting skeleton, not a mandate.

flowchart TB USERS([Global Users]) --> DNS[GeoDNS / Anycast] DNS --> CDN[CDN + Edge / WAF + DDoS] CDN --> LB[Regional Load Balancer] LB --> GW[API Gateway
authN/Z · rate-limit] GW --> MESH{{Service Mesh}} MESH --> SVC1[Service: Identity] MESH --> SVC2[Service: Core Domain] MESH --> SVC3[Service: Billing] MESH --> BFF[BFF / GraphQL] SVC2 --> CACHE[(Distributed Cache)] SVC2 --> PDB[(Primary OLTP
sharded + replicas)] SVC2 --> OUT[Outbox] --> BUS{{Event Bus / Kafka}} BUS --> WORK[Async Workers] BUS --> CDC[CDC → Lakehouse] CDC --> OLAP[(Analytics / OLAP)] SVC2 --> VDB[(Vector store - AI features)] OBS[[Observability: OTel · metrics · logs · traces · SLO]] -.-> MESH SEC[[Security: secrets · KMS · audit/SIEM]] -.-> MESH

Figure 17.1 — Reference: GeoDNS + CDN/WAF edge, regional gateway and mesh, bounded-context services owning their data, outbox→event-bus for decoupling, CDC into a lakehouse for analytics, with cross-cutting observability and security planes.

17.1 Why each piece is there

Component	Quality attribute served	If you skip it…
GeoDNS + CDN + WAF	Latency, availability, security	Slow far-away users; exposed origin
API Gateway	Security, governance	Auth/rate-limit logic scattered everywhere
Service mesh	Resilience, observability, security	Retries/mTLS/telemetry reimplemented per service
Per-service data ownership	Evolvability, autonomy	Shared-DB coupling; can't deploy independently
Outbox + event bus	Reliability, decoupling	Dual-write inconsistency; tight coupling
CDC → lakehouse	Analytics without OLTP load	Analysts hammer the prod DB
Observability plane	Operability	Blind during incidents; high MTTR

▸ Start smaller than this

This reference is the destination, not the starting point. A team of five should begin with a modular monolith + managed Postgres + a CDN, and grow toward this shape only as load and team size justify each addition. Architecture is earned incrementally.

18The Playbook — Operational Checklists

The distilled, printable core. Run these in design reviews, readiness reviews, and incident retros.

Design Review Checklist

Problem, goals, and explicit non-goals are written down
Top 3 quality attributes ranked; sacrifices named
Scale estimated (RPS, data, growth, latency budget)
API & data contracts defined and versioned
Failure modes enumerated; blast radius bounded
Consistency model chosen per data flow
Each significant decision has an ADR with alternatives
Riskiest assumption prototyped or load-tested

Production Readiness Checklist

SLOs defined with error budgets and alerts on symptoms
Metrics, logs, traces wired (OTel); dashboards exist
Every remote call has timeout + bounded retry + breaker
Idempotency on all side-effecting operations
Autoscaling + load shedding + rate limits configured
Runbooks linked from every alert
RTO/RPO set; backup & failover tested (not assumed)
Rollback path verified; deploy is progressive

Security Checklist

Threat model (STRIDE) done for new data flows
AuthN/Z enforced; least privilege everywhere
Secrets in a vault, rotated; none in code/images
Data encrypted in transit & at rest; PII classified
Dependencies scanned; SBOM generated; artifacts signed
Audit logging + anomaly detection in place
Inputs validated; model/LLM inputs treated as untrusted

Scalability & Cost Checklist

App tier stateless; state in shared stores
Caching strategy chosen; stampede/hot-key handled
Read/write split or sharding plan; shard key validated
Slow work offloaded to async queues/workers
Back-pressure & graceful degradation defined
Unit economics tracked ($/request, $/tenant)
Headroom: runs at 50–70% at peak × growth

18.1 One-page heuristics I keep nearby

Make it work, make it right, make it fast — in that order. Premature optimization and premature distribution are twin sins.
Boring technology by default. Spend your innovation tokens where they create real differentiation.
Optimize the bottleneck, measure first. Intuition about performance is usually wrong; profile.
Design for failure, test for failure. Untested failover is a hope, not a plan.
Couple loosely, align tightly. Loose technical coupling, tight clarity on contracts and ownership.
Write the decision down. If it isn't an ADR, it didn't happen.
Reversibility first. Move fast on two-way doors; deliberate on one-way doors.
The simplest thing that could possibly work — and no simpler.

19The Canon & Further Reading

Texts and bodies of knowledge I keep returning to. I read them for the reasoning, not the recipes. Annotated citations with DOIs and URLs: §20 References & Sources.

Foundational books

Designing Data-Intensive Applications — Kleppmann (the data & distributed-systems bible)
Fundamentals of Software Architecture & Software Architecture: The Hard Parts — Richards & Ford
Building Microservices — Newman
Domain-Driven Design — Evans; Implementing DDD — Vernon
Release It! — Nygard (stability & resilience patterns)
Site Reliability Engineering & The SRE Workbook — Google
Accelerate — Forsgren, Humble, Kim (the DORA research)
Team Topologies — Skelton & Pais (Conway's Law, in practice)

Papers, models & bodies of knowledge

Brewer — the CAP theorem; Abadi — PACELC
Lamport — time, clocks & the ordering of events
Raft & Paxos consensus papers
Amazon Dynamo paper (eventual consistency at scale)
The 12-Factor App methodology
C4 model for software architecture (Brown)
The Reactive Manifesto
OWASP Top 10 & NIST Zero Trust (SP 800-207)
Well-Architected Frameworks (cloud-vendor) — for pillar checklists

◆ What I expect to stay true

The technologies in these notes will change; the trade-offs will not. Distributed state, the speed of light, human and organizational limits, and the tension between consistency, availability, latency, cost, and complexity are permanent. When I understand the forces, every new tool is just another point in a design space I already recognize.

20References & Sources

Annotated bibliography behind the mindset heuristics, quality attributes, design process, architecture styles, integration patterns, data and scale tactics, resilience and distributed-systems foundations, security, observability, delivery, estimation, AI/ML workloads, ADRs, anti-patterns, reference topology, checklists, and canon sections. Section tags (e.g. §04) show where each source is used. Mermaid diagrams, trade-off matrices, and synthesis checklists are my own unless noted.

Scope. Synthesis of textbooks, peer-reviewed papers, industry frameworks, and vendor-neutral standards (May 2026). Cloud-specific numbers, SKUs, and product names change frequently — verify against current vendor docs before production decisions. Security and compliance references are architectural starting points, not audit checklists.

Citations are numbered continuously [1]–[n] within this section.

Architecture mindset, trade-offs & organizational forces (§00, §01, §18)

Brooks, F. P., The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, 1975 (rev. ed. 1995). Essential vs. accidental complexity; why late projects stay late — background for §00 irreversible-decision framing. — §00, §01.
Conway, M. E., "How Do Committees Invent?" Datamation, 14(4), 28–31, 1968. Conway's Law — §01 law callout and team–architecture coupling. melconway.com — §01.
Skelton, M., & Pais, M., Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press, 2019. Inverse Conway Maneuver and stream-aligned teams — §01 Conway corollary and §04 org-fit guidance. — §01, §04, §19.
Bezos, J. (via Amazon leadership principles), "Type 1 vs. Type 2 decisions." One-way vs. two-way door framing widely attributed to Amazon internal doctrine; popularized in tech leadership writing — §01 reversibility heuristic. See also All Things Distributed essays by Werner Vogels — §01, §18.
Beck, K., et al., Extreme Programming Explained: Embrace Change (2nd ed.). Addison-Wesley, 2004. YAGNI and incremental design — §01 YAGNI vs. evolvability card. — §01.
Richards, M., & Ford, N., Fundamentals of Software Architecture: An Engineering Approach. O'Reilly, 2020. Architecture characteristics, trade-off analysis, and role of the architect — §00 progression and §02 attribute catalog. — §00, §02, §19.

Distributed-systems laws: CAP, PACELC, Amdahl & scalability limits (§01, §09)

Brewer, E. A., "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer, 45(2), 23–29, 2012. CAP theorem refinement — §01 CAP callout and §09 consistency trade-offs. DOI: 10.1109/MC.2012.37 — §01, §09.
Abadi, D., "Consistency Tradeoffs in Modern Distributed Database System Design." IEEE Computer, 45(2), 37–42, 2012. PACELC extension — §01 PACELC callout. DOI: 10.1109/MC.2012.39 — §01, §09.
Amdahl, G. M., "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities." AFIPS Spring Joint Computer Conference, 1967. Serial fraction limits parallel speedup — §01 Amdahl callout. — §01, §07.
Gunther, N. J., The Practical Performance Analyst / Universal Scalability Law. Guerrilla-capacity.com, 2008+. Coherency/contention penalties beyond Amdahl — §01 USL callout. perfdynamics.com — §01, §07.
Deutsch, P., et al., "Eight Fallacies of Distributed Computing." Sun Microsystems internal list, popularized 1990s. Network unreliability and latency assumptions — §09 distributed foundations lead. — §09.

Quality attributes, scenarios & fitness functions (§02)

ISO/IEC 25010:2023, Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE). ISO standard for quality characteristics (-ilities taxonomy) — §02 attribute catalog. — §02.
Bass, L., Clements, P., & Kazman, R., Software Architecture in Practice (4th ed.). Addison-Wesley, 2021. Quality attribute scenarios (stimulus–response–measure) — §02 measurability paragraph. — §02.
Ford, N., Parsons, R., & Kua, P., Building Evolutionary Architectures: Automated Governance for Software Teams (2nd ed.). O'Reilly, 2022. Fitness functions and continuous architectural verification — §02 Figure 2.1 feedback loop. — §02, §03.
Google SRE Team, Site Reliability Engineering: How Google Runs Production Systems. O'Reilly, 2016. SLI/SLO/error-budget framing — §02 availability math and §08.3 SRE practices. sre.google — §02, §08, §11.
Google SRE Team, The Site Reliability Workbook. O'Reilly, 2018. Multi-window multi-burn-rate alerting — §11.2 alerting hygiene. sre.google/workbook — §08, §11.

Design process, diagrams & decision records (§03, §05, §15)

Brown, S., The C4 Model for Visualising Software Architecture. c4model.com, 2018+. Context/container/component/code hierarchy — §03 design-process communication. c4model.com — §03, §05, §19.
Nygard, M., "Documenting Architecture Decisions." Cognitect blog, 2011. Original ADR format — §15 ADR template and §15 lead. cognitect.com — §15.
Thomson, J., "ADR: Architecture Decision Records." GitHub / community templates, 2010s+. Widely adopted ADR practice — §15.1 sections (Status, Context, Decision, Consequences). adr.github.io — §15.
Richards, M., & Ford, N., Software Architecture: The Hard Parts. O'Reilly, 2021. Trade-off rubrics for distributed architecture decisions — §15.2 decision rubric and §09 consistency choices. — §09, §15, §19.

Architecture styles, DDD & evolutionary paths (§04, §16, §17)

Newman, S., Building Microservices: Designing Fine-Grained Systems (2nd ed.). O'Reilly, 2021. Microservices trade-offs, distributed monolith warning — §04 styles, §04.2 matrix, §16 anti-patterns. — §04, §16, §19.
Evans, E., Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. Bounded contexts and ubiquitous language — §04 modular monolith seams and §06 data ownership. — §04, §06, §19.
Vernon, V., Implementing Domain-Driven Design. Addison-Wesley, 2013. Tactical patterns and service boundaries — §04 hexagonal/DDD pairing. — §04, §19.
Cockburn, A., "Hexagonal Architecture." Alistair Cockburn, 2005. Ports and adapters — §04 hexagonal card. alistair.cockburn.us — §04.
Martin, R. C., Clean Architecture: A Craftsman's Guide to Software Structure and Design. Prentice Hall, 2017. Dependency rule and domain-centric layering — §04 clean/hexagonal card. — §04.
Reactive Manifesto. Lightbend et al., 2013. Responsive, resilient, elastic, message-driven systems — §04 event-driven style context. reactivemanifesto.org — §04, §19.
Vogels, W., "Building Scalable, Highly Concurrent, and Fault-Tolerant Applications." AWS re:Invent / All Things Distributed, 2018+. Cell-based architecture for blast-radius containment — §04 cell-based card and §17 reference topology. allthingsdistributed.com — §04, §17.
Heroku, The Twelve-Factor App. 2011+. Stateless processes, config, logs as streams — §12 platform foundations and §07 stateless scaling. 12factor.net — §07, §12, §19.

Integration, APIs & messaging patterns (§05)

Fielding, R. T., Architectural Styles and the Design of Network-based Software Architectures (REST dissertation). UC Irvine, 2000. REST constraints — §05 REST/JSON row. ics.uci.edu — §05.
Google, gRPC: A high performance, open source universal RPC framework. Protocol Buffers + HTTP/2 RPC — §05 gRPC row. grpc.io — §05.
Facebook / GraphQL Foundation, GraphQL Specification. Client-shaped queries — §05 GraphQL row. spec.graphql.org — §05.
Richardson, C., Microservices Patterns: With Examples in Java. Manning, 2018. API Gateway, BFF, messaging, outbox, saga patterns — §05.2 pattern list. — §05, §06.
Hohpe, G., & Woolf, B., Enterprise Integration Patterns. Addison-Wesley, 2003. Message routing, publish–subscribe, async integration — §05 sync vs. async axis. — §05.
Envoy Proxy / CNCF, service mesh architecture. Sidecar-based mTLS, retries, telemetry — §05 service mesh bullet. envoyproxy.io — §05, §12.
Debezium / transactional outbox pattern. Dual-write avoidance via outbox relay — §05 outbox pattern bullet. See also Richardson, Microservices Patterns, ch. 3 — §05, §06.

Data stores, replication, CQRS, event sourcing & sagas (§06)

Kleppmann, M., Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly, 2017. Replication, partitioning, consistency, stream processing — backbone for §06–§09. — §06, §07, §09, §19.
DeCandia, G. et al., "Dynamo: Amazon's Highly Available Key-value Store." ACM SIGOPS Operating Systems Review, 41(6), 205–220, 2007. Eventual consistency at scale — §06 store trade-offs and §09 eventual model. DOI: 10.1145/1323293.1294281 — §06, §09, §19.
Young, G., "CQRS Documents." CodeBetter / Greg Young, 2010. Command/query responsibility segregation — §06.3 CQRS card. — §06.
Young, G., "Event Sourcing." Domain-driven design community, 2010s. Event log as source of truth — §06.3 event sourcing card. — §06.
Richardson, C., "Pattern: Saga." microservices.io. Orchestrated/choreographed distributed transactions — §06.3 saga card. microservices.io — §06, §04.
Dehghani, Z., "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh." Martin Fowler blog, 2019; Data Mesh (O'Reilly, 2022). Domain-owned data products — §06.3 data mesh card. martinfowler.com — §06.
Karger, D., et al., "Consistent Hashing and Random Trees." STOC, 1997. Minimal remapping on node changes — §06.2 consistent hashing bullet. — §06, §07.

Scalability, caching & web topology (§07)

Memcached / Redis documentation and industry practice. Distributed cache tiers — §07 Figure 7.1 and §07.2 caching strategy. See also Kleppmann ch. 3 — §07.
CDN best practices (Cloudflare, Akamai, AWS CloudFront docs). Edge caching and origin offload — §07 CDN layer. Vendor docs vary; principle is widely documented — §07.
AWS Architecture Center, "Web Application Hosting." Reference scalable web tier patterns — §07 canonical topology. AWS Well-Architected — §07, §17.
Dean, J., & Barroso, L. A., "The Tail at Scale." Communications of the ACM, 56(2), 74–80, 2013. Latency outliers at scale — §07 back-pressure and §13 latency context. DOI: 10.1145/2408776.2408794 — §07, §13.

Reliability, resilience patterns & chaos engineering (§08)

Nygard, M. T., Release It! Design and Deploy Production-Ready Software (2nd ed.). Pragmatic Bookshelf, 2018. Circuit breaker, bulkhead, timeout, stability patterns — §08 resilience library and §08 Figure 8.1. — §08, §19.
Fowler, M., "CircuitBreaker." bliki, 2014. Circuit breaker state machine — §08 circuit breaker row. martinfowler.com — §08.
Netflix Technology Blog, Hystrix and resilience engineering posts. Bulkheads, fallbacks, latency tolerance — §08 pattern table. netflixtechblog.com — §08.
Beyer, B., et al., Site Reliability Engineering (Chaos Engineering chapter). Controlled failure injection — §08.3 chaos engineering bullet. — §08.
AWS, "Disaster Recovery of Workloads on AWS: Recovery in the Cloud." RTO/RPO strategies (backup, pilot light, warm standby, active-active) — §08.2 DR table. AWS DR whitepaper — §08.
Allspaw, J., "Blameless PostMortems and a Just Culture." Etsy / Code as Craft, 2012. Incident learning without blame — §08.3 postmortems. — §08.

Consensus, clocks, ordering & delivery semantics (§09)

Lamport, L., "Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, 21(7), 558–565, 1978. Logical clocks — §09.2 ordering bullet. DOI: 10.1145/359545.359563 — §09, §19.
Lamport, L., "The Part-Time Parliament." ACM Transactions on Computer Systems, 16(2), 133–169, 1998. Paxos consensus — §09.2 consensus bullet. — §09, §19.
Ongaro, D., & Ousterhout, J., "In Search of an Understandable Consensus Algorithm (Raft)." USENIX ATC, 2014. Raft for replicated logs — §09.2 consensus bullet. raft.github.io — §09, §19.
Fischer, M. J., Lynch, N. A., & Paterson, M. S., "Impossibility of Distributed Consensus with One Faulty Process." Journal of the ACM, 32(2), 374–382, 1985. FLP impossibility — background for §09 quorum/consensus limits. — §09.
Apache Kafka documentation, delivery semantics. At-least-once, at-most-once, idempotent producers — §09.3 delivery guarantees. kafka.apache.org — §09.

Security architecture, zero trust & threat modeling (§10)

NIST, Zero Trust Architecture, SP 800-207. U.S. Department of Commerce, 2020. Never trust, always verify — §10 zero-trust lead and Figure 10.1. csrc.nist.gov — §10, §19.
OWASP Foundation, OWASP Top Ten. Web application risk categories — §10 shift-left callout and §18 security checklist. owasp.org — §10, §18, §19.
Microsoft, "The STRIDE Threat Model." Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege — §10.2 STRIDE table. learn.microsoft.com — §10.
Hardt, D., et al., RFC 6749: The OAuth 2.0 Authorization Framework. IETF, 2012. OAuth2 flows — §10 identity card. datatracker.ietf.org — §10.
OpenID Foundation, OpenID Connect Core 1.0. Identity layer on OAuth2 — §10 OIDC authentication. openid.net — §10.
CNCF, Software Supply Chain Best Practices / SLSA framework. Artifact signing, provenance, SBOM — §10 supply-chain card and §12 signed artifacts. slsa.dev — §10, §12, §18.
OWASP, Top 10 for Large Language Model Applications. Prompt injection and LLM abuse — §14.1 safety card and §18 LLM input bullet. owasp.org — §14, §18.

Observability: metrics, logs, traces & OpenTelemetry (§11)

Sigelman, B., et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure." Google technical report, 2010. Distributed tracing lineage — §11 traces pillar. — §11.
Charity Majors, Liz Fong-Jones, & George Miranda, Observability Engineering. O'Reilly, 2022. Observability vs. monitoring — §11 lead and high-cardinality investigation. — §11.
OpenTelemetry, specification & documentation. Vendor-neutral instrumentation — §11 Figure 11.1 and OTel principle callout. opentelemetry.io — §11, §18.
Wilkes, J., "Site Reliability Engineering: Measuring and Managing Reliability." Google SRE Book, ch. 4. SLI/SLO definitions — §11 SLO/error-budget path. — §08, §11.
Google SRE Workbook, "Alerting on SLOs." Multi-burn-rate alerting — §11.2 hygiene bullets. — §11.
Elasticsearch / OpenSearch documentation. Log aggregation at scale — §11 logs pillar (structured logging). Vendor-neutral principle in SRE literature — §11.

Delivery, DORA metrics & deployment strategies (§12)

Forsgren, N., Humble, J., & Kim, G., Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018. DORA four keys — §12 lead and delivery performance. — §12, §19.
DORA / Google Cloud, State of DevOps Reports. Ongoing research on deployment frequency, lead time, change failure rate, MTTR — §12 DORA reference. dora.dev — §12, §19.
Humble, J., & Farley, D., Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley, 2010. Pipeline design and release safety — §12 CI/CD diagram. — §12.
Fowler, M., "Feature Toggles (aka Feature Flags)." Feature flags decouple deploy from release — §12 progressive delivery. martinfowler.com — §12.
CNCF, Kubernetes documentation. Container orchestration platform layer — §12.2 platform foundations. kubernetes.io — §12, §17.
LaunchDarkly / industry practice on progressive delivery. Canary, blue/green, rolling deployments — §12.1 deployment strategies table. See also Humble & Farley — §12.

Capacity planning, back-of-envelope math & unit economics (§13)

Dean, J., "Numbers Everyone Should Know" (latency table). Google slides / Communications of the ACM follow-ons. ns–ms–s hierarchy — §13.1 latency numbers table. Widely reproduced; verify current hardware — §13.
Barroso, L. A., Hölzle, U., & Parthasarathy, R., "Web Search for a Planet: The Google Cluster Architecture." IEEE Micro, 23(2), 22–28, 2003. Hyperscale capacity thinking — §13 estimation context. — §13.
AWS / Google Cloud pricing calculators and Well-Architected cost optimization pillar. Unit economics and headroom — §13.3 capacity headroom and §18 cost checklist. Vendor-specific — §13, §18.

AI / ML & LLM system architecture (§14)

Sculley, D., et al., "Hidden Technical Debt in Machine Learning Systems." NeurIPS, 2015. ML systems complexity beyond models — §14 MLOps loop. papers.nips.cc — §14.
Google, Machine Learning Engineering / TFX documentation. Training vs. serving paths, model registry — §14.2 inference vs. training. tensorflow.org/tfx — §14.
Chowdhery, A., et al., "PaLM: Scaling Language Modeling with Pathways." Journal of Machine Learning Research, 2023. Large-model serving economics background — §14.1 cost/latency card. — §14.
LangChain / semantic caching and model-routing patterns (industry practice). Provider abstraction gateway — §14.1 economics and §14 principle callout. Patterns evolving rapidly — §14.
Vector database vendors (pgvector, Pinecone, Milvus documentation). Embedding retrieval — §06 vector store row and §14 retrieval-augmented patterns. — §06, §14.

Anti-patterns, reference topology & operational checklists (§16, §17, §18)

Richardson, C., "Anti-pattern: Shared database." microservices.io — §06 shared-database callout and §16 anti-patterns. microservices.io — §06, §16.
Richardson, C., "Anti-pattern: Distributed monolith." microservices.io — §04 distributed monolith callout and §16. microservices.io — §04, §16.
AWS Well-Architected Framework. Operational excellence, security, reliability, performance, cost pillars — §17 reference architecture and §18 checklists. AWS; see also Azure and GCP equivalents — §17, §18, §19.
Google SRE Book & Workbook. Production readiness, alerting, incident response — §18 production-readiness and design-review checklists. — §18.
Knuth, D. E., "Structured Programming with go to Statements." ACM Computing Surveys, 6(4), 261–301, 1974. "Premature optimization is the root of all evil" — §18.1 heuristics. DOI: 10.1145/356635.356855 — §18.

Foundational canon cited in §19

Kleppmann, M., Designing Data-Intensive Applications. O'Reilly, 2017. — §19 foundational books list. — §19 (+ §06–§09).
Richards, M., & Ford, N., Fundamentals of Software Architecture & Software Architecture: The Hard Parts. O'Reilly, 2020–2021. — §19 list. — §19.
Newman, S., Building Microservices (2nd ed.). O'Reilly, 2021. — §19 list. — §19.
Evans, E., Domain-Driven Design; Vernon, V., Implementing Domain-Driven Design. — §19 list. — §19.
Nygard, M., Release It! (2nd ed.). Pragmatic Bookshelf, 2018. — §19 list. — §19.
Google SRE Team, Site Reliability Engineering & The Site Reliability Workbook. O'Reilly, 2016–2018. — §19 list. — §19.
Forsgren, N., Humble, J., & Kim, G., Accelerate. IT Revolution, 2018. — §19 list. — §19.
Skelton, M., & Pais, M., Team Topologies. IT Revolution, 2019. — §19 list. — §19.

Author synthesis

Truong, L., System Architecture Playbook — personal working notes. May 2026. Trade-off matrices, Mermaid diagrams, checklists, and synthesis prose. LinhTruong.com — all sections.

⚠ Before you cite externally

Latency tables, cloud pricing, and product capabilities go stale quickly — re-measure on your hardware and region. CAP/PACELC summaries simplify nuanced systems behavior; always map claims to the specific database or broker you use. DORA correlations describe population studies, not guarantees for your team. Security frameworks require threat modeling on your actual data flows. Verify DOIs, RFC numbers, and edition details against primary sources before academic citation.