The System Architecture Playbook
I wrote this note to capture how I think about architecture trade-offs — what to optimize, what to defer, and how to document decisions so they survive the next refactor. Diagrams, patterns, and checklists I use when reviewing cloud-native and distributed systems.
00Introduction — How I Use These Notes
Architecture is the set of design decisions that are expensive to change. I find the hardest part isn't writing code — it's picking the few decisions that determine whether a system still works five to ten years out.
I organized this note as a progression: mindset (how to think), quality attributes (what to optimize), concrete building blocks (styles, data, scale, resilience, security, observability, delivery), and the operating practices (estimation, decisions, anti-patterns, checklists) I run in design reviews.
There is no "best architecture." There is only the architecture that is least wrong for your constraints. Every section below is a menu of trade-offs, not a list of correct answers. The skill is matching the trade-off to the context — and being able to defend the choice in writing.
How I read it
- Skim the diagrams first. Each section has a canonical diagram; I use them as a visual index when I'm orienting on a problem.
- Use the trade-off tables as decision aids, not as verdicts. I add my own column for the constraints I'm actually under.
- The checklists in §18 are the operational distillation. I print them and run them in design reviews.
01The Architect's Mindset
Before any pattern, the disposition. For me, architecture is a thinking discipline applied under uncertainty, organizational politics, and incomplete information.
1.1 First principles over patterns
Patterns are compressed experience, but they encode the context in which they were discovered. Always decompose to the underlying forces — latency budgets, failure domains, data gravity, team topology — before reaching for a named solution. A pattern applied without its forces is cargo-culting.
1.2 The laws you cannot repeal
Organizations design systems that mirror their communication structures. Corollary (the Inverse Conway Maneuver): shape teams to match the architecture you want, because the system will grow to match the org chart whether you plan it or not.
Under a network Partition you choose Consistency or Availability. PACELC extends it: Else (no partition) you still trade Latency vs. Consistency. There is no free lunch in distributed state — only a choice of which guarantee to sacrifice.
Speedup is capped by the serial fraction (Amdahl). The USL adds a coherency penalty: beyond some point, adding nodes makes the system slower because of cross-talk and contention. Scaling has a peak, not an asymptote.
Reversibility test
Classify every decision as a one-way door or a two-way door. Spend your scarce deliberation on one-way doors; move fast on reversible ones.
Last responsible moment
Defer irreversible decisions until the cost of deferring exceeds the cost of deciding. Keep options open while information is still arriving.
YAGNI vs. evolvability
Don't build for imagined futures, but do leave seams (interfaces, boundaries) where change is plausible. Evolvability is cheap; speculative generality is not.
1.3 What I try to deliver in an architecture role
- Clarity under ambiguity — turning a vague mandate into bounded problems with explicit trade-offs.
- Written, reviewable decisions — RFCs/ADRs that survive the author leaving.
- Force multiplication — raising the design ceiling of every team they touch, not just shipping their own code.
- Risk pricing — naming what could go catastrophically wrong and what it would cost to insure against it.
A "perfect" design no one can operate, debug, or extend is a liability. Design for the median engineer at 3 a.m. during an incident — not for the architecture diagram's beauty.
02Quality Attributes — What You Are Actually Optimizing
Architecture exists to satisfy non-functional requirements (the "-ilities"). Functional requirements decide what the system does; quality attributes decide whether it is worth running. You cannot maximize all of them — they trade against each other.
Significant
Requirements} B --> C[Operational
availability · scalability · performance · reliability] B --> D[Structural
modularity · maintainability · evolvability · testability] B --> E[Cross-cutting
security · privacy · observability · cost · compliance] C --> F([Architecture Decisions
+ Trade-off Records]) D --> F E --> F F --> G([Fitness Functions
continuously verify the -ilities]) G -.feedback.-> F
2.1 The attribute catalog & how to make each measurable
A quality attribute you cannot measure is a wish. Every attribute below must be expressed as a testable scenario: source → stimulus → environment → response → response measure.
| Attribute | Question it answers | How to measure it | Primary tactics |
|---|---|---|---|
| Availability | What fraction of time is it usable? | SLO, e.g. 99.95% (≈4.4h/yr down) | Redundancy, failover, health checks, graceful degradation |
| Performance | How fast under load? | p50/p95/p99 latency, throughput | Caching, async, indexing, concurrency, CDN |
| Scalability | Does cost-per-unit stay flat as load grows? | Throughput vs. resources curve | Statelessness, sharding, horizontal scale, queues |
| Reliability | Does it produce correct results consistently? | MTBF, error rate, data-loss events | Idempotency, retries, transactions, validation |
| Resilience | Does it recover from failure? | MTTR, blast radius, RTO/RPO | Circuit breakers, bulkheads, chaos testing |
| Security | Can it resist misuse? | Attack surface, time-to-detect, audit pass | Zero trust, least privilege, encryption, defense-in-depth |
| Maintainability | How cheap is change? | Lead time for change, change-fail rate | Modularity, low coupling, tests, clear contracts |
| Evolvability | Can it absorb unforeseen change? | Coupling metrics, fitness functions | Bounded contexts, versioned APIs, anti-corruption layers |
| Observability | Can you ask new questions of prod? | Trace coverage, MTTD | Structured logs, metrics, traces, high-cardinality events |
| Cost efficiency | What does a unit of value cost to serve? | $ / request, $ / tenant, unit economics | Right-sizing, autoscaling, tiered storage, FinOps |
Pushing consistency and strong durability up tends to push latency and availability down, and pushing security up tends to push usability and performance down. Name which attributes you are deliberately sacrificing — an unstated sacrifice is a future incident.
2.2 Availability math you should know cold
| SLO ("nines") | Downtime / year | Downtime / month | Typical use |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.2 hours | Internal tools, batch |
| 99.9% (three nines) | 8.77 hours | 43.8 min | Most SaaS features |
| 99.95% | 4.38 hours | 21.9 min | Paid B2B, commerce |
| 99.99% (four nines) | 52.6 min | 4.4 min | Payments, critical APIs |
| 99.999% (five nines) | 5.26 min | 26 sec | Telco, core infra — very costly |
Series vs. parallel: dependencies in series multiply (each hard dependency lowers your ceiling — 5 deps at 99.9% ≈ 99.5%). Redundant components in parallel add nines (two 99% replicas ≈ 99.99% if failures are independent). Architect for independent failure domains.
03The Design Process — A Repeatable Method
When a design has to be legible to others — not just workable in my head — I follow a method like the one below. It scales from a one-pager to a platform decision.
problem, goals,
non-goals, constraints"] --> S2["2 · Requirements
functional +
quality attributes"] S2 --> S3["3 · Estimate
scale, traffic,
data, growth"] S3 --> S4["4 · Define
APIs & data
contracts"] S4 --> S5["5 · High-level
design
(components)"] S5 --> S6["6 · Deep-dive
bottlenecks &
failure modes"] S6 --> S7["7 · Trade-off
analysis &
ADR"] S7 --> S8["8 · Validate
fitness functions,
review, prototype"] S8 -.iterate.-> S5
3.1 The steps, with the questions to ask at each
- Frame the problem. What user/business outcome? What are the explicit non-goals? What constraints are hard (compliance, budget, deadline, existing systems)? Most bad designs are answers to the wrong question.
- Requirements. Separate functional from quality attributes. Rank the top 3 -ilities; you will optimize for these and explicitly compromise the rest.
- Estimate scale (see §13). Reads vs. writes, peak vs. average, data size, growth rate, latency budget. Numbers convert opinions into engineering.
- Define contracts. APIs and data schemas first — they are the longest-lived artifacts and the hardest to change once clients depend on them.
- High-level design. Boxes and arrows: clients, gateways, services, stores, queues, caches. Keep it to one whiteboard.
- Deep dives. Attack the riskiest part: the hottest path, the largest table, the strongest consistency need, the worst failure mode.
- Trade-off analysis. For each significant decision, list ≥2 options and why you rejected the alternatives. Write the ADR (§15).
- Validate. Prototype the risky bit, write fitness functions, run a design review, and load-test assumptions before committing.
Describe systems at four zoom levels: Context (system + users + external systems), Containers (apps, services, stores), Components (inside a container), and Code (only when needed). Match the diagram's altitude to the audience — executives want Context; engineers want Containers/Components.
04Architecture Styles — The Macro Choice
The style is your top-level structural commitment. It is the most expensive decision to reverse, so it deserves the most rigor. Choose by team size, domain complexity, and required independence of deployment — not by fashion.
complexity grow"| MICRO MICRO ==>|"decouple temporal &
scaling concerns"| EVENT
4.1 The styles in brief
Modular Monolith
Single deployable, strict internal module boundaries (own data, explicit interfaces). Simplest ops, in-process calls, easy transactions. Best when: <~3 teams, domain still being discovered. The boundaries become your future service seams.
Microservices
Independently deployable services around bounded contexts, each owning its data. Enables team autonomy and independent scaling — at the cost of distributed-systems complexity, network failure modes, and operational overhead. Earn it; don't start with it.
Event-Driven
Components react to events via a broker (Kafka, Pulsar, etc.). Excellent for decoupling, buffering load spikes, fan-out, and audit/replay. Cost: eventual consistency, harder debugging, requires strong observability and schema governance.
Serverless / FaaS
Functions + managed services; scale-to-zero, pay-per-use. Great for glue, spiky and unpredictable workloads, and small teams. Cost: cold starts, vendor coupling, execution limits, harder local dev and stateful flows.
Cell-Based / Bulkheaded
Replicate the whole stack into isolated "cells," each serving a slice of users/tenants. Limits blast radius, enables zonal/regional isolation. Used at hyperscale for fault containment and predictable scaling units.
Hexagonal / Clean / Ports & Adapters
An orthogonal pattern (works inside any style): isolate domain logic from I/O via ports and adapters. Maximizes testability and keeps infrastructure swappable. Pairs well with DDD.
4.2 Style trade-off matrix
| Dimension | Modular Monolith | Microservices | Event-Driven | Serverless |
|---|---|---|---|---|
| Operational complexity | Low | High | Med-High | Medium |
| Independent deploy/scale | No | Yes | Yes | Yes |
| Transactional simplicity | High | Low (sagas) | Low | Medium |
| Debuggability | Easy | Needs tracing | Hard | Medium |
| Team autonomy | Medium | High | High | High |
| Fit for spiky load | Medium | Medium | Good | Excellent |
| Best at team count | 1–3 | 4+ | 4+ | 1–small |
Microservices that share a database, deploy together, or call each other synchronously in long chains give you distributed-systems pain without the independence benefits. If services can't be deployed independently, they aren't microservices — they're a monolith you have to debug over the network.
05Communication & Integration
How components talk is as architecturally significant as how they're divided. The core axis: synchronous (request/response, temporal coupling) vs. asynchronous (messaging, temporal decoupling).
auth · rate-limit · routing] GW -->|REST / gRPC| SA[Service A] GW -->|GraphQL| BFF[BFF] SA -->|gRPC sync| SB[Service B] SA -->|publish event| Q{{Message Broker}} Q -->|subscribe| SC[Service C] Q -->|subscribe| SD[Service D] SB --> DB[(DB B)] SC --> DBC[(DB C)]
5.1 Choosing a protocol
| Style | Sweet spot | Strengths | Watch out for |
|---|---|---|---|
| REST/JSON | Public APIs, CRUD, broad compatibility | Ubiquitous, cacheable, simple | Over/under-fetching, chatty, weak typing |
| gRPC / Protobuf | Internal service-to-service, low latency | Fast, strongly typed, streaming, codegen | Browser support, debuggability, binary |
| GraphQL | Aggregating many backends for varied clients | Client-shaped queries, one round-trip | Caching, N+1, query-cost abuse, complexity |
| Messaging / events | Decoupling, fan-out, buffering, async work | Resilient, scalable, replayable | Eventual consistency, ordering, duplicates |
| WebSocket / SSE | Realtime push, live updates | Low-latency bidirectional / server push | Connection state, scaling fan-out |
5.2 Patterns that earn their keep
- API Gateway — single entry point for auth, rate limiting, routing, TLS termination, request shaping. Keep business logic out of it.
- Backend-for-Frontend (BFF) — a tailored aggregation layer per client type (web, mobile) to avoid one-size-fits-none APIs.
- Service mesh (sidecar) — push mTLS, retries, timeouts, traffic shaping, and telemetry out of app code into the platform (e.g., Envoy-based meshes).
- Outbox pattern — write the business change and the "to-publish" event in one local transaction, then relay — the cure for dual-write inconsistency.
- Async request-reply — for long operations: return
202 Accepted+ a status/polling URL or callback, rather than holding a connection.
Every synchronous cross-service call is a coupling of availability — the caller is only as available as the callee. Where you can tolerate eventual results, an event/queue converts a hard dependency into a soft one and absorbs load spikes for free.
06Data Architecture — Where Systems Truly Age
Stateless tiers are easy; state is where the hard, irreversible decisions live. Data has gravity: it is heavy to move, expensive to reshape, and the schema outlives the code. Spend disproportionate care here.
6.1 Choosing a store by access pattern, not by hype
| Store type | Data shape / access | Examples | Trade-off |
|---|---|---|---|
| Relational (OLTP) | Structured, relational, strong consistency, transactions | PostgreSQL, MySQL | Vertical scaling limits; sharding adds complexity |
| Document | Flexible/nested, aggregate-oriented reads | MongoDB, DynamoDB | Weaker cross-doc joins/consistency |
| Key-Value | Simple lookups, caching, sessions | Redis, DynamoDB | No rich queries |
| Wide-column | Massive write throughput, time-series | Cassandra, Bigtable | Query patterns fixed at design time |
| Graph | Relationship-heavy traversal | Neo4j, Neptune | Niche; scaling traversals is hard |
| Search | Full-text, faceted, relevance | Elasticsearch, OpenSearch | Not a source of truth; eventual |
| Analytical (OLAP) | Aggregations over huge datasets | Snowflake, BigQuery, ClickHouse | Not for low-latency point writes |
| Vector | Similarity search / embeddings (AI) | pgvector, Pinecone, Milvus | Recall vs. latency tuning; freshness |
Use the right store for each access pattern, but designate one system of record per piece of data and treat all others (search indexes, caches, read replicas, OLAP) as derived, rebuildable projections. If you can't rebuild a derived store from the source, you have two sources of truth and a future divergence bug.
6.2 Scaling data: replication and partitioning
writes)] -->|stream| R1[(Replica 1
reads)] PR -->|stream| R2[(Replica 2
reads)] end subgraph SHARD["Partitioning / Sharding — split for write scale"] direction LR RT{Router
shard key} --> S1[(Shard A
users 0-3)] RT --> S2[(Shard B
users 4-7)] RT --> S3[(Shard C
users 8-9)] end
- Replication (leader–follower) scales reads & gives failover, but introduces replication lag → read-your-writes anomalies. Mitigate with read-from-leader for the writer, or sticky sessions.
- Sharding scales writes/storage. The hardest part is the shard key: it must spread load evenly and avoid hot spots and cross-shard transactions. Bad shard keys are nearly impossible to change later.
- Consistent hashing minimizes data movement when nodes are added/removed — the standard for distributed caches and stores.
6.3 Advanced data patterns
CQRS
Separate the write model (normalized, transactional) from the read model (denormalized, query-optimized projections). Lets each side scale and evolve independently. Adds eventual consistency and projection-maintenance cost — use it where read and write shapes genuinely diverge.
Event Sourcing
Persist the sequence of state-changing events as the source of truth; current state is a fold over events. Gives a perfect audit log, time travel, and replay — at the cost of schema/versioning discipline and a steeper mental model.
Saga (distributed transactions)
Replace cross-service ACID with a sequence of local transactions plus compensating actions on failure. Orchestrated (central coordinator) or choreographed (events). The standard answer to "how do I do a transaction across microservices?" — the honest answer is "you don't; you do a saga."
Data Mesh / Lakehouse
Decentralize analytical data ownership to domains that publish data products with contracts and SLAs, over a self-serve platform. The lakehouse unifies the data lake's flexibility with warehouse-grade transactions/governance. For org-scale analytics.
Multiple services reading/writing the same tables couples them at the deepest level — a schema change for one breaks the others, and no one owns the data's invariants. Each service should own its data and expose it only through APIs/events.
07Scalability Strategy
Scalability is keeping cost-per-unit roughly flat as load grows by orders of magnitude. The toolkit is small and well-understood; the art is applying it before — not during — the incident.
cache static + near user] CDN --> LB[Load Balancer
health-aware] LB --> A1[App N1] LB --> A2[App N2] LB --> A3[App N3 ...autoscale] A1 --> CA[(Cache
Redis)] A2 --> CA A3 --> CA CA -.miss.-> DB[(Primary DB)] DB --> RR[(Read Replicas)] A1 --> MQ{{Queue}} MQ --> W[Async Workers]
7.1 The scaling toolkit
- Statelessness — push session/state to a shared store so any node can serve any request; the precondition for horizontal scale and autoscaling.
- Horizontal > vertical — add commodity nodes (linear-ish, fault-tolerant) rather than bigger boxes (hard ceiling, single point of failure). Vertical buys time; horizontal buys headroom.
- Caching at every layer (browser → CDN → app → distributed cache → DB buffer). The cheapest performance you'll ever buy — and the richest source of staleness bugs.
- Load balancing — distribute work; remove unhealthy nodes; enable zero-downtime deploys.
- Async offload — move anything not needed for the response (emails, thumbnails, indexing, analytics) onto a queue.
- Read/write split — replicas for reads, primary for writes; CQRS when shapes diverge.
7.2 Caching strategy & the hard parts
| Pattern | How it works | Use when |
|---|---|---|
| Cache-aside (lazy) | App checks cache, on miss loads DB + populates | Read-heavy, tolerant of slight staleness (most common) |
| Read-through | Cache loads from DB on miss transparently | Want caching logic centralized |
| Write-through | Write to cache + DB synchronously | Read-after-write consistency needed |
| Write-behind | Write to cache, async flush to DB | Write-heavy, can tolerate small loss window |
Invalidation (the famously hard problem — prefer TTLs + event-driven busting), stampede / thundering herd (many misses hit the DB at once on expiry — use request coalescing, jittered TTLs, locks), and hot keys (one key overwhelms a node — replicate or shard hot keys). Plan for all three before launch.
7.3 Back-pressure & load shedding
A scalable system must protect itself when demand exceeds capacity. Back-pressure propagates "slow down" upstream (bounded queues, flow control). Load shedding deliberately rejects low-priority work (with 429/503 + retry-after) to keep the core healthy. Rate limiting (token/leaky bucket) caps per-client demand. Failing fast and partial beats failing slow and total.
08Reliability & Resilience Engineering
At scale, failure is not an exception — it is the steady state. Disks die, networks partition, dependencies time out, deploys regress. Resilience is designing so that partial failure stays partial and the system degrades gracefully instead of collapsing.
Breaker} CB -->|closed| TO[Timeout +
bounded retry
w/ backoff+jitter] TO --> BH[Bulkhead
isolated pool] BH --> DEP[Dependency] CB -->|open| FB[Fallback /
degraded response] DEP -.fails.-> FB
8.1 The resilience pattern library
| Pattern | Failure it addresses | Key rule |
|---|---|---|
| Timeout | Unbounded waits exhausting threads/connections | Every remote call has a timeout. Always. |
| Retry + backoff + jitter | Transient blips (network, brief overload) | Bound attempts; jitter to avoid synchronized retries |
| Circuit breaker | Repeated calls to a dead dependency | Open after threshold; half-open to probe recovery |
| Bulkhead | One slow dependency draining all resources | Isolate thread/connection pools per dependency |
| Fallback / degrade | Dependency unavailable | Serve cached/default/partial — never a hard error if avoidable |
| Idempotency | Duplicate delivery from retries | Idempotency keys; "at-least-once" + dedupe |
| Redundancy + failover | Node/zone/region loss | Independent failure domains; tested failover |
Naïve retries amplify load during an outage and can cause retry storms that turn a brief degradation into a full collapse. Always combine retries with: bounded attempts, exponential backoff, jitter, circuit breakers, and a retry budget (cap retries as a % of total traffic).
8.2 RTO, RPO, and the spectrum of DR
RTO (Recovery Time Objective) = how long until service is restored. RPO (Recovery Point Objective) = how much data loss is acceptable. These business-set numbers dictate the (rising) cost of your DR strategy:
| DR strategy | RTO | RPO | Relative cost |
|---|---|---|---|
| Backup & restore | Hours–days | Hours | $ |
| Pilot light | Tens of min | Minutes | $$ |
| Warm standby | Minutes | Seconds–min | $$$ |
| Active-active (multi-region) | ~Zero | ~Zero | $$$$ |
8.3 Operating reliability: SRE practices
- SLI / SLO / error budgets — measure the indicator, set the objective, and spend the budget: when you're within budget, ship fast; when you're burning it, freeze features and fix reliability. This aligns velocity and stability with one number.
- Chaos engineering — deliberately inject failure (kill nodes, add latency, partition) in controlled experiments to validate resilience before reality does it for you.
- Blameless postmortems — treat incidents as system failures, not human failures; the output is action items that change the system, not blame.
- Game days — rehearse failover and incident response so muscle memory exists when it counts.
09Distributed Systems Foundations
The moment your system spans more than one machine, a new physics applies. The "fallacies of distributed computing" — the network is reliable, latency is zero, bandwidth is infinite, topology is stable — are all false, and assuming otherwise produces the subtlest, most expensive bugs.
9.1 Consistency models — the spectrum, not the binary
| Model | Guarantee | Cost | Typical use |
|---|---|---|---|
| Strong / linearizable | Reads see the latest write, globally ordered | Highest latency, lowest availability under partition | Balances, inventory, locks |
| Causal | Causally related ops seen in order | Moderate | Collaboration, comments, feeds |
| Read-your-writes | You see your own updates | Low (session pinning) | Profile edits, settings |
| Eventual | Replicas converge given no new writes | Lowest latency, highest availability | Caches, counts, DNS, search |
Strong consistency is a tax on latency and availability. Most features (likes, view counts, recommendations, search) are perfectly happy with eventual consistency. Reserve strong consistency for the genuinely money-or-safety-critical paths and pay the tax only there.
9.2 Consensus, ordering, and time
- Consensus (Raft, Paxos) — how a cluster agrees on a value/leader despite failures. You rarely implement it; you use systems built on it (etcd, ZooKeeper, consensus-backed DBs). Know that it requires a majority quorum and cannot make progress without one.
- Logical clocks (Lamport, vector clocks) — order events without synchronized wall clocks. Never trust wall-clock ordering across machines — clock skew is real.
- Quorums — with replication factor N, choosing read quorum R and write quorum W such that
R + W > Nguarantees overlap and thus strong reads; tuning R/W trades latency vs. consistency.
9.3 Delivery guarantees & idempotency
Messaging systems offer at-most-once (may lose), at-least-once (may duplicate), or exactly-once (expensive, often "effectively once" via dedupe). In practice, the robust pattern is: at-least-once delivery + idempotent consumers. Design every side-effecting operation to be safely repeatable — via idempotency keys, natural dedupe keys, or conditional writes. This single discipline prevents a huge class of distributed bugs.
10Security Architecture
Security is a quality attribute that must be designed in, not bolted on. The modern posture is zero trust: never trust the network, always verify identity, and assume breach. Defense in depth means no single control failing should be catastrophic.
AuthN: who are you?] ID --> PEP[Policy Enforcement
AuthZ: what may you do?] PEP --> EDGE[Edge: WAF · TLS · rate limit · DDoS] EDGE --> APP[App: input validation ·
output encoding · least privilege] APP --> SEC[Secrets manager
rotation · no creds in code] APP --> DATA[(Data: encrypted at rest
+ in transit · tokenized PII)] AUD[[Audit log · SIEM · anomaly detection]] -.observes.- PEP AUD -.observes.- APP AUD -.observes.- DATA
10.1 The pillars
Identity & Access
Strong authentication (OIDC/OAuth2, MFA, short-lived tokens, mTLS between services) and least-privilege authorization (RBAC/ABAC, scoped tokens). Identity is the new perimeter.
Data protection
Encrypt in transit (TLS everywhere) and at rest. Classify data; tokenize/encrypt PII; minimize what you collect and retain. Manage keys in a KMS with rotation.
Secrets management
No credentials in code, config, or images. Use a secrets manager / vault with rotation and dynamic, short-lived credentials. Prefer workload identity over long-lived keys.
Supply chain
Pin and scan dependencies, generate an SBOM, sign artifacts, verify provenance, and lock down CI/CD permissions. The build pipeline is now a top attack target.
10.2 Threat modeling — STRIDE
Systematically ask, per component and data flow, how it could be attacked:
| Threat | Violates | Countermeasure |
|---|---|---|
| Spoofing | Authentication | Strong auth, MFA, mTLS |
| Tampering | Integrity | Signing, hashing, validation |
| Repudiation | Non-repudiation | Audit logs, signed actions |
| Information disclosure | Confidentiality | Encryption, access control |
| Denial of service | Availability | Rate limiting, autoscale, WAF |
| Elevation of privilege | Authorization | Least privilege, isolation |
Bake security into design (threat models), code (SAST, secure defaults), build (dependency & container scanning, SBOM), and runtime (WAF, anomaly detection, least-privilege workloads). Privacy-by-design and compliance (GDPR, SOC 2, etc.) are architectural constraints, not afterthoughts.
11Observability
You cannot operate what you cannot see. Monitoring tells you whether the system is healthy against known questions; observability lets you ask new questions of production without shipping new code — essential for debugging novel failures in distributed systems.
OpenTelemetry SDK] --> COL[OTel Collector] COL --> M[(Metrics
rates · errors · latency)] COL --> L[(Logs
structured · correlated)] COL --> T[(Traces
end-to-end spans)] M --> DASH[Dashboards + Alerts] L --> DASH T --> DASH DASH --> SLO{SLO / error budget} SLO --> ONCALL([On-call action])
11.1 The three pillars (and the fourth)
- Metrics — cheap, aggregatable numeric time-series. Track the RED method (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for resources.
- Logs — structured (JSON), with a correlation/trace ID on every line. Unstructured logs don't scale to investigation.
- Traces — follow a single request across every service hop; the only practical way to find latency and failure in a distributed call graph.
- (Profiles / events) — continuous profiling and high-cardinality wide events round out modern observability.
Standardize instrumentation on a vendor-neutral standard (OpenTelemetry) so you're not locked in. Alert on user-facing symptoms (SLO burn, error rate, latency) — not on causes (high CPU). Cause-based alerts page you for non-problems and miss novel ones; symptom-based alerts catch what actually matters.
11.2 Good alerting hygiene
- Every alert must be actionable and urgent; otherwise it's a dashboard, not a page. Alert fatigue is a reliability risk.
- Use multi-window, multi-burn-rate SLO alerts to catch both fast and slow budget burns without noise.
- Every page links to a runbook. The metric-to-action gap is where MTTR goes to die.
12Delivery, Deployment & Platform
Architecture includes how software reaches production. The DORA research is clear: elite performers deploy frequently, with low lead time, low change-failure rate, and fast recovery — and these come from architecture that supports safe, small, independent deploys.
SAST · dep scan] CI --> ART[Signed Artifact
+ SBOM] ART --> STG[Deploy to Staging
integration + e2e] STG --> GATE{Quality &
policy gate} GATE -->|pass| CANARY[Canary / Progressive
1% → 10% → 100%] CANARY --> METRICS{SLO healthy?} METRICS -->|yes| FULL[Full rollout] METRICS -->|no| RB[Auto rollback]
12.1 Deployment strategies
| Strategy | How | Best for |
|---|---|---|
| Rolling | Replace instances batch by batch | Default; simple, no extra capacity |
| Blue-Green | Stand up full new env, switch traffic atomically | Instant rollback; needs 2× capacity |
| Canary | Route a small % to new version, watch metrics | Risk reduction at scale; needs good observability |
| Feature flags | Decouple deploy from release; toggle per cohort | Trunk-based dev, experiments, kill switches |
12.2 Platform foundations
- Containers + orchestration (Kubernetes) — package once, run anywhere; declarative scaling, self-healing, rollouts. Powerful but operationally heavy — adopt deliberately.
- Infrastructure as Code (Terraform, etc.) — all infra versioned, reviewed, reproducible. No click-ops in production.
- GitOps — Git as the single source of truth for desired state; a controller continuously reconciles reality to it. Auditable, revertible deployments.
- Platform engineering / Internal Developer Platform — pave golden paths so product teams ship safely without re-deriving infra. Reduces cognitive load; this is where Conway's Law meets DevEx.
- Service mesh — when service count grows, externalize mTLS, retries, traffic policy, and telemetry to the platform.
Batch size is the hidden variable in deployment risk. A 1-line change is easy to review, easy to roll back, and easy to attribute when something breaks. Decouple deploy (shipping code, dark) from release (turning it on, via flags) so you can do both safely and independently.
13Capacity Planning & Back-of-the-Envelope Estimation
I try to turn hand-waving into numbers in minutes. Estimation isn't about precision — it's about getting the order of magnitude right so you choose the right architecture and don't over- or under-build by 100×.
13.1 Latency numbers every engineer should know (approx.)
| Operation | ≈ Latency | Mental model |
|---|---|---|
| L1 cache reference | ~1 ns | Free |
| Main memory reference | ~100 ns | 100× slower than L1 |
| Read 1 MB sequentially from RAM | ~5–10 µs | Fast |
| SSD random read | ~100 µs | 1000× slower than RAM ref |
| Read 1 MB from SSD | ~1 ms | — |
| Round trip within a datacenter | ~0.5 ms | Cheap, but adds up in chains |
| Disk (HDD) seek | ~10 ms | Avoid in hot paths |
| Round trip across continents | ~150 ms | Physics — put data near users |
Takeaway: memory ≫ SSD ≫ disk ≫ cross-region network. Every order of magnitude changes the design. Keep hot data in memory; keep users close to their data.
13.2 A worked estimation (template)
Assume: 100M daily active users, each does 10 reads + 1 write/day.
Reads/day = 100M × 10 = 1B → /86,400s ≈ 11,600 RPS avg
Writes/day = 100M × 1 = 100M → ≈ 1,160 WPS avg
Peak factor ≈ 3× → ~35k RPS read, ~3.5k WPS write at peak
Storage/write: ~1 KB per record
100M writes/day × 1 KB = 100 GB/day → ~36 TB/year (before replication)
With 3× replication → ~110 TB/year
Cache: keep hot 20% of reads in memory
working set ≈ (some GB) → size Redis cluster accordingly
Bandwidth (read): 35k RPS × ~1 KB ≈ 35 MB/s egress (plan CDN offload)
This 5-minute exercise immediately tells you: you need a distributed cache, read replicas or sharding, a CDN, and roughly how many app nodes — before writing any code.
13.3 Capacity headroom & cost
- Plan for peak × growth × safety margin, not average. Run hot at ~50–70% utilization so spikes don't tip you over.
- Autoscale on the right signal (queue depth, RPS, p99) — CPU is often a lagging proxy.
- FinOps: track unit economics ($/request, $/tenant). The cheapest-to-build design and the cheapest-to-run design are rarely the same — make the trade explicit.
14Architecting for AI / ML & Modern Workloads
AI-intensive systems are now mainstream, and they bend several classic assumptions: non-deterministic outputs, GPU economics, large-payload latency, prompt/data-driven behavior, and new failure and safety modes. A current playbook must address them.
auth · rate-limit] GW --> ORCH[Orchestrator] ORCH --> RET[Retriever] RET --> VDB[(Vector DB
embeddings)] RET --> KB[(Source docs)] ORCH --> LLM[LLM / Model Service
GPU-backed, autoscaled] LLM --> GUARD[Guardrails
safety · validation] GUARD --> CACHE[(Semantic cache)] GUARD --> U OBS[[Eval · tracing · cost · drift monitoring]] -.observes.- ORCH OBS -.observes.- LLM
14.1 What changes when AI is in the loop
Non-determinism & evals
Outputs vary; "correct" is fuzzy. Replace exact-match tests with evaluation suites (offline + online), human/auto graders, and regression gates. Treat prompts and model versions as deployable, versioned artifacts.
Retrieval & data freshness (RAG)
Ground models in your data via embeddings + a vector store to reduce hallucination and add recency. The architecture problem becomes retrieval quality, chunking, and freshness as much as the model.
Cost & latency economics
Inference is expensive and often slow. Use semantic caching, model routing (small model first, escalate), batching, streaming responses, and right-sized models. Token cost is now a unit-economics line item.
Safety, guardrails & abuse
Validate inputs/outputs, constrain tools the model can call, sandbox actions, and add content/safety filters. Prompt injection and data exfiltration are real threat-model entries — treat model inputs as untrusted.
14.2 Inference vs. training paths
- Serving/inference is the latency-critical, always-on path: GPU autoscaling (with cold-start mitigation), request batching, streaming, and aggressive caching dominate.
- Training/fine-tuning is a batch, data-pipeline problem: feature stores, reproducible datasets, experiment tracking, and a model registry with lineage.
- MLOps/LLMOps closes the loop: deploy models like code (CI/CD), monitor for drift and quality regression, and enable safe rollback of model/prompt versions.
Models, prompts, and providers change monthly. Put a provider-agnostic abstraction (your own gateway/orchestration layer) between your product and any specific model so you can swap models, route by cost/quality, A/B test, and enforce guardrails centrally — without rewriting the product.
15Making & Recording Decisions
The artifact I care most about is the written decision. An architecture no one can reconstruct the reasoning behind becomes legacy the day its author leaves. ADRs and RFCs make decisions durable, reviewable, and reversible-with-eyes-open.
15.1 The Architecture Decision Record (ADR)
A short, immutable, append-only document per significant decision. The key sections:
# ADR-0042: Use event-driven sagas for order fulfillment
## Status
Accepted (supersedes ADR-0017)
## Context
Orders span inventory, payment, and shipping services. We need
"transaction-like" consistency without a distributed 2PC, and the
fulfillment flow must survive partial failures and be auditable.
## Decision
Use a choreographed saga over Kafka with compensating actions.
Each step is idempotent; the outbox pattern guarantees event publish.
## Consequences
+ Services stay decoupled and independently deployable.
+ Full audit trail via the event log.
- Eventual consistency: UI must reflect "pending" states.
- Requires saga monitoring + dead-letter handling.
## Alternatives considered
- 2PC across services: rejected (availability + coupling cost).
- Synchronous orchestration service: rejected (single point of
coupling, harder to scale steps independently).
"Alternatives considered" is the most valuable section — it proves you weighed options and lets a future reader judge whether the context has changed enough to revisit the decision. Context + alternatives, not just the verdict.
15.2 A decision rubric
- Is this a one-way or two-way door? Calibrate rigor to reversibility.
- What are we optimizing, and what are we sacrificing? Name the losing -ilities.
- What's the blast radius if we're wrong? Can we contain/roll it back?
- What would change our mind? Write the falsifiable condition for revisiting.
- Who must be in the room? The people who own the consequences.
Architecture is a team sport with real disagreement. Surface dissent in the RFC, decide explicitly, record the dissent, and commit. A documented decision that some disagreed with beats an undocumented consensus that no one can reconstruct.
16Anti-Patterns & Failure Modes to Recognize
Knowing what not to do is half of architecture. These recur across companies and decades; recognizing them early is a superpower.
| Anti-pattern | What it looks like | Why it hurts | Antidote |
|---|---|---|---|
| Distributed monolith | Services that must deploy together, share a DB, or call each other in long sync chains | Distributed pain, none of the independence | Real boundaries; async; own your data |
| Big Ball of Mud | No discernible structure; everything depends on everything | Change is unpredictable; fear-driven dev | Bounded contexts; enforce module boundaries |
| Premature microservices | 10 services, 3 engineers, undiscovered domain | Ops overhead crushes velocity | Start modular monolith; extract on evidence |
| God service / object | One component everything routes through | Bottleneck + SPOF + change magnet | Decompose by responsibility |
| Chatty interfaces | Dozens of fine-grained calls per use case | Latency multiplies; failure surface grows | Coarse-grained APIs; batch; BFF |
| Shared mutable database | Many services writing the same tables | Hidden coupling; no invariant owner | One owner; integrate via API/events |
| Retry storms | Naïve retries with no backoff/budget | Amplifies outages into collapse | Backoff+jitter, circuit breaker, budget |
| Cache as source of truth | Data only lives in the cache | Eviction/restart = data loss | Cache is derived & rebuildable |
| Resume-driven design | Adopting tech for novelty, not need | Complexity with no business return | Boring tech; justify every new dependency |
| Golden hammer | One tool/pattern forced onto every problem | Poor fit; accidental complexity | Match tool to access pattern |
Every box, queue, service, and database you add is a thing to operate, secure, monitor, and pay for, forever. The default answer to "should we add this component?" is no, until the requirement proves otherwise. Boring, well-understood technology is a feature, not a compromise.
17A Reference Architecture (Putting It Together)
A concrete, modern reference for a high-scale, multi-region SaaS — annotated with which decisions matter and why. Treat it as a starting skeleton, not a mandate.
authN/Z · rate-limit] GW --> MESH{{Service Mesh}} MESH --> SVC1[Service: Identity] MESH --> SVC2[Service: Core Domain] MESH --> SVC3[Service: Billing] MESH --> BFF[BFF / GraphQL] SVC2 --> CACHE[(Distributed Cache)] SVC2 --> PDB[(Primary OLTP
sharded + replicas)] SVC2 --> OUT[Outbox] --> BUS{{Event Bus / Kafka}} BUS --> WORK[Async Workers] BUS --> CDC[CDC → Lakehouse] CDC --> OLAP[(Analytics / OLAP)] SVC2 --> VDB[(Vector store - AI features)] OBS[[Observability: OTel · metrics · logs · traces · SLO]] -.-> MESH SEC[[Security: secrets · KMS · audit/SIEM]] -.-> MESH
17.1 Why each piece is there
| Component | Quality attribute served | If you skip it… |
|---|---|---|
| GeoDNS + CDN + WAF | Latency, availability, security | Slow far-away users; exposed origin |
| API Gateway | Security, governance | Auth/rate-limit logic scattered everywhere |
| Service mesh | Resilience, observability, security | Retries/mTLS/telemetry reimplemented per service |
| Per-service data ownership | Evolvability, autonomy | Shared-DB coupling; can't deploy independently |
| Outbox + event bus | Reliability, decoupling | Dual-write inconsistency; tight coupling |
| CDC → lakehouse | Analytics without OLTP load | Analysts hammer the prod DB |
| Observability plane | Operability | Blind during incidents; high MTTR |
This reference is the destination, not the starting point. A team of five should begin with a modular monolith + managed Postgres + a CDN, and grow toward this shape only as load and team size justify each addition. Architecture is earned incrementally.
18The Playbook — Operational Checklists
The distilled, printable core. Run these in design reviews, readiness reviews, and incident retros.
Design Review Checklist
- Problem, goals, and explicit non-goals are written down
- Top 3 quality attributes ranked; sacrifices named
- Scale estimated (RPS, data, growth, latency budget)
- API & data contracts defined and versioned
- Failure modes enumerated; blast radius bounded
- Consistency model chosen per data flow
- Each significant decision has an ADR with alternatives
- Riskiest assumption prototyped or load-tested
Production Readiness Checklist
- SLOs defined with error budgets and alerts on symptoms
- Metrics, logs, traces wired (OTel); dashboards exist
- Every remote call has timeout + bounded retry + breaker
- Idempotency on all side-effecting operations
- Autoscaling + load shedding + rate limits configured
- Runbooks linked from every alert
- RTO/RPO set; backup & failover tested (not assumed)
- Rollback path verified; deploy is progressive
Security Checklist
- Threat model (STRIDE) done for new data flows
- AuthN/Z enforced; least privilege everywhere
- Secrets in a vault, rotated; none in code/images
- Data encrypted in transit & at rest; PII classified
- Dependencies scanned; SBOM generated; artifacts signed
- Audit logging + anomaly detection in place
- Inputs validated; model/LLM inputs treated as untrusted
Scalability & Cost Checklist
- App tier stateless; state in shared stores
- Caching strategy chosen; stampede/hot-key handled
- Read/write split or sharding plan; shard key validated
- Slow work offloaded to async queues/workers
- Back-pressure & graceful degradation defined
- Unit economics tracked ($/request, $/tenant)
- Headroom: runs at 50–70% at peak × growth
18.1 One-page heuristics I keep nearby
- Make it work, make it right, make it fast — in that order. Premature optimization and premature distribution are twin sins.
- Boring technology by default. Spend your innovation tokens where they create real differentiation.
- Optimize the bottleneck, measure first. Intuition about performance is usually wrong; profile.
- Design for failure, test for failure. Untested failover is a hope, not a plan.
- Couple loosely, align tightly. Loose technical coupling, tight clarity on contracts and ownership.
- Write the decision down. If it isn't an ADR, it didn't happen.
- Reversibility first. Move fast on two-way doors; deliberate on one-way doors.
- The simplest thing that could possibly work — and no simpler.
19The Canon & Further Reading
Texts and bodies of knowledge I keep returning to. I read them for the reasoning, not the recipes. Annotated citations with DOIs and URLs: §20 References & Sources.
Foundational books
- Designing Data-Intensive Applications — Kleppmann (the data & distributed-systems bible)
- Fundamentals of Software Architecture & Software Architecture: The Hard Parts — Richards & Ford
- Building Microservices — Newman
- Domain-Driven Design — Evans; Implementing DDD — Vernon
- Release It! — Nygard (stability & resilience patterns)
- Site Reliability Engineering & The SRE Workbook — Google
- Accelerate — Forsgren, Humble, Kim (the DORA research)
- Team Topologies — Skelton & Pais (Conway's Law, in practice)
Papers, models & bodies of knowledge
- Brewer — the CAP theorem; Abadi — PACELC
- Lamport — time, clocks & the ordering of events
- Raft & Paxos consensus papers
- Amazon Dynamo paper (eventual consistency at scale)
- The 12-Factor App methodology
- C4 model for software architecture (Brown)
- The Reactive Manifesto
- OWASP Top 10 & NIST Zero Trust (SP 800-207)
- Well-Architected Frameworks (cloud-vendor) — for pillar checklists
The technologies in these notes will change; the trade-offs will not. Distributed state, the speed of light, human and organizational limits, and the tension between consistency, availability, latency, cost, and complexity are permanent. When I understand the forces, every new tool is just another point in a design space I already recognize.
20References & Sources
Annotated bibliography behind the mindset heuristics, quality attributes, design process, architecture styles, integration patterns, data and scale tactics, resilience and distributed-systems foundations, security, observability, delivery, estimation, AI/ML workloads, ADRs, anti-patterns, reference topology, checklists, and canon sections. Section tags (e.g. §04) show where each source is used. Mermaid diagrams, trade-off matrices, and synthesis checklists are my own unless noted.
Scope. Synthesis of textbooks, peer-reviewed papers, industry frameworks, and vendor-neutral standards (May 2026). Cloud-specific numbers, SKUs, and product names change frequently — verify against current vendor docs before production decisions. Security and compliance references are architectural starting points, not audit checklists.
Citations are numbered continuously [1]–[n] within this section.
Architecture mindset, trade-offs & organizational forces (§00, §01, §18)
- Brooks, F. P., The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, 1975 (rev. ed. 1995). Essential vs. accidental complexity; why late projects stay late — background for §00 irreversible-decision framing. — §00, §01.
- Conway, M. E., "How Do Committees Invent?" Datamation, 14(4), 28–31, 1968. Conway's Law — §01 law callout and team–architecture coupling. melconway.com — §01.
- Skelton, M., & Pais, M., Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press, 2019. Inverse Conway Maneuver and stream-aligned teams — §01 Conway corollary and §04 org-fit guidance. — §01, §04, §19.
- Bezos, J. (via Amazon leadership principles), "Type 1 vs. Type 2 decisions." One-way vs. two-way door framing widely attributed to Amazon internal doctrine; popularized in tech leadership writing — §01 reversibility heuristic. See also All Things Distributed essays by Werner Vogels — §01, §18.
- Beck, K., et al., Extreme Programming Explained: Embrace Change (2nd ed.). Addison-Wesley, 2004. YAGNI and incremental design — §01 YAGNI vs. evolvability card. — §01.
- Richards, M., & Ford, N., Fundamentals of Software Architecture: An Engineering Approach. O'Reilly, 2020. Architecture characteristics, trade-off analysis, and role of the architect — §00 progression and §02 attribute catalog. — §00, §02, §19.
Distributed-systems laws: CAP, PACELC, Amdahl & scalability limits (§01, §09)
- Brewer, E. A., "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer, 45(2), 23–29, 2012. CAP theorem refinement — §01 CAP callout and §09 consistency trade-offs. DOI: 10.1109/MC.2012.37 — §01, §09.
- Abadi, D., "Consistency Tradeoffs in Modern Distributed Database System Design." IEEE Computer, 45(2), 37–42, 2012. PACELC extension — §01 PACELC callout. DOI: 10.1109/MC.2012.39 — §01, §09.
- Amdahl, G. M., "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities." AFIPS Spring Joint Computer Conference, 1967. Serial fraction limits parallel speedup — §01 Amdahl callout. — §01, §07.
- Gunther, N. J., The Practical Performance Analyst / Universal Scalability Law. Guerrilla-capacity.com, 2008+. Coherency/contention penalties beyond Amdahl — §01 USL callout. perfdynamics.com — §01, §07.
- Deutsch, P., et al., "Eight Fallacies of Distributed Computing." Sun Microsystems internal list, popularized 1990s. Network unreliability and latency assumptions — §09 distributed foundations lead. — §09.
Quality attributes, scenarios & fitness functions (§02)
- ISO/IEC 25010:2023, Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE). ISO standard for quality characteristics (-ilities taxonomy) — §02 attribute catalog. — §02.
- Bass, L., Clements, P., & Kazman, R., Software Architecture in Practice (4th ed.). Addison-Wesley, 2021. Quality attribute scenarios (stimulus–response–measure) — §02 measurability paragraph. — §02.
- Ford, N., Parsons, R., & Kua, P., Building Evolutionary Architectures: Automated Governance for Software Teams (2nd ed.). O'Reilly, 2022. Fitness functions and continuous architectural verification — §02 Figure 2.1 feedback loop. — §02, §03.
- Google SRE Team, Site Reliability Engineering: How Google Runs Production Systems. O'Reilly, 2016. SLI/SLO/error-budget framing — §02 availability math and §08.3 SRE practices. sre.google — §02, §08, §11.
- Google SRE Team, The Site Reliability Workbook. O'Reilly, 2018. Multi-window multi-burn-rate alerting — §11.2 alerting hygiene. sre.google/workbook — §08, §11.
Design process, diagrams & decision records (§03, §05, §15)
- Brown, S., The C4 Model for Visualising Software Architecture. c4model.com, 2018+. Context/container/component/code hierarchy — §03 design-process communication. c4model.com — §03, §05, §19.
- Nygard, M., "Documenting Architecture Decisions." Cognitect blog, 2011. Original ADR format — §15 ADR template and §15 lead. cognitect.com — §15.
- Thomson, J., "ADR: Architecture Decision Records." GitHub / community templates, 2010s+. Widely adopted ADR practice — §15.1 sections (Status, Context, Decision, Consequences). adr.github.io — §15.
- Richards, M., & Ford, N., Software Architecture: The Hard Parts. O'Reilly, 2021. Trade-off rubrics for distributed architecture decisions — §15.2 decision rubric and §09 consistency choices. — §09, §15, §19.
Architecture styles, DDD & evolutionary paths (§04, §16, §17)
- Newman, S., Building Microservices: Designing Fine-Grained Systems (2nd ed.). O'Reilly, 2021. Microservices trade-offs, distributed monolith warning — §04 styles, §04.2 matrix, §16 anti-patterns. — §04, §16, §19.
- Evans, E., Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. Bounded contexts and ubiquitous language — §04 modular monolith seams and §06 data ownership. — §04, §06, §19.
- Vernon, V., Implementing Domain-Driven Design. Addison-Wesley, 2013. Tactical patterns and service boundaries — §04 hexagonal/DDD pairing. — §04, §19.
- Cockburn, A., "Hexagonal Architecture." Alistair Cockburn, 2005. Ports and adapters — §04 hexagonal card. alistair.cockburn.us — §04.
- Martin, R. C., Clean Architecture: A Craftsman's Guide to Software Structure and Design. Prentice Hall, 2017. Dependency rule and domain-centric layering — §04 clean/hexagonal card. — §04.
- Reactive Manifesto. Lightbend et al., 2013. Responsive, resilient, elastic, message-driven systems — §04 event-driven style context. reactivemanifesto.org — §04, §19.
- Vogels, W., "Building Scalable, Highly Concurrent, and Fault-Tolerant Applications." AWS re:Invent / All Things Distributed, 2018+. Cell-based architecture for blast-radius containment — §04 cell-based card and §17 reference topology. allthingsdistributed.com — §04, §17.
- Heroku, The Twelve-Factor App. 2011+. Stateless processes, config, logs as streams — §12 platform foundations and §07 stateless scaling. 12factor.net — §07, §12, §19.
Integration, APIs & messaging patterns (§05)
- Fielding, R. T., Architectural Styles and the Design of Network-based Software Architectures (REST dissertation). UC Irvine, 2000. REST constraints — §05 REST/JSON row. ics.uci.edu — §05.
- Google, gRPC: A high performance, open source universal RPC framework. Protocol Buffers + HTTP/2 RPC — §05 gRPC row. grpc.io — §05.
- Facebook / GraphQL Foundation, GraphQL Specification. Client-shaped queries — §05 GraphQL row. spec.graphql.org — §05.
- Richardson, C., Microservices Patterns: With Examples in Java. Manning, 2018. API Gateway, BFF, messaging, outbox, saga patterns — §05.2 pattern list. — §05, §06.
- Hohpe, G., & Woolf, B., Enterprise Integration Patterns. Addison-Wesley, 2003. Message routing, publish–subscribe, async integration — §05 sync vs. async axis. — §05.
- Envoy Proxy / CNCF, service mesh architecture. Sidecar-based mTLS, retries, telemetry — §05 service mesh bullet. envoyproxy.io — §05, §12.
- Debezium / transactional outbox pattern. Dual-write avoidance via outbox relay — §05 outbox pattern bullet. See also Richardson, Microservices Patterns, ch. 3 — §05, §06.
Data stores, replication, CQRS, event sourcing & sagas (§06)
- Kleppmann, M., Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly, 2017. Replication, partitioning, consistency, stream processing — backbone for §06–§09. — §06, §07, §09, §19.
- DeCandia, G. et al., "Dynamo: Amazon's Highly Available Key-value Store." ACM SIGOPS Operating Systems Review, 41(6), 205–220, 2007. Eventual consistency at scale — §06 store trade-offs and §09 eventual model. DOI: 10.1145/1323293.1294281 — §06, §09, §19.
- Young, G., "CQRS Documents." CodeBetter / Greg Young, 2010. Command/query responsibility segregation — §06.3 CQRS card. — §06.
- Young, G., "Event Sourcing." Domain-driven design community, 2010s. Event log as source of truth — §06.3 event sourcing card. — §06.
- Richardson, C., "Pattern: Saga." microservices.io. Orchestrated/choreographed distributed transactions — §06.3 saga card. microservices.io — §06, §04.
- Dehghani, Z., "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh." Martin Fowler blog, 2019; Data Mesh (O'Reilly, 2022). Domain-owned data products — §06.3 data mesh card. martinfowler.com — §06.
- Karger, D., et al., "Consistent Hashing and Random Trees." STOC, 1997. Minimal remapping on node changes — §06.2 consistent hashing bullet. — §06, §07.
Scalability, caching & web topology (§07)
- Memcached / Redis documentation and industry practice. Distributed cache tiers — §07 Figure 7.1 and §07.2 caching strategy. See also Kleppmann ch. 3 — §07.
- CDN best practices (Cloudflare, Akamai, AWS CloudFront docs). Edge caching and origin offload — §07 CDN layer. Vendor docs vary; principle is widely documented — §07.
- AWS Architecture Center, "Web Application Hosting." Reference scalable web tier patterns — §07 canonical topology. AWS Well-Architected — §07, §17.
- Dean, J., & Barroso, L. A., "The Tail at Scale." Communications of the ACM, 56(2), 74–80, 2013. Latency outliers at scale — §07 back-pressure and §13 latency context. DOI: 10.1145/2408776.2408794 — §07, §13.
Reliability, resilience patterns & chaos engineering (§08)
- Nygard, M. T., Release It! Design and Deploy Production-Ready Software (2nd ed.). Pragmatic Bookshelf, 2018. Circuit breaker, bulkhead, timeout, stability patterns — §08 resilience library and §08 Figure 8.1. — §08, §19.
- Fowler, M., "CircuitBreaker." bliki, 2014. Circuit breaker state machine — §08 circuit breaker row. martinfowler.com — §08.
- Netflix Technology Blog, Hystrix and resilience engineering posts. Bulkheads, fallbacks, latency tolerance — §08 pattern table. netflixtechblog.com — §08.
- Beyer, B., et al., Site Reliability Engineering (Chaos Engineering chapter). Controlled failure injection — §08.3 chaos engineering bullet. — §08.
- AWS, "Disaster Recovery of Workloads on AWS: Recovery in the Cloud." RTO/RPO strategies (backup, pilot light, warm standby, active-active) — §08.2 DR table. AWS DR whitepaper — §08.
- Allspaw, J., "Blameless PostMortems and a Just Culture." Etsy / Code as Craft, 2012. Incident learning without blame — §08.3 postmortems. — §08.
Consensus, clocks, ordering & delivery semantics (§09)
- Lamport, L., "Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, 21(7), 558–565, 1978. Logical clocks — §09.2 ordering bullet. DOI: 10.1145/359545.359563 — §09, §19.
- Lamport, L., "The Part-Time Parliament." ACM Transactions on Computer Systems, 16(2), 133–169, 1998. Paxos consensus — §09.2 consensus bullet. — §09, §19.
- Ongaro, D., & Ousterhout, J., "In Search of an Understandable Consensus Algorithm (Raft)." USENIX ATC, 2014. Raft for replicated logs — §09.2 consensus bullet. raft.github.io — §09, §19.
- Fischer, M. J., Lynch, N. A., & Paterson, M. S., "Impossibility of Distributed Consensus with One Faulty Process." Journal of the ACM, 32(2), 374–382, 1985. FLP impossibility — background for §09 quorum/consensus limits. — §09.
- Apache Kafka documentation, delivery semantics. At-least-once, at-most-once, idempotent producers — §09.3 delivery guarantees. kafka.apache.org — §09.
Security architecture, zero trust & threat modeling (§10)
- NIST, Zero Trust Architecture, SP 800-207. U.S. Department of Commerce, 2020. Never trust, always verify — §10 zero-trust lead and Figure 10.1. csrc.nist.gov — §10, §19.
- OWASP Foundation, OWASP Top Ten. Web application risk categories — §10 shift-left callout and §18 security checklist. owasp.org — §10, §18, §19.
- Microsoft, "The STRIDE Threat Model." Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege — §10.2 STRIDE table. learn.microsoft.com — §10.
- Hardt, D., et al., RFC 6749: The OAuth 2.0 Authorization Framework. IETF, 2012. OAuth2 flows — §10 identity card. datatracker.ietf.org — §10.
- OpenID Foundation, OpenID Connect Core 1.0. Identity layer on OAuth2 — §10 OIDC authentication. openid.net — §10.
- CNCF, Software Supply Chain Best Practices / SLSA framework. Artifact signing, provenance, SBOM — §10 supply-chain card and §12 signed artifacts. slsa.dev — §10, §12, §18.
- OWASP, Top 10 for Large Language Model Applications. Prompt injection and LLM abuse — §14.1 safety card and §18 LLM input bullet. owasp.org — §14, §18.
Observability: metrics, logs, traces & OpenTelemetry (§11)
- Sigelman, B., et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure." Google technical report, 2010. Distributed tracing lineage — §11 traces pillar. — §11.
- Charity Majors, Liz Fong-Jones, & George Miranda, Observability Engineering. O'Reilly, 2022. Observability vs. monitoring — §11 lead and high-cardinality investigation. — §11.
- OpenTelemetry, specification & documentation. Vendor-neutral instrumentation — §11 Figure 11.1 and OTel principle callout. opentelemetry.io — §11, §18.
- Wilkes, J., "Site Reliability Engineering: Measuring and Managing Reliability." Google SRE Book, ch. 4. SLI/SLO definitions — §11 SLO/error-budget path. — §08, §11.
- Google SRE Workbook, "Alerting on SLOs." Multi-burn-rate alerting — §11.2 hygiene bullets. — §11.
- Elasticsearch / OpenSearch documentation. Log aggregation at scale — §11 logs pillar (structured logging). Vendor-neutral principle in SRE literature — §11.
Delivery, DORA metrics & deployment strategies (§12)
- Forsgren, N., Humble, J., & Kim, G., Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018. DORA four keys — §12 lead and delivery performance. — §12, §19.
- DORA / Google Cloud, State of DevOps Reports. Ongoing research on deployment frequency, lead time, change failure rate, MTTR — §12 DORA reference. dora.dev — §12, §19.
- Humble, J., & Farley, D., Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley, 2010. Pipeline design and release safety — §12 CI/CD diagram. — §12.
- Fowler, M., "Feature Toggles (aka Feature Flags)." Feature flags decouple deploy from release — §12 progressive delivery. martinfowler.com — §12.
- CNCF, Kubernetes documentation. Container orchestration platform layer — §12.2 platform foundations. kubernetes.io — §12, §17.
- LaunchDarkly / industry practice on progressive delivery. Canary, blue/green, rolling deployments — §12.1 deployment strategies table. See also Humble & Farley — §12.
Capacity planning, back-of-envelope math & unit economics (§13)
- Dean, J., "Numbers Everyone Should Know" (latency table). Google slides / Communications of the ACM follow-ons. ns–ms–s hierarchy — §13.1 latency numbers table. Widely reproduced; verify current hardware — §13.
- Barroso, L. A., Hölzle, U., & Parthasarathy, R., "Web Search for a Planet: The Google Cluster Architecture." IEEE Micro, 23(2), 22–28, 2003. Hyperscale capacity thinking — §13 estimation context. — §13.
- AWS / Google Cloud pricing calculators and Well-Architected cost optimization pillar. Unit economics and headroom — §13.3 capacity headroom and §18 cost checklist. Vendor-specific — §13, §18.
AI / ML & LLM system architecture (§14)
- Sculley, D., et al., "Hidden Technical Debt in Machine Learning Systems." NeurIPS, 2015. ML systems complexity beyond models — §14 MLOps loop. papers.nips.cc — §14.
- Google, Machine Learning Engineering / TFX documentation. Training vs. serving paths, model registry — §14.2 inference vs. training. tensorflow.org/tfx — §14.
- Chowdhery, A., et al., "PaLM: Scaling Language Modeling with Pathways." Journal of Machine Learning Research, 2023. Large-model serving economics background — §14.1 cost/latency card. — §14.
- LangChain / semantic caching and model-routing patterns (industry practice). Provider abstraction gateway — §14.1 economics and §14 principle callout. Patterns evolving rapidly — §14.
- Vector database vendors (pgvector, Pinecone, Milvus documentation). Embedding retrieval — §06 vector store row and §14 retrieval-augmented patterns. — §06, §14.
Anti-patterns, reference topology & operational checklists (§16, §17, §18)
- Richardson, C., "Anti-pattern: Shared database." microservices.io — §06 shared-database callout and §16 anti-patterns. microservices.io — §06, §16.
- Richardson, C., "Anti-pattern: Distributed monolith." microservices.io — §04 distributed monolith callout and §16. microservices.io — §04, §16.
- AWS Well-Architected Framework. Operational excellence, security, reliability, performance, cost pillars — §17 reference architecture and §18 checklists. AWS; see also Azure and GCP equivalents — §17, §18, §19.
- Google SRE Book & Workbook. Production readiness, alerting, incident response — §18 production-readiness and design-review checklists. — §18.
- Knuth, D. E., "Structured Programming with go to Statements." ACM Computing Surveys, 6(4), 261–301, 1974. "Premature optimization is the root of all evil" — §18.1 heuristics. DOI: 10.1145/356635.356855 — §18.
Foundational canon cited in §19
- Kleppmann, M., Designing Data-Intensive Applications. O'Reilly, 2017. — §19 foundational books list. — §19 (+ §06–§09).
- Richards, M., & Ford, N., Fundamentals of Software Architecture & Software Architecture: The Hard Parts. O'Reilly, 2020–2021. — §19 list. — §19.
- Newman, S., Building Microservices (2nd ed.). O'Reilly, 2021. — §19 list. — §19.
- Evans, E., Domain-Driven Design; Vernon, V., Implementing Domain-Driven Design. — §19 list. — §19.
- Nygard, M., Release It! (2nd ed.). Pragmatic Bookshelf, 2018. — §19 list. — §19.
- Google SRE Team, Site Reliability Engineering & The Site Reliability Workbook. O'Reilly, 2016–2018. — §19 list. — §19.
- Forsgren, N., Humble, J., & Kim, G., Accelerate. IT Revolution, 2018. — §19 list. — §19.
- Skelton, M., & Pais, M., Team Topologies. IT Revolution, 2019. — §19 list. — §19.
Author synthesis
- Truong, L., System Architecture Playbook — personal working notes. May 2026. Trade-off matrices, Mermaid diagrams, checklists, and synthesis prose. LinhTruong.com — all sections.
Latency tables, cloud pricing, and product capabilities go stale quickly — re-measure on your hardware and region. CAP/PACELC summaries simplify nuanced systems behavior; always map claims to the specific database or broker you use. DORA correlations describe population studies, not guarantees for your team. Security frameworks require threat modeling on your actual data flows. Verify DOIs, RFC numbers, and edition details against primary sources before academic citation.