Source / canonical copy:LinhTruong.com. Share that link if you pass this file around so attribution stays with it.
Notes on the stack I reach for when software engineers have to ship
data- and knowledge-grounded AI: ingestion and lakehouse, vectors and graphs,
RAG through GraphRAG-style patterns, agents, evaluation, governance, and the LLMOps wiring that keeps quality from drifting.
Opinionated, refreshed for 2026, and slanted toward what holds up in audits and postmortems—not demo-day polish.
Most of the pain I see in production is data and knowledge, not which LLM badge is on the box. Models churn;
what lasts is information that is permissioned, curated, and structured enough to retrieve under scrutiny.
AI Data & Knowledge Engineering is the work of building and running that layer—not the slide version, the on-call version.
~80%
common share of time on data & knowledge work (rule of thumb)
3×
accuracy lift from hybrid + GraphRAG vs. naive RAG
10–100×
cost variance from poor retrieval design
< 5%
of orgs have production-grade AI data contracts
How I draw it on a whiteboard. Four beats: a lakehouse for raw + curated tables,
a semantic layer (ontology + graph) for meaning, vector + hybrid search for recall,
and an evaluation + governance spine so you can show provenance when something goes wrong.
The AI Data & Knowledge Engineer sits at the intersection. A software engineer moving into this role
trades "deploy a service" for "deploy a data product" — versioned, contract-checked, observable, and consumable
by both humans and models.
flowchart LR
subgraph DE["Data Engineering"]
A1[Ingestion CDC, Events] --> A2[Lakehouse Bronze/Silver/Gold]
A2 --> A3[Feature Store]
end
subgraph KE["Knowledge Engineering"]
B1[Ontology / Taxonomy] --> B2[Entity Resolution]
B2 --> B3[Knowledge Graph]
A2 --> B2
end
subgraph MLE["AI / ML Engineering"]
C1[Embeddings] --> C2[Vector Index]
C2 --> C3[Hybrid + GraphRAG]
C3 --> C4[Agents / Apps]
A3 --> C1
B3 --> C3
end
style DE fill:#eef2ff,stroke:#3730a3
style KE fill:#f5f3ff,stroke:#5b21b6
style MLE fill:#ecfeff,stroke:#155e75
Figure 2.1 — The three pillars and where they meet.
3. Reference Architecture
The canonical end-to-end pipeline a Software Engineer should be able to draw on a whiteboard:
Late-chunking / contextual embedding (embed in context of doc; 2024+ technique).
Fixed-size sliding window — only as fallback.
6. Knowledge Engineering Pillar
6.1 The Knowledge Stack
flowchart TB
O[Ontology classes, properties, axioms OWL / SHACL / schema.org]
T[Taxonomy / Controlled Vocabulary SKOS]
ER[Entity Resolution canonical IDs]
KG[Knowledge Graph typed nodes & edges]
SR[Semantic Reasoner / Rules SHACL, SWRL, Datalog]
SL[Semantic / Metrics Layer dbt-semantic, Cube, MetricFlow]
APP[Apps + LLMs]
O --> T
O --> ER
O --> KG
T --> KG
ER --> KG
KG --> SR
KG --> SL
SL --> APP
KG --> APP
SR --> APP
style O fill:#f5f3ff,stroke:#5b21b6
style KG fill:#ecfeff,stroke:#155e75
style SL fill:#ecfdf5,stroke:#047857
Figure 6.1 — Knowledge layer: ontology → taxonomy → ER → KG → reasoning → semantic layer.
6.2 Ontology Design — Rules of Thumb
Start from use cases, not the universe. Model what is queried.
Separate T-Box (schema) from A-Box (instances); version T-Box independently.
Use SHACL for validation, not just OWL for inference.
Prefer property graphs (LPG) for app speed; RDF when you need standards / federation.
6.3 Entity Resolution Pipeline
flowchart LR
R[Raw records] --> N[Normalize casing, unicode, addresses]
N --> B[Blocking candidate pairs]
B --> SC[Score ML / rules / Fellegi-Sunter]
SC --> CL[Cluster connected components]
CL --> RV[Human review active learning]
RV --> CN[Canonical IDs + survivorship]
CN --> KG[Knowledge Graph]
style SC fill:#eef2ff,stroke:#3730a3
style RV fill:#fffbeb,stroke:#92400e
Figure 6.2 — ER is a loop, not a one-shot. Budget for review tooling.
6.4 The Semantic Layer
A semantic layer defines metrics, dimensions, and joins once so BI, apps, and LLMs all
answer the same question the same way. This is the difference between "ChatGPT for our database" demos and
systems that survive an audit.
7.2 Hybrid Retrieval Architecture (L2 — production default)
flowchart LR
Q[User query] --> QR[Query rewriting + classification]
QR --> P1[Lexical BM25 / SPLADE]
QR --> P2[Dense vector kNN]
QR --> P3[Metadata filters ACL, time, source]
P1 --> F[Fusion RRF / weighted]
P2 --> F
P3 --> F
F --> RR[Cross-encoder reranker]
RR --> CTX[Context packer dedup, diversify, cite]
CTX --> LLM[LLM]
LLM --> ANS[Answer + citations]
ANS --> EV[Eval / trace]
style F fill:#eef2ff,stroke:#3730a3
style RR fill:#ecfeff,stroke:#155e75
style EV fill:#fffbeb,stroke:#92400e
Figure 7.1 — Hybrid retrieval is the production default. Always include a reranker and citations.
7.3 GraphRAG (L5)
GraphRAG augments retrieval with a knowledge graph: entities and relationships extracted from the corpus
become first-class retrieval keys. Strong for multi-hop questions ("Which suppliers of company X were affected
by event Y?") and for producing global summaries that vector search cannot.
flowchart TB
DOC[Corpus] --> EX[Entity + relation extraction LLM + NER]
EX --> KG[(Knowledge Graph)]
KG --> COM[Community detection Leiden / Louvain]
COM --> SUM[Community summaries]
Q[Query] --> QC[Query classifier]
QC -->|local| LK[Local: ego-network around entity]
QC -->|global| GS[Global: community summaries map-reduce]
LK --> CTX[Context]
GS --> CTX
CTX --> LLM
style KG fill:#f5f3ff,stroke:#5b21b6
style COM fill:#ecfeff,stroke:#155e75
Figure 7.2 — GraphRAG (Microsoft Research, 2024) flow: local vs global queries.
The four retrieval failure modes — memorize these:
(1) Recall miss (right doc not retrieved), (2) Ranking miss (retrieved but buried),
(3) Context dilution (good chunk drowned by noise), (4) Grounding miss (LLM ignores context).
Each has different fixes; eval should measure each separately.
8. Agentic Data Workflows
Agents make data & knowledge systems active: they plan, call tools, branch, and self-correct. The
software-engineer mindset: an agent is a state machine with tool-calls and a budget, not magic.
Figure 8.1 — Plan → tools → critic loop. Bounded retries and explicit budgets are non-negotiable.
8.1 Engineering Rules for Production Agents
Deterministic skeleton, LLM in the joints. Hard-code the graph; let the LLM decide branches.
Tools are typed APIs with schemas, idempotency keys, retries, and timeouts.
Budgets everywhere: max steps, max tokens, max latency, max $ — fail loud when exceeded.
Durable execution (Temporal, Restate, LangGraph-checkpointed) for any agent running > 30s.
Human-in-the-loop at irreversible actions (writes, payments, deletes).
Trace everything — every prompt, tool call, retrieval, and response — to a queryable store.
8.2 Text-to-SQL on the Semantic Layer
Don't let LLMs write raw SQL against raw tables. Expose the semantic layer as the tool surface.
The LLM picks metrics + dimensions + filters; the semantic layer compiles deterministic SQL. This collapses error
rates from ~50% to <5% on real schemas.
9. Data & Knowledge Quality
9.1 Data Contracts
A data contract is a producer-signed schema + SLA + semantics. It is the single biggest cultural shift
required to make AI data reliable.
flowchart LR
LOG[Prod traces] --> CUR[Curate eval set hard, edge, regression]
CUR --> OFF[Offline eval regression suite in CI]
OFF --> CHG[Change prompt / index / model]
CHG --> SHADOW[Shadow / canary]
SHADOW --> AB[A/B test online]
AB --> WIN{wins?}
WIN -->|yes| ROLL[Rollout]
WIN -->|no| BACK[Roll back + add to eval]
BACK --> CUR
ROLL --> LOG
style OFF fill:#ecfdf5,stroke:#047857
style AB fill:#eef2ff,stroke:#3730a3
Figure 10.1 — The eval flywheel. Every prod incident becomes a regression test.
10.3 LLM-as-Judge — Use Wisely
Use rubrics, not vibes. Define what 1–5 means per criterion.
Pairwise beats absolute scoring. Use pairs against a baseline.
Audit the judge: 100 human-labeled samples per quarter, track agreement.
Use a stronger model as judge than the one being judged when possible.
Vector stores leak. If a document contains PII, the chunk, the embedding, AND any metadata payload
must be treated as PII. ACL must be checked at query time against payload tags — not just at ingestion.
Tag PII at parse time; carry tags through chunking, embedding, and retrieval.
Enforce ACL before reranking, not after — saves cost and prevents side-channel leaks.
Redact prompts at the LLM gateway; route sensitive prompts to private / on-prem models.
Honor right-to-erasure: every PII record must be re-findable by subject ID across lakehouse + vector + KG.
12.3 Compliance Map (2026)
Regime
What it forces on you
GDPR / CCPA / CPRA
Lawful basis, DSAR, deletion across all stores including vectors
EU AI Act
Risk class, data governance evidence, transparency, logging
SOC 2 / ISO 27001
Access logs, change management, vendor mgmt
HIPAA / PCI / FINRA
Sector-specific encryption, retention, audit
NIST AI RMF / ISO 42001
AI management system, eval, incident response
12.4 FinOps for AI Data
Cache aggressively: semantic cache for repeated questions; prompt cache for system prompts.
Route by complexity: small model first; escalate on uncertainty (router pattern).
Right-size embeddings: dim reduction (Matryoshka) cuts storage 4–8× with marginal quality loss.
Batch and async: never block UX on full agent runs — stream + progressively reveal.
Track $/query, $/user, $/business-outcome, not just $ total.
13. Implementation Strategy
13.1 Strategic Operating Model
flowchart LR
V[Business value hypothesis] --> UC[Pick 1–2 use cases narrow, measurable]
UC --> DATA[Inventory data & knowledge gaps]
DATA --> MVP[Thin slice MVP L2 hybrid RAG + contracts]
MVP --> EV[Eval set + baselines]
EV --> SHIP[Ship behind flag]
SHIP --> MEAS[Measure business + tech KPIs]
MEAS --> ITER[Iterate or kill]
ITER -->|expand| PLAT[Promote shared components to platform]
style MVP fill:#ecfdf5,stroke:#047857
style PLAT fill:#eef2ff,stroke:#3730a3
Figure 13.1 — Use-case-led, platform-emergent. Do not build the platform first.
13.2 The 90-Day Plan for a Software Engineer Stepping In
Days 1–15 Discover
Map data sources, owners, SLAs. Pick 1 high-value use case with a measurable KPI. Stand up a catalog
(OpenMetadata/DataHub/Unity). Define data contracts for the top 5 tables. Choose lakehouse format & engine.
Days 16–35 Foundation
Build Bronze/Silver/Gold for the use-case data. Add CI with dbt/SQLMesh + Great Expectations / Soda. Stand up an
LLM gateway with logging, PII redaction, model routing. Pick embedding model + vector store; version everything.
Days 36–60 Retrieve & Reason
Ship L2 hybrid RAG (BM25 + dense + reranker) with citations. Build a labeled eval set (≥200 queries).
Add observability (Langfuse / Phoenix / LangSmith). If multi-hop matters, prototype GraphRAG over a small KG
built from Silver + extracted entities.
Days 61–80 Harden
Add ACL on retrieval; integrate semantic layer for any numeric questions; introduce shadow + canary; wire
feedback collection. Run red-team prompts; close PII gaps. Establish on-call & runbooks.
Days 81–90 Scale
Promote reusable pieces (chunker, embedder, gateway, eval harness) to a platform repo. Publish the data
contract and prompt registries. Write a postmortem on the first incident. Plan use case #2 — but reuse, don't rebuild.
flowchart LR
SE[Software Engineer] --> DE[Data Engineer]
SE --> MLE[ML Engineer]
DE --> AIDKE[AI Data & Knowledge Engineer]
MLE --> AIDKE
AIDKE --> SAIDKE[Staff AI Platform Engineer]
AIDKE --> ARCH[AI Solutions Architect]
SAIDKE --> PRIN[Principal / Distinguished]
style AIDKE fill:#eef2ff,stroke:#3730a3
Figure 15.1 — Typical career paths converging on the AI Data & Knowledge Engineer role.
16. References & Further Reading
Starter reading only; your vendor contracts and internal policies still win. Canonical HTML: LinhTruong.com — Linh Truong.
Foundational books
Kleppmann — Designing Data-Intensive Applications
Reis & Housley — Fundamentals of Data Engineering
Kimball — The Data Warehouse Toolkit
Allemang, Hendler & Gandon — Semantic Web for the Working Ontologist
Hogan et al. — Knowledge Graphs (open access)
Huyen — Designing Machine Learning Systems; AI Engineering (2024)
Specs & standards
Apache Iceberg / Delta Lake / Hudi specs
W3C: RDF 1.1, OWL 2, SHACL, SKOS, PROV-O
schema.org, FIBO, SNOMED CT, ISO 8000
OpenLineage, OpenTelemetry, MLflow Model Registry
NIST AI RMF, ISO/IEC 42001, EU AI Act
OWASP Top 10 for LLM Applications
Seminal / influential papers
Lewis et al. — Retrieval-Augmented Generation (2020)
Karpukhin et al. — Dense Passage Retrieval (2020)
Khattab & Zaharia — ColBERT / ColBERTv2
Microsoft Research — From Local to Global: GraphRAG (2024)
Anthropic — Contextual Retrieval (2024)
Google — Chain-of-Verification, Self-Consistency
Meta — Toolformer; ReAct (Yao et al.)
Living resources
Hamilton, Z. — Data Engineering Weekly
Chip Huyen's blog & AI Engineering newsletter
Eugene Yan — applied ML & LLM systems essays
Hamel Husain — LLM evals, AI Engineering
Jay Alammar / Cohere blog — retrieval & embeddings
Databricks, Snowflake, AWS, GCP engineering blogs
Closing bias. I describe the job in one line: make grounded answers fast, owned, and auditable—so the model is a part of the system you can reason about, not a black box you shrug at after a bad rollout.