Data & knowledge for AI · field notes · May 2026

AI : Data & Knowledge Engineer
Grounding models in governed data, retrieval you can defend, and answers you can trace

Linh Truong, MA (Harvard), MBA · LinhTruong.com · Linh@Alumni.Harvard.edu

Source / canonical copy: LinhTruong.com. Share that link if you pass this file around so attribution stays with it.

Notes on the stack I reach for when software engineers have to ship data- and knowledge-grounded AI: ingestion and lakehouse, vectors and graphs, RAG through GraphRAG-style patterns, agents, evaluation, governance, and the LLMOps wiring that keeps quality from drifting. Opinionated, refreshed for 2026, and slanted toward what holds up in audits and postmortems—not demo-day polish.

Lakehouse RAG / GraphRAG Knowledge Graphs Vector + Hybrid Search LLMOps / DataOps Agentic AI Governance & Lineage

1. Executive Summary — why this role matters 2. The Role & Responsibilities — DE vs KE vs MLE 3. Reference Architecture — end-to-end 4. The 2026 Tech Stack — picks & trade-offs 5. Data Engineering Pillar — ingest → lakehouse 6. Knowledge Engineering Pillar — ontologies, KG 7. Retrieval & RAG Patterns — naive → GraphRAG 8. Agentic Data Workflows — tools & planners 9. Data & Knowledge Quality — contracts, tests 10. Evaluation & Observability — offline + online 11. MLOps / LLMOps — CI/CD/CT 12. Governance, Security, Cost — GDPR, PII, FinOps 13. Implementation Strategy — 90-day plan 14. Anti-patterns & Pitfalls — what kills projects 15. Skills Matrix & Career Path — what to learn 16. References & Further Reading

1. Executive Summary

Most of the pain I see in production is data and knowledge, not which LLM badge is on the box. Models churn; what lasts is information that is permissioned, curated, and structured enough to retrieve under scrutiny. AI Data & Knowledge Engineering is the work of building and running that layer—not the slide version, the on-call version.

~80%

common share of time on data & knowledge work (rule of thumb)

3×

accuracy lift from hybrid + GraphRAG vs. naive RAG

10–100×

cost variance from poor retrieval design

< 5%

of orgs have production-grade AI data contracts

How I draw it on a whiteboard. Four beats: a lakehouse for raw + curated tables, a semantic layer (ontology + graph) for meaning, vector + hybrid search for recall, and an evaluation + governance spine so you can show provenance when something goes wrong.

2. The Role & Responsibilities

Data Engineer (DE)

Pipes & Plumbing

Owns ingestion, ELT/ETL, lakehouse, streaming, schemas, SLAs.

Batch + streaming pipelines
Data modeling (3NF, dimensional, Data Vault)
Storage formats (Parquet, Iceberg, Delta, Hudi)
Orchestration & observability

Knowledge Engineer (KE)

Meaning & Structure

Owns ontologies, taxonomies, knowledge graphs, entity resolution.

Domain ontology design (RDF/OWL, SHACL, schema.org)
Entity resolution & canonicalization
Graph construction & reasoning
Semantic / metrics layers

AI / ML Engineer (MLE)

Models & Retrieval

Owns embeddings, retrieval, fine-tuning, agents, evaluation.

Embedding & reranker selection
Vector indexes (HNSW, IVF, ScaNN, DiskANN)
RAG / GraphRAG / agentic patterns
Online + offline eval, guardrails

The AI Data & Knowledge Engineer sits at the intersection. A software engineer moving into this role trades "deploy a service" for "deploy a data product" — versioned, contract-checked, observable, and consumable by both humans and models.

flowchart LR subgraph DE["Data Engineering"] A1[Ingestion CDC, Events] --> A2[Lakehouse Bronze/Silver/Gold] A2 --> A3[Feature Store] end subgraph KE["Knowledge Engineering"] B1[Ontology / Taxonomy] --> B2[Entity Resolution] B2 --> B3[Knowledge Graph] A2 --> B2 end subgraph MLE["AI / ML Engineering"] C1[Embeddings] --> C2[Vector Index] C2 --> C3[Hybrid + GraphRAG] C3 --> C4[Agents / Apps] A3 --> C1 B3 --> C3 end style DE fill:#eef2ff,stroke:#3730a3 style KE fill:#f5f3ff,stroke:#5b21b6 style MLE fill:#ecfeff,stroke:#155e75

Figure 2.1 — The three pillars and where they meet.

3. Reference Architecture

The canonical end-to-end pipeline a Software Engineer should be able to draw on a whiteboard:

flowchart TB %% Sources subgraph S["① Sources"] S1[OLTP DBs
Postgres, MySQL] S2[SaaS APIs
Salesforce, Stripe] S3[Event Streams
Kafka, Kinesis] S4[Files / Docs
PDF, HTML, S3] S5[Unstructured
Images, Audio, Video] end %% Ingest subgraph I["② Ingest & Capture"] I1[CDC
Debezium / Fivetran] I2[Stream Processors
Flink / Spark Structured] I3[Doc Loaders & Parsers
Unstructured.io / Docling] end %% Lakehouse subgraph L["③ Lakehouse"] L1[Bronze
raw immutable] L2[Silver
cleaned, conformed] L3[Gold
business / serving] end %% Semantic subgraph K["④ Semantic & Knowledge"] K1[Metrics / Semantic Layer
dbt Semantic, Cube] K2[Entity Resolution
Zingg / Splink] K3[Knowledge Graph
Neo4j / TigerGraph / RDF] K4[Ontology
OWL / SHACL / SKOS] end %% AI surfaces subgraph A["⑤ AI Surfaces"] A1[Feature Store
Feast / Tecton] A2[Vector Store
pgvector / Qdrant / Milvus] A3[Reranker + Hybrid Search] A4[LLM Gateway
routing, caching, PII] A5[Agents & Tools] end %% Apps subgraph X["⑥ Applications"] X1[Copilots / Assistants] X2[Search & Q&A] X3[Decision / Recsys] X4[Analytics & BI] end %% Spine subgraph G["Cross-cutting: Governance, Eval, Ops"] G1[Catalog + Lineage
Unity / OpenMetadata / DataHub] G2[Quality & Contracts
Great Expectations / Soda] G3[Eval & Observability
LangSmith / Arize / Phoenix] G4[Security & Privacy
RBAC, ABAC, PII, DLP] G5[CI/CD/CT & FinOps] end S1 --> I1 --> L1 S2 --> I1 S3 --> I2 --> L1 S4 --> I3 --> L1 S5 --> I3 L1 --> L2 --> L3 L3 --> K1 L2 --> K2 --> K3 K4 --> K3 L3 --> A1 L2 --> A2 K3 --> A3 A1 --> A5 A2 --> A3 --> A5 A4 --> A5 A5 --> X1 A3 --> X2 A1 --> X3 K1 --> X4 G1 -.-> L1 G1 -.-> K3 G2 -.-> L2 G3 -.-> A5 G4 -.-> A4 G5 -.-> A5

Figure 3.1 — Reference architecture for an AI Data & Knowledge platform.

Engineer's heuristic. If a box on this diagram has no owner, no SLO, and no contract, it is not a system — it is a future incident.

4. The 2026 Tech Stack (Picks & Trade-offs)

Layer	Strong picks (2026)	When to choose	Watch-outs
Lakehouse storage	Apache Iceberg, Delta Lake, Hudi	Iceberg for openness/multi-engine; Delta if Databricks-heavy	Table maintenance (compaction, expiration) is mandatory
Query / compute	DuckDB, Trino, Spark, Snowflake, BigQuery	DuckDB for local dev & small data; Trino for federation; Spark/Snowflake at scale	Cost explodes with bad partitioning / scans
Streaming	Kafka, Redpanda, Flink, Materialize	Flink for stateful; Materialize for SQL-on-streams	Schema evolution + exactly-once need design
Transformations	dbt, SQLMesh, Spark, Polars	dbt for SQL teams; SQLMesh for column-level lineage & virtual envs	Macro sprawl; lacking tests
Orchestration	Dagster, Airflow 2.x, Prefect, Temporal	Dagster for asset-centric; Temporal for durable agent workflows	DAG-of-DAGs sprawl
Vector stores	pgvector, Qdrant, Milvus, Weaviate, LanceDB	pgvector if Postgres-shop; Qdrant/Milvus for scale + filters	Index rebuild cost; payload schema drift
Embeddings	OpenAI text-embedding-3, Voyage, Cohere v3, BGE-M3, E5-mistral	Multilingual: BGE-M3 / Cohere; long-context: Voyage	Re-embedding is expensive — version aggressively
Rerankers	Cohere Rerank, Voyage Rerank, BGE-reranker, ColBERT-v2	Always add a reranker; +10–30% nDCG typical	Latency budget; batch carefully
Knowledge graph	Neo4j, TigerGraph, ArangoDB, Memgraph, Apache Jena, GraphDB	RDF/OWL for ontology rigor; LPG for app speed	Don't model the world — model the use case
Entity resolution	Zingg, Splink, dedupe.io, Senzing	Splink for transparency; Zingg ML-first	Threshold tuning, manual review queues
Feature store	Feast, Tecton, Databricks FS	Feast for OSS; Tecton for streaming features	Online/offline skew is the #1 bug
LLM gateway	LiteLLM, Portkey, Kong AI, Cloudflare AI Gateway	Always put a gateway in front: routing, retries, caching, PII, $$	One more hop — measure latency
Eval & observability	LangSmith, Arize Phoenix, Langfuse, Ragas, TruLens	Phoenix/Langfuse open-source; Arize/LangSmith managed	Eval datasets must be versioned like code
Catalog & lineage	Unity Catalog, OpenMetadata, DataHub, Apache Atlas	Unity if Databricks; OpenMetadata/DataHub otherwise	Lineage is only useful if it's column-level
Quality / contracts	Great Expectations, Soda, dbt tests, Monte Carlo, Acceldata	GE/Soda in CI; Monte Carlo for runtime anomaly detection	Tests rot — review quarterly

Tool-chasing kills. Pick one tool per layer. Standardize. The bottleneck is rarely the tool — it's schema discipline and ownership.

5. Data Engineering Pillar

5.1 The Medallion Pattern (Bronze → Silver → Gold)

flowchart LR R[Raw sources] -->|append-only, schema-on-read| B[Bronze
immutable, replayable] B -->|deduplicate, conform, type, mask PII| S[Silver
cleaned, conformed] S -->|aggregate, model, semantic| G[Gold
business-facing] G --> FS[Feature Store] G --> SEM[Semantic Layer] S --> VEC[Vector / Search Index] S --> KG[Knowledge Graph] style B fill:#fff7ed,stroke:#9a3412 style S fill:#f1f5f9,stroke:#475569 style G fill:#ecfdf5,stroke:#047857

Figure 5.1 — Medallion lakehouse layers and their downstream consumers.

5.2 Storage & Format Choices

Parquet is the default file format. Columnar, compressed, splittable.
Iceberg / Delta / Hudi provide ACID, time travel, schema evolution, hidden partitioning.
Use Z-order / liquid clustering on high-cardinality filter columns.
Always design partition keys around query patterns, not ingest patterns.

5.3 Streaming vs Batch Decision

Need	Choose	Why
Reporting, training data	Batch (hourly/daily)	Simpler, cheaper, easier to test
Feature freshness < 1 min	Streaming (Flink/Spark SS)	Latency-bound use cases
CDC from OLTP	Debezium → Kafka → Iceberg	Replayable, no source load
Document ingestion	Event-driven (S3 → queue → worker)	Bursty, idempotent

5.4 The Modern Document Ingestion Pipeline (for RAG)

flowchart LR F[File arrives
PDF / HTML / DOCX / Image] --> P[Parse + Layout
Docling / Unstructured / Azure DI] P --> N[Normalize
Markdown / JSON blocks] N --> E[Extract structure
tables, headings, captions] E --> C[Chunk
semantic + sliding + parent-doc] C --> M[Enrich
title, summary, entities, ACL] M --> EM[Embed
multi-vector / late-interaction] EM --> V[Vector store] M --> K[Knowledge graph] M --> S[Search index BM25] style P fill:#eef2ff,stroke:#3730a3 style C fill:#ecfeff,stroke:#155e75 style EM fill:#f5f3ff,stroke:#5b21b6

Figure 5.2 — Document → retrieval-ready pipeline. Skipping layout-aware parsing is the #1 RAG quality killer.

5.5 Chunking Strategies (ranked, 2026)

Structure-aware (headings, sections, tables) — best default.
Semantic chunking (embedding-based boundary detection).
Parent-document retrieval (embed small, return large).
Late-chunking / contextual embedding (embed in context of doc; 2024+ technique).
Fixed-size sliding window — only as fallback.

6. Knowledge Engineering Pillar

6.1 The Knowledge Stack

flowchart TB O[Ontology
classes, properties, axioms
OWL / SHACL / schema.org] T[Taxonomy / Controlled Vocabulary
SKOS] ER[Entity Resolution
canonical IDs] KG[Knowledge Graph
typed nodes & edges] SR[Semantic Reasoner / Rules
SHACL, SWRL, Datalog] SL[Semantic / Metrics Layer
dbt-semantic, Cube, MetricFlow] APP[Apps + LLMs] O --> T O --> ER O --> KG T --> KG ER --> KG KG --> SR KG --> SL SL --> APP KG --> APP SR --> APP style O fill:#f5f3ff,stroke:#5b21b6 style KG fill:#ecfeff,stroke:#155e75 style SL fill:#ecfdf5,stroke:#047857

Figure 6.1 — Knowledge layer: ontology → taxonomy → ER → KG → reasoning → semantic layer.

6.2 Ontology Design — Rules of Thumb

Start from use cases, not the universe. Model what is queried.
Reuse before invent: schema.org, FIBO (finance), SNOMED (health), GoodRelations (commerce), W3C PROV (lineage).
Separate T-Box (schema) from A-Box (instances); version T-Box independently.
Use SHACL for validation, not just OWL for inference.
Prefer property graphs (LPG) for app speed; RDF when you need standards / federation.

6.3 Entity Resolution Pipeline

flowchart LR R[Raw records] --> N[Normalize
casing, unicode, addresses] N --> B[Blocking
candidate pairs] B --> SC[Score
ML / rules / Fellegi-Sunter] SC --> CL[Cluster
connected components] CL --> RV[Human review
active learning] RV --> CN[Canonical IDs + survivorship] CN --> KG[Knowledge Graph] style SC fill:#eef2ff,stroke:#3730a3 style RV fill:#fffbeb,stroke:#92400e

Figure 6.2 — ER is a loop, not a one-shot. Budget for review tooling.

6.4 The Semantic Layer

A semantic layer defines metrics, dimensions, and joins once so BI, apps, and LLMs all answer the same question the same way. This is the difference between "ChatGPT for our database" demos and systems that survive an audit.

# dbt Semantic Layer / MetricFlow style
semantic_models:
  - name: orders
    model: ref('fct_orders')
    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign
    dimensions:
      - name: order_date
        type: time
        type_params: { time_granularity: day }
      - name: status
        type: categorical
    measures:
      - name: revenue
        agg: sum
        expr: gross_amount - discounts

metrics:
  - name: weekly_active_revenue
    type: simple
    type_params:
      measure: revenue
    filter: "{{ Dimension('orders__status') }} = 'completed'"

7. Retrieval & RAG Patterns

7.1 The RAG Maturity Ladder

Level	Pattern	When good enough	Typical lift
L0	Stuff context into prompt	POC, < 5 docs	—
L1	Naive vector RAG	FAQ, narrow domain	baseline
L2	Hybrid (BM25 + vector) + rerank	Most production cases	+15–30% nDCG
L3	Query rewriting + HyDE + multi-query	Ambiguous queries	+5–15%
L4	Parent-doc / contextual / late-chunking	Long docs, technical PDFs	+10–20%
L5	GraphRAG / hybrid graph + vector	Multi-hop, entity-heavy	+20–50% on multi-hop
L6	Agentic retrieval (planner + tools)	Cross-source reasoning	opens new capabilities

7.2 Hybrid Retrieval Architecture (L2 — production default)

flowchart LR Q[User query] --> QR[Query rewriting
+ classification] QR --> P1[Lexical BM25 / SPLADE] QR --> P2[Dense vector kNN] QR --> P3[Metadata filters
ACL, time, source] P1 --> F[Fusion RRF / weighted] P2 --> F P3 --> F F --> RR[Cross-encoder reranker] RR --> CTX[Context packer
dedup, diversify, cite] CTX --> LLM[LLM] LLM --> ANS[Answer + citations] ANS --> EV[Eval / trace] style F fill:#eef2ff,stroke:#3730a3 style RR fill:#ecfeff,stroke:#155e75 style EV fill:#fffbeb,stroke:#92400e

Figure 7.1 — Hybrid retrieval is the production default. Always include a reranker and citations.

7.3 GraphRAG (L5)

GraphRAG augments retrieval with a knowledge graph: entities and relationships extracted from the corpus become first-class retrieval keys. Strong for multi-hop questions ("Which suppliers of company X were affected by event Y?") and for producing global summaries that vector search cannot.

flowchart TB DOC[Corpus] --> EX[Entity + relation extraction
LLM + NER] EX --> KG[(Knowledge Graph)] KG --> COM[Community detection
Leiden / Louvain] COM --> SUM[Community summaries] Q[Query] --> QC[Query classifier] QC -->|local| LK[Local: ego-network around entity] QC -->|global| GS[Global: community summaries map-reduce] LK --> CTX[Context] GS --> CTX CTX --> LLM style KG fill:#f5f3ff,stroke:#5b21b6 style COM fill:#ecfeff,stroke:#155e75

Figure 7.2 — GraphRAG (Microsoft Research, 2024) flow: local vs global queries.

7.4 Practical RAG Code Skeleton

from typing import List
import numpy as np

def retrieve(query: str, k: int = 50) -> List[dict]:
    # 1. Rewrite + classify
    rewrites = llm_rewrite(query, n=3)          # multi-query
    # 2. Parallel lexical + dense
    lex  = bm25.search(rewrites, k=k)
    dense = vector.search(embed(rewrites), k=k, filter=acl(user))
    # 3. Reciprocal Rank Fusion
    fused = rrf([lex, dense], k=k)
    # 4. Cross-encoder rerank top-N
    reranked = reranker.rerank(query, fused[:50])[:8]
    return reranked

def answer(query: str) -> dict:
    ctx = retrieve(query)
    packed = pack_context(ctx, max_tokens=6_000, dedup=True)
    out = llm.chat(
        system=SYSTEM_WITH_CITATIONS,
        user=query,
        context=packed,
        tools=[search_kg, lookup_metric],   # optional agentic
    )
    log_trace(query, ctx, out)              # eval/observability
    return out

The four retrieval failure modes — memorize these: (1) Recall miss (right doc not retrieved), (2) Ranking miss (retrieved but buried), (3) Context dilution (good chunk drowned by noise), (4) Grounding miss (LLM ignores context). Each has different fixes; eval should measure each separately.

8. Agentic Data Workflows

Agents make data & knowledge systems active: they plan, call tools, branch, and self-correct. The software-engineer mindset: an agent is a state machine with tool-calls and a budget, not magic.

Figure 8.1 — Plan → tools → critic loop. Bounded retries and explicit budgets are non-negotiable.

8.1 Engineering Rules for Production Agents

Deterministic skeleton, LLM in the joints. Hard-code the graph; let the LLM decide branches.
Tools are typed APIs with schemas, idempotency keys, retries, and timeouts.
Budgets everywhere: max steps, max tokens, max latency, max $ — fail loud when exceeded.
Durable execution (Temporal, Restate, LangGraph-checkpointed) for any agent running > 30s.
Human-in-the-loop at irreversible actions (writes, payments, deletes).
Trace everything — every prompt, tool call, retrieval, and response — to a queryable store.

8.2 Text-to-SQL on the Semantic Layer

Don't let LLMs write raw SQL against raw tables. Expose the semantic layer as the tool surface. The LLM picks metrics + dimensions + filters; the semantic layer compiles deterministic SQL. This collapses error rates from ~50% to <5% on real schemas.

9. Data & Knowledge Quality

9.1 Data Contracts

A data contract is a producer-signed schema + SLA + semantics. It is the single biggest cultural shift required to make AI data reliable.

# contract: orders.v3
owner: team-commerce
schema:
  order_id: { type: uuid, pii: false, required: true }
  customer_id: { type: uuid, pii: false, required: true }
  amount_cents: { type: int64, required: true, range: [0, 10_000_000] }
  status: { type: enum, values: [pending, paid, refunded, cancelled] }
  created_at: { type: timestamp_utc, required: true }
freshness_sla: 15m
volume_sla: { min_rows_per_day: 1_000, max_change_pct: 30 }
quality_checks:
  - unique: [order_id]
  - not_null: [order_id, customer_id, amount_cents, created_at]
  - referential: { customer_id: customers.customer_id }
breaking_change_policy: deprecate_30d_then_remove
consumers: [feature_store.orders_v3, kg.order_node, dashboards.revenue]

9.2 The Test Pyramid for Data

flowchart TB A[Unit tests on transforms
fast, mocked] --> B[Contract tests on inputs
schema + ranges] B --> C[Integration tests on pipelines
fixtures + assertions] C --> D[Property tests on invariants
row counts, distributions] D --> E[Runtime anomaly detection
volume, freshness, schema drift] style A fill:#ecfdf5,stroke:#047857 style E fill:#fef2f2,stroke:#b91c1c

Figure 9.1 — Test pyramid. Don't skip the bottom in favor of dashboards at the top.

9.3 Knowledge Quality

SHACL shapes enforce KG node/edge constraints (every Order must have exactly one Customer).
Coverage metrics: % of mentions linked, % of entities with canonical ID, edge density per type.
Drift detection: new entity types or relations appearing without ontology updates.
Provenance on every triple (source doc + extractor version + confidence).

10. Evaluation & Observability

10.1 Eval Strategy Matrix

Layer	What you measure	How
Retrieval	Recall@k, MRR, nDCG, context precision	Labeled query set + LLM-as-judge for relevance
Generation	Faithfulness, answer relevance, citation correctness	RAGAS, TruLens, custom LLM judges with rubrics
Task	Task success rate, tool-call accuracy	Scenario suite + golden traces
Safety	PII leak, jailbreak resistance, toxicity	Red-team prompts + classifiers
Business	Deflection rate, CSAT, $ saved/earned	Online A/B + product analytics
Cost / latency	p50/p95 latency, $/query, cache hit rate	Gateway metrics + traces

10.2 Offline → Online Loop

flowchart LR LOG[Prod traces] --> CUR[Curate eval set
hard, edge, regression] CUR --> OFF[Offline eval
regression suite in CI] OFF --> CHG[Change
prompt / index / model] CHG --> SHADOW[Shadow / canary] SHADOW --> AB[A/B test online] AB --> WIN{wins?} WIN -->|yes| ROLL[Rollout] WIN -->|no| BACK[Roll back + add to eval] BACK --> CUR ROLL --> LOG style OFF fill:#ecfdf5,stroke:#047857 style AB fill:#eef2ff,stroke:#3730a3

Figure 10.1 — The eval flywheel. Every prod incident becomes a regression test.

10.3 LLM-as-Judge — Use Wisely

Use rubrics, not vibes. Define what 1–5 means per criterion.
Pairwise beats absolute scoring. Use pairs against a baseline.
Audit the judge: 100 human-labeled samples per quarter, track agreement.
Use a stronger model as judge than the one being judged when possible.

11. MLOps / LLMOps

11.1 The Four Lifecycles

flowchart LR subgraph DATA["Data lifecycle"] D1[Ingest] --> D2[Transform] --> D3[Validate] --> D4[Publish] end subgraph INDEX["Index lifecycle"] I1[Parse + chunk] --> I2[Embed] --> I3[Build index] --> I4[Promote] end subgraph MODEL["Model lifecycle"] M1[Select / fine-tune] --> M2[Eval] --> M3[Register] --> M4[Deploy] end subgraph APP["App / agent lifecycle"] A1[Prompt / graph] --> A2[Eval] --> A3[Canary] --> A4[Release] end D4 --> I1 D4 --> M1 I4 --> A1 M4 --> A1

Figure 11.1 — Four independent lifecycles. Each needs versioning, CI, and rollback.

11.2 CI / CD / CT (Continuous Training/Tuning)

CI: lint, unit, contract, dbt build, eval on a small set, security scan.
CD: deploy data assets, indexes, prompts, agent graphs behind feature flags.
CT: schedule retraining / re-embedding / KG rebuilds with quality gates.
Rollback: every artifact is content-addressed and versioned (datasets, indexes, prompts).

11.3 Versioning Everything

Artifact	How to version
Datasets	Iceberg/Delta snapshots, LakeFS, DVC
Schemas / contracts	Git + semver, registered in catalog
Embeddings	Model id + dim + normalization + chunker version
Indexes	Alias swap (blue/green); never mutate in place
Prompts	Git + prompt registry (LangSmith, Langfuse)
Agent graphs	Code + graph definition checksum
Eval sets	Versioned in object store with row-level lineage

12. Governance, Security, and Cost

12.1 Data Governance Pillars

Catalog

Searchable inventory of every asset, owner, freshness, contract.

Lineage

Column-level upstream/downstream — essential for impact analysis.

Access

RBAC + ABAC + row/column policies; propagate to vector store payloads.

Privacy

PII tagging, masking, tokenization, DSAR support, retention rules.

12.2 PII & RAG — Specific Hazards

Vector stores leak. If a document contains PII, the chunk, the embedding, AND any metadata payload must be treated as PII. ACL must be checked at query time against payload tags — not just at ingestion.

Tag PII at parse time; carry tags through chunking, embedding, and retrieval.
Enforce ACL before reranking, not after — saves cost and prevents side-channel leaks.
Redact prompts at the LLM gateway; route sensitive prompts to private / on-prem models.
Honor right-to-erasure: every PII record must be re-findable by subject ID across lakehouse + vector + KG.

12.3 Compliance Map (2026)

Regime	What it forces on you
GDPR / CCPA / CPRA	Lawful basis, DSAR, deletion across all stores including vectors
EU AI Act	Risk class, data governance evidence, transparency, logging
SOC 2 / ISO 27001	Access logs, change management, vendor mgmt
HIPAA / PCI / FINRA	Sector-specific encryption, retention, audit
NIST AI RMF / ISO 42001	AI management system, eval, incident response

12.4 FinOps for AI Data

Cache aggressively: semantic cache for repeated questions; prompt cache for system prompts.
Route by complexity: small model first; escalate on uncertainty (router pattern).
Right-size embeddings: dim reduction (Matryoshka) cuts storage 4–8× with marginal quality loss.
Batch and async: never block UX on full agent runs — stream + progressively reveal.
Track $/query, $/user, $/business-outcome, not just $ total.

13. Implementation Strategy

13.1 Strategic Operating Model

flowchart LR V[Business value hypothesis] --> UC[Pick 1–2 use cases
narrow, measurable] UC --> DATA[Inventory data & knowledge gaps] DATA --> MVP[Thin slice MVP
L2 hybrid RAG + contracts] MVP --> EV[Eval set + baselines] EV --> SHIP[Ship behind flag] SHIP --> MEAS[Measure business + tech KPIs] MEAS --> ITER[Iterate or kill] ITER -->|expand| PLAT[Promote shared components to platform] style MVP fill:#ecfdf5,stroke:#047857 style PLAT fill:#eef2ff,stroke:#3730a3

Figure 13.1 — Use-case-led, platform-emergent. Do not build the platform first.

13.2 The 90-Day Plan for a Software Engineer Stepping In

Days 1–15
Discover

Map data sources, owners, SLAs. Pick 1 high-value use case with a measurable KPI. Stand up a catalog (OpenMetadata/DataHub/Unity). Define data contracts for the top 5 tables. Choose lakehouse format & engine.

Days 16–35
Foundation

Build Bronze/Silver/Gold for the use-case data. Add CI with dbt/SQLMesh + Great Expectations / Soda. Stand up an LLM gateway with logging, PII redaction, model routing. Pick embedding model + vector store; version everything.

Days 36–60
Retrieve & Reason

Ship L2 hybrid RAG (BM25 + dense + reranker) with citations. Build a labeled eval set (≥200 queries). Add observability (Langfuse / Phoenix / LangSmith). If multi-hop matters, prototype GraphRAG over a small KG built from Silver + extracted entities.

Days 61–80
Harden

Add ACL on retrieval; integrate semantic layer for any numeric questions; introduce shadow + canary; wire feedback collection. Run red-team prompts; close PII gaps. Establish on-call & runbooks.

Days 81–90
Scale

Promote reusable pieces (chunker, embedder, gateway, eval harness) to a platform repo. Publish the data contract and prompt registries. Write a postmortem on the first incident. Plan use case #2 — but reuse, don't rebuild.

13.3 Decision Records You'll Need

ADR-001: Lakehouse table format & engine
ADR-002: Vector store + embedding + reranker stack (with re-embedding policy)
ADR-003: Chunking strategy and metadata schema
ADR-004: LLM gateway + model routing + caching policy
ADR-005: Eval methodology + judge model + cadence
ADR-006: PII classification + ACL propagation to vector store
ADR-007: Knowledge graph store + ontology versioning
ADR-008: Agent durability + human-in-loop boundaries

14. Anti-patterns & Pitfalls

Anti-pattern	Why it kills	Do instead
Naive RAG in prod	Recall + ranking are both bad; no citations	Hybrid + rerank + citations from day one
Re-embedding without versioning	Index mismatch; silent quality cliff	Embedding artifact = (model, dim, normalizer, chunker) tuple
One giant chunk size for all docs	Underfits long technical PDFs; wastes context on short ones	Structure-aware + parent-doc retrieval
LLM writes raw SQL on raw tables	Hallucinated joins, wrong metrics, no auditability	Expose semantic layer as the tool
Ontology-first boil-the-ocean	Months of modeling, zero shipped value	Use-case ontology slice; reuse standards
No data contracts	Producers break consumers silently	Contracts in Git, CI-enforced
Eval = vibes	Cannot detect regressions; can't justify changes	Versioned eval set + automated regression in CI
Agents with unlimited steps/$$	Runaway cost, infinite loops	Hard budgets, durable execution, critic step
PII tags lost at chunk time	Vector store leaks; ACL bypass	Carry PII + ACL tags into payload; enforce at query
Single vendor, single model	Outage + price = existential risk	Gateway + multi-model routing + fallback
"Chat with your data" demos shipped as products	No grounding, no metrics, no governance	Narrow scope, semantic layer, citations, eval

15. Skills Matrix & Career Path

15.1 Core Competencies (T-shape)

Foundations (must)

Python + typed SQL
Distributed systems basics
Parquet / columnar, partitioning
Git, CI/CD, Docker, IaC
Probability + stats reading-level

Data (must)

Lakehouse formats (Iceberg/Delta)
dbt / SQLMesh
Spark / DuckDB / Polars
Kafka + stream processing
Great Expectations / Soda

Knowledge (deepen)

RDF/OWL or LPG modeling
SHACL / Cypher / SPARQL
Entity resolution (Splink/Zingg)
Semantic / metrics layers

AI / Retrieval (deepen)

Embeddings + reranking
Hybrid + GraphRAG patterns
Prompting + structured outputs
Agent frameworks (LangGraph, LlamaIndex, Haystack)
Eval (RAGAS, Phoenix, Langfuse)

Ops & Platform

Orchestration (Dagster/Airflow)
Feature stores (Feast/Tecton)
Observability + tracing
FinOps for AI workloads

Governance & Soft

Data contracts & ADRs
Privacy: GDPR / AI Act
Threat modeling (LLM-specific: OWASP LLM Top 10)
Stakeholder + product partnering

15.2 Career Ladder

flowchart LR SE[Software Engineer] --> DE[Data Engineer] SE --> MLE[ML Engineer] DE --> AIDKE[AI Data & Knowledge Engineer] MLE --> AIDKE AIDKE --> SAIDKE[Staff AI Platform Engineer] AIDKE --> ARCH[AI Solutions Architect] SAIDKE --> PRIN[Principal / Distinguished] style AIDKE fill:#eef2ff,stroke:#3730a3

Figure 15.1 — Typical career paths converging on the AI Data & Knowledge Engineer role.

16. References & Further Reading

Starter reading only; your vendor contracts and internal policies still win. Canonical HTML: LinhTruong.com — Linh Truong.

Foundational books

Kleppmann — Designing Data-Intensive Applications
Reis & Housley — Fundamentals of Data Engineering
Kimball — The Data Warehouse Toolkit
Allemang, Hendler & Gandon — Semantic Web for the Working Ontologist
Hogan et al. — Knowledge Graphs (open access)
Huyen — Designing Machine Learning Systems; AI Engineering (2024)

Specs & standards

Apache Iceberg / Delta Lake / Hudi specs
W3C: RDF 1.1, OWL 2, SHACL, SKOS, PROV-O
schema.org, FIBO, SNOMED CT, ISO 8000
OpenLineage, OpenTelemetry, MLflow Model Registry
NIST AI RMF, ISO/IEC 42001, EU AI Act
OWASP Top 10 for LLM Applications

Seminal / influential papers

Lewis et al. — Retrieval-Augmented Generation (2020)
Karpukhin et al. — Dense Passage Retrieval (2020)
Khattab & Zaharia — ColBERT / ColBERTv2
Microsoft Research — From Local to Global: GraphRAG (2024)
Anthropic — Contextual Retrieval (2024)
Google — Chain-of-Verification, Self-Consistency
Meta — Toolformer; ReAct (Yao et al.)

Living resources

Hamilton, Z. — Data Engineering Weekly
Chip Huyen's blog & AI Engineering newsletter
Eugene Yan — applied ML & LLM systems essays
Hamel Husain — LLM evals, AI Engineering
Jay Alammar / Cohere blog — retrieval & embeddings
Databricks, Snowflake, AWS, GCP engineering blogs

Closing bias. I describe the job in one line: make grounded answers fast, owned, and auditable—so the model is a part of the system you can reason about, not a black box you shrug at after a bad rollout.