Data & knowledge for AI · field notes · May 2026

AI : Data & Knowledge Engineer
Grounding models in governed data, retrieval you can defend, and answers you can trace

Source / canonical copy: LinhTruong.com. Share that link if you pass this file around so attribution stays with it.

Notes on the stack I reach for when software engineers have to ship data- and knowledge-grounded AI: ingestion and lakehouse, vectors and graphs, RAG through GraphRAG-style patterns, agents, evaluation, governance, and the LLMOps wiring that keeps quality from drifting. Opinionated, refreshed for 2026, and slanted toward what holds up in audits and postmortems—not demo-day polish.

Lakehouse RAG / GraphRAG Knowledge Graphs Vector + Hybrid Search LLMOps / DataOps Agentic AI Governance & Lineage

1. Executive Summary

Most of the pain I see in production is data and knowledge, not which LLM badge is on the box. Models churn; what lasts is information that is permissioned, curated, and structured enough to retrieve under scrutiny. AI Data & Knowledge Engineering is the work of building and running that layer—not the slide version, the on-call version.

~80%
common share of time on data & knowledge work (rule of thumb)
accuracy lift from hybrid + GraphRAG vs. naive RAG
10–100×
cost variance from poor retrieval design
< 5%
of orgs have production-grade AI data contracts
How I draw it on a whiteboard. Four beats: a lakehouse for raw + curated tables, a semantic layer (ontology + graph) for meaning, vector + hybrid search for recall, and an evaluation + governance spine so you can show provenance when something goes wrong.

2. The Role & Responsibilities

Data Engineer (DE)

Pipes & Plumbing

Owns ingestion, ELT/ETL, lakehouse, streaming, schemas, SLAs.

  • Batch + streaming pipelines
  • Data modeling (3NF, dimensional, Data Vault)
  • Storage formats (Parquet, Iceberg, Delta, Hudi)
  • Orchestration & observability
Knowledge Engineer (KE)

Meaning & Structure

Owns ontologies, taxonomies, knowledge graphs, entity resolution.

  • Domain ontology design (RDF/OWL, SHACL, schema.org)
  • Entity resolution & canonicalization
  • Graph construction & reasoning
  • Semantic / metrics layers
AI / ML Engineer (MLE)

Models & Retrieval

Owns embeddings, retrieval, fine-tuning, agents, evaluation.

  • Embedding & reranker selection
  • Vector indexes (HNSW, IVF, ScaNN, DiskANN)
  • RAG / GraphRAG / agentic patterns
  • Online + offline eval, guardrails

The AI Data & Knowledge Engineer sits at the intersection. A software engineer moving into this role trades "deploy a service" for "deploy a data product" — versioned, contract-checked, observable, and consumable by both humans and models.

flowchart LR subgraph DE["Data Engineering"] A1[Ingestion CDC, Events] --> A2[Lakehouse Bronze/Silver/Gold] A2 --> A3[Feature Store] end subgraph KE["Knowledge Engineering"] B1[Ontology / Taxonomy] --> B2[Entity Resolution] B2 --> B3[Knowledge Graph] A2 --> B2 end subgraph MLE["AI / ML Engineering"] C1[Embeddings] --> C2[Vector Index] C2 --> C3[Hybrid + GraphRAG] C3 --> C4[Agents / Apps] A3 --> C1 B3 --> C3 end style DE fill:#eef2ff,stroke:#3730a3 style KE fill:#f5f3ff,stroke:#5b21b6 style MLE fill:#ecfeff,stroke:#155e75
Figure 2.1 — The three pillars and where they meet.

3. Reference Architecture

The canonical end-to-end pipeline a Software Engineer should be able to draw on a whiteboard:

flowchart TB %% Sources subgraph S["① Sources"] S1[OLTP DBs
Postgres, MySQL] S2[SaaS APIs
Salesforce, Stripe] S3[Event Streams
Kafka, Kinesis] S4[Files / Docs
PDF, HTML, S3] S5[Unstructured
Images, Audio, Video] end %% Ingest subgraph I["② Ingest & Capture"] I1[CDC
Debezium / Fivetran] I2[Stream Processors
Flink / Spark Structured] I3[Doc Loaders & Parsers
Unstructured.io / Docling] end %% Lakehouse subgraph L["③ Lakehouse"] L1[Bronze
raw immutable] L2[Silver
cleaned, conformed] L3[Gold
business / serving] end %% Semantic subgraph K["④ Semantic & Knowledge"] K1[Metrics / Semantic Layer
dbt Semantic, Cube] K2[Entity Resolution
Zingg / Splink] K3[Knowledge Graph
Neo4j / TigerGraph / RDF] K4[Ontology
OWL / SHACL / SKOS] end %% AI surfaces subgraph A["⑤ AI Surfaces"] A1[Feature Store
Feast / Tecton] A2[Vector Store
pgvector / Qdrant / Milvus] A3[Reranker + Hybrid Search] A4[LLM Gateway
routing, caching, PII] A5[Agents & Tools] end %% Apps subgraph X["⑥ Applications"] X1[Copilots / Assistants] X2[Search & Q&A] X3[Decision / Recsys] X4[Analytics & BI] end %% Spine subgraph G["Cross-cutting: Governance, Eval, Ops"] G1[Catalog + Lineage
Unity / OpenMetadata / DataHub] G2[Quality & Contracts
Great Expectations / Soda] G3[Eval & Observability
LangSmith / Arize / Phoenix] G4[Security & Privacy
RBAC, ABAC, PII, DLP] G5[CI/CD/CT & FinOps] end S1 --> I1 --> L1 S2 --> I1 S3 --> I2 --> L1 S4 --> I3 --> L1 S5 --> I3 L1 --> L2 --> L3 L3 --> K1 L2 --> K2 --> K3 K4 --> K3 L3 --> A1 L2 --> A2 K3 --> A3 A1 --> A5 A2 --> A3 --> A5 A4 --> A5 A5 --> X1 A3 --> X2 A1 --> X3 K1 --> X4 G1 -.-> L1 G1 -.-> K3 G2 -.-> L2 G3 -.-> A5 G4 -.-> A4 G5 -.-> A5
Figure 3.1 — Reference architecture for an AI Data & Knowledge platform.
Engineer's heuristic. If a box on this diagram has no owner, no SLO, and no contract, it is not a system — it is a future incident.

4. The 2026 Tech Stack (Picks & Trade-offs)

LayerStrong picks (2026)When to chooseWatch-outs
Lakehouse storageApache Iceberg, Delta Lake, HudiIceberg for openness/multi-engine; Delta if Databricks-heavyTable maintenance (compaction, expiration) is mandatory
Query / computeDuckDB, Trino, Spark, Snowflake, BigQueryDuckDB for local dev & small data; Trino for federation; Spark/Snowflake at scaleCost explodes with bad partitioning / scans
StreamingKafka, Redpanda, Flink, MaterializeFlink for stateful; Materialize for SQL-on-streamsSchema evolution + exactly-once need design
Transformationsdbt, SQLMesh, Spark, Polarsdbt for SQL teams; SQLMesh for column-level lineage & virtual envsMacro sprawl; lacking tests
OrchestrationDagster, Airflow 2.x, Prefect, TemporalDagster for asset-centric; Temporal for durable agent workflowsDAG-of-DAGs sprawl
Vector storespgvector, Qdrant, Milvus, Weaviate, LanceDBpgvector if Postgres-shop; Qdrant/Milvus for scale + filtersIndex rebuild cost; payload schema drift
EmbeddingsOpenAI text-embedding-3, Voyage, Cohere v3, BGE-M3, E5-mistralMultilingual: BGE-M3 / Cohere; long-context: VoyageRe-embedding is expensive — version aggressively
RerankersCohere Rerank, Voyage Rerank, BGE-reranker, ColBERT-v2Always add a reranker; +10–30% nDCG typicalLatency budget; batch carefully
Knowledge graphNeo4j, TigerGraph, ArangoDB, Memgraph, Apache Jena, GraphDBRDF/OWL for ontology rigor; LPG for app speedDon't model the world — model the use case
Entity resolutionZingg, Splink, dedupe.io, SenzingSplink for transparency; Zingg ML-firstThreshold tuning, manual review queues
Feature storeFeast, Tecton, Databricks FSFeast for OSS; Tecton for streaming featuresOnline/offline skew is the #1 bug
LLM gatewayLiteLLM, Portkey, Kong AI, Cloudflare AI GatewayAlways put a gateway in front: routing, retries, caching, PII, $$One more hop — measure latency
Eval & observabilityLangSmith, Arize Phoenix, Langfuse, Ragas, TruLensPhoenix/Langfuse open-source; Arize/LangSmith managedEval datasets must be versioned like code
Catalog & lineageUnity Catalog, OpenMetadata, DataHub, Apache AtlasUnity if Databricks; OpenMetadata/DataHub otherwiseLineage is only useful if it's column-level
Quality / contractsGreat Expectations, Soda, dbt tests, Monte Carlo, AcceldataGE/Soda in CI; Monte Carlo for runtime anomaly detectionTests rot — review quarterly
Tool-chasing kills. Pick one tool per layer. Standardize. The bottleneck is rarely the tool — it's schema discipline and ownership.

5. Data Engineering Pillar

5.1 The Medallion Pattern (Bronze → Silver → Gold)

flowchart LR R[Raw sources] -->|append-only, schema-on-read| B[Bronze
immutable, replayable] B -->|deduplicate, conform, type, mask PII| S[Silver
cleaned, conformed] S -->|aggregate, model, semantic| G[Gold
business-facing] G --> FS[Feature Store] G --> SEM[Semantic Layer] S --> VEC[Vector / Search Index] S --> KG[Knowledge Graph] style B fill:#fff7ed,stroke:#9a3412 style S fill:#f1f5f9,stroke:#475569 style G fill:#ecfdf5,stroke:#047857
Figure 5.1 — Medallion lakehouse layers and their downstream consumers.

5.2 Storage & Format Choices

5.3 Streaming vs Batch Decision

NeedChooseWhy
Reporting, training dataBatch (hourly/daily)Simpler, cheaper, easier to test
Feature freshness < 1 minStreaming (Flink/Spark SS)Latency-bound use cases
CDC from OLTPDebezium → Kafka → IcebergReplayable, no source load
Document ingestionEvent-driven (S3 → queue → worker)Bursty, idempotent

5.4 The Modern Document Ingestion Pipeline (for RAG)

flowchart LR F[File arrives
PDF / HTML / DOCX / Image] --> P[Parse + Layout
Docling / Unstructured / Azure DI] P --> N[Normalize
Markdown / JSON blocks] N --> E[Extract structure
tables, headings, captions] E --> C[Chunk
semantic + sliding + parent-doc] C --> M[Enrich
title, summary, entities, ACL] M --> EM[Embed
multi-vector / late-interaction] EM --> V[Vector store] M --> K[Knowledge graph] M --> S[Search index BM25] style P fill:#eef2ff,stroke:#3730a3 style C fill:#ecfeff,stroke:#155e75 style EM fill:#f5f3ff,stroke:#5b21b6
Figure 5.2 — Document → retrieval-ready pipeline. Skipping layout-aware parsing is the #1 RAG quality killer.

5.5 Chunking Strategies (ranked, 2026)

  1. Structure-aware (headings, sections, tables) — best default.
  2. Semantic chunking (embedding-based boundary detection).
  3. Parent-document retrieval (embed small, return large).
  4. Late-chunking / contextual embedding (embed in context of doc; 2024+ technique).
  5. Fixed-size sliding window — only as fallback.

6. Knowledge Engineering Pillar

6.1 The Knowledge Stack

flowchart TB O[Ontology
classes, properties, axioms
OWL / SHACL / schema.org] T[Taxonomy / Controlled Vocabulary
SKOS] ER[Entity Resolution
canonical IDs] KG[Knowledge Graph
typed nodes & edges] SR[Semantic Reasoner / Rules
SHACL, SWRL, Datalog] SL[Semantic / Metrics Layer
dbt-semantic, Cube, MetricFlow] APP[Apps + LLMs] O --> T O --> ER O --> KG T --> KG ER --> KG KG --> SR KG --> SL SL --> APP KG --> APP SR --> APP style O fill:#f5f3ff,stroke:#5b21b6 style KG fill:#ecfeff,stroke:#155e75 style SL fill:#ecfdf5,stroke:#047857
Figure 6.1 — Knowledge layer: ontology → taxonomy → ER → KG → reasoning → semantic layer.

6.2 Ontology Design — Rules of Thumb

6.3 Entity Resolution Pipeline

flowchart LR R[Raw records] --> N[Normalize
casing, unicode, addresses] N --> B[Blocking
candidate pairs] B --> SC[Score
ML / rules / Fellegi-Sunter] SC --> CL[Cluster
connected components] CL --> RV[Human review
active learning] RV --> CN[Canonical IDs + survivorship] CN --> KG[Knowledge Graph] style SC fill:#eef2ff,stroke:#3730a3 style RV fill:#fffbeb,stroke:#92400e
Figure 6.2 — ER is a loop, not a one-shot. Budget for review tooling.

6.4 The Semantic Layer

A semantic layer defines metrics, dimensions, and joins once so BI, apps, and LLMs all answer the same question the same way. This is the difference between "ChatGPT for our database" demos and systems that survive an audit.

# dbt Semantic Layer / MetricFlow style
semantic_models:
  - name: orders
    model: ref('fct_orders')
    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign
    dimensions:
      - name: order_date
        type: time
        type_params: { time_granularity: day }
      - name: status
        type: categorical
    measures:
      - name: revenue
        agg: sum
        expr: gross_amount - discounts

metrics:
  - name: weekly_active_revenue
    type: simple
    type_params:
      measure: revenue
    filter: "{{ Dimension('orders__status') }} = 'completed'"

7. Retrieval & RAG Patterns

7.1 The RAG Maturity Ladder

LevelPatternWhen good enoughTypical lift
L0Stuff context into promptPOC, < 5 docs
L1Naive vector RAGFAQ, narrow domainbaseline
L2Hybrid (BM25 + vector) + rerankMost production cases+15–30% nDCG
L3Query rewriting + HyDE + multi-queryAmbiguous queries+5–15%
L4Parent-doc / contextual / late-chunkingLong docs, technical PDFs+10–20%
L5GraphRAG / hybrid graph + vectorMulti-hop, entity-heavy+20–50% on multi-hop
L6Agentic retrieval (planner + tools)Cross-source reasoningopens new capabilities

7.2 Hybrid Retrieval Architecture (L2 — production default)

flowchart LR Q[User query] --> QR[Query rewriting
+ classification] QR --> P1[Lexical BM25 / SPLADE] QR --> P2[Dense vector kNN] QR --> P3[Metadata filters
ACL, time, source] P1 --> F[Fusion RRF / weighted] P2 --> F P3 --> F F --> RR[Cross-encoder reranker] RR --> CTX[Context packer
dedup, diversify, cite] CTX --> LLM[LLM] LLM --> ANS[Answer + citations] ANS --> EV[Eval / trace] style F fill:#eef2ff,stroke:#3730a3 style RR fill:#ecfeff,stroke:#155e75 style EV fill:#fffbeb,stroke:#92400e
Figure 7.1 — Hybrid retrieval is the production default. Always include a reranker and citations.

7.3 GraphRAG (L5)

GraphRAG augments retrieval with a knowledge graph: entities and relationships extracted from the corpus become first-class retrieval keys. Strong for multi-hop questions ("Which suppliers of company X were affected by event Y?") and for producing global summaries that vector search cannot.

flowchart TB DOC[Corpus] --> EX[Entity + relation extraction
LLM + NER] EX --> KG[(Knowledge Graph)] KG --> COM[Community detection
Leiden / Louvain] COM --> SUM[Community summaries] Q[Query] --> QC[Query classifier] QC -->|local| LK[Local: ego-network around entity] QC -->|global| GS[Global: community summaries map-reduce] LK --> CTX[Context] GS --> CTX CTX --> LLM style KG fill:#f5f3ff,stroke:#5b21b6 style COM fill:#ecfeff,stroke:#155e75
Figure 7.2 — GraphRAG (Microsoft Research, 2024) flow: local vs global queries.

7.4 Practical RAG Code Skeleton

from typing import List
import numpy as np

def retrieve(query: str, k: int = 50) -> List[dict]:
    # 1. Rewrite + classify
    rewrites = llm_rewrite(query, n=3)          # multi-query
    # 2. Parallel lexical + dense
    lex  = bm25.search(rewrites, k=k)
    dense = vector.search(embed(rewrites), k=k, filter=acl(user))
    # 3. Reciprocal Rank Fusion
    fused = rrf([lex, dense], k=k)
    # 4. Cross-encoder rerank top-N
    reranked = reranker.rerank(query, fused[:50])[:8]
    return reranked

def answer(query: str) -> dict:
    ctx = retrieve(query)
    packed = pack_context(ctx, max_tokens=6_000, dedup=True)
    out = llm.chat(
        system=SYSTEM_WITH_CITATIONS,
        user=query,
        context=packed,
        tools=[search_kg, lookup_metric],   # optional agentic
    )
    log_trace(query, ctx, out)              # eval/observability
    return out
The four retrieval failure modes — memorize these: (1) Recall miss (right doc not retrieved), (2) Ranking miss (retrieved but buried), (3) Context dilution (good chunk drowned by noise), (4) Grounding miss (LLM ignores context). Each has different fixes; eval should measure each separately.

8. Agentic Data Workflows

Agents make data & knowledge systems active: they plan, call tools, branch, and self-correct. The software-engineer mindset: an agent is a state machine with tool-calls and a budget, not magic.

flowchart TB U[User goal] --> PL[Planner / Router] PL -->|simple| RAG[RAG tool] PL -->|structured| SQL[Text-to-SQL on semantic layer] PL -->|graph| KG[KG query tool] PL -->|action| API[App API tool] RAG --> CR[Critic / Verifier] SQL --> CR KG --> CR API --> CR CR -->|ok| ANS[Answer + citations] CR -->|retry| PL ANS --> MEM[(Memory / Trace)] MEM -.-> PL style PL fill:#eef2ff,stroke:#3730a3 style CR fill:#fffbeb,stroke:#92400e
Figure 8.1 — Plan → tools → critic loop. Bounded retries and explicit budgets are non-negotiable.

8.1 Engineering Rules for Production Agents

8.2 Text-to-SQL on the Semantic Layer

Don't let LLMs write raw SQL against raw tables. Expose the semantic layer as the tool surface. The LLM picks metrics + dimensions + filters; the semantic layer compiles deterministic SQL. This collapses error rates from ~50% to <5% on real schemas.

9. Data & Knowledge Quality

9.1 Data Contracts

A data contract is a producer-signed schema + SLA + semantics. It is the single biggest cultural shift required to make AI data reliable.

# contract: orders.v3
owner: team-commerce
schema:
  order_id: { type: uuid, pii: false, required: true }
  customer_id: { type: uuid, pii: false, required: true }
  amount_cents: { type: int64, required: true, range: [0, 10_000_000] }
  status: { type: enum, values: [pending, paid, refunded, cancelled] }
  created_at: { type: timestamp_utc, required: true }
freshness_sla: 15m
volume_sla: { min_rows_per_day: 1_000, max_change_pct: 30 }
quality_checks:
  - unique: [order_id]
  - not_null: [order_id, customer_id, amount_cents, created_at]
  - referential: { customer_id: customers.customer_id }
breaking_change_policy: deprecate_30d_then_remove
consumers: [feature_store.orders_v3, kg.order_node, dashboards.revenue]

9.2 The Test Pyramid for Data

flowchart TB A[Unit tests on transforms
fast, mocked] --> B[Contract tests on inputs
schema + ranges] B --> C[Integration tests on pipelines
fixtures + assertions] C --> D[Property tests on invariants
row counts, distributions] D --> E[Runtime anomaly detection
volume, freshness, schema drift] style A fill:#ecfdf5,stroke:#047857 style E fill:#fef2f2,stroke:#b91c1c
Figure 9.1 — Test pyramid. Don't skip the bottom in favor of dashboards at the top.

9.3 Knowledge Quality

10. Evaluation & Observability

10.1 Eval Strategy Matrix

LayerWhat you measureHow
RetrievalRecall@k, MRR, nDCG, context precisionLabeled query set + LLM-as-judge for relevance
GenerationFaithfulness, answer relevance, citation correctnessRAGAS, TruLens, custom LLM judges with rubrics
TaskTask success rate, tool-call accuracyScenario suite + golden traces
SafetyPII leak, jailbreak resistance, toxicityRed-team prompts + classifiers
BusinessDeflection rate, CSAT, $ saved/earnedOnline A/B + product analytics
Cost / latencyp50/p95 latency, $/query, cache hit rateGateway metrics + traces

10.2 Offline → Online Loop

flowchart LR LOG[Prod traces] --> CUR[Curate eval set
hard, edge, regression] CUR --> OFF[Offline eval
regression suite in CI] OFF --> CHG[Change
prompt / index / model] CHG --> SHADOW[Shadow / canary] SHADOW --> AB[A/B test online] AB --> WIN{wins?} WIN -->|yes| ROLL[Rollout] WIN -->|no| BACK[Roll back + add to eval] BACK --> CUR ROLL --> LOG style OFF fill:#ecfdf5,stroke:#047857 style AB fill:#eef2ff,stroke:#3730a3
Figure 10.1 — The eval flywheel. Every prod incident becomes a regression test.

10.3 LLM-as-Judge — Use Wisely

11. MLOps / LLMOps

11.1 The Four Lifecycles

flowchart LR subgraph DATA["Data lifecycle"] D1[Ingest] --> D2[Transform] --> D3[Validate] --> D4[Publish] end subgraph INDEX["Index lifecycle"] I1[Parse + chunk] --> I2[Embed] --> I3[Build index] --> I4[Promote] end subgraph MODEL["Model lifecycle"] M1[Select / fine-tune] --> M2[Eval] --> M3[Register] --> M4[Deploy] end subgraph APP["App / agent lifecycle"] A1[Prompt / graph] --> A2[Eval] --> A3[Canary] --> A4[Release] end D4 --> I1 D4 --> M1 I4 --> A1 M4 --> A1
Figure 11.1 — Four independent lifecycles. Each needs versioning, CI, and rollback.

11.2 CI / CD / CT (Continuous Training/Tuning)

11.3 Versioning Everything

ArtifactHow to version
DatasetsIceberg/Delta snapshots, LakeFS, DVC
Schemas / contractsGit + semver, registered in catalog
EmbeddingsModel id + dim + normalization + chunker version
IndexesAlias swap (blue/green); never mutate in place
PromptsGit + prompt registry (LangSmith, Langfuse)
Agent graphsCode + graph definition checksum
Eval setsVersioned in object store with row-level lineage

12. Governance, Security, and Cost

12.1 Data Governance Pillars

Catalog

Searchable inventory of every asset, owner, freshness, contract.

Lineage

Column-level upstream/downstream — essential for impact analysis.

Access

RBAC + ABAC + row/column policies; propagate to vector store payloads.

Privacy

PII tagging, masking, tokenization, DSAR support, retention rules.

12.2 PII & RAG — Specific Hazards

Vector stores leak. If a document contains PII, the chunk, the embedding, AND any metadata payload must be treated as PII. ACL must be checked at query time against payload tags — not just at ingestion.

12.3 Compliance Map (2026)

RegimeWhat it forces on you
GDPR / CCPA / CPRALawful basis, DSAR, deletion across all stores including vectors
EU AI ActRisk class, data governance evidence, transparency, logging
SOC 2 / ISO 27001Access logs, change management, vendor mgmt
HIPAA / PCI / FINRASector-specific encryption, retention, audit
NIST AI RMF / ISO 42001AI management system, eval, incident response

12.4 FinOps for AI Data

13. Implementation Strategy

13.1 Strategic Operating Model

flowchart LR V[Business value hypothesis] --> UC[Pick 1–2 use cases
narrow, measurable] UC --> DATA[Inventory data & knowledge gaps] DATA --> MVP[Thin slice MVP
L2 hybrid RAG + contracts] MVP --> EV[Eval set + baselines] EV --> SHIP[Ship behind flag] SHIP --> MEAS[Measure business + tech KPIs] MEAS --> ITER[Iterate or kill] ITER -->|expand| PLAT[Promote shared components to platform] style MVP fill:#ecfdf5,stroke:#047857 style PLAT fill:#eef2ff,stroke:#3730a3
Figure 13.1 — Use-case-led, platform-emergent. Do not build the platform first.

13.2 The 90-Day Plan for a Software Engineer Stepping In

Days 1–15
Discover
Map data sources, owners, SLAs. Pick 1 high-value use case with a measurable KPI. Stand up a catalog (OpenMetadata/DataHub/Unity). Define data contracts for the top 5 tables. Choose lakehouse format & engine.
Days 16–35
Foundation
Build Bronze/Silver/Gold for the use-case data. Add CI with dbt/SQLMesh + Great Expectations / Soda. Stand up an LLM gateway with logging, PII redaction, model routing. Pick embedding model + vector store; version everything.
Days 36–60
Retrieve & Reason
Ship L2 hybrid RAG (BM25 + dense + reranker) with citations. Build a labeled eval set (≥200 queries). Add observability (Langfuse / Phoenix / LangSmith). If multi-hop matters, prototype GraphRAG over a small KG built from Silver + extracted entities.
Days 61–80
Harden
Add ACL on retrieval; integrate semantic layer for any numeric questions; introduce shadow + canary; wire feedback collection. Run red-team prompts; close PII gaps. Establish on-call & runbooks.
Days 81–90
Scale
Promote reusable pieces (chunker, embedder, gateway, eval harness) to a platform repo. Publish the data contract and prompt registries. Write a postmortem on the first incident. Plan use case #2 — but reuse, don't rebuild.

13.3 Decision Records You'll Need

14. Anti-patterns & Pitfalls

Anti-patternWhy it killsDo instead
Naive RAG in prodRecall + ranking are both bad; no citationsHybrid + rerank + citations from day one
Re-embedding without versioningIndex mismatch; silent quality cliffEmbedding artifact = (model, dim, normalizer, chunker) tuple
One giant chunk size for all docsUnderfits long technical PDFs; wastes context on short onesStructure-aware + parent-doc retrieval
LLM writes raw SQL on raw tablesHallucinated joins, wrong metrics, no auditabilityExpose semantic layer as the tool
Ontology-first boil-the-oceanMonths of modeling, zero shipped valueUse-case ontology slice; reuse standards
No data contractsProducers break consumers silentlyContracts in Git, CI-enforced
Eval = vibesCannot detect regressions; can't justify changesVersioned eval set + automated regression in CI
Agents with unlimited steps/$$Runaway cost, infinite loopsHard budgets, durable execution, critic step
PII tags lost at chunk timeVector store leaks; ACL bypassCarry PII + ACL tags into payload; enforce at query
Single vendor, single modelOutage + price = existential riskGateway + multi-model routing + fallback
"Chat with your data" demos shipped as productsNo grounding, no metrics, no governanceNarrow scope, semantic layer, citations, eval

15. Skills Matrix & Career Path

15.1 Core Competencies (T-shape)

Foundations (must)

  • Python + typed SQL
  • Distributed systems basics
  • Parquet / columnar, partitioning
  • Git, CI/CD, Docker, IaC
  • Probability + stats reading-level

Data (must)

  • Lakehouse formats (Iceberg/Delta)
  • dbt / SQLMesh
  • Spark / DuckDB / Polars
  • Kafka + stream processing
  • Great Expectations / Soda

Knowledge (deepen)

  • RDF/OWL or LPG modeling
  • SHACL / Cypher / SPARQL
  • Entity resolution (Splink/Zingg)
  • Semantic / metrics layers

AI / Retrieval (deepen)

  • Embeddings + reranking
  • Hybrid + GraphRAG patterns
  • Prompting + structured outputs
  • Agent frameworks (LangGraph, LlamaIndex, Haystack)
  • Eval (RAGAS, Phoenix, Langfuse)

Ops & Platform

  • Orchestration (Dagster/Airflow)
  • Feature stores (Feast/Tecton)
  • Observability + tracing
  • FinOps for AI workloads

Governance & Soft

  • Data contracts & ADRs
  • Privacy: GDPR / AI Act
  • Threat modeling (LLM-specific: OWASP LLM Top 10)
  • Stakeholder + product partnering

15.2 Career Ladder

flowchart LR SE[Software Engineer] --> DE[Data Engineer] SE --> MLE[ML Engineer] DE --> AIDKE[AI Data & Knowledge Engineer] MLE --> AIDKE AIDKE --> SAIDKE[Staff AI Platform Engineer] AIDKE --> ARCH[AI Solutions Architect] SAIDKE --> PRIN[Principal / Distinguished] style AIDKE fill:#eef2ff,stroke:#3730a3
Figure 15.1 — Typical career paths converging on the AI Data & Knowledge Engineer role.

16. References & Further Reading

Starter reading only; your vendor contracts and internal policies still win. Canonical HTML: LinhTruong.comLinh Truong.

Foundational books

  • Kleppmann — Designing Data-Intensive Applications
  • Reis & Housley — Fundamentals of Data Engineering
  • Kimball — The Data Warehouse Toolkit
  • Allemang, Hendler & Gandon — Semantic Web for the Working Ontologist
  • Hogan et al. — Knowledge Graphs (open access)
  • Huyen — Designing Machine Learning Systems; AI Engineering (2024)

Specs & standards

  • Apache Iceberg / Delta Lake / Hudi specs
  • W3C: RDF 1.1, OWL 2, SHACL, SKOS, PROV-O
  • schema.org, FIBO, SNOMED CT, ISO 8000
  • OpenLineage, OpenTelemetry, MLflow Model Registry
  • NIST AI RMF, ISO/IEC 42001, EU AI Act
  • OWASP Top 10 for LLM Applications

Seminal / influential papers

  • Lewis et al. — Retrieval-Augmented Generation (2020)
  • Karpukhin et al. — Dense Passage Retrieval (2020)
  • Khattab & Zaharia — ColBERT / ColBERTv2
  • Microsoft Research — From Local to Global: GraphRAG (2024)
  • Anthropic — Contextual Retrieval (2024)
  • Google — Chain-of-Verification, Self-Consistency
  • Meta — Toolformer; ReAct (Yao et al.)

Living resources

  • Hamilton, Z. — Data Engineering Weekly
  • Chip Huyen's blog & AI Engineering newsletter
  • Eugene Yan — applied ML & LLM systems essays
  • Hamel Husain — LLM evals, AI Engineering
  • Jay Alammar / Cohere blog — retrieval & embeddings
  • Databricks, Snowflake, AWS, GCP engineering blogs

Closing bias. I describe the job in one line: make grounded answers fast, owned, and auditable—so the model is a part of the system you can reason about, not a black box you shrug at after a bad rollout.