Linh Truong · Product strategy · 2026

The AI Product Manager
Operating Manual

Canvases, diagrams, and decision checks I use with teams—from discovery through scale—for AI that actually ships and stays trustworthy.

Author: Linh Truong, MA (Harvard), MBA

Source: LinhTruong.com

Email: Linh@Alumni.Harvard.edu

Edition: v2026.1

Reading time: ~35–45 min

For: PMs, group PMs, heads of product

Covers: LLM, agentic, multimodal, embedded AI (B2B, consumer, internal)

The New Mandate of the AI PM
Competency Model & T-Shape
The AI Product Lifecycle
Strategy Canvas & North-Star
Technical Foundations Every PM Needs
Discovery, Problem Framing & Validation
Data, Evals & Quality Loops
Metrics: North-Star, HEART, Guardrails
Risk, Safety, Ethics & Governance
Build vs. Buy vs. Fine-tune vs. RAG
Pricing, Unit Economics & Cost Control
Team Topology & RACI
Roadmapping & Prioritization
Go-to-Market & Adoption
Common Pitfalls & Anti-Patterns
The 30 / 60 / 90 Day Plan
Tooling Stack Reference
Closing: The Durable AI PM
References & sources

78%

Enterprises with AI in production

McKinsey State of AI, 2025

3.7×

ROI gap, leaders vs. laggards

BCG AI Maturity Index, 2025

42%

AI pilots that reach scale

Gartner, 2025

11mo

Median PoC → production

a16z Enterprise survey, 2025

01The New Mandate of the AI Product Manager

The PM job did not change. The medium did. AI shifts the product surface from deterministic features to probabilistic capabilities — and that flips the operating model.

Traditional PMs ship features that behave the same way on Tuesday as they did on Monday. AI PMs ship capabilities whose quality is a distribution. Two consequences:

Evals replace spec sheets. A “done” feature is no longer a checked acceptance criterion — it is a measured pass-rate on a representative test set.
Data is the new roadmap. What you can build is bounded by the data you can collect, label, and govern. Roadmaps without a data plan are wishlists.
UX absorbs uncertainty. Confidence, fallbacks, citations, and undo become first-class UX primitives, not afterthoughts.
Cost is a feature. Latency and unit economics are now product decisions, not infra concerns.

Fig. 1 — The product surface shifts from a discrete behavior to a probability distribution.

“In AI products, the PM's job is not to specify the answer — it's to define the question, the bar, and the experience when the answer is wrong.” — Linh Truong

02Competency Model: The T-Shaped AI PM

An AI PM is not a research scientist — but cannot be a passenger either. The bar is fluency, not authorship: enough depth to ask the right questions and make the right tradeoffs.

Fig. 2 — Six competency lobes. Most AI PMs are strong in 2–3 and adequate in the rest. The differentiator is depth in ② and ③.

Competency	Novice	Practitioner	Expert
Technical fluency	Knows what a token is	Designs eval sets & prompts; chooses RAG vs. fine-tune	Co-designs system architecture with eng; reasons about cost/latency curves
Data literacy	Reads dashboards	Defines labeling guidelines & sampling strategy	Owns a data flywheel with feedback loops in production
Risk & ethics	Lists obvious harms	Runs structured red-team; owns disclosure UX	Builds governance program mapped to EU AI Act / NIST AI RMF
Business acumen	Knows pricing	Models cost-per-query & gross margin	Drives pricing reinvention around value, not seats

03The AI Product Lifecycle

A two-loop model: an outer product loop (months) and an inner quality loop (days–hours). Most failures come from teams running only one.

Fig. 3 — The two-loop operating model. Outer loop is the PM's classical territory; the inner loop is where AI products are won or lost.

04The AI Strategy Canvas

Before writing a PRD, fill the canvas. If any of the seven boxes are empty, you are not ready to commit roadmap.

Fig. 4 — The 7-box AI Strategy Canvas. Each box has a stress-test question; commit only when all seven pass.

05Technical Foundations Every AI PM Needs

You will not write the code. You will make tradeoffs about it weekly. Here is the minimum surface area.

5.1 The model landscape (2026)

Class	What it is	When PM picks it	Risks
Frontier general LLM Claude / GPT / Gemini	Hosted, broad, instruction-following	Fast time-to-value, broad tasks, lower volume	Vendor lock, cost at scale, data egress
Open-weights LLM Llama / Mistral / Qwen	Self-hostable, customizable	Data sovereignty, fine-tuning, on-prem	Ops burden, slower iteration
Small / specialized distilled · task-tuned	Cheap, fast, narrow	High-volume narrow tasks (classify, extract)	Brittle to drift
Multimodal	Text + vision + audio + video	Doc understanding, accessibility, robotics	Eval is harder; PII in pixels
Agentic systems	Tool-using, multi-step planning	Workflow automation, “do” vs “answer”	Long-horizon failures compound

5.2 The four common architectures — and when to choose each

Fig. 5 — Four common architectures. The PM's job is to choose the simplest one that clears the quality bar at acceptable cost.

Rule of thumb: start with A, add B when knowledge matters, add C when style or structure matters, reach for D only when the task is multi-step and a human would otherwise click through 5+ screens.

5.3 Tokens, context, latency, cost — the four levers

Tokens

~¾

Words per token (English). Plan budgets by tokens, not words.

Context window

200k+

Common today. Bigger ≠ better — retrieval quality decays past mid-window.

Latency

P95

What users feel. Streaming hides P50; never P95.

Cost / 1M tok

10–100×

Spread between cheapest and frontier. Route by task.

06Discovery, Problem Framing & Validation

AI does not change discovery; it raises the cost of building the wrong thing. Spend more time on the problem, not less.

6.1 The Three-Question Filter

Is this a real problem? Measured in current time/money spent, not stated interest.
Is AI the best lever? Or would a rule, a form, or a search box do it cheaper and safer?
Can we afford the error mode? What happens at the 1% and 0.01% bad output — and who bears the cost?

Failure pattern: teams skip Q2. “AI” is added to a doc-search problem that needed indexing. Result: higher cost, lower trust, same outcome.

6.2 The Wizard-of-Oz prototype

Before any model integration, simulate the experience with a human in the loop. If users do not love the experience when answers are perfect, no model will save you. If they do, you now have a quality bar.

6.3 Risk-classed problem framing

Tier	Examples	Bar to ship
Low Drafting / brainstorming	Email drafts, alt copy, summaries	>70% “useful” rating; user always edits
Mid Decision support	Triage, prioritization, lead scoring	Cited evidence, override always present, eval on rep set
High Autonomous action	Sending email, executing trades, code commits	Multi-stage approval, audit log, kill-switch, scoped permissions
Regulated Health / legal / financial advice	Diagnosis, contract terms, fiduciary advice	Domain expert review of outputs; EU AI Act conformity; clear disclosure

07Data, Evals & the Quality Loop

Evals are not QA. Evals are the product spec. The PM who owns the eval set owns the product.

Fig. 6 — The eval pyramid. Build bottom-up. Every team has CI tests; few have judge calibration; almost none have rigorous online evals — which is exactly where competitive advantage lives.

7.1 The PM's eval responsibilities

Curate the golden set. 50–500 high-signal examples covering happy paths, edge cases, and your “newspaper test” harms.
Define the rubric. What does “good” mean, in writing, on this task? Without a rubric, every eval is a vibe check.
Calibrate the judge. If using LLM-as-judge, validate against human ratings on a sample. Re-calibrate when the judge model changes.
Own the regression policy. No model or prompt change ships if it loses on the golden set.

7.2 The data flywheel

Fig. 7 — The data flywheel. The PM's job is to ensure every loop closes — most break at “implicit signals” (not captured) or “eval gates” (not enforced).

08Metrics: North-Star, HEART & AI Guardrails

Classical product metrics (engagement, retention) are necessary but insufficient. AI products need a layer for quality and a layer for safety.

Layer	What it measures	Sample metrics
North-Star	Customer value created	Successful tasks completed; revenue per active user
Engagement (HEART)	Happiness, Engagement, Adoption, Retention, Task success	CSAT, DAU/MAU, activation rate, D30 retention, task completion
AI Quality	Output goodness	Eval pass-rate, hallucination rate, citation accuracy, regeneration rate
AI Safety / Guardrails	What must not happen	Policy-violation rate, jailbreak success, PII leak rate
Unit Economics	Sustainability at scale	Cost per successful task, gross margin, P95 latency

The cardinal rule: never optimize a single metric. AI products can game any one of them — e.g., a model that refuses everything has zero hallucinations and zero usefulness. Always pair an “up” metric with a guardrail.

8.1 Worked example: an AI support assistant

Metric	Type	Target	Guardrail pair
Tickets self-served	North-Star	↑ +30% YoY	CSAT ≥ baseline
First-response accuracy	Quality	≥ 92%	Escalation false-negative < 1%
Hallucinated policy citations	Safety	= 0	—
Cost per resolved ticket	Econ	< $0.15	P95 latency < 4s

09Risk, Safety, Ethics & Governance

Governance is product strategy. Regulatory class shapes architecture choices that are expensive to reverse later.

9.1 Failure modes the PM must name

Hallucination

Plausible-sounding falsehoods. Mitigate via RAG, citations, constrained generation, retrieval grounding.

Bias & fairness

Disparate quality across user groups. Mitigate via stratified evals, dataset audits, counterfactual tests.

Prompt injection

Hostile inputs hijack instructions. Mitigate via input sanitization, separated trust zones, output validation.

Data leakage

Training/inference data appearing in outputs. Mitigate via tenant isolation, no-train guarantees, output filters.

Over-reliance

Users skip review on low-friction outputs. Mitigate via friction, confidence cues, sampling forced-review.

Drift

Quality decays as world or model changes. Mitigate via monitoring, scheduled re-evals, alerting.

9.2 Regulatory landscape (2026 snapshot)

Framework	Scope	What it asks of you
EU AI Act	Anyone serving EU users	Risk-class your system; conformity assessment for high-risk; transparency for limited-risk; GPAI documentation
NIST AI RMF	US, voluntary but referenced by gov contracts	Govern, Map, Measure, Manage — produce documentation across all four
ISO/IEC 42001	Org-wide AI management	Auditable management system; common in enterprise procurement
Sector-specific	HIPAA, GLBA, FDA SaMD, etc.	Pre-existing rules apply with AI-specific guidance overlays

PM action: on day one of any new AI initiative, write the system's EU AI Act risk classification on the first page of the PRD. If it is high-risk, route procurement, legal, and security before design — not after.

9.3 Red-team in product cadence

Before each launch: structured red-team with adversarial prompts spanning jailbreaks, prompt injection, biased inputs, and worst-case user intents.
Monthly: a rotating internal team tries to break the system using new techniques. Findings convert to eval cases.
External: for high-risk systems, contract a third-party red-team annually.

10Build vs. Buy vs. Fine-tune vs. RAG

A decision tree, not a debate.

Fig. 8 — Build/buy decision tree. Most enterprise AI products live in “Frontier API + RAG” for years before earning the right to fine-tune.

11Pricing, Unit Economics & Cost Control

AI products break the per-seat SaaS model. Consumption costs are real, variable, and unforgiving.

11.1 Pricing model menu

Model	Best for	Risk
Per-seat	Predictable usage; co-pilot products	Heavy users subsidize light; margin erosion
Per-task / outcome	Agentic products doing discrete jobs	Need clear unit of value; gaming risk
Usage / token	Developer / API products	Hard to forecast; bill shock
Tiered with credits	Mid-market with variability	Complexity; renegotiation cycles
Outcome-based	Replacement-of-labor positioning	Attribution; long sales cycle

11.2 The cost stack and where PMs control it

Fig. 9 — Inference dominates most AI products' cost. Routing (small model for easy tasks, frontier for hard) is the highest-leverage PM decision.

Cost control (how I run it): (1) instrument cost-per-successful-task as a first-class metric; (2) build a model router with quality-cost-latency curves per task; (3) cache aggressively at the semantic level; (4) prompt-compress and trim context; (5) re-run unit economics every major model release — frontier prices fall ~70%/year.

12Team Topology & RACI

AI teams have more disciplines than classical product teams — and more handoffs. The PM is the seam-stitcher.

Fig. 10 — A typical AI product pod. PM doesn't manage every line, but is the only role with line-of-sight to every dependency.

12.1 RACI for an AI feature launch

Activity	PM	ML Eng	Design	Data	Legal
Problem framing & success metric	A/R	C	C	C	I
Eval set curation	A	R	C	R	C
Architecture choice	A	R	I	C	C
UX of uncertainty	A	C	R	I	C
Risk class & disclosures	A	C	C	I	R
Launch decision (go/no-go)	A	R	R	R	R

13Roadmapping & Prioritization

Roadmaps for AI products are theses, not Gantt charts. Bet on capabilities, not features.

13.1 The three-horizon AI roadmap

Fig. 11 — Three horizons. Most teams over-invest in H1, under-invest in H2, and confuse H3 for vision deck décor.

13.2 Prioritization: RICE-AI

Classical RICE (Reach × Impact × Confidence ÷ Effort) needs two adjustments for AI work:

Add a Risk multiplier (0–1). Discount score by expected harm and reputational cost.
Add a Defensibility kicker. Bonus for work that strengthens the data flywheel — those bets compound.

Score = (Reach × Impact × Confidence × Defensibility × (1 − Risk)) ÷ Effort

14Go-to-Market & Adoption

AI products fail in market for one reason more than any other: users don't know when to trust them. GTM solves trust before it sells capability.

14.1 Adoption curve archetypes

Augment — the product helps users do something faster. Easy wins; low risk; price ceiling lower.
Automate — the product does something for the user. Harder trust curve; higher pricing power.
Amplify — the product lets users do new things they could not do before. Hardest GTM; biggest category prize.

14.2 The trust ladder

Fig. 12 — Trust ladder. Most B2B AI products should ship at “Draft” and graduate users to “Execute” per category as quality is proven.

14.3 GTM motions that work

Land with a single high-value task. Not a platform pitch — a job done better, today.
Show, then sell. Live demos on the buyer's data outperform every other content asset.
Champion enablement. Train one power user per account; document their wins; spread by anecdote.
ROI math, conservatively framed. Convert task-time saved or error-rate reduced into dollars with the customer in the room.
Transparency as marketing. Publish your eval methodology and limitations. Buyers reward honesty.

15Common Pitfalls & Anti-Patterns

Demo-driven development. Shipping the demo as the product. The demo path is curated; production isn't. Build for the long tail from day one.

No eval, no progress. Without a golden set, every prompt change is a guess. Subjective “feels better” is not engineering.

Cost discovered at scale. Pricing locked before unit economics modeled. Then growth = losses.

UX without uncertainty. Confident-looking output with no source, no confidence, no undo. Each error erodes the whole product.

Capability fishing. “What can we do with AI?” is the wrong question. “What painful job is now possible?” is the right one.

One big launch. AI products improve by living in production. Gradual rollouts >> big bangs.

Outsourcing risk. “Legal will handle it.” They won't. Risk class belongs on page 1 of the PRD.

Frontier dependency without exit. No model abstraction, no routing, no second vendor. One vendor incident = product outage.

16The 30 / 60 / 90-Day Plan for a New AI PM

Fig. 13 — A 90-day plan biased toward learning before bets, and bets before shipping.

17Tooling Stack Reference (2026)

Categories worth knowing, with archetypes. Tool choices change yearly; categories don't.

Category	What it does	Archetypes
Frontier model APIs	Hosted LLMs / multimodal	Claude · GPT · Gemini
Open-weights serving	Self-host models	vLLM · TGI · Ollama
Orchestration	Chains, agents, tool use	LangChain · LlamaIndex · in-house
Vector DB	Embeddings & retrieval	pgvector · Pinecone · Weaviate · Qdrant
Evals & observability	Test, monitor, alert	Braintrust · LangSmith · Arize · Helicone
Prompt management	Version, A/B, deploy prompts	PromptLayer · in-house registry
Annotation / labeling	Ground-truth creation	Label Studio · Scale · Surge
Guardrails / safety	Filter, classify, redact	Llama Guard · NeMo Guardrails · in-house
Cost / routing	Smart model selection	LiteLLM · Martian · in-house router
Governance	Risk register, model cards	Internal wiki · Credo AI · ModelOp

18Closing: The Durable AI Product Manager

Models and vendors churn fast. Judgment doesn’t: framing the problem, setting the bar, designing for failure, and running a compounding quality loop.

Be skeptical of the model. Be relentless about the user. Be precise about the bar. Be honest about the risk. Ship.

Nail that, and you beat most roadmaps full of clever features. The stack will move; the job of the PM won’t.

One-page summary · (1) Probabilistic product = distribution, not feature · (2) Evals are the spec · (3) UX absorbs uncertainty · (4) Data flywheel or no moat · (5) Risk class on page 1 · (6) Cost is a feature · (7) Trust earned in rungs, not leaps · (8) The PM owns the seams.

19References & sources

Below is the reading list behind this note: KPI sources (§1), discovery and metrics (§6–8), evals and RAG (§7, §10), risk (§9), economics and delivery (§11–13). Survey numbers change every release—pull the publisher’s current PDF before you cite a statistic.

For formal citations, prefer stable URLs and publisher PDFs. arXiv preprints are fine where that’s the canonical version; vendor docs are for day-to-day work, not your bibliography.

Industry context & adoption (KPI strip, §1)

McKinsey & Company, “The state of AI.” Annual survey series—enterprise adoption and economics. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
Boston Consulting Group, AI maturity / value surveys (e.g., AI maturity index materials). Use the year labeled on the report you download.
Gartner research on AI deployment and pilot-to-scale rates—access via subscription or public summaries; verify headline stats against the underlying note.
Andreessen Horowitz (a16z), enterprise AI surveys & data infrastructure essays. Useful directional context for B2B AI GTM (§14).

Product discovery, strategy & UX metrics (§2–8)

Cagan, Inspired: How to Create Tech Products Customers Love. Wiley—product discovery and empowered teams; maps to lifecycle §3–6.
Ries, The Lean Startup. Crown Business—validated learning and experiment design.
Torres, Continuous Discovery Habits. Product Talk—interview cadence and opportunity mapping (§6).
Olsen, The Lean Product Playbook. Wiley—hypothesis-driven PM process.
Rodden et al., “Measuring the User Experience on a Large Scale” (HEART framework). Google / CHI lineage—maps to §8 HEART.
Christensen et al., “Know Your Customers’ Jobs to Be Done.” Harvard Business Review, 2016—JTBD framing for “one job, done better.”

Software delivery, teams & operating models (§12–13)

Forsgren, Humble & Kim, Accelerate—DORA metrics and delivery performance. IT Revolution.
Skelton & Pais, Team Topologies—stream-aligned teams vs platform; RACI adjacency in §12.

Machine learning systems & AI product engineering (§5, §7, §10)

Huyen, Designing Machine Learning Systems. O’Reilly—data, deployment, and monitoring vocabulary for PMs working with engineers.
Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020—RAG baseline. https://arxiv.org/abs/2005.11401
Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023—RAG eval metrics named in tooling discussions. https://arxiv.org/abs/2309.15217
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023—human-calibrated judging. https://arxiv.org/abs/2306.05685
Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022—fine-tuning trade-space in §10. https://arxiv.org/abs/2106.09685

Safety, security & AI governance (§9)

European Union, Artificial Intelligence Act (Regulation (EU) 2024/1689). Primary legal text. EUR-Lex
NIST, AI Risk Management Framework (AI RMF 1.0). https://www.nist.gov/itl/ai-risk-management-framework
ISO/IEC 42001—AI management system standard (purchase from ISO or national body).
OWASP Top 10 for Large Language Model Applications. OWASP project page
OECD AI Principles—high-level policy framing. https://oecd.ai/en/ai-principles

Pricing, platforms & economics of information (§11, §14)

Christensen, The Innovator’s Dilemma. Harvard Business Review Press—incumbent dynamics when new capabilities (e.g., AI) reshape value chains.
Shapiro & Varian, Information Rules: A Strategic Guide to the Network Economy. Harvard Business Press—versioning, bundling, and metering strategies; useful mental model for token / usage pricing.

Reliability & cost discipline (§11)

Google, Site Reliability Engineering (free). O’Reilly / Google—SLOs and error budgets when AI features share production services. SRE book
Dean & Barroso, “The Tail at Scale.” Communications of the ACM, 2013—why tail latency matters for AI UX.

KPI strip at the top: Re-check McKinsey, BCG, Gartner, and a16z figures against each publisher’s latest report before you drop them in a deck—headlines and years shift.

The AI Product ManagerOperating Manual

Contents

01The New Mandate of the AI Product Manager

02Competency Model: The T-Shaped AI PM

03The AI Product Lifecycle

04The AI Strategy Canvas

05Technical Foundations Every AI PM Needs

5.1 The model landscape (2026)

5.2 The four common architectures — and when to choose each

5.3 Tokens, context, latency, cost — the four levers

06Discovery, Problem Framing & Validation

6.1 The Three-Question Filter

6.2 The Wizard-of-Oz prototype

6.3 Risk-classed problem framing

07Data, Evals & the Quality Loop

7.1 The PM's eval responsibilities

7.2 The data flywheel

08Metrics: North-Star, HEART & AI Guardrails

8.1 Worked example: an AI support assistant

09Risk, Safety, Ethics & Governance

9.1 Failure modes the PM must name

Hallucination

Bias & fairness

Prompt injection

Data leakage

Over-reliance

Drift

9.2 Regulatory landscape (2026 snapshot)

9.3 Red-team in product cadence

10Build vs. Buy vs. Fine-tune vs. RAG

11Pricing, Unit Economics & Cost Control

11.1 Pricing model menu

11.2 The cost stack and where PMs control it

12Team Topology & RACI

12.1 RACI for an AI feature launch

13Roadmapping & Prioritization

13.1 The three-horizon AI roadmap

13.2 Prioritization: RICE-AI

14Go-to-Market & Adoption

14.1 Adoption curve archetypes

14.2 The trust ladder

14.3 GTM motions that work

15Common Pitfalls & Anti-Patterns

16The 30 / 60 / 90-Day Plan for a New AI PM

17Tooling Stack Reference (2026)

18Closing: The Durable AI Product Manager

19References & sources

Industry context & adoption (KPI strip, §1)

Product discovery, strategy & UX metrics (§2–8)

Software delivery, teams & operating models (§12–13)

Machine learning systems & AI product engineering (§5, §7, §10)

Safety, security & AI governance (§9)

Pricing, platforms & economics of information (§11, §14)

Reliability & cost discipline (§11)

The AI Product Manager
Operating Manual