What is PeopleAnalyst?

PeopleAnalyst is the front door for people-analytics research: 205+ works indexed and profiled, 40+ citation-grade findings extracted, and peer-reviewed behavioral science translated from academic to actionable — the missing manual for the people analytics you always meant to do.

What is people analytics?

People analytics is not a dashboard. It is behavioral science and statistical inference applied to workforce decisions — a discipline with its own methodology, spanning measurement, organizational design, talent, leadership, and analytics craft.

Why does AI in HR need measurement science?

AI is being deployed in high-stakes people decisions — hiring, performance, attrition — without the measurement science to evaluate whether it works or whom it harms. Construct validity, effect sizes, and criterion validity are the vocabulary for asking an AI vendor the right questions.

How is the research made accessible?

The evidence is indexed and searchable: 205+ works, 40+ citation-grade insight cards, and 8 research arcs, so the right finding reaches the right decision at the right time.

What separates good people measurement from assertion?

Good measurement has a method: construct validity, reliability, and effect-size interpretation are not optional — they are what separates evidence from assertion.

library / lib505a4136d5fd07c1

AI Engineering: Building Applications with Foundation Models

Chip Huyen · 2025

In a sentence

A comprehensive engineering guide for building production-ready AI applications on top of foundation models, covering the full stack from evaluation and prompt engineering to RAG, finetuning, inference optimization, and deployment architecture.

AI Engineering by Chip Huyen is the definitive practitioner's guide for anyone building applications on top of large language models and multimodal foundation models. Written by a Stanford AI lecturer and veteran ML engineer, the book systematically addresses every stage of the development lifecycle: understanding how foundation models work under the hood, establishing rigorous evaluation pipelines, crafting effective prompts and retrieval-augmented generation systems, deciding when and how to finetune, optimizing inference for cost and latency, and assembling a production-grade architecture with guardrails, caching, and user feedback loops. Unlike tutorials tied to specific tools, Huyen focuses on durable fundamentals—why techniques work, when to use them, and how to reason about trade-offs—making it equally valuable for engineers just starting out and those scaling mature AI products.

The four lenses

Science
Statistics
Systems
Strategy

Tags

f1-systems

The model

A causal model describing how foundation model design choices and engineering adaptation levers—training data quality, model architecture, post-training alignment, prompt engineering, context construction (RAG), finetuning technique, and inference optimization—operate through intermediate psychological and behavioral states (developer confidence, evaluation reliability, output quality perception) to produce application-level and business outcomes (production reliability, cost-efficiency, user satisfaction, data flywheel growth).

Training Data Quality, Coverage, and Quantitydesign lever

The degree to which the data used to pre-train or finetune a foundation model is relevant, accurate, consistent, unique, diverse across domains and tasks, and sufficient in volume. Encompasses all three dimensions of the data golden triad: quality (correctness, alignment with task requirements, format), coverage (topical and linguistic diversity), and quantity (token or example count relative to model size and finetuning technique).

Model Architecture and Scaledesign lever

The structural design of the foundation model including the choice of architecture (transformer, Mamba, MoE hybrids), number of parameters, number of transformer layers, model dimension, vocabulary size, context length, and the extent of post-training alignment (SFT and preference finetuning). Scale is operationalized through the three key numbers: parameter count, training token count, and training FLOPs.

Post-Training Alignment Qualitydesign lever

The extent to which a model has been post-trained via supervised finetuning on high-quality demonstration data and preference finetuning (RLHF, DPO, RLAIF) to follow instructions safely, refuse harmful requests, and respond in ways that align with diverse human preferences. Higher alignment quality corresponds to better instruction-following, lower toxicity, reduced hallucination tendency, and more reliable refusal of out-of-scope requests.

Prompt Engineering Qualitydesign lever

The effectiveness of the instructions, examples, persona assignments, output format specifications, and context provided to a foundation model in a prompt, measured by how well the prompt elicits the desired model behavior without modifying model weights. Encompasses clarity, specificity, few-shot example quality, chain-of-thought elicitation, task decomposition, and defensive engineering against prompt attacks.

RAG Retrieval Qualitydesign lever

The degree to which the retrieval component of a RAG system surfaces documents that are relevant, precise, and sufficient to answer a given query. Encompasses context precision (fraction of retrieved documents that are relevant), context recall (fraction of relevant documents that are retrieved), retrieval algorithm choice (term-based, embedding-based, hybrid), chunking strategy, reranking, and contextual augmentation of chunks.

Finetuning Technique and Data Qualitydesign lever

The choice of finetuning approach (full finetuning, LoRA, QLoRA, soft-prompt tuning, preference finetuning) and the quality, coverage, and quantity of the instruction or preference data used. Determines how effectively model weights are updated to improve task-specific performance, output format adherence, and behavioral alignment while minimizing memory footprint and catastrophic forgetting.

Inference Optimization Leveldesign lever

The extent to which model inference has been optimized for latency and cost through techniques including quantization (FP16, INT8, INT4), speculative decoding, KV cache management, prompt caching, batching strategy (static, dynamic, continuous), tensor and pipeline parallelism, and kernel-level optimizations such as FlashAttention. Higher optimization level yields lower cost-per-token and lower latency while potentially introducing quality trade-offs.

Evaluation Pipeline Reliabilitypsychological state

The degree to which the evaluation infrastructure—comprising evaluation criteria, scoring rubrics, annotated evaluation datasets, evaluation methods (functional correctness, AI judges, similarity metrics), and experiment tracking—produces consistent, meaningful signals that correlate with real-world application performance and business metrics. A reliable evaluation pipeline enables confident iteration and reduces the risk of deploying degraded models.

Model Output Qualitypsychological state

The composite quality of responses generated by the AI application as perceived by target users and measured by domain-specific capability scores, generation quality metrics (factual consistency, coherence, relevance), instruction-following capability, and safety. This is the central mediating variable between adaptation levers and user-facing outcomes. It integrates contributions from the base model, prompt engineering, RAG, and finetuning.

Developer Systematic Iteration Behaviorbehavioral pattern

The degree to which AI engineering teams follow a principled, evaluation-driven development process: defining success metrics before building, versioning prompts and models, running controlled experiments with proper experiment tracking, systematically diagnosing failure modes before escalating to more expensive adaptation techniques, and iterating based on quantitative evaluation results rather than ad-hoc intuition.

Guardrail Coverage and Effectivenesscontextual condition

The comprehensiveness and reliability of safety and quality guardrails applied to both inputs and outputs of the AI application, including PII detection, toxicity filtering, prompt injection defenses, output format validation, retry logic, and human escalation policies. Measured jointly by violation rate (harmful outputs that bypass guardrails) and false-refusal rate (legitimate requests incorrectly blocked).

User Feedback System Qualitydesign lever

The effectiveness of the mechanisms designed to collect, extract, and interpret user feedback—both explicit (thumbs up/down, ratings) and implicit conversational signals (early termination, error correction attempts, regeneration, conversation length)—and to route that feedback into model evaluation, development, and personalization pipelines. A high-quality feedback system enables the data flywheel.

Production Reliability and Safetyoutcome metric

The degree to which the deployed AI application operates within acceptable performance, safety, and compliance bounds in production: low hallucination rate, low toxicity rate, acceptable latency and uptime, no critical security breaches, and stable output quality over time. Represents the primary risk-mitigation outcome of AI engineering.

Inference Cost Efficiencyoutcome metric

The ratio of application value delivered to total inference cost, capturing how well the engineering choices about model selection, quantization, batching, caching, and architecture minimize cost per useful output. Measured in cost-per-1M-tokens, cost-per-request, or cost-per-resolved-task relative to the value delivered by each resolved task.

User Satisfaction and Task Success Rateoutcome metric

The degree to which end users achieve their goals using the AI application, measured through explicit satisfaction signals (ratings, NPS), implicit behavioral signals (session length, regeneration rate, error correction rate, task completion rate), and business metrics (DAU, retention, subscription conversion). Represents the primary product-level outcome of AI engineering.

Data Flywheel Growthoutcome metric

The rate at which user interactions generate proprietary training data that, when used to improve the model, attracts more users who generate more data—a compounding competitive advantage. Measured by the volume and quality of user-feedback-derived training examples over time and the correlation between data flywheel cycles and model performance improvement on production queries.

How they connect

training data quality → predicts model output quality
model architecture and scale → predicts model output quality
post training alignment → predicts model output quality
prompt engineering quality → predicts model output quality
rag retrieval quality → predicts model output quality
finetuning technique and data → predicts model output quality
evaluation pipeline reliability → predicts developer systematic iteration
developer systematic iteration → predicts model output quality
model output quality → predicts user satisfaction
inference optimization level → predicts cost efficiency
inference optimization level → predicts user satisfaction
guardrail coverage → moderates production reliability
user feedback system quality → predicts data flywheel growth
data flywheel growth → predicts training data quality
user satisfaction → predicts data flywheel growth
model output quality → predicts production reliability
evaluation pipeline reliability → predicts production reliability

The process

This book provides a comprehensive playbook for developing and deploying sophisticated AI applications, guiding practitioners from initial strategy to operational excellence. The lifecycle begins with a foundational phase of evaluation-driven development, where clear criteria are established to select the most suitable model for the task. Following model selection, the playbook details a rigorous data engineering process, covering everything from data acquisition and curation to advanced synthesis and verification, ensuring a high-quality foundation for any model customization. The core of the playbook addresses model adaptation and interaction. It presents a detailed process for model finetuning, including various techniques like supervised and preference tuning, alongside strategies for hyperparameter optimization. As an alternative or complement to finetuning, it provides in-depth guidance on prompt engineering to elicit desired behaviors, including defensive techniques to ensure system safety and reliability. Finally, the playbook transitions to building and deploying the application. It covers advanced architectural patterns like Retrieval-Augmented Generation (RAG) and the development of intelligent agents capable of complex, tool-assisted tasks. It outlines how to build a robust application architecture with essential components like guardrails, model routing, and caching. The process concludes with strategies for optimizing inference for speed and cost, and establishing user feedback loops to drive continuous improvement, completing the iterative cycle of building a production-ready AI system.

Evaluation-Driven Development and Model Selection

To systematically define evaluation criteria, select the most suitable AI model for a specific application, and establish a continuous evaluation pipeline to ensure the system delivers measurable business value.

When to use: When starting a new AI project or when considering replacing an existing model in an application.

Step 1Define and categorize evaluation criteria before development begins.
Entry: A clear understanding of the application's requirements and intended business outcomes.
Exit: A documented framework of evaluation criteria and scoring rubrics, agreed upon by stakeholders.
- Which metrics are most critical for business success?
- How to balance trade-offs between performance, cost, and safety?
In: Application requirements, Business goals · Out: Structured framework of evaluation criteria, Scoring rubrics
ch05 · ch06
Step 2Filter and select a shortlist of candidate models.
Entry: A defined set of evaluation criteria.
Exit: A shortlist of 2-4 promising AI models for experimental evaluation.
- Which models meet the non-negotiable hard attributes?
- Which models show the most promise based on public data?
In: List of available AI models, Hard and soft application requirements · Out: Shortlist of candidate models
ch05
Step 3Run experiments with candidate models against the defined criteria.
Entry: A shortlist of candidate models and an annotated evaluation dataset.
Exit: Performance data for each candidate model against the defined criteria.
- Which model provides the best trade-off between performance, cost, and other criteria?
In: Shortlist of candidate models, Evaluation data · Out: Model performance benchmarks, The selected AI model for the application
ch05 · ch06
Step 4Develop and implement an evaluation pipeline.
Entry: A selected model and defined evaluation metrics.
Exit: A functioning evaluation pipeline that can track model performance post-deployment.
- What is the minimum performance level required for the application to be useful?
In: Defined evaluation metrics, Business metrics · Out: Automated evaluation pipeline
ch05 · ch06
Step 5Continuously monitor and iterate on the evaluation process.
Entry: A deployed application with an active evaluation pipeline.
Exit: Ongoing performance data and a refined evaluation process.
- When to adjust metrics or application parameters based on performance feedback?
- Is the current model's performance degrading over time?
In: Real-world performance data, User feedback · Out: Ongoing performance reports, Actionable insights for model or application improvement
ch05 · ch06

Data Engineering for AI Training

To acquire, curate, synthesize, verify, and format high-quality data necessary for training or finetuning machine learning models, ensuring it meets the specific requirements of the task for optimal model performance.

When to use: Before starting a model finetuning process or when creating a dataset for training a model from scratch.

Step 1Identify data requirements and acquire initial data.
Entry: A defined machine learning task and model.
Exit: An initial collection of raw data relevant to the task.
- Which data sources are most relevant and of highest quality?
- How to handle privacy and PII in user-generated data?
In: Task objectives, Model input format specifications · Out: Initial raw dataset
ch11
Step 2Filter and clean the dataset.
Entry: A raw dataset.
Exit: A cleaned dataset with low-quality and irrelevant examples removed.
- What are the specific criteria for identifying low-quality data?
- What level of duplication is acceptable?
In: Raw dataset, Filtering heuristics and quality criteria · Out: Cleaned dataset
ch11 · ch12
Step 3Synthesize additional data if necessary.
Entry: A cleaned dataset that may have gaps in coverage or insufficient volume.
Exit: An augmented dataset containing both real and high-quality synthetic data.
- Is synthetic data needed to improve model performance?
- Which method of data synthesis is most appropriate?
In: Cleaned dataset, Defined characteristics of desired synthetic data · Out: Augmented training dataset
ch11 · ch12
Step 4Verify data correctness and quality.
Entry: A dataset containing raw, cleaned, or synthesized examples.
Exit: A verified dataset where examples have passed all quality checks.
- Does a failing example get discarded or sent for revision?
- What is the quality threshold for accepting a generated example?
In: Generated data (code or text), Verification tools (parsers, linters, unit tests, AI verifiers) · Out: High-quality, verified dataset
ch12
Step 5Annotate data according to task requirements.
Entry: A verified, clean dataset.
Exit: A fully annotated dataset.
- Which data requires manual annotation versus automated annotation?
- How to ensure annotation consistency?
In: Verified dataset, Annotation guidelines · Out: Annotated dataset
ch11
Step 6Adjust data to the correct format for the model.
Entry: A complete, annotated dataset.
Exit: A final dataset correctly formatted for model training.
In: Annotated dataset, Model input format specifications · Out: Formatted training-ready dataset
ch12
Step 7Continuously refine the dataset based on model performance.
Entry: Model performance feedback.
Exit: An improved dataset for the next training iteration.
- Does the model's poor performance on certain tasks indicate a data gap?
In: Model performance metrics, Error analysis reports · Out: Refined dataset
ch11

Model Finetuning and Adaptation

To adapt a pre-trained foundation model to perform specific tasks more effectively by adjusting its weights through further training, aligning its outputs with human preferences, and optimizing hyperparameters for performance.

When to use: When a pre-trained model needs to learn a new skill, adopt a specific style, or improve its performance on a narrow task, and high-quality training data is available.

Step 1Determine if finetuning is necessary.
Entry: A pre-trained base model and a specific target task with defined performance criteria.
Exit: A decision to proceed with finetuning.
- Is the performance gap addressable with finetuning?
- Is the return on investment for finetuning justified compared to prompt engineering?
In: Base model performance metrics, Target performance criteria, Resource evaluation (cost, time) · Out: Go/no-go decision for finetuning
ch10p01
Step 2Select a finetuning approach and method.
Entry: A decision to finetune and a curated dataset.
Exit: A selected finetuning strategy and technical method.
- Is the goal to teach a new skill (SFT) or align with preferences (Preference Finetuning)?
- Should adapter-based methods (LoRA) or full finetuning be used?
In: Task requirements, Available data volume, Computational resources, Model serving plan · Out: Chosen finetuning approach (SFT/Preference), Chosen finetuning method (Full/PEFT)
ch03 · ch10p01 · ch10p02
Step 3Tune critical hyperparameters.
Entry: A selected finetuning method and prepared dataset.
Exit: A set of initial hyperparameters for the training run.
- How to adjust the learning rate based on the loss curve's behavior?
- When to stop training to avoid overfitting?
In: Base model parameters, Training and validation data, Performance results from smaller models (for scaling extrapolation) · Out: Tuned hyperparameters
ch03 · ch10p02
Step 4Select a development path and execute finetuning.
Entry: A complete finetuning plan (approach, method, hyperparameters).
Exit: A completed training run and an initial finetuned model.
- Use the Progression Path for direct optimization or the Distillation Path for creating a cheaper, high-performing model?
In: Finetuning code, Prepared dataset, Selected base model · Out: Finetuned model weights
ch10p02
Step 5Evaluate the finetuned model.
Entry: A finetuned model.
Exit: A comprehensive performance evaluation of the finetuned model.
In: Finetuned model, Test dataset, Evaluation pipeline · Out: Performance metrics report
ch10p01
Step 6Deploy the model or iterate.
Entry: A performance evaluation report.
Exit: A deployed model or a plan for the next iteration.
- Is the performance improvement sufficient for deployment?
In: Performance metrics report · Out: Deployed finetuned model
ch10p01

Prompt Engineering

To craft instructions (prompts) that effectively guide an AI model to produce a desired outcome without altering its underlying weights.

When to use: When developing an AI application to define the model's role, specify output formats, and guide its reasoning process for user queries.

Step 1Define the task description and role for the model.
Entry: A clear understanding of the desired model behavior for a given task.
Exit: A written task description and role definition.
In: Task requirements · Out: System prompt or task description
ch07
Step 2Provide in-context learning examples (few-shot prompting).
Entry: A defined task.
Exit: A prompt containing illustrative examples.
- How many examples are needed to guide the model effectively?
In: Exemplary inputs and outputs · Out: Few-shot prompt
ch07
Step 3Specify the concrete task for the current input.
Entry: A system prompt and optional few-shot examples.
Exit: A complete prompt ready to be sent to the model.
In: User input/query · Out: Finalized prompt
ch07
Step 4Experiment with prompt structure and assess robustness.
Entry: An initial prompt structure.
Exit: An understanding of the model's sensitivity and an optimized prompt structure.
- Does placing instructions before or after the user query yield better results?
In: Initial prompt, Variations of the prompt · Out: Optimized prompt structure
ch07
Step 5Incorporate feedback and iterate.
Entry: Model responses to a set of test prompts.
Exit: A refined and validated prompt that consistently produces the desired output.
In: Model outputs, Evaluation of outputs · Out: Iteratively improved prompt
ch07

Defensive Prompt Engineering

To develop safeguards within prompts and surrounding systems to protect against potential prompt attacks (e.g., jailbreaking, prompt injection) and improve the safety and reliability of AI systems.

When to use: When designing prompts for any user-facing AI application to mitigate security vulnerabilities and prevent misuse.

Step 1Identify potential prompt attack vectors.
Entry: An AI application that accepts user input.
Exit: A list of relevant security threats for the application.
In: Application use case, Knowledge of common prompt attacks · Out: Threat model for prompt-based attacks
ch07
Step 2Craft explicit negative constraints in the prompt.
Entry: A base system prompt.
Exit: A system prompt containing explicit safety instructions.
In: System prompt · Out: Hardened system prompt
ch07
Step 3Develop a defense hierarchy and use reinforcement.
Entry: A hardened system prompt.
Exit: A prompt structure that emphasizes system instructions.
In: Hardened system prompt, User prompt · Out: Reinforced prompt structure
ch07
Step 4Anticipate and instruct on rogue prompts.
Entry: A threat model for prompt attacks.
Exit: Prompts that include instructions for handling anticipated attacks.
In: Threat model · Out: Attack-aware prompts
ch07
Step 5Implement external guardrails.
Entry: A complete application architecture.
Exit: Deployed input/output filters.
In: User input, Model output · Out: Sanitized inputs and outputs
ch07

Retrieval-Augmented Generation (RAG)

To enhance a generative model's responses by grounding them in relevant, up-to-date information retrieved from external knowledge sources, thereby reducing hallucinations and improving factual accuracy.

When to use: When an AI model needs to answer questions or generate text about information that was not included in its original training data.

Step 1Receive a user query.
Entry: A user interacts with the system.
Exit: The user's query is captured.
In: User query · Out: Captured user query
ch08
Step 2Retrieve relevant information from an external memory source.
Entry: A user query and access to an indexed knowledge source.
Exit: A set of relevant data chunks or documents is retrieved.
- How many documents or chunks to retrieve?
- What is the relevance threshold for retrieved information?
In: User query, External memory sources · Out: Retrieved context documents/chunks
ch08
Step 3Construct the final prompt.
Entry: A user query and the retrieved context.
Exit: A final, context-enriched prompt.
In: User query, Retrieved context · Out: Augmented prompt
ch08
Step 4Send the augmented prompt to the generative model.
Entry: An augmented prompt.
Exit: The prompt is sent to the LLM.
In: Augmented prompt
ch08
Step 5Generate a response based on the provided context.
Entry: The model has received the augmented prompt.
Exit: A final, contextually grounded response is generated.
Out: Contextually enriched response
ch08

Intelligent Agent Development

To build an autonomous AI agent that can perform complex, multi-step tasks by planning, using tools to interact with its environment, and reflecting on its actions to achieve a user-defined goal.

When to use: When a user's request cannot be fulfilled by a simple query and requires a sequence of actions, information gathering, and interaction with external systems.

Step 1Define the agent's tool inventory.
Entry: A defined set of tasks the agent should be able to perform.
Exit: A documented and implemented inventory of tools.
- Which tools are essential for the agent's tasks?
- How to name and describe tools for the model to understand them best?
In: Task requirements, Available APIs and functions · Out: Tool inventory
ch09
Step 2Receive a user task and generate an initial plan.
Entry: A user task/query and a defined tool inventory.
Exit: An initial, multi-step plan of action.
- What is the most logical sequence of actions to achieve the goal?
In: User query/goal, Tool inventory · Out: Sequence of planned actions (including tool calls)
ch08 · ch09
Step 3Execute the planned actions using tools.
Entry: A plan of action.
Exit: The execution of one or more steps in the plan.
- Which tool should be invoked for the current step?
- What are the correct parameters for the tool call?
In: Plan, User input · Out: Tool execution outcomes/observations
ch08 · ch09
Step 4Evaluate progress and reflect on the outcome.
Entry: The outcome of an executed action.
Exit: An assessment of the current state relative to the goal.
- Is the task complete?
- Did an error occur that requires replanning?
In: Action outcomes/observations · Out: Progress evaluation, Self-critique
ch08 · ch09
Step 5Revise the plan and re-execute if necessary.
Entry: An evaluation indicating the task is not yet complete or has failed.
Exit: A revised plan or a final failure state.
- Should the agent continue with a modified plan or stop?
In: Progress evaluation, Error messages · Out: Revised plan, Successful task completion
ch08 · ch09

AI Application Architecture and Operations

To incrementally build a robust, scalable, and safe AI application architecture around a core model, and to establish operational practices for continuous improvement through user feedback.

When to use: When moving from a prototype AI model to a fully-featured, user-facing application.

Step 1Start with a simple, direct-to-model architecture.
Entry: A working AI model.
Exit: A basic application that can interact with the model.
In: Trained AI model · Out: Simple AI application
ch14
Step 2Implement input and output guardrails.
Entry: A baseline application and a risk assessment.
Exit: An application with safeguards against common input/output risks.
- When to block a query versus sanitizing it?
- What is the policy for handling model failures or unsafe outputs?
In: Risk assessment, Tools for sensitive data detection · Out: Application with enhanced safety and reliability
ch14
Step 3Enhance context with external data sources and tools.
Entry: A secure baseline application.
Exit: An application capable of leveraging external information.
- Which external data sources are most valuable for the use case?
In: External data sources, APIs · Out: Context-aware AI application
ch14
Step 4Implement a model router and gateway for complex pipelines.
Entry: The need to use multiple specialized models or services.
Exit: An efficient query routing system.
- Which model is best suited for a given user intent?
- When should a query be escalated to a human?
In: User queries, Trained intent classifier, Multiple available models · Out: Multi-model application with intelligent routing
ch14
Step 5Add caching mechanisms to optimize for latency and cost.
Entry: An application with identifiable, repetitive query patterns.
Exit: A performant application with reduced operational costs.
- What is the optimal cache maintenance policy (e.g., TTL)?
- Is exact caching sufficient, or is semantic caching needed?
In: User queries, API responses · Out: Application with a caching layer
ch14
Step 6Integrate user feedback mechanisms.
Entry: A deployed, user-facing application.
Exit: A system for continuous improvement based on user feedback.
- When and how to ask for feedback without disrupting the user experience?
In: User interactions, Feedback collection tools · Out: Analyzed user feedback, Data for model improvement
ch14

Inference Optimization and Generation Strategies

To configure how a model generates responses and to optimize the inference process for speed, throughput, and cost-effectiveness while maintaining response quality.

When to use: When deploying a model to production and seeking to improve its performance to meet user expectations and manage operational costs.

Step 1Select and configure a generation strategy.
Entry: A trained model ready for inference.
Exit: A defined and configured sampling strategy.
- Which sampling strategy is most appropriate for the application's context (e.g., creative writing vs. factual Q&A)?
In: Model's output logits · Out: Generated token sequence
ch03
Step 2Optionally, implement a Best-of-N strategy.
Entry: A configured generation strategy and sufficient computational resources.
Exit: A single, higher-quality response.
- How many candidates (N) to generate?
- What is the best selection criterion for the use case?
In: Input prompt · Out: A single, optimal AI-generated response
ch03
Step 3Measure and monitor inference performance.
Entry: A deployed model serving traffic.
Exit: Ongoing performance metrics for inference.
In: User requests · Out: Latency metrics (TTFT, TPOT)
ch13
Step 4Apply model compression techniques.
Entry: A trained model and performance benchmarks.
Exit: A smaller, faster model.
- What is the acceptable trade-off between accuracy loss and performance gain from compression?
In: Trained AI model · Out: Compressed model
ch13
Step 5Implement efficient batching and resource allocation.
Entry: An inference serving infrastructure.
Exit: An optimized inference service with higher throughput.
- Which batching strategy best fits the application's latency requirements and traffic patterns?
In: Incoming user requests · Out: Batched responses
ch13
Step 6Utilize prompt caching and parallelism.
Entry: A deployed model and understanding of hardware capabilities.
Exit: Reduced redundant computation and improved scalability.
In: User prompts · Out: Faster responses for cached prompts
ch13

The story

The reader Software engineers, ML engineers, data scientists, and technical product managers who want to build production-ready AI applications on top of foundation models—people who can get a demo working quickly but struggle to move it to a reliable, scalable product.

External problem

They have access to powerful foundation model APIs but lack a systematic framework for adapting, evaluating, optimizing, and deploying them reliably in production.

Internal problem

They feel overwhelmed by the pace of AI change, uncertain whether their choices (model, prompting approach, finetuning strategy) are principled or just lucky, and anxious that they are missing something critical.

Philosophical problem

It is wrong that an enormous amount of potential value from AI is locked behind undocumented trial-and-error rather than accessible engineering principles.

The plan

Understand foundation models deeply enough to make informed adaptation decisions—training data, architecture, post-training alignment, and probabilistic sampling.
Establish a rigorous evaluation pipeline before building—define criteria, metrics, and scoring rubrics tied to business outcomes.
Adapt models using the correct technique for each failure mode: prompt engineering for behavioral shaping, RAG for knowledge gaps, finetuning for structural/format issues.
Optimize inference for cost and latency using quantization, speculative decoding, batching, and prompt caching.
Assemble a production architecture progressively—context construction, guardrails, gateway, caching, agents—with monitoring at every layer.
Collect and leverage user feedback to close the data flywheel and continuously improve the product.

Success

Shipping AI applications that perform reliably in production, not just in demos.
Being able to diagnose and fix AI failures systematically rather than through guesswork.
Building evaluation pipelines that give genuine confidence before deployment.
Reducing inference costs and latency through principled optimization rather than ad-hoc fixes.
Creating a data flywheel from user feedback that compounds competitive advantage over time.
Communicating AI trade-offs clearly to cross-functional stakeholders.

At stake

Wasting months on finetuning when prompt engineering would have sufficed.
Shipping applications that hallucinate dangerously because evaluation was skipped.
Building architectures that fail silently with no observability into what went wrong.
Losing proprietary data advantages to competitors who move faster and collect feedback more effectively.
Being outpaced by the AI landscape because decisions are driven by hype rather than durable principles.

Chapter by chapter

ch01Introduction to Building AI Applications with Foundation Models
As AI models scale dramatically, the shift towards foundation models presents unprecedented opportunities and challenges for engineers looking to build AI applications.
- The rise of foundation models requires a paradigm shift in how we approach AI engineering, moving from model development to model adaptation.
- Self-supervision has unlocked unprecedented opportunities for training models without extensive labeled datasets, democratizing access to AI capabilities.
- Leveraging existing foundation models can significantly cut down the time and resources for developing advanced AI applications compared to traditional methods.
- Continuous evaluation and adaptation practices are crucial as AI models evolve and integrate more capabilities beyond their initial training contexts.
ch02Understanding Foundation Models
This chapter delves into the intricacies of foundation models, focusing on their development, training data, architecture, and usability, emphasizing the importance of design decisions in shaping application performance.
- Foundation models are defined by their training data, architecture, and post-training alignment with human preferences.
- English dominates internet datasets, leading to significant underperformance in multilingual AI tasks.
- A model’s training data must align with its intended tasks to ensure effectiveness; lacking crucial data can hinder model performance.
- The transformer architecture represents a significant breakthrough in AI, enabling parallel processing and robust attention mechanisms.
ch03Understanding Foundation Models
This chapter explores the intricate dynamics of foundation models, delving into scaling laws, hyperparameter tuning, training data challenges, and the inherent probabilistic nature of AI outputs.
- Models increasingly depend on optimal hyperparameter configurations, especially at larger scales, making hyperparameter tuning an essential focus area.
- Scaling bottlenecks pose genuine challenges, with data availability being a crucial factor that might limit the future design of AI models.
- The use of reinforcement learning to finetune model outputs not only enhances user preference alignment but also addresses model imperfections.
- Sampling techniques fundamentally impact model outputs, highlighting the importance of understanding probabilistic nature in AI for effective deployment.
ch04Evaluation Methodology
The chapter dissects the complexities surrounding the evaluation of AI outputs, emphasizing the critical need for systematic methodologies to avoid catastrophic failures and enhance reliability in AI applications.
- The stakes of inadequate AI evaluations can lead to catastrophic real-world failures, necessitating robust methodologies.
- Traditional evaluation metrics may not suffice for foundation models due to their open-ended nature and unpredictability.
- Systematic evaluation frameworks are crucial in mitigating risk and enhancing the reliability of AI applications.
- While human evaluation remains important, the automation of AI assessment through AI judges is a promising area that requires careful consideration.
ch05Evaluate AI Systems
This chapter argues that evaluating AI systems effectively before their deployment is critical for ensuring their usability, reliability, and alignment with intended business outcomes.
- AI models should always be evaluated in the context of their intended applications to ensure utility.
- Establishing pre-defined evaluation criteria is essential to avoid deploying ineffective AI systems.
- Continuous evaluation is not just a one-time task; it is integral to fostering long-term innovation and success in AI applications.
- The adoption of evaluation-driven development clarifies business outcomes and reinforces an organization's investment in AI technologies.
ch06Evaluate AI Systems
This chapter examines the critical considerations for selecting and evaluating AI models, including the trade-offs between open-source and commercial options, as well as the importance of a rigorous evaluation pipeline.
- The landscape for AI model selection is rapidly evolving, requiring a strategic approach that balances technical performance with compliance and ethical considerations.
- Public benchmarks offer valuable insights but come with significant risks of data contamination—prudent evaluation pipelines must account for these challenges.
- A holistic view encompassing both open-source and proprietary models can yield better, more flexible solutions for organizations willing to adapt.
- Developing a solid evaluation framework is essential for organizations to differentiate between good and excellent AI systems and achieve their business objectives.
ch07Prompt Engineering
Prompt engineering is the pivotal skill in developing effective interactions with AI models, allowing users to frame tasks accurately without changing the model’s internal workings.
- Prompt engineering is an essential skill that involves crafting precise instructions to guide AI outputs without altering model parameters.
- Clarity and explicit communication are paramount; it’s not just about writing prompts but doing so effectively and thoughtfully.
- Security is a critical consideration in AI deployment, with defined strategies necessary to mitigate prompt attacks and ensure the integrity of AI systems.
- Models are sensitive to the structure of prompts; small changes can produce dramatically different outcomes, emphasizing the necessity for careful experimentation.
ch08RAG and Agents
In this chapter, the author explores two pivotal techniques in AI—Retrieval-Augmented Generation (RAG) and intelligent agents—highlighting how both enhance AI's capabilities by providing context and enabling interactive problem-solving.
- AI models require contextual information, alongside effective instructions, to produce reliable outputs, emphasizing the need for RAG frameworks.
- RAG systems utilize retrieval mechanisms to access pertinent data, drastically improving question-answering capabilities while minimizing errors.
- Intelligent agents expand the potential of AI by enabling automated interactions with external tools, enhancing both efficiency and task complexity.
- A structured planning approach is essential to guide agents toward successful task execution, helping mitigate risks associated with complex processes.
ch09RAG and Agents
This chapter explores the capabilities and constraints of foundation models in planning and their integration with reinforcement learning (RL) agents, highlighting the complexities of tool usage, learning, and system memory within AI frameworks.
- The efficacy of foundation models in planning is often overstated; their understanding of planning must be reframed from simply ‘knowing’ to ‘acting effectively’.
- Backtracking, a critical component of planning, can be achieved operationally with a thoughtful design approach, even if it looks different than traditional methods.
- Tool selection and the ability to dynamically adapt to tool effectiveness are paramount for successful AI function in complex environments.
- Incorporating reflection mechanisms into AI planning processes can significantly improve performance, while iterative learning is crucial for agent efficacy.
ch10p01Finetuning (part 1/2)
This chapter discusses finetuning as a method for adapting large machine learning models to specific tasks versus using prompt-based techniques, exploring its advantages, challenges, and alternatives.
ch10p02Finetuning (part 2/2)
This chapter outlines practical tactics for finetuning AI models, exploring various methods, models, and frameworks while addressing the complexities practitioners face in optimizing performance.
ch11Dataset Engineering
The efficacy of machine learning models hinges on the quality and structure of training data, prompting a pivotal focus on dataset engineering to maximize performance within budget constraints.
- A high-quality dataset is the bedrock of a successful machine learning model; without it, even the most sophisticated algorithms will falter.
- Emphasizing data quality can yield far better results than simply increasing the quantity of training data available.
- Adopting a data-centric approach paves the way for more robust and effective AI applications, enabling organizations to leverage existing models more efficiently.
- Continuous feedback loops between model performance and dataset quality are essential for refining training data and achieving desired results.
ch12Dataset Engineering
This chapter dives into the intricate world of dataset engineering, detailing how to effectively create, evaluate, and synthesize data that significantly enhances machine learning models' performance.
- High-quality data is foundational to the success of AI models; the premise of synthetic data must be supported by thorough verification processes to ensure real-world applicability.
- Integrating both synthetic and human-generated data can enhance model reliability, overcoming the limitations posed by either data type alone.
- A multi-faceted approach in dataset engineering can significantly augment the volume and quality of data available for training models, enhancing performance markers.
- The emphasis on programmatically verifiable datasets is crucial for enterprises aiming to utilize AI effectively in a competitive landscape.
ch13Inference Optimization
This chapter asserts that making AI models faster and cheaper is as crucial as improving their accuracy, emphasizing the need for effective inference optimization techniques across model, hardware, and service levels.
- Inferencing optimization is crucial for the viability of AI applications; delays or excess costs can alienate users.
- A robust understanding of compute-bound vs. memory bandwidth-bound tasks is necessary for efficient optimization efforts.
- The implementation of effective batching strategies can significantly enhance throughput while minimizing latency impact.
- Model compression techniques are essential for improving inference speed and reducing operational costs without losing model fidelity.
ch14AI Engineering Architecture and User Feedback
This chapter breaks down the architectural strategies necessary for building effective AI applications, emphasizing the critical role of user feedback in enhancing model performance and guiding product development.
- A structured and proactive approach to AI application architecture is imperative for success in a competitive market.
- Feedback from users serves as both a vital resource for improving AI models and a competitive advantage through enhanced personalization.
- Balancing innovation with responsibility, particularly regarding user privacy and ethical guidelines, is essential in AI product development.
- Observability is a critical aspect of AI systems that ensures robust performance monitoring and facilitates quick resolution of emerging issues.
ch15Epilogue
The epilogue celebrates the completion of a comprehensive technical journey while emphasizing the importance of curiosity and ongoing inquiry in AI engineering.

Questions this book answers

Should I build this AI application and what use cases are worth pursuing?
How do I evaluate open-ended AI outputs systematically and reliably?
What are the best practices for prompt engineering and how do I defend against prompt attacks?
Why does RAG work and what strategies maximize retrieval quality?
How do agents plan and use tools, and how do I evaluate them?

Related in the library

Tools these methods power