library / lib505a4136d5fd07c1
AI Engineering: Building Applications with Foundation Models
Chip Huyen · 2025
In a sentence
A comprehensive engineering guide for building production-ready AI applications on top of foundation models, covering the full stack from evaluation and prompt engineering to RAG, finetuning, inference optimization, and deployment architecture.
AI Engineering by Chip Huyen is the definitive practitioner's guide for anyone building applications on top of large language models and multimodal foundation models. Written by a Stanford AI lecturer and veteran ML engineer, the book systematically addresses every stage of the development lifecycle: understanding how foundation models work under the hood, establishing rigorous evaluation pipelines, crafting effective prompts and retrieval-augmented generation systems, deciding when and how to finetune, optimizing inference for cost and latency, and assembling a production-grade architecture with guardrails, caching, and user feedback loops. Unlike tutorials tied to specific tools, Huyen focuses on durable fundamentals—why techniques work, when to use them, and how to reason about trade-offs—making it equally valuable for engineers just starting out and those scaling mature AI products.
The four lenses
- Science
- Statistics
- Systems
- Strategy
Tags
The model
A causal model describing how foundation model design choices and engineering adaptation levers—training data quality, model architecture, post-training alignment, prompt engineering, context construction (RAG), finetuning technique, and inference optimization—operate through intermediate psychological and behavioral states (developer confidence, evaluation reliability, output quality perception) to produce application-level and business outcomes (production reliability, cost-efficiency, user satisfaction, data flywheel growth).
Training Data Quality, Coverage, and Quantitydesign lever
The degree to which the data used to pre-train or finetune a foundation model is relevant, accurate, consistent, unique, diverse across domains and tasks, and sufficient in volume. Encompasses all three dimensions of the data golden triad: quality (correctness, alignment with task requirements, format), coverage (topical and linguistic diversity), and quantity (token or example count relative to model size and finetuning technique).
Model Architecture and Scaledesign lever
The structural design of the foundation model including the choice of architecture (transformer, Mamba, MoE hybrids), number of parameters, number of transformer layers, model dimension, vocabulary size, context length, and the extent of post-training alignment (SFT and preference finetuning). Scale is operationalized through the three key numbers: parameter count, training token count, and training FLOPs.
Post-Training Alignment Qualitydesign lever
The extent to which a model has been post-trained via supervised finetuning on high-quality demonstration data and preference finetuning (RLHF, DPO, RLAIF) to follow instructions safely, refuse harmful requests, and respond in ways that align with diverse human preferences. Higher alignment quality corresponds to better instruction-following, lower toxicity, reduced hallucination tendency, and more reliable refusal of out-of-scope requests.
Prompt Engineering Qualitydesign lever
The effectiveness of the instructions, examples, persona assignments, output format specifications, and context provided to a foundation model in a prompt, measured by how well the prompt elicits the desired model behavior without modifying model weights. Encompasses clarity, specificity, few-shot example quality, chain-of-thought elicitation, task decomposition, and defensive engineering against prompt attacks.
RAG Retrieval Qualitydesign lever
The degree to which the retrieval component of a RAG system surfaces documents that are relevant, precise, and sufficient to answer a given query. Encompasses context precision (fraction of retrieved documents that are relevant), context recall (fraction of relevant documents that are retrieved), retrieval algorithm choice (term-based, embedding-based, hybrid), chunking strategy, reranking, and contextual augmentation of chunks.
Finetuning Technique and Data Qualitydesign lever
The choice of finetuning approach (full finetuning, LoRA, QLoRA, soft-prompt tuning, preference finetuning) and the quality, coverage, and quantity of the instruction or preference data used. Determines how effectively model weights are updated to improve task-specific performance, output format adherence, and behavioral alignment while minimizing memory footprint and catastrophic forgetting.
Inference Optimization Leveldesign lever
The extent to which model inference has been optimized for latency and cost through techniques including quantization (FP16, INT8, INT4), speculative decoding, KV cache management, prompt caching, batching strategy (static, dynamic, continuous), tensor and pipeline parallelism, and kernel-level optimizations such as FlashAttention. Higher optimization level yields lower cost-per-token and lower latency while potentially introducing quality trade-offs.
Evaluation Pipeline Reliabilitypsychological state
The degree to which the evaluation infrastructure—comprising evaluation criteria, scoring rubrics, annotated evaluation datasets, evaluation methods (functional correctness, AI judges, similarity metrics), and experiment tracking—produces consistent, meaningful signals that correlate with real-world application performance and business metrics. A reliable evaluation pipeline enables confident iteration and reduces the risk of deploying degraded models.
Model Output Qualitypsychological state
The composite quality of responses generated by the AI application as perceived by target users and measured by domain-specific capability scores, generation quality metrics (factual consistency, coherence, relevance), instruction-following capability, and safety. This is the central mediating variable between adaptation levers and user-facing outcomes. It integrates contributions from the base model, prompt engineering, RAG, and finetuning.
Developer Systematic Iteration Behaviorbehavioral pattern
The degree to which AI engineering teams follow a principled, evaluation-driven development process: defining success metrics before building, versioning prompts and models, running controlled experiments with proper experiment tracking, systematically diagnosing failure modes before escalating to more expensive adaptation techniques, and iterating based on quantitative evaluation results rather than ad-hoc intuition.
Guardrail Coverage and Effectivenesscontextual condition
The comprehensiveness and reliability of safety and quality guardrails applied to both inputs and outputs of the AI application, including PII detection, toxicity filtering, prompt injection defenses, output format validation, retry logic, and human escalation policies. Measured jointly by violation rate (harmful outputs that bypass guardrails) and false-refusal rate (legitimate requests incorrectly blocked).
User Feedback System Qualitydesign lever
The effectiveness of the mechanisms designed to collect, extract, and interpret user feedback—both explicit (thumbs up/down, ratings) and implicit conversational signals (early termination, error correction attempts, regeneration, conversation length)—and to route that feedback into model evaluation, development, and personalization pipelines. A high-quality feedback system enables the data flywheel.
Production Reliability and Safetyoutcome metric
The degree to which the deployed AI application operates within acceptable performance, safety, and compliance bounds in production: low hallucination rate, low toxicity rate, acceptable latency and uptime, no critical security breaches, and stable output quality over time. Represents the primary risk-mitigation outcome of AI engineering.
Inference Cost Efficiencyoutcome metric
The ratio of application value delivered to total inference cost, capturing how well the engineering choices about model selection, quantization, batching, caching, and architecture minimize cost per useful output. Measured in cost-per-1M-tokens, cost-per-request, or cost-per-resolved-task relative to the value delivered by each resolved task.
User Satisfaction and Task Success Rateoutcome metric
The degree to which end users achieve their goals using the AI application, measured through explicit satisfaction signals (ratings, NPS), implicit behavioral signals (session length, regeneration rate, error correction rate, task completion rate), and business metrics (DAU, retention, subscription conversion). Represents the primary product-level outcome of AI engineering.
Data Flywheel Growthoutcome metric
The rate at which user interactions generate proprietary training data that, when used to improve the model, attracts more users who generate more data—a compounding competitive advantage. Measured by the volume and quality of user-feedback-derived training examples over time and the correlation between data flywheel cycles and model performance improvement on production queries.
How they connect
- training data quality → predicts model output quality
- model architecture and scale → predicts model output quality
- post training alignment → predicts model output quality
- prompt engineering quality → predicts model output quality
- rag retrieval quality → predicts model output quality
- finetuning technique and data → predicts model output quality
- evaluation pipeline reliability → predicts developer systematic iteration
- developer systematic iteration → predicts model output quality
- model output quality → predicts user satisfaction
- inference optimization level → predicts cost efficiency
- inference optimization level → predicts user satisfaction
- guardrail coverage → moderates production reliability
- user feedback system quality → predicts data flywheel growth
- data flywheel growth → predicts training data quality
- user satisfaction → predicts data flywheel growth
- model output quality → predicts production reliability
- evaluation pipeline reliability → predicts production reliability
The story
The reader Software engineers, ML engineers, data scientists, and technical product managers who want to build production-ready AI applications on top of foundation models—people who can get a demo working quickly but struggle to move it to a reliable, scalable product.
External problem
They have access to powerful foundation model APIs but lack a systematic framework for adapting, evaluating, optimizing, and deploying them reliably in production.
Internal problem
They feel overwhelmed by the pace of AI change, uncertain whether their choices (model, prompting approach, finetuning strategy) are principled or just lucky, and anxious that they are missing something critical.
Philosophical problem
It is wrong that an enormous amount of potential value from AI is locked behind undocumented trial-and-error rather than accessible engineering principles.
The plan
- Understand foundation models deeply enough to make informed adaptation decisions—training data, architecture, post-training alignment, and probabilistic sampling.
- Establish a rigorous evaluation pipeline before building—define criteria, metrics, and scoring rubrics tied to business outcomes.
- Adapt models using the correct technique for each failure mode: prompt engineering for behavioral shaping, RAG for knowledge gaps, finetuning for structural/format issues.
- Optimize inference for cost and latency using quantization, speculative decoding, batching, and prompt caching.
- Assemble a production architecture progressively—context construction, guardrails, gateway, caching, agents—with monitoring at every layer.
- Collect and leverage user feedback to close the data flywheel and continuously improve the product.
Success
- Shipping AI applications that perform reliably in production, not just in demos.
- Being able to diagnose and fix AI failures systematically rather than through guesswork.
- Building evaluation pipelines that give genuine confidence before deployment.
- Reducing inference costs and latency through principled optimization rather than ad-hoc fixes.
- Creating a data flywheel from user feedback that compounds competitive advantage over time.
- Communicating AI trade-offs clearly to cross-functional stakeholders.
At stake
- Wasting months on finetuning when prompt engineering would have sufficed.
- Shipping applications that hallucinate dangerously because evaluation was skipped.
- Building architectures that fail silently with no observability into what went wrong.
- Losing proprietary data advantages to competitors who move faster and collect feedback more effectively.
- Being outpaced by the AI landscape because decisions are driven by hype rather than durable principles.
Chapter by chapter
ch01Introduction to Building AI Applications with Foundation Models
As AI models scale dramatically, the shift towards foundation models presents unprecedented opportunities and challenges for engineers looking to build AI applications.
- The rise of foundation models requires a paradigm shift in how we approach AI engineering, moving from model development to model adaptation.
- Self-supervision has unlocked unprecedented opportunities for training models without extensive labeled datasets, democratizing access to AI capabilities.
- Leveraging existing foundation models can significantly cut down the time and resources for developing advanced AI applications compared to traditional methods.
- Continuous evaluation and adaptation practices are crucial as AI models evolve and integrate more capabilities beyond their initial training contexts.
ch02Understanding Foundation Models
This chapter delves into the intricacies of foundation models, focusing on their development, training data, architecture, and usability, emphasizing the importance of design decisions in shaping application performance.
- Foundation models are defined by their training data, architecture, and post-training alignment with human preferences.
- English dominates internet datasets, leading to significant underperformance in multilingual AI tasks.
- A model’s training data must align with its intended tasks to ensure effectiveness; lacking crucial data can hinder model performance.
- The transformer architecture represents a significant breakthrough in AI, enabling parallel processing and robust attention mechanisms.
ch03Understanding Foundation Models
This chapter explores the intricate dynamics of foundation models, delving into scaling laws, hyperparameter tuning, training data challenges, and the inherent probabilistic nature of AI outputs.
- Models increasingly depend on optimal hyperparameter configurations, especially at larger scales, making hyperparameter tuning an essential focus area.
- Scaling bottlenecks pose genuine challenges, with data availability being a crucial factor that might limit the future design of AI models.
- The use of reinforcement learning to finetune model outputs not only enhances user preference alignment but also addresses model imperfections.
- Sampling techniques fundamentally impact model outputs, highlighting the importance of understanding probabilistic nature in AI for effective deployment.
ch04Evaluation Methodology
The chapter dissects the complexities surrounding the evaluation of AI outputs, emphasizing the critical need for systematic methodologies to avoid catastrophic failures and enhance reliability in AI applications.
- The stakes of inadequate AI evaluations can lead to catastrophic real-world failures, necessitating robust methodologies.
- Traditional evaluation metrics may not suffice for foundation models due to their open-ended nature and unpredictability.
- Systematic evaluation frameworks are crucial in mitigating risk and enhancing the reliability of AI applications.
- While human evaluation remains important, the automation of AI assessment through AI judges is a promising area that requires careful consideration.
ch05Evaluate AI Systems
This chapter argues that evaluating AI systems effectively before their deployment is critical for ensuring their usability, reliability, and alignment with intended business outcomes.
- AI models should always be evaluated in the context of their intended applications to ensure utility.
- Establishing pre-defined evaluation criteria is essential to avoid deploying ineffective AI systems.
- Continuous evaluation is not just a one-time task; it is integral to fostering long-term innovation and success in AI applications.
- The adoption of evaluation-driven development clarifies business outcomes and reinforces an organization's investment in AI technologies.
ch06Evaluate AI Systems
This chapter examines the critical considerations for selecting and evaluating AI models, including the trade-offs between open-source and commercial options, as well as the importance of a rigorous evaluation pipeline.
- The landscape for AI model selection is rapidly evolving, requiring a strategic approach that balances technical performance with compliance and ethical considerations.
- Public benchmarks offer valuable insights but come with significant risks of data contamination—prudent evaluation pipelines must account for these challenges.
- A holistic view encompassing both open-source and proprietary models can yield better, more flexible solutions for organizations willing to adapt.
- Developing a solid evaluation framework is essential for organizations to differentiate between good and excellent AI systems and achieve their business objectives.
ch07Prompt Engineering
Prompt engineering is the pivotal skill in developing effective interactions with AI models, allowing users to frame tasks accurately without changing the model’s internal workings.
- Prompt engineering is an essential skill that involves crafting precise instructions to guide AI outputs without altering model parameters.
- Clarity and explicit communication are paramount; it’s not just about writing prompts but doing so effectively and thoughtfully.
- Security is a critical consideration in AI deployment, with defined strategies necessary to mitigate prompt attacks and ensure the integrity of AI systems.
- Models are sensitive to the structure of prompts; small changes can produce dramatically different outcomes, emphasizing the necessity for careful experimentation.
ch08RAG and Agents
In this chapter, the author explores two pivotal techniques in AI—Retrieval-Augmented Generation (RAG) and intelligent agents—highlighting how both enhance AI's capabilities by providing context and enabling interactive problem-solving.
- AI models require contextual information, alongside effective instructions, to produce reliable outputs, emphasizing the need for RAG frameworks.
- RAG systems utilize retrieval mechanisms to access pertinent data, drastically improving question-answering capabilities while minimizing errors.
- Intelligent agents expand the potential of AI by enabling automated interactions with external tools, enhancing both efficiency and task complexity.
- A structured planning approach is essential to guide agents toward successful task execution, helping mitigate risks associated with complex processes.
ch09RAG and Agents
This chapter explores the capabilities and constraints of foundation models in planning and their integration with reinforcement learning (RL) agents, highlighting the complexities of tool usage, learning, and system memory within AI frameworks.
- The efficacy of foundation models in planning is often overstated; their understanding of planning must be reframed from simply ‘knowing’ to ‘acting effectively’.
- Backtracking, a critical component of planning, can be achieved operationally with a thoughtful design approach, even if it looks different than traditional methods.
- Tool selection and the ability to dynamically adapt to tool effectiveness are paramount for successful AI function in complex environments.
- Incorporating reflection mechanisms into AI planning processes can significantly improve performance, while iterative learning is crucial for agent efficacy.
ch10p01Finetuning (part 1/2)
This chapter discusses finetuning as a method for adapting large machine learning models to specific tasks versus using prompt-based techniques, exploring its advantages, challenges, and alternatives.
ch10p02Finetuning (part 2/2)
This chapter outlines practical tactics for finetuning AI models, exploring various methods, models, and frameworks while addressing the complexities practitioners face in optimizing performance.
ch11Dataset Engineering
The efficacy of machine learning models hinges on the quality and structure of training data, prompting a pivotal focus on dataset engineering to maximize performance within budget constraints.
- A high-quality dataset is the bedrock of a successful machine learning model; without it, even the most sophisticated algorithms will falter.
- Emphasizing data quality can yield far better results than simply increasing the quantity of training data available.
- Adopting a data-centric approach paves the way for more robust and effective AI applications, enabling organizations to leverage existing models more efficiently.
- Continuous feedback loops between model performance and dataset quality are essential for refining training data and achieving desired results.
ch12Dataset Engineering
This chapter dives into the intricate world of dataset engineering, detailing how to effectively create, evaluate, and synthesize data that significantly enhances machine learning models' performance.
- High-quality data is foundational to the success of AI models; the premise of synthetic data must be supported by thorough verification processes to ensure real-world applicability.
- Integrating both synthetic and human-generated data can enhance model reliability, overcoming the limitations posed by either data type alone.
- A multi-faceted approach in dataset engineering can significantly augment the volume and quality of data available for training models, enhancing performance markers.
- The emphasis on programmatically verifiable datasets is crucial for enterprises aiming to utilize AI effectively in a competitive landscape.
ch13Inference Optimization
This chapter asserts that making AI models faster and cheaper is as crucial as improving their accuracy, emphasizing the need for effective inference optimization techniques across model, hardware, and service levels.
- Inferencing optimization is crucial for the viability of AI applications; delays or excess costs can alienate users.
- A robust understanding of compute-bound vs. memory bandwidth-bound tasks is necessary for efficient optimization efforts.
- The implementation of effective batching strategies can significantly enhance throughput while minimizing latency impact.
- Model compression techniques are essential for improving inference speed and reducing operational costs without losing model fidelity.
ch14AI Engineering Architecture and User Feedback
This chapter breaks down the architectural strategies necessary for building effective AI applications, emphasizing the critical role of user feedback in enhancing model performance and guiding product development.
- A structured and proactive approach to AI application architecture is imperative for success in a competitive market.
- Feedback from users serves as both a vital resource for improving AI models and a competitive advantage through enhanced personalization.
- Balancing innovation with responsibility, particularly regarding user privacy and ethical guidelines, is essential in AI product development.
- Observability is a critical aspect of AI systems that ensures robust performance monitoring and facilitates quick resolution of emerging issues.
ch15Epilogue
The epilogue celebrates the completion of a comprehensive technical journey while emphasizing the importance of curiosity and ongoing inquiry in AI engineering.
Related in the library