peopleanalyst

library / lib819d3ddf552fcc6d

AI Agents and Applications (with LangChain, LangGraph, and MCP)

Roberto Infante · 2025

In a sentence

A hands-on developer guide that takes you from LLM prompt basics through advanced RAG, multi-tool agents, multi-agent systems, and the Model Context Protocol using LangChain, LangGraph, and LangSmith.

This book is a comprehensive, code-driven journey through the full spectrum of LLM-powered application development. Beginning with the fundamentals of prompt engineering and the OpenAI API, it progressively builds toward sophisticated architectures: summarization engines, Q&A chatbots grounded in private knowledge bases via Retrieval-Augmented Generation, and finally autonomous multi-tool AI agents and multi-agent systems. Each concept is illustrated with practical, runnable Python examples centered on a travel industry theme. The book covers LangChain's modular component model, LangGraph's stateful graph-based agent framework, LangSmith's observability tooling, and the emerging Model Context Protocol (MCP) standard. Readers learn not just how to wire components together but how to reason about trade-offs in chunk size, embedding strategy, query transformation, routing, and production concerns like memory and guardrails—making this an indispensable reference for any software developer building real-world AI applications.

The four lenses

  • Science
  • Statistics
  • Systems
  • Strategy

Tags

f1-systems

The model

A causal model describing how design levers in LLM application development—spanning prompt engineering, indexing strategy, query transformation, retrieval architecture, agent orchestration design, and observability—drive intermediate system states such as retrieval relevance and agent reasoning quality, which in turn determine outcome metrics including answer accuracy, hallucination rate, system reliability, and developer productivity.

Prompt Engineering Qualitydesign lever

The degree to which prompts are deliberately structured with clear persona, context, instructions, output format, tone, and in-context examples (zero-shot through few-shot and chain-of-thought) to guide the LLM toward accurate and appropriate outputs for a given task. High quality reflects well-crafted, task-specific prompt templates with appropriate examples and safety instructions such as instructing the model to admit ignorance rather than hallucinate.

Indexing Strategy Sophisticationdesign lever

The extent to which the content ingestion pipeline employs advanced techniques beyond naive fixed-size chunking, including appropriate splitting strategies (HTML header, markdown header, recursive character, sentence-based), multi-vector embedding schemes (child chunk embeddings, summary embeddings, hypothetical question embeddings), chunk expansion with adjacent context, and metadata enrichment. Higher sophistication means the vector store indexes each document in multiple semantically rich ways.

Query Transformation Sophisticationdesign lever

The degree to which the system applies pre-retrieval query reformulation techniques including rewrite-retrieve-read, multiple query generation, step-back abstraction, hypothetical document embeddings (HyDE), and single-step or multi-step decomposition of complex questions. Higher sophistication means the system does not pass raw user queries directly to the vector store but instead reformulates them to maximize semantic alignment with indexed content.

Retrieval Architecture Breadthdesign lever

The extent to which the application draws on multiple heterogeneous content stores beyond a single vector database, including relational SQL databases, graph databases, and metadata-filtered vector search, with appropriate routing logic to direct each query to the most relevant store and post-processing techniques such as reciprocal rank fusion to rerank results from multiple sources.

Agent Orchestration Design Qualitydesign lever

The quality of the agent architecture as measured by adoption of explicit typed state management (TypedDict state schemas), appropriate use of LangGraph's conditional edges and node functions, clear tool descriptions and system prompts that guide LLM tool selection, use of proven patterns such as ReAct and supervisor, and avoidance of brittle linear chains for tasks requiring dynamic decision-making. Encompasses both single-agent and multi-agent system design.

Observability and Tracing Practicedesign lever

The extent to which the development team instruments the LLM application with end-to-end tracing using tools such as LangSmith, enabling inspection of every LLM call, tool invocation, retrieved document, and message flow. High observability practice means traces are captured for all environments, examined during development, and used to diagnose issues in production rather than relying on surface-level outputs alone.

MCP Integration Adoptiondesign lever

The degree to which the agent application leverages the Model Context Protocol to discover and consume remote tools published by MCP servers, rather than maintaining bespoke per-service tool wrappers. Higher adoption means external capabilities are accessed through standardized MCP clients (FastMCP, LangChain MultiServerMCPClient) and the agent tool registry is populated dynamically from MCP server tool catalogs.

Retrieval Relevancepsychological state

The degree to which the document chunks returned by the retrieval stage of a RAG pipeline are semantically pertinent to the user's question and contain sufficient information for the LLM to synthesize a correct and complete answer. High retrieval relevance means the top-k results consistently contain the key facts needed, with minimal noise or off-topic content, and that context length is appropriate—neither too sparse to answer nor too verbose to confuse the model.

Agent Reasoning and Tool Selection Qualitypsychological state

The quality of the agent's intermediate reasoning steps, including its ability to correctly identify which tools to call, formulate accurate tool arguments, interpret tool outputs, know when to request additional tool calls versus when to synthesize a final answer, and avoid defaulting to internal model knowledge when tools should be used. High quality means the agent's chain of thought is coherent, grounded in retrieved evidence, and converges efficiently to a correct final answer.

Context Grounding Qualitypsychological state

The degree to which the LLM's generated responses are anchored to verified external context provided in the prompt (retrieved chunks, tool outputs, database records) rather than to the model's pre-trained parametric knowledge. High grounding means the model cites or refers to retrieved sources, stays within the boundaries of provided context, and explicitly acknowledges uncertainty rather than fabricating plausible-sounding but unverified information.

Answer Accuracyoutcome metric

The correctness and completeness of the final natural language response delivered to the user, measured against a ground truth reference. High accuracy means the answer correctly addresses all parts of the user's question, contains no factual errors, and does not omit critical information available in the knowledge base. Encompasses both factual precision and response completeness.

Hallucination Rateoutcome metric

The proportion of LLM responses that contain fabricated, unsupported, or factually incorrect information not present in the retrieved context or tool outputs. Lower is better. Hallucinations occur when the model fills gaps with plausible-sounding content drawn from parametric memory rather than grounded context, particularly when the retrieved context does not contain the answer.

System Reliability and Maintainabilityoutcome metric

The degree to which the LLM application or agent system behaves predictably, handles edge cases gracefully, can be debugged and extended without major refactoring, and remains stable as underlying models or data sources change. High reliability reflects modular component design, explicit state management, robust error handling, and the ability to swap LLMs or vector stores without rewriting application logic.

Developer Productivityoutcome metric

The speed and ease with which a software developer can design, implement, test, and iterate on LLM-powered applications and agent systems. High productivity reflects reduced boilerplate through framework abstractions (LangChain, LangGraph, pre-built agent components), faster debugging through observability tooling (LangSmith), and reuse of community resources (LangChain Hub, MCP ecosystem) rather than building every component from scratch.

How they connect

  • prompt engineering quality influences context grounding quality
  • prompt engineering quality influences agent reasoning quality
  • prompt engineering quality influences answer accuracy
  • indexing strategy sophistication predicts retrieval relevance
  • query transformation sophistication predicts retrieval relevance
  • retrieval architecture breadth influences retrieval relevance
  • retrieval relevance predicts context grounding quality
  • retrieval relevance predicts answer accuracy
  • context grounding quality predicts answer accuracy
  • context grounding quality predicts hallucination rate
  • agent orchestration design predicts agent reasoning quality
  • agent reasoning quality predicts answer accuracy
  • agent reasoning quality influences system reliability
  • observability practice influences agent reasoning quality
  • observability practice influences system reliability
  • mcp integration influences developer productivity
  • mcp integration influences system reliability
  • indexing strategy sophistication mediates context grounding quality
  • query transformation sophistication mediates context grounding quality
  • prompt engineering quality influences hallucination rate
  • agent orchestration design predicts system reliability
  • answer accuracy correlates hallucination rate

The story

The reader A software developer or technical practitioner who wants to build reliable, production-grade AI applications and agents using large language models but is overwhelmed by the rapidly evolving ecosystem and unsure how to move from toy demos to robust systems.

External problem

The developer needs to design and implement LLM-powered applications—summarization engines, Q&A chatbots, and autonomous agents—that work reliably on real data, but encounters brittle pipelines, poor retrieval quality, hallucinating models, and fragile integrations every time they try.

Internal problem

They feel uncertain and out of their depth in a field that changes daily, worried that the approach they are building will become obsolete or that they are missing critical techniques that separate toy demos from production systems.

Philosophical problem

Developers should not have to reinvent the same boilerplate infrastructure over and over, and AI systems should be grounded in verified knowledge rather than generating plausible-sounding falsehoods.

The plan

  1. Master the fundamentals of prompt engineering and programmatic LLM interaction using the OpenAI API and LangChain.
  2. Build summarization engines and research tools using LangChain chains, LCEL, map-reduce, and refine strategies.
  3. Implement RAG from scratch to understand vector stores, embeddings, and retrieval before using LangChain abstractions.
  4. Apply advanced indexing (parent-child chunks, multi-vector retrieval, hypothetical questions, summaries) and query transformations (rewrite, step-back, HyDE, decomposition) to achieve production-quality retrieval.
  5. Extend RAG to heterogeneous data sources with metadata self-querying, SQL generation, knowledge graph querying, chain routing, and reciprocal rank fusion.
  6. Transition from linear chains to stateful, conditional LangGraph workflows with explicit state management.
  7. Build single-tool and multi-tool ReAct agents using LangGraph, understanding tool calling protocol, tool registration, and LLM guidance techniques.
  8. Compose multi-agent systems using router and supervisor patterns, enabling specialized agents to collaborate on complex requests.
  9. Build and consume MCP servers with FastMCP 2 and integrate them into agent applications alongside local tools.
  10. Monitor, trace, and debug the entire system lifecycle with LangSmith.

Success

  • The reader confidently designs, builds, debugs, and maintains LLM-powered applications and multi-agent systems in production.
  • They understand the architectural trade-offs between engines, chatbots, and agents, and choose the right pattern for each use case.
  • Their RAG systems surface genuinely relevant context and rarely hallucinate, because they apply advanced indexing and query transformation techniques.
  • Their agents use tools reliably and adapt dynamically to user requests rather than relying on model memory.
  • They can integrate any external service into an agent via MCP without writing bespoke wrappers.
  • They have end-to-end observability through LangSmith and can diagnose issues quickly in production.

At stake

  • Without these techniques, developers continue building brittle, one-off pipelines that fail in production due to hallucinations, poor retrieval, and unmanageable orchestration code.
  • They waste time reinventing infrastructure that LangChain, LangGraph, and MCP already solve, slowing delivery and accumulating technical debt.
  • Their LLM applications erode user trust by confidently returning incorrect information instead of admitting uncertainty.
  • They miss the shift to agent-based architectures and MCP-standardized tooling, leaving them unable to participate in the rapidly growing ecosystem of AI-ready services.

Related in the library