peopleanalyst

library / lib0ce0d95ed10ea0d3

Spring AI in Action

Craig Walls · 2025

In a sentence

A hands-on guide for Java developers to build production-grade generative AI applications using Spring AI, covering everything from basic prompting to RAG, tools, MCP, multimodal generation, observability, security, and autonomous agents.

Spring AI in Action is the definitive practical guide for Java and Spring Boot developers who want to harness the power of generative AI without switching to Python. Starting from a simple question-and-answer REST service, the book progressively builds a complete 'Board Game Buddy' application that demonstrates retrieval-augmented generation, conversational memory, tool use, Model Context Protocol servers and clients, voice and image generation, observability with Prometheus and Grafana, security with Spring Security, and agentic workflow patterns including the new Embabel planning framework. Written by Craig Walls—author of Spring in Action—the book treats Spring AI as a first-class citizen of the Spring ecosystem, showing how its consistent client abstractions work across OpenAI, Anthropic, Azure, Ollama, and other providers, making generative AI capabilities natural extensions of any existing Spring Boot project.

The four lenses

  • Science
  • Statistics
  • Systems
  • Strategy

Tags

f1-systems

The model

A causal model describing how design levers available to Spring Boot developers (prompt engineering, RAG configuration, memory management, tool integration, security controls, observability instrumentation, and agentic composition) drive intermediate psychological and behavioral states in both the system and the developer, ultimately producing outcome metrics such as response quality, operational safety, developer productivity, and system observability.

Prompt Qualitydesign lever

The degree to which a prompt submitted to an LLM is well-structured, contains appropriate instructions, assigns correct roles (system vs. user), uses templates with filled parameters, and provides sufficient grounding context to elicit accurate and relevant responses.

RAG Configuration Qualitydesign lever

The effectiveness of the retrieval-augmented generation setup, encompassing document chunking strategy, embedding model choice, vector store selection, similarity search parameters (Top-K, similarity threshold, metadata filters), and query transformation techniques such as rewriting and expansion.

Conversational Memory Configurationdesign lever

The design choices governing how chat history is stored, retrieved, and injected into prompts, including the choice of chat memory advisor (MessageChatMemoryAdvisor, PromptChatMemoryAdvisor, or VectorStoreChatMemoryAdvisor), the persistence backend (in-memory, JDBC, Cassandra, Neo4j, vector store), the message window size, and conversation ID scoping.

Tool Integration Completenessdesign lever

The extent to which an application exposes relevant, well-described, and correctly scoped @Tool-annotated methods or Function implementations to the LLM, enabling it to retrieve real-time data or trigger application actions that it could not accomplish from training alone.

AI Security Controlsdesign lever

The breadth and depth of security mechanisms applied to the generative AI layer, including Spring Security authorization on tool methods via @PreAuthorize, metadata filters on vector store similarity searches to enforce document access controls, SafeGuardAdvisor for sensitive term blocking, CanaryWordAdvisor for prompt-leak detection, and ModerationModel usage for content policy enforcement.

Observability Instrumentation Leveldesign lever

The degree to which the Spring AI application is instrumented for operational visibility, including enabling Spring Boot Actuator metrics, configuring Micrometer Prometheus export, building Grafana dashboards for token usage and operation latency, and enabling distributed tracing with Jaeger or Zipkin to trace requests through the full AI pipeline.

Agentic Workflow Compositiondesign lever

The sophistication of the agentic architecture employed, ranging from single-prompt interactions through explicit workflow patterns (chaining, routing, parallelization) to fully autonomous planning via Embabel's Goal-Oriented Action Planning, where the agent derives its own execution path from action pre- and postconditions.

Model and Provider Selectiondesign lever

The choice of AI provider (OpenAI, Anthropic, Azure OpenAI, Ollama, Mistral, Google, etc.) and specific model (gpt-4o-mini, gpt-4.1, gemma, llama, etc.), along with tuning of generation parameters such as temperature, Top-P, and Top-K, which determine the capability ceiling, cost, context window size, and response variability of the application.

Context Relevance in Promptpsychological state

The degree to which the documents, tool results, and conversation history injected into a prompt as context are genuinely relevant to the current user query, as determined by vector similarity search quality, metadata filtering accuracy, and query transformation effectiveness.

LLM Response Groundingpsychological state

The extent to which the LLM's generated response is anchored in the provided prompt context (documents, tool outputs, memory) rather than relying solely on parametric training knowledge, thereby reducing hallucination and improving factual accuracy.

Token Efficiencybehavioral pattern

The ratio of useful information delivered to total tokens consumed per interaction, reflecting how well prompt construction, RAG chunking, and memory window sizing avoid wasting tokens on irrelevant or redundant content, directly affecting API cost and context window utilization.

Developer Productivitybehavioral pattern

The speed and ease with which a Java developer can implement, iterate on, test, and maintain generative AI features within a Spring Boot application, encompassing the reduction of boilerplate through Spring AI autoconfiguration, the portability of code across providers, and the availability of testing infrastructure.

Response Qualityoutcome metric

The overall quality of the LLM-generated response as experienced by the end user, encompassing factual accuracy (FactCheckingEvaluator), relevance to the question (RelevancyEvaluator), coherence across multi-turn conversation, appropriate tone and language, and absence of hallucinated content.

Operational Safetyoutcome metric

The degree to which the application prevents unauthorized data access through the AI layer, resists adversarial and manipulative prompting, blocks content policy violations, and ensures that tool invocations respect authorization boundaries, measured as the absence of security incidents and moderation violations.

System Observabilityoutcome metric

The operational visibility into the generative AI application's behavior, including token usage metrics, operation latency distributions, active operation counts, vector store query performance, and end-to-end distributed traces linking user requests to LLM calls, tool invocations, and vector store operations.

API Costoutcome metric

The monetary cost incurred for LLM API usage, determined by the total number of prompt tokens and completion tokens consumed across chat, embedding, and image generation operations, multiplied by the per-token pricing of the selected model and provider.

Feature Capability Breadthoutcome metric

The range of generative AI capabilities exposed by the application, including text Q&A, RAG over documents, conversational memory, real-time tool invocation, MCP server/client integration, voice transcription and synthesis, vision understanding, image generation, summarization, translation, sentiment analysis, and agentic workflows.

How they connect

  • prompt quality influences context relevance
  • rag configuration predicts context relevance
  • context relevance predicts llm grounding
  • llm grounding predicts response quality
  • memory configuration influences context relevance
  • tool integration influences llm grounding
  • tool integration influences token efficiency
  • security controls predicts operational safety
  • observability instrumentation predicts system observability
  • observability instrumentation influences api cost
  • model selection influences response quality
  • model selection predicts api cost
  • model selection moderates token efficiency
  • prompt quality influences token efficiency
  • token efficiency predicts api cost
  • agentic composition predicts feature capability breadth
  • agentic composition influences api cost
  • prompt quality influences developer productivity
  • developer productivity influences feature capability breadth
  • context relevance mediates response quality
  • llm grounding mediates response quality

The story

The reader A Java or Spring Boot developer who wants to add generative AI capabilities to their applications but feels excluded from the AI wave because it seems to require Python and unfamiliar frameworks.

External problem

There is no well-established, idiomatic Java path for integrating LLMs, RAG, tools, memory, security, and agents into production Spring applications.

Internal problem

The developer feels left behind and anxious that their Java expertise is becoming irrelevant as AI becomes central to software development.

Philosophical problem

It is wrong that a developer's language choice should bar them from participating in one of the most transformative shifts in software history.

The plan

  1. Start with the simplest possible Spring AI application: a REST endpoint that sends a prompt to OpenAI and returns the answer.
  2. Learn to test AI responses using evaluators for relevancy and factual accuracy, and WireMock for deterministic unit tests.
  3. Master prompt construction: templates, roles, context stuffing, output conversion, streaming, and response metadata.
  4. Implement RAG with a vector store so the application can answer questions about any document, not just what the LLM was trained on.
  5. Add conversational memory so users can hold multi-turn dialogues.
  6. Enable tool use so LLMs can fetch real-time data and take actions via application code.
  7. Package tools into MCP Servers and consume them from MCP Clients for reusability and independent deployment.
  8. Extend the application with voice transcription, speech synthesis, vision, and image generation.
  9. Instrument the application with Micrometer metrics, Prometheus, Grafana dashboards, and distributed tracing.
  10. Secure RAG, tools, and prompts using Spring Security authorization and adversarial-prompt mitigation.
  11. Apply common generative AI patterns: summarization, translation, and sentiment analysis.
  12. Compose agentic workflows using chaining, routing, and parallelization, then graduate to self-planning agents with Embabel.

Success

  • Confidently add generative AI features to any existing Spring Boot project without touching Python.
  • Build RAG systems that answer factual questions about proprietary documents with high accuracy.
  • Deliver conversational AI experiences with persistent, per-user memory.
  • Expose and consume reusable tool collections via the industry-standard MCP protocol.
  • Ship multimodal features—voice, vision, image generation—as natural extensions of a Spring application.
  • Observe and optimize token usage and latency with production-grade dashboards and traces.
  • Protect AI features with the same Spring Security patterns already used in the rest of the application.
  • Design autonomous agents that plan their own workflows to achieve goals without hard-coded execution logic.

At stake

  • Remain on the sidelines of the generative AI revolution while competitors ship AI-powered features.
  • Rewrite business logic in Python, abandoning years of Java investment and domain knowledge.
  • Build brittle AI integrations that hallucinate, leak sensitive data, or break when the AI provider changes.
  • Operate AI features as black boxes with no visibility into cost, latency, or correctness.
  • Expose unauthorized users to premium documents or privileged tool invocations through an unsecured AI layer.

Related in the library