peopleanalyst

library / libe8b79c99d30f462e

Designing Data-Intensive Applications

Martin Kleppmann · 2017

In a sentence

A deep, principles-first guide to the architecture of reliable, scalable, and maintainable data systems, explaining the trade-offs behind databases, distributed systems, and data processing.

Behind the dizzying array of buzzwords—NoSQL, Big Data, eventual consistency, CAP theorem, MapReduce, stream processing—lie a small set of enduring principles that govern how data systems behave. Martin Kleppmann's Designing Data-Intensive Applications strips away the marketing to give software engineers and architects a technically precise understanding of how databases store and retrieve data, how replication and partitioning work, what guarantees transactions and consensus can and cannot provide, and how batch and stream processing fit together. Rather than tutoring you on one tool, it teaches you to reason about the fundamental trade-offs—reliability vs. cost, consistency vs. availability, timeliness vs. integrity—so you can choose, combine, and operate the right tools for any data-intensive application and design systems that survive the messy realities of hardware faults, network failures, and human error.

The four lenses

  • Science
  • Statistics
  • Systems
  • Strategy

Tags

f1-systems

The model

A causal framework expressing how design levers (data model, storage engine, replication, partitioning, transaction guarantees, dataflow integration) and contextual conditions (faults, load, network/clock unreliability) influence psychological/behavioral states of the system and engineering team (consistency guarantees, fault tolerance, complexity) which in turn drive outcomes of reliability, scalability, and maintainability.

Data Model and Encoding Choicedesign lever

The selection of how data is structured and serialized (relational, document, graph) and how it is encoded for storage and transmission, including schema evolution strategy.

Storage Engine Designdesign lever

The internal data structures and write/read strategies used to persist and retrieve data, such as log-structured (LSM-tree), page-oriented (B-tree), or column-oriented storage, optimized for transactional or analytic workloads.

Replication Strategydesign lever

The approach to keeping copies of data on multiple nodes, including single-leader, multi-leader, and leaderless replication, and synchronous versus asynchronous propagation of changes.

Partitioning Strategydesign lever

The method of splitting a large dataset across nodes, including key-range and hash partitioning, secondary index partitioning, and rebalancing approaches to spread load and storage.

Transaction and Isolation Guarantee Leveldesign lever

The strength of safety guarantees provided for grouped reads and writes, ranging from weak isolation levels (read committed, snapshot isolation) to serializability and atomic commit across nodes.

Distributed Environment Unreliabilitycontextual condition

The contextual condition that networks drop or delay packets, clocks drift and are unsynchronized, and processes pause unpredictably, producing partial failures that are nondeterministic and hard to detect.

Workload Loadcontextual condition

The volume and pattern of data and requests placed on the system, including read/write ratio, data volume, concurrency, and access patterns that stress its capacity.

Consensus and Ordering Mechanismbehavioral pattern

The use of mechanisms for getting nodes to agree and for ordering events, including linearizability, total order broadcast, and consensus algorithms that enable correct coordination despite faults.

Dataflow Integration Approachbehavioral pattern

The way heterogeneous data systems are kept in sync through derived data, using batch processing, stream processing, change data capture, and event logs to flow changes between systems.

System Complexitypsychological state

The degree of accidental and inherent complexity in the system, manifested as tight coupling, tangled dependencies, and difficulty for engineers to understand and reason about behavior.

Reliabilityoutcome metric

The system continuing to work correctly, performing the desired function at the desired performance level, even in the face of hardware faults, software errors, and human mistakes.

Scalabilityoutcome metric

The system's ability to cope with increased load in data volume, traffic, or complexity through reasonable means of adding computing resources while maintaining good performance.

Maintainabilityoutcome metric

The ease with which engineering and operations teams can keep the system running, understand it, and adapt it to new use cases over time, encompassing operability, simplicity, and evolvability.

How they connect

  • data model choice influences system complexity
  • data model choice influences maintainability outcome
  • storage engine design influences scalability outcome
  • replication strategy influences reliability outcome
  • replication strategy predicts consensus and ordering
  • partitioning strategy influences scalability outcome
  • transaction guarantee level influences reliability outcome
  • transaction guarantee level influences scalability outcome
  • distributed environment unreliability moderates consensus and ordering
  • distributed environment unreliability influences reliability outcome
  • workload load moderates scalability outcome
  • consensus and ordering mediates reliability outcome
  • dataflow integration mediates maintainability outcome
  • dataflow integration influences reliability outcome
  • system complexity influences maintainability outcome
  • system complexity influences reliability outcome

A candidate measure

Designing Data-Intensive Applications — derived measurement candidates

Data Model and Encoding Choice

Proportion of data in each model type; Encoding format compatibility coverage; Number of schema migrations performed without downtime

self-report suitability: medium

Storage Engine Design

Write amplification ratio; Sequential vs random I/O proportion; Read latency for point and range queries

self-report suitability: low

Replication Strategy

Replication lag (seconds); Data loss incidents on failover; Conflict resolution counts

self-report suitability: medium

Partitioning Strategy

Coefficient of variation of per-partition load; Hot-spot request concentration; Rebalancing frequency and duration

self-report suitability: medium

Transaction and Isolation Guarantee Level

Incidence of observed concurrency anomalies; Pass/fail of serializability tests; Distributed transaction success rate

self-report suitability: medium

Distributed Environment Unreliability

Network fault events per month; Maximum observed clock offset; Distribution of GC/process pause durations

self-report suitability: none

Workload Load

Requests per second; Dataset size in bytes; Fan-out factor; Read/write ratio

self-report suitability: medium

Consensus and Ordering Mechanism

Linearizability test pass rate; Leader election frequency; Ordering violations detected

self-report suitability: low

Dataflow Integration Approach

Derived data lag; Number of dual-write race conditions; Reprocessing capability presence

self-report suitability: medium

System Complexity

Coupling and dependency metrics; Developer-reported complexity ratings; Change effort estimates

self-report suitability: high

Reliability

Uptime percentage; Mean time to failure; Data loss/corruption incident count

self-report suitability: low

Scalability

Peak sustainable requests per second; p95/p99 latency at load levels; Scaling efficiency per added node

self-report suitability: low

Maintainability

Average change lead time; Team-reported operability scores; Operational incident frequency

self-report suitability: high

Run the assessment

The story

The reader A software engineer or architect who builds backend systems and wants to design data-intensive applications that are reliable, scalable, and maintainable.

External problem

The landscape of data tools is fast-changing and overwhelming, and no single tool meets all requirements, forcing engineers to combine systems they don't fully understand.

Internal problem

They feel uncertain whether their architectural choices are correct, fearing subtle bugs, data loss, or systems that buckle under load or failure.

Philosophical problem

It's wrong to make critical engineering decisions based on buzzwords and vendor hype rather than a precise understanding of the underlying trade-offs.

The plan

  1. Learn the foundational concerns: reliability, scalability, and maintainability.
  2. Understand how data is modeled, stored, encoded, and evolved on a single machine.
  3. Master the challenges of distributing data: replication, partitioning, and transactions.
  4. Confront the hard limits of distributed systems: faults, clocks, consistency, and consensus.
  5. Apply batch and stream processing to derive data and integrate heterogeneous systems.
  6. Design future-proof architectures around dataflow, correctness, and ethical responsibility.

Success

  • You can confidently choose and combine the right tools for each problem.
  • You build systems that survive hardware faults, network failures, and human error.
  • You reason clearly about consistency, ordering, and correctness in distributed settings.
  • You design evolvable systems that adapt as requirements and technologies change.

At stake

  • You build fragile systems that silently lose or corrupt data under faults.
  • You make costly architectural mistakes by trusting hype over understanding.
  • You become locked into inflexible designs that can't evolve.
  • You inadvertently cause harm through poor handling of data, privacy, and bias.

Chapter by chapter

  1. ch01Reliable, Scalable, and Maintainable Applications

    This chapter outlines the key principles of designing data-intensive applications, emphasizing the importance of reliability, scalability, and maintainability in creating robust software systems.

  2. ch02Data Models and Query Languages

    This chapter explores the foundational aspects of data models and query languages, emphasizing how different models impact software development and data manipulation.

    • Data models significantly shape not only the structure of applications but also the way developers understand and approach problems.
    • Relational databases continue to dominate but face increasing competition from more flexible data models like document and graph databases.
    • Key factors driving the choice of data models include relationship complexity, scalability, and ease of query execution.
    • The emergence of NoSQL has expanded the toolkit available to developers, necessitating a nuanced understanding of the trade-offs involved.
  3. ch03Storage and Retrieval

    This chapter explores the internal mechanisms of database storage and retrieval, contrasting various storage engines to help application developers select the right one for their needs.

  4. ch04Storage and Retrieval

    This chapter contrasts the mechanisms of data storage and retrieval in databases, delineating how different architectures cater to online transaction processing (OLTP) and online analytic processing (OLAP) to meet diverse query patterns and performance needs.

  5. ch05Encoding and Evolution

    This chapter explores the dynamic nature of data applications, highlighting the importance of encoding formats that support both backward and forward compatibility to facilitate system evolvability.

    • Embracing evolvability in design ensures that systems can adapt without disruption.
    • Maintaining both backward and forward compatibility is critical for reducing the risks associated with schema changes.
    • Adopting binary encoding formats can lead to performance improvements and better schema management.
    • Rolling upgrades allow for controlled system enhancements without significant downtime.
  6. ch06p01Replication (part 1/2)

    Replication is essential for maintaining data integrity and availability in distributed systems, yet it introduces complexities in ensuring consistency across multiple data nodes.

    • Replication is not merely about duplicating data but requires an intricate understanding of consistency, availability, and performance.
    • Single-leader replication simplifies data consistency at the expense of availability; multi-leader and leaderless systems can mitigate some risks but introduce significant complexity.
    • Managing replication involves trade-offs; asynchronous methods favor performance but risk data loss if not properly managed.
    • Techniques such as read-after-write consistency and proper conflict resolution are essential to provide a seamless user experience.
  7. ch06p02Replication (part 2/2)

    This chapter delves into the complexities and methodologies of data replication in distributed systems, emphasizing the pivotal nature of replication in achieving high availability and fault tolerance.

  8. ch07Partitioning

    This chapter addresses the necessity of partitioning large datasets to enhance scalability and mitigate the risks of data skew and hot spots, focusing on techniques, their implications, and the management of distributed data systems.

    • Partitioning is essential for scaling large datasets across distributed systems, directly impacting query performance and operational efficiency.
    • An effective partitioning strategy requires a balance between access patterns and data distribution to prevent hot spots from developing.
    • Key range partitioning supports efficient range queries but risks load imbalance, while hash partitioning provides even distribution but complicates range access.
    • Rebalancing the data across nodes is critical when systems evolve, ensuring equitable access and performance consistency throughout the data lifecycle.
  9. ch08p01Transactions (part 1/2)

    This chapter explores the intricacies of transactions in databases, asserting their necessity for managing concurrency and reliability while exposing the complexities that arise with their implementation.

  10. ch08p02Transactions (part 2/2)

    This chapter explores the intricacies of transaction management in databases, detailing how transactions mitigate concurrency issues through various isolation levels, and emphasizing the significance of serializable snapshot isolation (SSI) as a robust solution.

    • Transactions serve as an essential abstraction layer that simplifies error management in complex database systems.
    • Serializable snapshot isolation provides a powerful alternative to traditional locking mechanisms, enhancing both performance and predictability in transaction management.
    • Various isolation levels affect the outcome of transaction interactions, with serializable isolation being the only level that protects against all race conditions discussed.
    • Shorter transaction duration can significantly reduce the rate of aborts, leading to better application performance.
  11. ch09The Trouble with Distributed Systems

    In this chapter, we confront the myriad ways things can go awry in distributed systems, from unreliable networks to non-deterministic faults, highlighting the critical need for engineers to design for failure rather than optimism.

    • The adage “anything that can go wrong will go wrong” is especially true for distributed systems, and engineers must therefore design with failure in mind.
    • Unlike single-node systems, distributed systems face a spectrum of faults, including unpredictable network behaviors and partial failures.
    • Each node should be treated as fallible; implementing quorum-based decisions can mitigate risks associated with relying on a single node's operational state.
    • Network communication is inherently unreliable; it’s essential to build systems that can handle uncertainties in message delivery and timing.
  12. ch10The Trouble with Distributed Systems

    This chapter explores the complexities of managing leadership and locks in distributed systems, highlighting the pitfalls of false assumptions about node reliability and the critical role of fencing tokens in preventing data corruption.

    • Leadership in distributed systems is not simply about election; it demands continuous consensus among nodes to avoid catastrophic failures like split brain.
    • Fencing tokens serve as a crucial defense against the mistaken assumption of ownership in a resource, thus preventing data corruption in multi-client write scenarios.
    • Byzantine faults introduce significant complexity, requiring that systems be designed with both honesty and unreliability in mind to ensure robust operation.
    • Correctness for distributed algorithms is defined by safety properties that must always hold true, allowing for greater flexibility in liveness properties.
  13. ch11p01Consistency and Consensus (part 1/2)

    This chapter explores the complexities of achieving consensus in distributed systems, examining the importance of fault tolerance and the trade-offs involved in consistency models.

    • Achieving consensus in distributed systems is critical but inherently challenging due to the limitations posed by network reliability.
    • Different consistency models, from eventual to linearizability, come with trade-offs that can significantly impact system performance and dependability.
    • Understanding and implementing consensus algorithms are essential steps toward building fault-tolerant distributed systems.
    • The perceived simplicity of achieving consensus can lead to flawed designs if underlying complexities are not adequately considered.
  14. ch11p02Consistency and Consensus (part 2/2)

    This chapter delves into the challenges and methodologies of maintaining consistency in distributed systems, particularly focusing on the execution of consensus algorithms and the limitations of distributed transactions.

    • Atomic committing of message acknowledgments and database writes is essential for ensuring message processing happens exactly once.
    • The XA standard enables two-phase commit across diverse systems, although it introduces significant operational challenges.
    • In-doubt transactions can lead to extensive locks, blocking other operations and necessitating a framework for resolution.
    • Consensus algorithms empower distributed systems to make decisions even in the face of node failures, reinforcing the notion of redundancy in system design.
  15. ch12Batch Processing

    Batch processing systems operate by executing extensive data tasks without the immediate need for user interaction, setting them apart from online request-response systems.

    • Batch processing systems are indispensable for managing extensive datasets with defined performance expectations.
    • MapReduce, while historically significant, underlines the importance of structured data processing in commodity hardware environments.
    • Embracing Unix tools can enhance your efficiency in data analysis without the need for complex programming.
    • The principles of modularity and simplicity in the Unix philosophy resonate with modern agile practices, advocating for rapid iteration and experimentation in system design.
  16. ch13Batch Processing

    This chapter delves into batch processing within the context of Hadoop's MapReduce framework, detailing its operations, efficiencies, and workflows while contrasting it with traditional databases and emerging data processing paradigms.

  17. ch14p01Stream Processing (part 1/2)

    This chapter explores stream processing as an approach to handling unbounded data input in real-time, contrasting it with traditional batch processing and examining its mechanics, including event streams, messaging systems, and fault tolerance.

    • Stream processing is a necessary evolution for handling unbounded, continuous data input effectively.
    • Systems like Apache Kafka and AWS Kinesis provide essential infrastructure for managing event streams in a reliable manner.
    • Continuous data flow necessitates a shift from traditional batch processing mechanics to a more agile, event-driven approach.
    • Robust fault tolerance measures, including checkpointing and micro-batching, are critical in maintaining data integrity amidst real-time processing challenges.
  18. ch14p02Stream Processing (part 2/2)

    This chapter delves into the intricacies of stream processing, detailing its protocols and methodologies for ensuring fault tolerance, state management, and the handling of message processing, ultimately supporting the concept of continuous data flows in real-time applications.

    • Idempotence is critical in stream processing to prevent adverse effects from processing the same message multiple times.
    • Local state management, through periodic snapshots, is often more efficient than relying solely on remote databases for recovery.
    • Stream processing architectures can significantly benefit from utilizing log-based brokers for message delivery and state tracking.
    • Effective joins in stream processing (stream-stream, stream-table, table-table) allow for enriching and correlating data in real-time.
  19. ch15The Future of Data Systems

    This chapter presents a forward-looking perspective on data systems, advocating for innovative design choices that prioritize robustness, correctness, and adaptability in the face of diverse integration needs.

    • The future of data systems hinges on integrating diverse tools tailored to specific application needs, rather than relying on generalized solutions.
    • Effective data integration requires an understanding of one’s organizational data flows and the complexities entailed in managing multiple systems simultaneously.
    • Change data capture and event sourcing represent promising advancements in maintaining data consistency and promoting system robustness.
    • As systems scale, it becomes increasingly necessary to address both total order broadcasting and causal dependencies to preempt data inconsistencies.
  20. ch16p01The Future of Data Systems (part 1/3)

    This chapter explores the gradual evolution of data systems through deriving views, the lambda architecture, and the unbundling of databases, all while addressing the increasing complexity of integrating batch and stream processing.

  21. ch16p02The Future of Data Systems (part 2/3)

    As algorithmic decision-making proliferates, the chapter argues that individuals may face systematic exclusion from essential societal functions due to biased data, highlighting the urgent need for accountability and ethical frameworks in data systems.

    • "Algorithmic prison" refers to the restrictive barriers faced by individuals categorized as risky by automated systems, limiting their societal participation.
    • Data-driven systems can exacerbate existing biases, amplifying discrimination unless actively countered by ethical oversight.
    • Algorithms are not necessarily more impartial than humans; they can reflect and reinforce societal prejudices.
    • The lack of accountability for algorithmic decisions risks eroding trust in technology and societal institutions.
  22. ch16p03The Future of Data Systems (part 3/3)

    This chapter explores the shifting paradigms in data systems, emphasizing the necessity for evolving architectures that can handle the complexities of modern data processing and integrity.

    • A shift towards microservices can provide significantly higher resilience in data handling compared to traditional monolithic architectures.
    • Understanding the CAP theorem is crucial for making informed decisions about the trade-offs necessary in distributed systems.
    • Emerging technologies, such as distributed ledger systems, hold the key to enhancing transparency and trust in data processing infrastructures.
    • Change data capture is not merely an option but a necessity for maintaining accurate and timely data across complex systems.

Related in the library