What is PeopleAnalyst?

PeopleAnalyst is the front door for people-analytics research: 205+ works indexed and profiled, 40+ citation-grade findings extracted, and peer-reviewed behavioral science translated from academic to actionable — the missing manual for the people analytics you always meant to do.

What is people analytics?

People analytics is not a dashboard. It is behavioral science and statistical inference applied to workforce decisions — a discipline with its own methodology, spanning measurement, organizational design, talent, leadership, and analytics craft.

Why does AI in HR need measurement science?

AI is being deployed in high-stakes people decisions — hiring, performance, attrition — without the measurement science to evaluate whether it works or whom it harms. Construct validity, effect sizes, and criterion validity are the vocabulary for asking an AI vendor the right questions.

How is the research made accessible?

The evidence is indexed and searchable: 205+ works, 40+ citation-grade insight cards, and 8 research arcs, so the right finding reaches the right decision at the right time.

What separates good people measurement from assertion?

Good measurement has a method: construct validity, reliability, and effect-size interpretation are not optional — they are what separates evidence from assertion.

library / libe8b79c99d30f462e

Designing Data-Intensive Applications

Martin Kleppmann · 2017

In a sentence

A deep, principles-first guide to the architecture of reliable, scalable, and maintainable data systems, explaining the trade-offs behind databases, distributed systems, and data processing.

Behind the dizzying array of buzzwords—NoSQL, Big Data, eventual consistency, CAP theorem, MapReduce, stream processing—lie a small set of enduring principles that govern how data systems behave. Martin Kleppmann's Designing Data-Intensive Applications strips away the marketing to give software engineers and architects a technically precise understanding of how databases store and retrieve data, how replication and partitioning work, what guarantees transactions and consensus can and cannot provide, and how batch and stream processing fit together. Rather than tutoring you on one tool, it teaches you to reason about the fundamental trade-offs—reliability vs. cost, consistency vs. availability, timeliness vs. integrity—so you can choose, combine, and operate the right tools for any data-intensive application and design systems that survive the messy realities of hardware faults, network failures, and human error.

The four lenses

Science
Statistics
Systems
Strategy

Select an Appropriate Data Model

To choose the best data model (e.g., relational, document, graph) that aligns with application requirements, data relationships, and expected use cases.

When to use: When designing a new application or feature that requires data persistence.

Step 1Analyze application requirements, focusing on the types of relationships the data will have (one-to-one, one-to-many, many-to-many).
Entry: A clear understanding of the application's domain and features.
Exit: A documented list of data entities and their relationships.
In: Application requirements · Out: Data relationship analysis
ch02
Step 2Evaluate the capabilities of different data models (relational, document, graph) against the application's needs.
Entry: Data relationship analysis is complete.
Exit: A comparative analysis of suitable data models.
- Which models are viable candidates for the application.
In: Data relationship analysis · Out: Comparative analysis of data models
ch02
Step 3Consider non-functional requirements such as performance, scalability, data integrity, and flexibility for each candidate model.
Entry: A set of candidate data models has been identified.
Exit: An evaluation of each model against non-functional requirements.
In: Performance and scalability constraints · Out: Model evaluation report
ch02
Step 4Make an informed decision on the data model that best suits the application's operational context and requirements.
Entry: Full analysis of models against functional and non-functional requirements is complete.
Exit: A final data model is selected and documented.
- Selecting the final data model.
In: Model evaluation report · Out: Selected data model
ch02

Model Application Data in Layers

To efficiently model real-world entities and relationships by abstracting them into layers, from application objects down to a general-purpose data model for storage.

When to use: During the development of an application's data persistence layer.

Step 1Identify real-world entities relevant to the application domain (e.g., people, organizations, products).
Entry: Application domain is understood.
Exit: A list of core entities is defined.
In: Domain knowledge · Out: List of real-world entities
ch02
Step 2Model these entities as objects or data structures within the application code and develop APIs for their manipulation.
Entry: Entities are identified.
Exit: Application-level data structures and APIs are created.
In: List of real-world entities · Out: Application objects/data structures
ch02
Step 3Translate the application data structures into a general-purpose data model format (e.g., JSON, XML, relational tables, graph).
Entry: Application objects are defined.
Exit: A representation of the data in a standard format.
- Which general-purpose data model to use.
In: Application objects/data structures · Out: Structured data representation
ch02
Step 4Store and process the data using a database system that supports the chosen data model.
Entry: Data is represented in a general-purpose format.
Exit: Data is persistently stored and queryable.
In: Structured data representation · Out: Stored data
ch02

Implement Object-Relational Mapping (ORM)

To reduce boilerplate code and seamlessly translate data between object-oriented application code and relational database tables.

When to use: When building the data access layer of an application that interacts with a relational database.

Step 1Identify the application objects that require persistent storage in the database.
Entry: Application objects and database schema are defined.
Exit: A list of objects to be persisted is created.
In: Application data structures, Database schema · Out: List of objects for persistence
ch02
Step 2Use an ORM framework to define mappings between application classes and database tables.
Entry: An ORM framework has been chosen.
Exit: Mappings between objects and tables are configured.
In: ORM framework · Out: Object-relational mappings
ch02
Step 3Implement data manipulation (CRUD) operations using the ORM's methods, which automatically generate and execute SQL queries.
Entry: Mappings are configured.
Exit: Application code uses ORM methods for database interaction.
In: Object-relational mappings · Out: Data access layer code
ch02
Step 4Optimize performance by configuring the ORM's data retrieval strategies, caching, and connection pooling.
Entry: Basic CRUD operations are implemented.
Exit: ORM performance is tuned for the application's workload.
- Which data relationships should be eagerly vs. lazily loaded.
ch02

Build an LSM-Tree Based Key-Value Store

To build a high-performance, write-optimized key-value store using the Log-Structured Merge-Tree (LSM-Tree) architecture, suitable for high-throughput workloads.

When to use: When designing a database or storage system that needs to handle a very high rate of writes.

Step 1Handle incoming writes by first appending them to a write-ahead log (WAL) on disk for durability.
Entry: A write request is received.
Exit: The write operation is durably logged.
In: Key-value pair · Out: WAL entry
ch03
Step 2After logging, add the key-value pair to an in-memory sorted structure, called a memtable (e.g., a red-black tree or skip list).
Entry: Write is logged to WAL.
Exit: Memtable is updated with the new key-value pair.
In: Key-value pair · Out: Updated memtable
ch03
Step 3When the memtable exceeds a configured size threshold, flush it to a new file on disk as a Sorted String Table (SSTable).
Entry: Memtable size exceeds threshold.
Exit: A new, sorted, immutable SSTable file is created on disk.
- When to flush the memtable to disk.
In: Full memtable · Out: SSTable file
ch03
Step 4To handle a read request, first check the memtable for the key, then check the most recent SSTable on disk, and then subsequent older SSTables in order.
Entry: A read request for a key is received.
Exit: The most recent value for the key is returned, or not-found.
In: Key · Out: Value
ch03
Step 5Use an in-memory hash map as a sparse index to store the byte offsets of some keys within each SSTable file for faster lookups.
Entry: SSTables are created.
Exit: An in-memory index for SSTables is maintained.
In: SSTable file · Out: In-memory index
ch03
Step 6Periodically run a background compaction process to merge multiple SSTable segments.
Entry: Multiple SSTable segments exist on disk.
Exit: A new, compacted SSTable segment is created, and old segments are deleted.
- Which segments to merge and when.
In: SSTable segments · Out: Compacted SSTable segment
ch03

Implement Column-Oriented Storage

To optimize storage and query performance for analytical workloads by storing data column by column instead of row by row.

When to use: When building a data warehouse or an analytical query engine for large datasets.

Step 1Organize data on disk by storing all values for a single column together in a contiguous block.
Entry: A dataset with a defined schema is ready for storage.
Exit: Data is physically stored on disk in a columnar format.
In: Dataset · Out: Columnar data files
ch04
Step 2Apply compression techniques to each column's data.
Entry: Data is stored in columnar format.
Exit: Column data is compressed on disk.
- Which compression algorithm to use for each column.
In: Columnar data files · Out: Compressed columnar data files
ch04
Step 3Develop query execution logic that only reads the columns required for a given query from disk.
Entry: A query is submitted.
Exit: Only the necessary columns for the query are loaded into memory.
In: SQL query · Out: In-memory column data
ch04

Implement Multi-Column Indexing

To efficiently query multiple columns in a database simultaneously by creating a single, composite index.

When to use: When optimizing queries that have predicates on several columns in the `WHERE` clause.

Step 1Identify the columns that are frequently queried together.
Entry: Slow multi-column queries have been identified.
Exit: A set of columns for a composite index is determined.
In: Query logs · Out: List of columns to index
ch04
Step 2Define the order of the columns in the concatenated index.
Entry: Columns for the index are identified.
Exit: The order of columns for the index is defined.
- Deciding the optimal order of columns in the index.
In: List of columns to index · Out: Ordered list of columns
ch04
Step 3Create the concatenated index using the database's DDL command.
Entry: The ordered list of columns is defined.
Exit: A multi-column index is created in the database.
In: Ordered list of columns · Out: Multi-column index
ch04
Step 4Verify that the query optimizer uses the new index to execute the target queries.
Entry: The index is created.
Exit: Queries are confirmed to be using the new index, and performance is improved.
In: SQL query · Out: Query execution plan

Optimize Database Queries

To enhance the performance of database queries by reducing their execution time and resource consumption.

When to use: When application response times are slow, or database load is high.

Step 1Identify slow or resource-intensive queries by analyzing database performance monitoring tools and logs.
Entry: Performance degradation is observed.
Exit: A list of candidate queries for optimization is created.
- Which queries to prioritize for optimization based on frequency and impact.
In: Database performance metrics, Query logs · Out: List of slow queries
ch02
Step 2Use the database's query analysis tool (e.g., `EXPLAIN PLAN`) to understand the current execution plan for a slow query.
Entry: A slow query has been selected for analysis.
Exit: The query's execution plan is understood.
In: SQL query · Out: Query execution plan
ch02
Step 3Modify the query or database schema based on the analysis to improve performance.
Entry: The execution plan shows a performance bottleneck.
Exit: A change (e.g., new index, rewritten query) is implemented.
- Whether to add an index, rewrite the query, or change the schema.
In: Query execution plan · Out: Optimized query or new index
ch02 · ch04
Step 4Regularly monitor query performance to detect new bottlenecks as data and access patterns evolve.
Entry: Initial optimization is complete.
Exit: An ongoing monitoring process is in place.
ch02

Partition Data Across Multiple Nodes

To scale a database beyond a single machine by dividing a large dataset into smaller, more manageable partitions distributed across multiple nodes.

When to use: When a dataset's size or query throughput exceeds the capacity of a single server.

Step 1Analyze the dataset's size, growth rate, and query access patterns to determine the need for partitioning.
Entry: Performance or storage capacity limits are being approached.
Exit: A decision to partition the data is made.
In: Dataset characteristics, Performance metrics
ch07
Step 2Choose a partitioning strategy: key-range partitioning or hash partitioning.
Entry: The decision to partition is made.
Exit: A partitioning strategy is selected.
- Choosing between key-range and hash partitioning based on query patterns.
In: Query access patterns · Out: Selected partitioning strategy
ch07
Step 3Implement the chosen partitioning method by assigning key ranges or hash ranges to each node in the cluster.
Entry: A partitioning strategy is selected.
Exit: The dataset is distributed across multiple nodes according to the partitioning scheme.
In: Selected partitioning strategy · Out: Partitioned dataset
ch07
Step 4Combine partitioning with replication to ensure fault tolerance for each partition.
Entry: Data is partitioned.
Exit: Each partition is replicated for high availability.
ch07

Rebalance Partitions

To redistribute data and query load evenly across nodes in a distributed database when nodes are added or removed, or when data distribution becomes skewed.

When to use: When adding or removing nodes from a cluster, or when monitoring shows significant load imbalance.

Step 1Monitor the data size and query load on each partition and node to detect imbalances.
Entry: A partitioned system is in operation.
Exit: Load imbalance is detected or a change in cluster size is planned.
In: System load metrics · Out: Rebalancing trigger
ch07
Step 2Initiate the rebalancing process, which calculates a new assignment of partitions to nodes.
Entry: Rebalancing is triggered.
Exit: A new partition assignment plan is created.
In: Current partition assignments, Cluster topology · Out: New partition assignment plan
ch07
Step 3Move partition data from source nodes to destination nodes according to the new plan.
Entry: A new assignment plan is ready.
Exit: Data for partitions has been moved to the new nodes.
In: New partition assignment plan
ch07
Step 4Update the routing tier or coordination service with the new location of the moved partitions.
Entry: Partition data has been moved.
Exit: The routing layer is updated with the new partition locations.
ch07

Implement Request Routing in a Partitioned System

To direct client requests to the correct node in a partitioned database that holds the data for the requested key.

When to use: When building or interacting with a partitioned (sharded) database.

Step 1Maintain an up-to-date mapping of which partition lives on which node.
Entry: A partitioned database cluster exists.
Exit: A partition-to-node mapping is available.
Out: Partition-to-node mapping
ch07
Step 2When a client sends a request, the routing layer determines which partition the request's key belongs to.
Entry: A client request is received.
Exit: The target partition for the request is identified.
In: Client request with key · Out: Target partition ID
ch07
Step 3Look up the node responsible for the target partition in the partition-to-node mapping.
Entry: Target partition is identified.
Exit: The target node address is identified.
In: Target partition ID, Partition-to-node mapping · Out: Target node address
ch07
Step 4Forward the request to the identified target node for execution.
Entry: Target node is identified.
Exit: Request is sent to the correct node.
In: Client request, Target node address
ch07

Implement Leader-Based Replication

To maintain data consistency and availability in a distributed system by designating a single leader to handle all writes and propagate them to follower replicas.

When to use: When setting up a distributed database like PostgreSQL, MySQL, or many NoSQL systems.

Step 1Designate one replica as the leader (primary) and configure other nodes as followers (secondaries).
Entry: A cluster of database nodes is available.
Exit: Leader and follower roles are assigned.
- Which node to designate as the initial leader.
In: Node configuration files
ch06p01 · ch16p03
Step 2Direct all client write requests to the leader node.
Entry: Roles are assigned.
Exit: Clients are configured to send writes to the leader.
In: Write requests
ch06p01
Step 3The leader writes the new data to its local storage and appends the change to a replication log.
Entry: Leader receives a write request.
Exit: Data is written locally and logged by the leader.
In: Write request · Out: Replication log entry
ch06p01 · ch16p03
Step 4The leader sends the data change from its log to all followers.
Entry: Data is logged by the leader.
Exit: Followers receive the data change.
In: Replication log entry
ch06p01
Step 5Followers apply the changes from the leader to their local data copy.
Entry: Followers receive a data change.
Exit: Follower's local database is updated.
In: Data change
ch06p01
Step 6Configure clients to read from either the leader or any follower, depending on consistency requirements.
Entry: Replication is active.
Exit: Clients are configured for read access.
ch06p01

Implement Leaderless (Multi-Leader) Replication

To achieve high write availability and low latency, especially in multi-datacenter deployments, by allowing any replica to accept write requests.

When to use: When designing a system like Cassandra or Riak that needs to remain available for writes even during network partitions or node failures.

Step 1Configure all nodes in the cluster to accept both read and write requests from clients.
Entry: A cluster of database nodes is available.
Exit: All nodes are configured as write replicas.
ch16p03
Step 2When a node receives a write, it sends the write to all other replicas responsible for that data.
Entry: A node receives a write request.
Exit: The write is propagated to other replicas.
In: Write request
Step 3Use a quorum mechanism for reads and writes to ensure consistency.
Entry: A read or write request is initiated.
Exit: The operation is confirmed by a quorum of nodes.
- Choosing the appropriate quorum sizes (W and R) to balance read/write latency and consistency.
In: Read/write request
ch16p03
Step 4Implement a mechanism to detect and resolve write conflicts caused by concurrent updates to the same key.
Entry: A write conflict is detected.
Exit: The conflict is resolved according to a defined policy.
- Choosing a conflict resolution strategy.
In: Conflicting write requests · Out: Resolved value
ch06p01 · ch16p03
Step 5Implement a background repair process (e.g., read repair or anti-entropy) to ensure that nodes with stale data eventually catch up.
Entry: Nodes may have inconsistent data due to failures or partitions.
Exit: Stale data is eventually updated on all replicas.
ch16p03

Manage Node Lifecycle and Failover in a Replicated System

To ensure high availability by managing the addition of new nodes, recovery of failed nodes, and automatic promotion of a new leader when the current one fails.

When to use: During the operation of a replicated database to handle planned scaling or unplanned node failures.

Step 1To add a new follower, take a consistent snapshot of the leader's database and copy it to the new node.
Entry: A new node needs to be added to the cluster.
Exit: A database snapshot is available on the new node.
In: Leader's database · Out: Database snapshot
ch06p01
Step 2The new follower connects to the leader and requests all data changes that occurred since the snapshot was taken to catch up.
Entry: Snapshot is restored on the new node.
Exit: The new follower is fully synchronized and actively replicating.
In: Database snapshot, Replication log · Out: Synchronized follower
ch06p01
Step 3For a follower that recovers after a temporary outage, it reconnects to the leader and requests all changes it missed while disconnected.
Entry: A follower node restarts after a crash or network partition.
Exit: The recovered follower is fully synchronized.
In: Follower's last known transaction ID · Out: Resynchronized follower
ch06p01
Step 4Continuously monitor the leader's health using heartbeats or a failure detector.
Entry: A leader-based system is operational.
Exit: Leader failure is detected.
In: Node status information · Out: Leader failure signal
ch06p01
Step 5If the leader fails, automatically trigger a failover process by starting a leader election among the remaining followers.
Entry: Leader failure is detected.
Exit: A leader election is initiated.
In: Leader failure signal
ch06p01
Step 6Elect a new leader, typically the follower with the most up-to-date replication log, to minimize data loss.
Entry: Leader election is initiated.
Exit: A new leader is chosen.
- How to choose the new leader from available followers.
In: Follower replication statuses · Out: New leader
ch06p01
Step 7Reconfigure the system and clients to direct writes to the new leader.
Entry: A new leader is chosen.
Exit: The system is reconfigured to use the new leader.
ch06p01

Detect Node Failures in a Distributed System

To reliably identify unresponsive or failed nodes in a distributed system to trigger recovery or failover mechanisms.

When to use: As a core component of any high-availability distributed system for monitoring node health.

Step 1Continuously measure network round-trip times (RTT) between nodes to establish a baseline performance profile.
Entry: A distributed system is operational.
Exit: A statistical model of network RTT is available.
In: Network monitoring data · Out: RTT distribution data
ch09
Step 2Configure an appropriate timeout value for requests based on the observed network latency, adding a buffer for variability.
Entry: RTT data is available.
Exit: A timeout value is configured.
- How long the timeout should be.
In: RTT distribution data · Out: Configured timeout
ch09
Step 3A monitoring node sends periodic heartbeat requests to a target node.
Entry: Monitoring is active.
Exit: A heartbeat request is sent.
Out: Heartbeat request
ch09
Step 4If a response is not received within the configured timeout, the target node is considered potentially failed.
Entry: A heartbeat request has been sent.
Exit: A timeout occurs.
Out: Potential failure event
ch09
Step 5Retry the request a specified number of times to guard against transient network issues.
Entry: A timeout has occurred.
Exit: All retries have failed.
ch09
Step 6If all retries fail, declare the node as dead and notify other system components (e.g., a leader election module or load balancer).
Entry: All retries have failed.
Exit: The node is marked as dead and a notification is sent.
Out: Node failure notification
ch09

Manage ACID Transactions

To ensure data integrity and consistency in the face of concurrent operations and failures by grouping database operations into atomic units with ACID properties.

When to use: When performing a sequence of read and write operations that must all succeed or all fail together.

Step 1Choose an appropriate isolation level (e.g., Read Committed, Repeatable Read, Serializable) based on the application's tolerance for concurrency anomalies versus performance overhead.
Entry: A transactional workload is being designed.
Exit: An isolation level is selected for the transaction.
- Choosing between stronger vs. weaker isolation levels.
In: Application consistency requirements · Out: Selected isolation level
ch08p01
Step 2Begin a transaction to group a series of read and write operations.
Entry: A multi-step database operation needs to be performed.
Exit: A transaction context is started.
ch08p01
Step 3Execute the database operations within the transaction.
Entry: A transaction is started.
Exit: All database operations are executed.
In: Read/write commands
ch08p01
Step 4Implement error handling to detect failures (e.g., constraint violations, deadlocks, network errors) during execution.
Entry: Operations are being executed.
Exit: An error is detected or all operations complete successfully.
Out: Success or error signal
ch08p01
Step 5If all operations succeed, commit the transaction to make the changes durable and visible to other transactions.
Entry: All operations completed successfully.
Exit: Transaction is committed.
ch08p01
Step 6If any operation fails or a consistency violation is detected, abort the transaction and roll back all its changes.
Entry: An error occurred during the transaction.
Exit: Transaction is rolled back.
- Whether to commit or rollback based on the success of operations.
In: Error signal
ch08p01
Step 7Implement a retry strategy (e.g., with exponential backoff) for transient errors like deadlocks, where retrying the transaction may succeed.
Entry: A transient error caused a transaction to abort.
Exit: The transaction is retried or a final error is reported.
- Whether to retry a transaction based on the type of error.
In: Error type
ch08p01

Implement Serializable Snapshot Isolation (SSI)

To provide full serializability (the strongest isolation level) with high performance by using an optimistic approach that avoids most locking.

When to use: When implementing a database that requires both strong consistency guarantees and high concurrency.

Step 1Allow transactions to read from a consistent snapshot of the database without taking locks.
Entry: A transaction begins.
Exit: The transaction has a consistent view of the database.
ch08p02
Step 2While a transaction runs, the database tracks its read and write sets.
Entry: A transaction is executing.
Exit: The transaction's read and write sets are tracked.
In: Read and write operations · Out: Transaction read/write sets
ch08p02
Step 3When a transaction attempts to commit, the database checks for serialization conflicts.
Entry: A transaction requests to commit.
Exit: A decision is made whether the commit is safe.
- Determine if the transaction can commit based on serializability criteria.
In: Transaction read/write sets · Out: Commit or abort decision
ch08p02
Step 4If a conflict is detected, abort the transaction that would violate serializability.
Entry: A serialization conflict is detected.
Exit: The conflicting transaction is aborted.
ch08p02

Achieve Consensus in a Distributed System

To ensure all nodes in a distributed system agree on a single value or state (e.g., a transaction outcome or the current leader), preventing inconsistencies like split-brain.

When to use: When a distributed system needs to perform an operation that requires all nodes to agree, such as electing a new leader or committing a distributed transaction.

Step 1A node initiates a proposal for a value or action (e.g., electing itself as leader).
Entry: A decision needs to be made across the cluster.
Exit: A proposal is formulated.
Out: Proposal
ch11p01 · ch11p02
Step 2The proposer sends the proposal, often with an incrementing epoch or term number to distinguish it from stale proposals, to other nodes.
Entry: A proposal is formulated.
Exit: The proposal is sent to other nodes.
In: Proposal
ch11p01 · ch11p02
Step 3Each node evaluates the proposal based on its own state and the algorithm's rules, then sends a vote (accept/reject) back to the proposer.
Entry: A node receives a proposal.
Exit: The node sends its vote.
- Whether to accept or reject the proposal.
In: Proposal · Out: Vote
ch11p01 · ch11p02
Step 4The proposer collects votes and waits for a quorum (typically a majority) of nodes to accept the proposal.
Entry: The proposer has sent a proposal.
Exit: A quorum of votes is received or the round times out.
In: Votes from other nodes
ch11p01 · ch11p02
Step 5If a quorum is reached, the decision is considered committed and is communicated to all nodes.
Entry: A quorum of votes has been achieved.
Exit: The decision is committed and broadcast.
Out: Committed decision
ch11p01 · ch11p02
Step 6If a quorum is not reached (e.g., due to timeouts or rejections), the process is aborted or retried, potentially with a new proposal in a new round.
Entry: A quorum is not achieved within the timeout.
Exit: The consensus round fails.
- Whether to retry with a new proposal.
ch11p01

Implement Atomic Distributed Transactions (2PC/XA)

To achieve atomic commitment for a transaction that spans multiple nodes or heterogeneous systems (e.g., multiple databases, message queues), ensuring all participants either commit or abort together.

When to use: When an application needs to update data in multiple different databases or systems as a single atomic operation.

Step 1A transaction coordinator begins the transaction and assigns it a globally unique ID.
Entry: A distributed transaction is required.
Exit: A transaction is initiated with a coordinator.
Out: Global transaction ID
ch11p01 · ch11p02
Step 2Phase 1 (Prepare): The coordinator sends a 'prepare' request to all participating nodes.
Entry: The application has performed all its operations on the participants.
Exit: All participants receive a 'prepare' request.
ch11p01 · ch11p02
Step 3Each participant checks if it can commit the transaction. If so, it makes its changes durable (e.g., writes to a WAL) and replies 'yes' to the coordinator; otherwise, it replies 'no'.
Entry: A participant receives a 'prepare' request.
Exit: The participant sends its vote to the coordinator.
In: Prepare request · Out: Participant vote ('yes'/'no')
ch11p01 · ch11p02
Step 4Phase 2 (Commit/Abort): The coordinator collects all responses.
Entry: The coordinator has received votes from all participants or timed out.
Exit: The coordinator makes a final commit/abort decision.
- The coordinator must decide whether to commit or abort based on participant responses.
In: Participant votes · Out: Final transaction outcome
ch11p01 · ch11p02
Step 5If all participants replied 'yes', the coordinator writes a 'commit' decision to its own durable log and sends a 'commit' command to all participants.
Entry: All participants voted 'yes'.
Exit: All participants receive a 'commit' command.
ch11p01 · ch11p02
Step 6If any participant replied 'no' or timed out, the coordinator writes an 'abort' decision to its log and sends an 'abort' command to all participants.
Entry: At least one participant voted 'no' or timed out.
Exit: All participants receive an 'abort' command.
ch11p01 · ch11p02
Step 7Participants receive the final command and either commit or roll back their prepared changes.
Entry: A participant receives a final command from the coordinator.
Exit: The participant's part of the transaction is finalized.
In: Commit/abort command
ch11p01 · ch11p02
Step 8In case of coordinator failure after the prepare phase, restart the coordinator, read its transaction log to resolve in-doubt transactions, and notify participants of the final outcome.
Entry: The coordinator process restarts after a failure.
Exit: In-doubt transactions are resolved.
In: Coordinator's transaction log
ch11p02

Use Fencing Tokens to Prevent Split-Brain

To prevent a node that has been paused or mistakenly believes it is still the leader from making writes that corrupt data, especially after a failover.

When to use: When implementing a distributed lock or a leader election mechanism to ensure safe access to a shared resource.

Step 1A client requests a lock or lease from a lock service to become the leader for a resource.
Entry: A client needs to perform a leadership role.
Exit: A lock request is sent.
Out: Lock request
ch10
Step 2The lock service grants the lock and returns a unique, strictly monotonically increasing fencing token (e.g., a number).
Entry: The lock service can grant the lock.
Exit: The client receives a lock and a fencing token.
In: Lock request · Out: Fencing token
ch10
Step 3The client (now the leader) must include its fencing token with every write request it sends to the storage service.
Entry: The client holds a lock and a token.
Exit: A write request with a fencing token is sent.
In: Write operation · Out: Write request with token
ch10
Step 4The storage service maintains the highest token number it has seen for a given resource.
Entry: Storage service receives a write request with a token.
Exit: The received token is compared with the stored maximum token.
In: Write request with token
ch10
Step 5If the token in the request is older (lower) than the highest token seen, the storage service rejects the write.
Entry: The request token is lower than the stored maximum.
Exit: The write request is rejected.
- Comparison of the received token with the highest processed token determines whether the write is accepted or rejected.
Out: Write rejection
ch10
Step 6If the token is newer (higher), the storage service accepts the write and updates its highest-seen token.
Entry: The request token is higher than or equal to the stored maximum.
Exit: The write is accepted and the maximum token is updated.
Out: Write acceptance
ch10

Perform Extract-Transform-Load (ETL)

To move data from operational (OLTP) databases into a data warehouse for analysis, transforming it into a query-friendly format.

When to use: To populate and maintain a data warehouse for analytical querying.

Step 1Extract data from one or more source OLTP databases.
Entry: The ETL job is scheduled to run.
Exit: Source data is copied to a staging area.
- Whether to perform a full or incremental extraction.
In: OLTP databases · Out: Extracted raw data
ch04
Step 2Transform the extracted data into a schema suitable for analysis.
Entry: Raw data is in the staging area.
Exit: Data is cleaned and restructured for analysis.
In: Extracted raw data, Transformation criteria · Out: Transformed data
ch04
Step 3Load the transformed data into the target data warehouse.
Entry: Data transformation is complete.
Exit: The data warehouse is updated with the new data.
In: Transformed data · Out: Updated data warehouse
ch04

Execute Distributed Batch Processing with MapReduce

To process and analyze large datasets in a distributed, fault-tolerant manner using the MapReduce framework, including complex workflows with joins and chained jobs.

When to use: When a batch processing task is too large or too long to run on a single machine.

Step 1Define and implement custom mapper and reducer functions for the specific data processing task.
Entry: A batch processing task is defined.
Exit: Mapper and reducer code is written.
In: Processing logic · Out: Application code (e.g., JAR file)
ch12 · ch13
Step 2The framework reads input files from a distributed filesystem (like HDFS), breaking them into records and passing each to a map task.
Entry: Input data is available in HDFS.
Exit: Map tasks are launched and start processing data.
In: Input data files
ch12 · ch13
Step 3The mapper function processes each record and emits zero or more key-value pairs.
Entry: A map task is running.
Exit: Mapper output is generated.
In: Input record · Out: Intermediate key-value pairs
ch12 · ch13
Step 4The framework partitions the mapper output by key (e.g., using a hash function) and sorts the key-value pairs within each partition. This is the 'shuffle and sort' phase.
Entry: Mappers have produced output.
Exit: Sorted and partitioned data is ready for reducers.
In: Intermediate key-value pairs · Out: Grouped and sorted key-value pairs
ch12 · ch13
Step 5Reducers fetch their assigned partitions from the mappers.
Entry: Shuffle and sort phase is complete.
Exit: Reducers have their input data.
In: Grouped and sorted key-value pairs
ch13
Step 6The reducer function is called once for each unique key, with an iterator over all its associated values, and writes the final output to the distributed filesystem.
Entry: A reducer task is running.
Exit: Final output is written to HDFS.
In: Key and list of values · Out: Final output records
ch12 · ch13
Step 7For complex workflows, chain multiple MapReduce jobs by configuring the output directory of one job as the input for the next.
Entry: A multi-step processing pipeline is required.
Exit: A chain of dependent MapReduce jobs is configured.
ch13
Step 8To perform joins, design mappers for different datasets to emit the same join key, so the shuffle and sort phase groups all related records together for the reducer to process.
Entry: Data from multiple datasets needs to be joined.
Exit: The reducer receives all records for a given join key together.
- Choosing the appropriate joining method (e.g., sort-merge vs. broadcast hash join).
In: Multiple input datasets · Out: Joined records
ch13

Analyze Logs with Unix Tools

To perform simple, effective batch processing and analysis of log files by composing standard Unix command-line tools.

When to use: For quick, ad-hoc analysis of web server logs, application logs, or other text-based data files.

Step 1Read the log file using `cat` and pipe its content to the next command.
Entry: A log file is available.
Exit: The log file content is streamed to standard output.
In: Log file · Out: Stream of log lines
ch12
Step 2Use `awk` to extract the relevant field(s) from each line.
Entry: Log lines are being streamed.
Exit: A stream of the extracted fields is produced.
- Which field(s) to extract for the analysis.
In: Stream of log lines · Out: Stream of extracted fields
ch12
Step 3Use `sort` to group identical lines together.
Entry: A stream of fields is available.
Exit: A sorted stream of fields is produced.
In: Stream of extracted fields · Out: Sorted stream of fields
ch12
Step 4Use `uniq -c` to count the occurrences of each unique line.
Entry: The stream is sorted.
Exit: A stream of counts and unique values is produced.
In: Sorted stream of fields · Out: Stream of counts and values
ch12
Step 5Use `sort -r -n` to sort the counted list in reverse numerical order, bringing the most frequent items to the top.
Entry: A stream of counts and values is available.
Exit: A stream sorted by frequency is produced.
In: Stream of counts and values · Out: Frequency-sorted stream
ch12
Step 6Use `head` to display the top N results.
Entry: The stream is sorted by frequency.
Exit: The final result is displayed.
In: Frequency-sorted stream · Out: Top N results
ch12

Implement Stream Processing for Real-Time Data

To process an unbounded stream of events in real-time, enabling continuous data handling for applications like monitoring, alerting, and real-time analytics.

When to use: When building applications for fraud detection, real-time monitoring, or personalized user experiences.

Step 1Identify the sources of events (e.g., user clicks, sensor readings, database changes) that will form the input stream.
Entry: A real-time processing requirement is identified.
Exit: Event sources are defined.
Out: List of event sources
ch14p01
Step 2Use a messaging system (e.g., Apache Kafka, AWS Kinesis) to ingest events from producers and make them available to consumers.
Entry: Event sources are defined.
Exit: A messaging system is set up to ingest events.
In: Events · Out: Event stream
ch14p01
Step 3Implement one or more stream processors (consumers) that subscribe to the event stream.
Entry: An event stream is available.
Exit: Stream processors are deployed and subscribed.
In: Event stream
ch14p01
Step 4Define the processing logic for the stream processor, which can include filtering, transforming, or aggregating events.
Entry: Stream processors are subscribed.
Exit: Processing logic is implemented.
In: Events · Out: Processed events or derived streams
ch14p01
Step 5Send the output of the stream processor to a sink, which could be another event stream, a database, a cache, or a user-facing notification system.
Entry: Events are processed.
Exit: The output is delivered to a sink.
- Where to send the processed results.
In: Processed events
ch14p01

Ensure Idempotent Stream Operations

To prevent unintended side effects (like duplicate processing) when a stream processing operation is retried after a failure, by ensuring that performing the operation multiple times has the same effect as performing it once.

When to use: When designing a fault-tolerant stream processor that interacts with external systems (e.g., databases, APIs).

Step 1Identify the operations within the stream processor that have external side effects (e.g., writing to a database, sending a notification).
Entry: A stream processing job is being designed.
Exit: A list of non-idempotent operations is created.
Out: List of side-effecting operations
ch14p02
Step 2For each operation, determine if it is naturally idempotent (e.g., setting a key to a fixed value).
Entry: An operation is identified.
Exit: The idempotency of the operation is determined.
- Determining whether an operation is idempotent or requires redesign.
In: Operation
ch14p02
Step 3If an operation is not idempotent (e.g., incrementing a counter), redesign it to be so.
Entry: An operation is identified as non-idempotent.
Exit: The operation is redesigned to be idempotent.
In: Non-idempotent operation · Out: Idempotent operation
ch14p02
Step 4Implement the redesigned operation, ensuring the external system can handle the idempotency key to prevent duplicate effects.
Entry: The operation is redesigned.
Exit: The idempotent operation is implemented and can be safely retried.
ch14p02

Manage State and Fault Tolerance in Stream Processing

To maintain operational continuity and exactly-once processing guarantees in stateful stream applications during failures.

When to use: When building any stateful stream processing application that must be reliable.

Step 1Maintain the operator's state locally in memory or on local disk for low-latency access during processing.
Entry: A stateful stream operator is designed.
Exit: State is managed locally by the operator.
- Choosing between local versus remote state management based on performance characteristics.
Out: Local operator state
ch14p02
Step 2Periodically write consistent snapshots of the local state to durable, replicated storage (e.g., HDFS, S3). This is known as checkpointing.
Entry: The operator is processing events and updating its state.
Exit: A consistent checkpoint of the state and input offset is written to durable storage.
In: Local operator state · Out: State checkpoint
ch14p02
Step 3Use micro-batching to process events in small, atomic groups, which can simplify state management and recovery.
Entry: A stream processing framework is chosen.
Exit: The framework is configured to use micro-batching.
ch14p02
Step 4Upon failure, a new instance of the stream processing operator is started.
Entry: An operator instance fails.
Exit: A new operator instance is launched.
ch14p02
Step 5The new instance recovers its state by loading the most recent successful checkpoint from durable storage.
Entry: A new operator instance is launched.
Exit: The operator's state is restored from the last checkpoint.
In: State checkpoint · Out: Restored local state
ch14p02
Step 6The operator resumes consuming the input stream from the message offset that was saved as part of the checkpoint.
Entry: The operator's state is restored.
Exit: The operator resumes processing from the correct position in the input stream.
In: Input stream offset
ch14p02

Join Data Streams

To combine and correlate data from multiple real-time data streams or between a stream and a static dataset (table).

When to use: When building applications that need to correlate user activity events with user profile data, or match buy and sell orders in a trading system.

Step 1Identify the input streams or tables to be joined and the join key.
Entry: A requirement to combine data from multiple sources is identified.
Exit: Input sources and join key are defined.
Out: Join definition
ch14p02
Step 2Choose the appropriate type of join based on the nature of the inputs.
Entry: Join definition is available.
Exit: A join type is selected.
- Determining which joining method to use based on the data sources.
In: Join definition · Out: Selected join type
ch14p02
Step 3For stream-stream joins, define a time window to constrain the join.
Entry: A stream-stream join is selected.
Exit: A time window is defined.
Out: Window definition
ch14p02
Step 4Implement the join operation using a stream processing framework.
Entry: Join type and window are defined.
Exit: The join logic is implemented.
In: Input streams · Out: Joined output stream
ch14p02

Integrate Systems with Change Data Capture (CDC)

To capture all data changes from a system of record (e.g., a transactional database) and stream them to derived data systems (e.g., search indexes, caches, data warehouses) to maintain consistency.

When to use: When building derived data systems that must stay in sync with a primary database without putting load on it.

Step 1Configure the source database (system of record) to enable detailed transaction logging.
Entry: A source database is identified as the system of record.
Exit: The source database is configured for change logging.
In: Source database · Out: Transaction log
ch15 · ch16p03
Step 2Deploy a CDC tool or process (e.g., Debezium, Maxwell) that reads the database's transaction log.
Entry: Change logging is enabled.
Exit: A CDC process is running and tailing the transaction log.
- Choosing between trigger-based versus log-based change capture methods.
In: Transaction log
ch15 · ch16p03
Step 3The CDC tool converts the low-level log entries into a structured stream of change events (e.g., in JSON format).
Entry: The CDC process is reading the log.
Exit: A stream of structured change events is produced.
In: Log entries · Out: Change event stream
ch15 · ch16p03
Step 4Publish the change events to a durable message broker or event stream platform like Apache Kafka.
Entry: Change events are being produced.
Exit: Change events are published to a message broker.
In: Change event stream
ch15 · ch16p03
Step 5Downstream systems (e.g., search index, cache, data warehouse) subscribe to the event stream and apply the changes in the same order they occurred in the source system.
Entry: A change event stream is available on a message broker.
Exit: Derived data systems are updated based on the change events.
In: Change event stream · Out: Updated derived data systems
ch15 · ch16p03

Manage Derived Data System Updates

To ensure that derived data systems (e.g., search indexes, materialized views) are updated reliably and consistently from a source of truth, typically an event log.

When to use: When operating systems like search indexes, caches, or analytics dashboards that are populated from a primary data source.

Step 1Define an event log (e.g., from CDC or application events) as the single source of truth for all changes.
Entry: A derived data system needs to be maintained.
Exit: An event log is established as the source of truth.
Out: Event log
ch15
Step 2Implement a deterministic process that reads from the event log and applies transformations to generate the derived data.
Entry: An event log is available.
Exit: A deterministic update process is implemented.
In: Event log, Transformation rules · Out: Derived data
ch15
Step 3Write the transformed data to the derived data system (e.g., load into a search index).
Entry: Data has been transformed.
Exit: The derived data system is updated.
In: Derived data
ch15
Step 4Make the update operations idempotent to allow for safe retries in case of failure.
Entry: The update process is designed.
Exit: Update operations are idempotent.
ch15
Step 5If the derived data system becomes corrupted or needs to be rebuilt, reprocess the entire event log from the beginning to recreate it from scratch.
Entry: The derived data system needs to be rebuilt.
Exit: The derived data system is fully recreated from the event log.
In: Event log · Out: Rebuilt derived data system
ch15

Perform Gradual Schema Migration

To update a data schema in a live production system without causing downtime or disrupting user access.

When to use: When a database schema needs to be changed to support new application features or requirements.

Step 1Instead of modifying the existing schema in place, create a new derived view or table with the desired new schema.
Entry: A schema change is required.
Exit: Both old and new schema versions are available.
In: New schema requirements · Out: New schema view/table
ch16p01
Step 2Modify the application to be able to read from both the old and new schemas.
Entry: Both schema versions are available.
Exit: Application code is backward-compatible.
Step 3Gradually shift a small number of users or a percentage of traffic to start reading from the new schema.
Entry: Application can read from both schemas.
Exit: A portion of read traffic is using the new schema.
ch16p01
Step 4Monitor performance and error rates for traffic using the new schema. If issues arise, roll back by directing all traffic back to the old schema.
Entry: Traffic is being served from the new schema.
Exit: The new schema is validated or rolled back.
- If issues arise, reverse to the old view without losing functionality.
In: Performance metrics, Error logs
ch16p01
Step 5Once confident in the new schema's stability, gradually increase the proportion of users accessing it until all read traffic is on the new schema.
Entry: The new schema is stable under partial load.
Exit: All read traffic is using the new schema.
ch16p01
Step 6Perform a similar gradual migration for write traffic, and once all traffic is on the new schema, decommission and drop the old one.
Entry: All read traffic is on the new schema.
Exit: The old schema is removed.
ch16p01

Apply Lambda Architecture

To combine batch and stream processing to serve queries on large datasets, providing a balance of low latency for recent data and accuracy for historical data.

When to use: When building a system that needs to answer queries based on both up-to-the-minute data and the full historical dataset.

Step 1Append all incoming data as immutable events to a master dataset.
Entry: A stream of incoming data is available.
Exit: Data is stored in a durable, append-only log.
In: Incoming data streams · Out: Master dataset (event log)
ch16p01
Step 2Implement a batch layer that periodically reprocesses the entire master dataset to create accurate, pre-computed batch views.
Entry: A master dataset exists.
Exit: Batch views are created and stored.
In: Master dataset · Out: Batch views
ch16p01
Step 3Implement a speed (stream) layer that consumes the incoming data in real-time to produce incremental, approximate updates.
Entry: A stream of incoming data is available.
Exit: Real-time views are created and stored.
In: Incoming data streams · Out: Real-time views
ch16p01
Step 4Implement a serving layer that receives queries from users.
Entry: Batch and real-time views are available.
Exit: The serving layer is ready to accept queries.
ch16p01
Step 5To answer a query, the serving layer merges results from both the batch view and the real-time view.
Entry: A query is received.
Exit: A merged result is returned to the user.
In: User query, Batch views, Real-time views · Out: Query result
ch16p01

Unbundle Databases into Specialized Components

To build more scalable, robust, and performant data systems by composing specialized, best-of-breed tools for different functions (e.g., storage, indexing, stream processing) rather than using a single monolithic database.

When to use: When the limitations of a single, general-purpose database become a bottleneck for performance, scalability, or functionality.

Step 1Identify the distinct functions your application requires from its data infrastructure (e.g., durable storage, secondary indexing, caching, full-text search, stream processing, batch analytics).
Entry: A new data system is being designed or an existing one is being re-architected.
Exit: A list of required data management functions is created.
In: Application requirements · Out: List of data functions
ch16p01
Step 2For each function, select a specialized tool or system that is optimized for that specific task.
Entry: Data functions are identified.
Exit: A set of specialized tools is selected.
- Choosing the best tool for each specific concern.
In: List of data functions · Out: Technology stack
ch16p01
Step 3Use an event log (e.g., from Change Data Capture) as the central point of integration to keep all the specialized components synchronized.
Entry: A technology stack is selected.
Exit: An integration method is designed.
ch16p01
Step 4Develop the application logic to query the appropriate specialized system for each type of request.
Entry: The integrated system is deployed.
Exit: The application is able to leverage the strengths of each specialized component.
ch16p01

A candidate measure

Designing Data-Intensive Applications — derived measurement candidates

Data Model and Encoding Choice

Proportion of data in each model type; Encoding format compatibility coverage; Number of schema migrations performed without downtime