What is PeopleAnalyst?

PeopleAnalyst is the front door for people-analytics research: 205+ works indexed and profiled, 40+ citation-grade findings extracted, and peer-reviewed behavioral science translated from academic to actionable — the missing manual for the people analytics you always meant to do.

What is people analytics?

People analytics is not a dashboard. It is behavioral science and statistical inference applied to workforce decisions — a discipline with its own methodology, spanning measurement, organizational design, talent, leadership, and analytics craft.

Why does AI in HR need measurement science?

AI is being deployed in high-stakes people decisions — hiring, performance, attrition — without the measurement science to evaluate whether it works or whom it harms. Construct validity, effect sizes, and criterion validity are the vocabulary for asking an AI vendor the right questions.

How is the research made accessible?

The evidence is indexed and searchable: 205+ works, 40+ citation-grade insight cards, and 8 research arcs, so the right finding reaches the right decision at the right time.

What separates good people measurement from assertion?

Good measurement has a method: construct validity, reliability, and effect-size interpretation are not optional — they are what separates evidence from assertion.

library / lib0e87545fd3f6b74b

Big Data_ A Very Short Introduction (Very Short Introductions)

Very Short Introductions

In a sentence

A concise introduction to what big data is, how it is collected, stored, and analysed, and how it is transforming medicine, business, security, and society.

Big Data: A Very Short Introduction demystifies one of the defining technological forces of our age, charting how data evolved from notched bones and census tallies to the exabyte-scale streams of the digital universe. Dawn Holmes explains, in plain language and with clear diagrams, what distinguishes big data from traditional 'small data'—volume, variety, velocity, and veracity—and how new storage architectures (Hadoop, NoSQL, the Cloud) and analytic techniques (clustering, classification, MapReduce, Bloom filters, PageRank, recommender systems) extract useful information from massive, messy datasets. Through vivid case studies—Google Flu Trends, the Ebola and Nepal disaster responses, Amazon, Netflix, the Snowden leaks, WikiLeaks, and the rise of smart homes and cities—the book shows both the immense promise and the real perils of a data-driven world, ending with a call to use big data's power responsibly.

The four lenses

Science
Statistics
Systems
Strategy

Tags

aiapplied-statistics

The model

A causal framework describing how data-generating conditions and design choices (storage architecture, analytic technique, data quality, security measures) drive intermediate processing states (information extraction, predictive accuracy) that produce outcomes such as decision quality, organizational value, and privacy/security risk.

Data Volume, Velocity, and Varietycontextual condition

The scale, speed of generation, and heterogeneity (structured, semi-structured, unstructured) of data produced across search engines, sensors, social media, and other digital sources that together characterize big data.

Data Veracity (Quality)contextual condition

The accuracy, reliability, and trustworthiness of collected data, recognizing that digital-age data is often imprecise, uncertain, biased, or simply untrue and requires pre-processing for consistency.

Distributed Storage Architecturedesign lever

The design choice of scalable storage and management systems such as Hadoop distributed file systems, NoSQL databases, and Cloud infrastructure that enable horizontal scalability and fault tolerance for big data.

Analytic Technique Applicationdesign lever

The application of data mining and machine learning methods such as clustering, classification, MapReduce, Bloom filters, PageRank, and recommender algorithms to discover patterns and extract knowledge from big data.

Useful Information Extractionbehavioral pattern

The intermediate state in which raw, often unstructured data is transformed into meaningful, valuable information and patterns that can inform understanding, prediction, and action.

Predictive Model Accuracybehavioral pattern

The degree to which models built from big data correctly forecast outcomes, sensitive to model construction issues such as over-fitting, spurious correlation, and failure to update for changing conditions.

Data Security and Privacy Measuresdesign lever

The protective controls such as encryption, firewalls, access authentication, and anonymization deployed to safeguard data from theft, tampering, hacking, and unauthorized disclosure.

Decision Quality and Organizational Valueoutcome metric

The downstream benefits of big data including improved business decisions, targeted marketing, better patient care, cost reduction, scientific discovery, and societal efficiency gains in smart homes and cities.

Privacy and Security Riskoutcome metric

The adverse outcome of exposure to data theft, breaches, surveillance, identity theft, and loss of personal privacy arising from large-scale collection and storage of personal data.

How they connect

data volume velocity variety → influences information extraction
storage architecture → predicts information extraction
analytic technique → predicts information extraction
information extraction → influences predictive accuracy
data veracity → moderates predictive accuracy
data volume velocity variety → influences predictive accuracy
predictive accuracy → predicts decision and value outcomes
information extraction → predicts decision and value outcomes
data volume velocity variety → predicts privacy security risk
security measures − moderates privacy security risk
storage architecture → influences privacy security risk

The process

This book provides an introduction to the world of big data, explaining its core characteristics, storage methods, and analytical techniques. The overall playbook is not a single, linear workflow but rather a toolkit of fundamental algorithms and processes for extracting value from massive datasets. The practitioner's journey begins with foundational methods for handling and processing data at scale, such as MapReduce for parallel computation and compression algorithms for efficient storage. Once the data is manageable, the playbook offers specific analytical techniques to derive insights. These include probabilistic methods like Bloom filters for rapid set membership testing, machine learning approaches like decision trees for classification tasks such as fraud detection, and graph analysis algorithms like PageRank for ranking information. The playbook extends to building recommender systems using collaborative filtering, a key process for personalizing user experiences in e-commerce and media. Collectively, these processes enable a data scientist to transform vast, raw, and often unstructured data into structured information, patterns, and predictions. The book contextualizes these methods within real-world applications in business, medicine, and security, illustrating their power while also highlighting the significant challenges of veracity, privacy, and security in the big data era.

Processing Big Data with MapReduce

To process and analyze very large datasets in parallel by distributing the computational workload across many computers.

When to use: When a dataset is too large to be processed on a single machine and the task involves aggregation or transformation that can be performed on subsets of the data independently.

Step 1Execute the Map function to process input data.
Entry: Input data is available in a distributed file system (like HDFS).
Exit: All input data has been processed and converted into intermediate key-value pairs.
In: Large data files · Out: A set of intermediate key-value pairs
Step 2Perform the Shuffle and Sort step.
Entry: The Map step is complete.
Exit: All intermediate key-value pairs are grouped by their unique key.
In: Intermediate key-value pairs from the Map step · Out: Grouped and sorted key-value pairs
Step 3Execute the Reduce function to aggregate results.
Entry: The Shuffle and Sort step is complete.
Exit: Final aggregated results are generated and stored.
In: Grouped and sorted key-value pairs · Out: Final output file with aggregated key-value pairs

Compressing Data using Huffman Coding

To perform lossless data compression on text files, reducing storage requirements by assigning shorter codes to more frequent characters.

When to use: When storage space needs to be minimized for text files without losing any of the original information.

Step 1Count the frequency of each character in the input string.
Entry: An uncompressed string of characters is available.
Exit: A frequency count for every unique character is complete.
In: Uncompressed character string · Out: Frequency table of characters
Step 2Build a binary tree from the character frequencies.
Entry: Frequency table is available.
Exit: A single binary tree containing all characters as leaves is constructed.
In: Frequency table of characters · Out: Completed binary tree
Step 3Assign binary codes to each branch of the tree.
Entry: The binary tree is complete.
Exit: All branches are labeled.
In: Completed binary tree · Out: Labeled binary tree
Step 4Generate the Huffman code for each character.
Entry: The binary tree is labeled.
Exit: A codebook mapping each character to its binary code is created.
In: Labeled binary tree · Out: Huffman code for each character
Step 5Encode the original string.
Entry: Huffman codes for all characters are generated.
Exit: The original string is fully encoded into a compressed binary string.
In: Original character string, Huffman codes · Out: Compressed binary string

Using a Bloom Filter for Set Membership Testing

To quickly and efficiently test whether an element is a member of a very large set, using less memory than storing the entire set. It is a probabilistic method that avoids false negatives but allows for a small rate of false positives.

When to use: When you need to check if an item exists in a massive list and speed and memory efficiency are more critical than 100% accuracy.

Step 1Initialize a bit array.
Entry: The size of the array and the number of hash functions have been determined.
Exit: A bit array of the desired size is created with all values set to 0.
Out: Empty bit array
Step 2Populate the filter with the known set of elements.
Entry: An empty bit array and a list of known elements are available.
Exit: All elements from the known set have been hashed and their corresponding bits in the array are set to 1.
In: A set of known elements (e.g., malicious URLs) · Out: A populated Bloom filter (bit array)
Step 3Query the filter with a new element.
Entry: The Bloom filter is populated and a new element needs to be checked.
Exit: A set of array indices corresponding to the new element is generated.
In: A new element to test · Out: A set of array indices
Step 4Determine set membership.
Entry: The array indices for the new element have been generated.
Exit: A probabilistic determination of set membership is made.
- If any bit is 0, conclude 'not in set'.
- If all bits are 1, conclude 'probably in set'.
In: A set of array indices, The populated Bloom filter · Out: A membership decision ('definitely not in set' or 'probably in set')

Detecting Credit Card Fraud using a Decision Tree

To classify new credit card transactions as either genuine or fraudulent using a pre-built decision tree model.

When to use: When a new credit card transaction occurs and needs to be automatically checked for potential fraud in real-time.

Step 1Check if the credit card has been reported lost or stolen.
Entry: A new transaction is initiated.
Exit: The card's status (lost/stolen or not) is known.
- If yes, classify as fraudulent.
- If no, proceed to the next step.
In: Transaction data · Out: Card status
Step 2Check if the purchase is unusual for the customer.
Entry: The card has not been reported lost or stolen.
Exit: The transaction is determined to be usual or unusual.
- If not unusual, classify as genuine.
- If unusual, proceed to the next step.
In: Transaction data, Customer purchase history · Out: Purchase typicality assessment
Step 3Trigger a verification call to the customer.
Entry: The purchase has been flagged as unusual.
Exit: The customer has been contacted and has responded.
In: Unusual purchase flag · Out: Customer verification response
Step 4Classify the transaction based on customer confirmation.
Entry: A response from the customer has been received.
Exit: The transaction is definitively classified as genuine or fraudulent.
- If customer confirms, classify as genuine.
- If customer denies, classify as fraudulent.
In: Customer verification response · Out: Final transaction classification

Ranking Webpages using the PageRank Algorithm

To calculate the relative importance of webpages within a network (like the World Wide Web) based on their link structure, allowing search engines to rank results by relevance.

When to use: When needing to rank a large set of interlinked documents to determine which are the most important or authoritative.

Step 1Model the web as a directed graph.
Entry: A crawl of the web or a set of documents is available.
Exit: A graph structure representing the webpages and their links is created.
In: A set of webpages and their hyperlinks · Out: A directed graph of the web
Step 2Initialize the PageRank for all pages.
Entry: The graph model is complete.
Exit: Every page has an initial PageRank score.
In: The directed graph · Out: Initial PageRank scores for all pages
Step 3Iteratively update PageRank scores.
Entry: Pages have been assigned a PageRank score (either initial or from a previous iteration).
Exit: The PageRank scores for all pages have been updated for the current iteration.
In: Current PageRank scores, The directed graph · Out: Updated PageRank scores
Step 4Continue iterating until the scores converge.
Entry: At least one iteration has been completed.
Exit: The PageRank scores have converged to their final values.
In: Updated PageRank scores · Out: Final, stable PageRank scores

Building a Collaborative Filtering Recommender System

To provide personalized recommendations to a user by identifying other users with similar tastes and suggesting items that those similar users have liked.

When to use: When a business wants to increase user engagement and sales by suggesting relevant products or content to individual users.

Step 1Collect user preference data.
Entry: A platform with users and items exists.
Exit: A dataset of user-item interactions is available.
In: User activity on the platform · Out: User preference data (e.g., purchase lists, rating matrices)
Step 2Calculate similarity between the target user and other users.
Entry: A target user is identified and preference data is available.
Exit: A similarity score for the target user against all other users is calculated.
In: Target user's preference data, All other users' preference data · Out: A list of users ranked by similarity to the target user
Step 3Identify the most similar users (neighbors).
Entry: Similarity scores have been calculated.
Exit: A neighborhood of similar users is identified.
In: Ranked list of similar users · Out: A set of 'neighbor' users
Step 4Generate recommendations.
Entry: A neighborhood of similar users has been identified.
Exit: A ranked list of recommended items is generated for the target user.
In: Preference data of 'neighbor' users, Preference data of the target user · Out: A list of recommended items

The story

The reader A curious general reader who wants to understand what big data really is and how it is changing their everyday life and the wider world.

External problem

Big data is everywhere but explanations are either superficial or buried in mathematical textbooks aimed at graduate students.

Internal problem

The reader feels swamped, intimidated, and uncertain about a powerful technology shaping their privacy, health, and work.

Philosophical problem

Such a transformative force should not be opaque to ordinary people who have a stake in how it is used.

The plan

Learn how data evolved and what distinguishes big data via the four V's.
Understand how big data is stored using distributed file systems, NoSQL, and the Cloud.
See how analytic techniques mine raw data into useful information.
Explore real applications in medicine, business, security, and society.
Reflect on the privacy, security, and ethical responsibilities involved.

Success

The reader can confidently define big data, recognize its techniques in daily life, and critically evaluate its benefits and risks.
They make more informed choices about their own data, privacy, and the technologies they adopt.

At stake

The reader remains mystified by big data and passively subject to its uses and abuses.
They overlook the privacy, security, and ethical stakes of a rapidly data-driven world.

Questions this book answers

What is data and what makes 'big data' special?
How is big data stored and managed when it exceeds traditional databases?
How do analytic techniques turn raw data into useful information?
How is big data applied in medicine, business, and society?
What security, privacy, and ethical risks does big data create?

Glossary

Data Volume, Velocity, and Variety: The defining scale, generation speed, and heterogeneity of data that together constitute the big data condition.
Data Veracity (Quality): The accuracy, reliability, and trustworthiness of data being collected and analysed.
Distributed Storage Architecture: The chosen systems for scalable, fault-tolerant storage and management of big data.
Analytic Technique Application: The deployment of data mining and machine learning algorithms to discover patterns in big data.
Useful Information Extraction: The transformation of raw data into meaningful, valuable information and patterns.
Predictive Model Accuracy: The correctness of forecasts produced by models built from big data.
Data Security and Privacy Measures: Protective controls deployed to safeguard data from theft, tampering, and unauthorized access.
Decision Quality and Organizational Value: The beneficial outcomes of big data including better decisions, value creation, improved care, and societal efficiency.

Related in the library

Tools these methods power

Related in the literature

The measurement literature behind this signal — sourced, so you can defend it.

“Title : Artificial Intelligence: A Very Short Introduction (Very Short Introductions) Author: Boden, Margaret A. ASIN : B07DPP4J1B ISBN : 9780191080074 Artificial Intelligence : A Very Short Introduction VERY SHORT INTRODUCTIONS are for anyone wanting a stimulating and…”
— Artificial Intelligence a Very Short Introductionmatch 63%
“Organizations: A Very Short Introduction VERY SHORT INTRODUCTIONS are for anyone wanting a stimulating and accessible way in to a new subject. They are written by experts, and have been published in more than 25 languages worldwide. The series began in 1995, and now represents a…”
— Organizations a Very Short Introductionmatch 61%
“meaningfully discussed without frequent reference to its collection, storage, analysis, and use by the big commercial players. Since research departments in companies such as Google and Amazon have been responsible for many of the major developments in big data, frequent…”
— Big Data a Very Short Introductionmatch 58%

Resources: Artificial Intelligence a Very Short Introduction · Organizations a Very Short Introduction · Big Data a Very Short Introduction