What is PeopleAnalyst?

PeopleAnalyst is the front door for people-analytics research: 205+ works indexed and profiled, 40+ citation-grade findings extracted, and peer-reviewed behavioral science translated from academic to actionable — the missing manual for the people analytics you always meant to do.

What is people analytics?

People analytics is not a dashboard. It is behavioral science and statistical inference applied to workforce decisions — a discipline with its own methodology, spanning measurement, organizational design, talent, leadership, and analytics craft.

Why does AI in HR need measurement science?

AI is being deployed in high-stakes people decisions — hiring, performance, attrition — without the measurement science to evaluate whether it works or whom it harms. Construct validity, effect sizes, and criterion validity are the vocabulary for asking an AI vendor the right questions.

How is the research made accessible?

The evidence is indexed and searchable: 205+ works, 40+ citation-grade insight cards, and 8 research arcs, so the right finding reaches the right decision at the right time.

What separates good people measurement from assertion?

Good measurement has a method: construct validity, reliability, and effect-size interpretation are not optional — they are what separates evidence from assertion.

Part IV — AI in Product, Operations, and Decision Support · AI Human Interaction Guide

4.1 The product/operations/decision-support intersection

When enterprises talk about getting value from AI, they usually mean one of three things. AI in product — building AI features into the products and services the enterprise sells (recommendation engines; intelligent search; in-product assistants). AI in operations — using AI to automate or augment operational processes (supply-chain optimization; fraud detection; process mining; predictive maintenance). AI in decision support — using AI to inform executive and managerial decisions (forecasting; scenario modeling; risk analysis; talent decisions).

The three categories overlap and the boundaries are conventional rather than principled. The guide treats them together because they share the same underlying methodological challenge: an AI system that produces recommendations requires the user (or the system itself, in agentic configurations) to take action on those recommendations, and the action-taking is where the methodology gap from Part I §1.3 surfaces most visibly. Drift is more consequential when actions follow outputs. Hallucination is more consequential when fabricated outputs drive real-world commits. Sycophancy is more consequential when the user's framing shapes how the system reasons about a decision.

The challenges play out differently across the three:

In product, the AI's outputs are typically consumed by the product's end user. The product is the integration surface; the user's experience of the AI's failures is the product's reputation problem. The empirical evidence on user-facing AI failures is fragmentary but consistent: users tolerate occasional misses more than they tolerate confident incorrectness; calibration matters more than peak quality; the AI's humility is a feature.

In operations, the AI's outputs typically drive automated actions or strongly guide human action. The integration surface is the operational system; the cost of an AI failure is measured in shipped product, transacted dollars, or operational disruption. The empirical evidence on operational AI is older and richer than on user-facing AI — the pattern-recognition systems for fraud detection, recommender systems for inventory management, anomaly detection for predictive maintenance have been in production for a decade-plus.

In decision support, the AI's outputs typically inform managerial or executive decisions. The integration surface is the manager's deliberation process. The cost of a failure is harder to measure — bad decisions made on AI-augmented analysis blend into the broader noise of bad decisions made on conventional analysis — but the cumulative cost is potentially large because decisions at this level have outsize organizational consequence.

The chapter walks the three surfaces in turn, with the agentic-AI work (treated in §4.2 because it crosses the boundaries) introduced first as the active frontier shaping how all three are evolving.

4.2 Agentic AI systems — what they are, what they aren't

The term agent in AI has carried different meanings over the field's history. In the symbolic-AI tradition, an agent was a system that perceived an environment and took actions to maximize a goal under explicit goal-representation. In the foundation-model era — 2023 onward — agentic has come to mean something narrower and looser: a system that takes multiple steps to accomplish a task, often calling external tools (APIs, databases, code execution) between LLM invocations, with some persistence across steps.

The 2024-2026 wave of agentic systems has produced several reference implementations: ChatGPT's tool-calling and code-interpreter features; Claude's tool-use API; Anthropic's Computer Use feature; OpenAI's Operator; AutoGPT, LangGraph, CrewAI, and similar open-source orchestration frameworks; enterprise-specific agentic systems (customer-service agents that read knowledge bases; coding agents that traverse repositories; research agents that query the web and synthesize findings).

The promise of agentic AI is straightforward: rather than the user prompting the system for a single output, the user describes a goal and the system takes the multiple steps required to accomplish it. The reality has been more complicated. Three observations from the deployment record:

Agentic systems amplify the methodology-gap failure modes from Part I §1.3. Drift accumulates across steps; an agent's behavior in step 7 depends on the state accumulated through steps 1-6, which depends on the (possibly hallucinated, possibly sycophantic) outputs from earlier steps. Long-context drift (the Laban 2025 finding from Part V §5.2.2) compounds across agent execution.¹ The single-turn benchmark on which an agent's underlying foundation model was evaluated does not predict the multi-step agent's behavior; the agent is operating outside the benchmark distribution by construction.

The reliability question is harder than it looks. A single-step LLM call that produces correct outputs 95% of the time produces correct multi-step outputs at rates closer to (0.95)^N where N is the number of steps. For a 10-step agent the compound rate is ~60%. The arithmetic is harsher than it appears because the failure modes at each step are not independent — failures earlier in the chain bias the system toward further failures later.

Tool-calling reliability is the practical bottleneck. Foundation models in 2026 are reasonably good at deciding which tool to call and at constructing the inputs to it. They are less reliable at handling tool outputs that don't match what the model expected, at recovering from tool failures, and at avoiding repeated calls that produce circular state. The robustness work required to make agentic systems production-grade is engineering work that the orchestration frameworks are doing, but it is not yet at the maturity level the foundation-model layer has reached.

The guide treats agentic AI as an active frontier rather than a settled capability. The work happening at frontier labs (Anthropic's Computer Use; OpenAI's o-series models trained explicitly for multi-step reasoning; Google's agentic Gemini features) and at the open-source orchestration layer is changing the picture quickly. The general shape of the gap — single-step capability exceeds multi-step reliability — is unlikely to close completely; the question is how much it narrows.

The implications for enterprise adoption: agentic deployments in 2026 are most safely scoped to constrained domains with rollback paths. A coding agent that produces pull-request diffs that a human reviews before merging is in this category. A customer-service agent that drafts responses for human approval is in this category. A research agent that produces written drafts citing its sources is in this category — the human verifies the citations exist and verifies the synthesis. Agentic deployments that commit actions in the world without human review (autonomous trade execution; autonomous customer-account changes; autonomous code merges to production) sit beyond what the empirical reliability record supports.

4.3 AI in product development workflows

The product-development application of AI has two main shapes: AI in the products being built (recommendation engines; intelligent search; conversational interfaces) and AI in the development of those products (coding agents; design assistants; documentation generation; testing automation). The second category has accumulated the richest empirical record.

Coding workflows. The empirical evidence on AI coding tools is the most-cited single body of work on AI productivity. The Brynjolfsson NBER w31161 customer-support study and the Peng et al. controlled-task coding benchmark both anchor the AI raises productivity narrative. The METR 2025 sign-inversion study counters the narrative for experienced developers on familiar codebases.² The aggregated picture, drawing on the AHI program's longitudinal-cognitive-effects review: AI coding tools robustly help novices and intermediates on routine work; the gains diminish or invert for experts working in familiar territory; the cognitive offloading failure mode (Part V §5.2.4) accumulates over time for workers who use AI tools to skip the harder work that builds expertise.

The operational implication for product organizations: AI coding tools are a real productivity tool, deployed selectively. Deploying them uniformly across the engineering organization without attention to who benefits and who doesn't is the same uniform-rollout failure mode that Part II §2.2 named for the workforce more broadly.

Design workflows. The empirical evidence on AI design tools (Figma's AI features; Midjourney for ideation; specialized design copilots) is thinner than for coding. The pattern emerging from practitioner reports — generative-AI design tools are useful for ideation and rapid prototyping, less useful for production-quality finished work — is consistent with the cognitive-redistribution framing: AI substitutes for the production of design candidates (which is what novices benefit from) but not for the evaluation of design quality (which is what experts contribute).

Documentation and testing workflows. These are the workflows where AI's bounded, verifiable, novice-augmenting characteristics fit best. Generating API documentation from code (where the code is the source of truth and the documentation can be verified against it). Drafting test cases (where running the tests is the verification). Writing changelog entries (where the diff is the source). The pattern: AI tools work best when there's an external verifier (the code; the tests; the diff) the human can use to check the output.

4.4 Decision support — VOI, structured frameworks, and the methodology gap

AI in decision support is where the methodology gap from Part I §1.3 surfaces with the highest decision-stakes. A manager consulting an AI tool to inform a hiring decision, a promotion decision, a compensation decision, a strategic-direction decision, a capacity-allocation decision is the configuration in which AI outputs most directly drive consequential outcomes for other humans.

The conventional decision-support shape: an AI tool produces a recommendation; the manager evaluates the recommendation; the manager takes (or doesn't take) the recommended action. The methodology gap surfaces in two places:

The AI's recommendation is calibrated to something the manager may not be able to interrogate. The system was trained on data the manager doesn't have access to; the system's reasoning over the case is opaque; the system's output is a recommendation that looks confident regardless of whether the underlying analysis was on solid ground. The manager who consults the recommendation has the burden of verification but not the tools.

The decision-quality outcome is hard to measure. Bad decisions made on AI-augmented analysis blend into the broader noise of bad decisions; the counterfactual (what the manager would have decided without the AI) is not observable. The AI's failure mode is not detectable as a discrete failure event; it accumulates as a quiet drift in decision quality that may take quarters or years to surface.

Value of Information as the structural correction

Formal Value of Information (VOI) analysis — the framework borrowed from Howard, Raiffa, and the broader decision-theory tradition — addresses both gaps. Rather than asking what should I decide, VOI asks how much would more information be worth, and what additional information would I most want before deciding. The reframing matters because it makes the uncertainty in the AI's recommendation a first-class object in the decision rather than a hidden assumption.

A VOI-informed decision process treats the AI's recommendation as one input — calibrated by the analysis that produced it — alongside other inputs (the manager's experience; the team's contextual knowledge; explicit risk-tolerance considerations). The framework explicitly asks whether the cost of more analysis would be repaid by improved decision quality. For high-stakes decisions with high uncertainty, the VOI analysis often recommends more analysis before deciding; for low-stakes decisions with low uncertainty, it recommends decide now. AI-augmented analysis is one specific kind of analysis that fits into the VOI frame.

The guide treats VOI as the structural correction that decision-support AI deployments most often lack. Formal VOI analysis is rarely implemented as production software in enterprise tooling; the People Analytics Toolbox's forecasting spoke (Monte Carlo simulation + EVPI + discrete EVSI on aligned-chance decision trees) is one of the few production implementations.³ Without VOI-style analysis surrounding AI-augmented decision support, the methodology gap from Part I §1.3 — AI's confident-looking outputs without calibration scaffolding — extends directly into managerial deliberation.

Structured decision frameworks

VOI is one of a small family of structured decision-analysis frameworks the literature has developed: Kepner-Tregoe Decision Analysis; multi-attribute utility theory; analytic hierarchy process; scenario planning; Monte Carlo simulation against parameterized uncertainty. These frameworks share a feature relevant to AI integration: they treat the decision as a structure with components (alternatives; criteria; uncertainties; constraints; valuations) rather than as a single question with a single answer.

AI tools integrate cleanly into structured frameworks. The AI can produce candidate alternatives; weight criteria against a stated objective; surface uncertainties the framework should treat explicitly; run Monte Carlo simulations over parameter ranges. The integration is much harder when the AI is asked to produce the decision — at that point the framework is hidden and the AI's failure modes propagate without checks.

People-analytics decision support specifically

The decision-support category overlaps with the workforce category from Part II in one important place: people-analytics decisions are decisions about people. AI-augmented people-analytics tools (engagement-analysis tools; manager-effectiveness scoring; retention-risk models; performance-evaluation augmentation) are decision-support tools applied to workforce decisions, with the additional weight that the decisions affect people's careers.

The combination of consequential decisions about people + AI outputs with the methodology-gap failure modes + limited managerial capacity to verify the AI's analysis + the bias-amplification loops from Part V §5.2.5 produces specific risks that justify the suppression-gate-as-substrate pattern Performix uses (Part V §5.5). The structural answer: people-analytics AI tools should not produce outputs that allow individual respondents to be identified from team-aggregate reports; should not produce individual-level scores from protected feedback; should not be designed in a way that allows them to be used for surveillance, scoring, disciplinary action, or retaliation. These are design constraints, not configuration options.⁴

4.5 AI in operations — process automation and the human-in-the-loop question

Operational AI is the oldest enterprise AI category — pattern-recognition systems for fraud detection have been in production since the 1990s; predictive-maintenance and inventory-forecasting systems have a similarly long track record. The contemporary frontier in operational AI is two-fold: extending the older pattern-recognition systems with foundation-model capabilities (giving fraud-detection systems the ability to explain their decisions in natural language; giving inventory systems the ability to consult unstructured documents), and building entirely new agentic systems for tasks that previously required human judgment at every step (customer-service agents; document-processing agents; specialized research agents).

The empirical record on operational AI is richer than on user-facing AI because the systems have been measured for longer. The pattern emerging from twenty-plus years of operational AI deployment:

Pattern-recognition AI in well-defined operational contexts is reliable. Fraud-detection systems achieve high accuracy on their training distribution; inventory-forecasting systems produce useful demand forecasts; predictive-maintenance systems reduce unplanned downtime. The methodology for deploying these systems — feature engineering, model selection, validation against held-out data, monitoring for drift — is mature. The failure modes are known and addressable.

Foundation-model-augmented operational AI is less mature but increasingly capable. Adding conversational interfaces to operational systems; adding document-understanding capabilities; adding multi-modal inputs (image-based quality control; voice-based logging). These extensions inherit the foundation-model failure modes from Part I §1.3 — hallucination; drift; sycophancy — and require additional methodology to handle them. The operational system's existing safety properties (transaction validation; rule-based checks; human approval gates) carry over partly but don't fully cover the foundation-model failure surface.

Fully agentic operational AI is the active frontier. Systems that not only detect anomalies but autonomously take corrective action; that not only forecast demand but autonomously adjust inventory; that not only summarize documents but autonomously make commitments based on them. The reliability arithmetic from §4.2 applies. Most enterprise agentic-operational deployments in 2026 sit in human-in-the-loop configurations rather than fully autonomous ones, for sound empirical reasons.

The human-in-the-loop question

The default architecture for operational AI in 2026 includes a human in the loop — a human reviewer or approver who sees the AI's output before it drives an action. The architecture has the obvious safety property: the AI's failure modes are caught by the human before they propagate. It also has a less-obvious cost: the human's review capacity is a bottleneck, and the human's review quality degrades when the AI is mostly right.

The Bainbridge 1983 paper Ironies of Automation is the canonical reference here. Bainbridge observed that as automated systems become more reliable, the human operators' vigilance for failures decreases. The human-in-the-loop is most reliable when the AI is unreliable enough that the human stays attentive; the human-in-the-loop becomes brittle precisely when the AI gets good enough that the human stops paying close attention. This is the empirical anchor for the DevPlane research program's C1 risk-compensation field study — a pre-registered test of whether the Bainbridge effect operates in real coding work with AI tools.⁵

The operational implication: human-in-the-loop is necessary but not sufficient. Pairing it with anomaly-detection on the human's review behavior (are reviews getting faster? are they becoming more confirmatory and less critical?) is the structural correction. Without that monitoring, the human-in-the-loop architecture quietly degrades to a rubber-stamp without anyone noticing.

4.6 Part-end glossary, bibliography, and cross-references

Glossary

Action-grade reliability. The standard an AI system must meet when its outputs drive actions in the world: calibration, traceability, recoverability, and observability of failure modes. Distinct from output quality, which most AI evaluation infrastructure measures.

Agentic AI. A system built on foundation-model substrate that decomposes a task into multiple steps, executes those steps (often via tool calls), maintains state across steps, and produces a result at the end.

Bainbridge effect (Ironies of Automation). The observation that human operators' vigilance for automation failures decreases as automation reliability increases. Bainbridge 1983 is the canonical reference.

EVPI / EVSI. Expected Value of Perfect Information / Expected Value of Sample Information. The two primitives in Value-of-Information analysis: how much would knowing the truth be worth (EVPI); how much would a noisy sample be worth (EVSI). Implemented in production form in the People Analytics Toolbox's forecasting spoke.

Human-in-the-loop. An AI deployment architecture in which a human reviewer or approver sees the AI's output before it drives an action. The default safety architecture for high-stakes operational AI; vulnerable to the Bainbridge / rubber-stamp failure mode without auxiliary instrumentation.

Kepner-Tregoe. A structured decision-analysis methodology developed in the 1960s, organized around explicit problem analysis, decision analysis, and potential-problem analysis. Integrates cleanly with AI-augmented analysis as one structured-framework option.

Monte Carlo simulation. A computational method for parameterized uncertainty — running many random samples through a model to surface the distribution of outcomes. The People Analytics Toolbox's forecasting spoke implements Monte Carlo as a callable service.

Rubber-stamping. The failure mode in which human-in-the-loop reviewers approve AI outputs without substantive review, typically because the AI's apparent reliability has reduced the reviewer's vigilance. A consequence of the Bainbridge effect.

Tool-calling. The pattern in agentic AI systems by which the foundation model invokes external APIs, databases, or code execution between language-generation steps. The practical bottleneck in agentic-system reliability.

Value of Information (VOI) analysis. A decision-theory framework that quantifies how much additional information would be worth before making a decision. The structural correction to AI-augmented decision support that lacks calibration scaffolding.

Bibliography (Part 4)

Bainbridge, Lisanne. Ironies of Automation. Automatica, 1983.

Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. Generative AI at Work. NBER Working Paper 31161, 2023.

Howard, Ronald A., and James E. Matheson, eds. Readings on the Principles and Applications of Decision Analysis. Strategic Decisions Group, 1984.

Lee, Hao-Ping, et al. Confidence in Generative AI and Critical Thinking. Microsoft Research / CHI 2025.

METR. Experienced Open-Source Developers Slower with AI Tools on Familiar Repositories. 2025.

Peng, S., et al. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. 2023.

Raiffa, Howard. Decision Analysis: Introductory Lectures on Choices Under Uncertainty. Addison-Wesley, 1968.

People Analytics Toolbox. Forecasting spoke — Monte Carlo + EVPI + discrete EVSI on aligned-chance decision trees. peopleanalyst.com/research/pa-platform/forecasting.

DevPlane Research Program. C1 — Risk Compensation in Human-AI Coordination. Pre-registered field study at peopleanalyst.com/research/devplane/.

Cross-references

Concept introduced here	Where it gets fuller treatment
The methodology gap surfacing in decision-stakes contexts	Part I §1.3
The Bainbridge effect / Ironies of Automation	Part V §5.5 (instrumentation as design constraint); DevPlane research program
VOI as the decision-support correction	People Analytics Toolbox `forecasting` spoke
Suppression-gate-as-substrate in people-analytics AI	Part V §5.5 (Performix); Part VI §6.5
Cognitive offloading in coding workflows	Part V §5.2.4 (the AHI program's longitudinal cognitive-effects review)
Human-in-the-loop architectures + rubber-stamp drift	Part V §5.5 (design-constraint pattern); Part VI §6.4 (auditability)
Agentic AI's compound reliability arithmetic	Part I §1.5 (foundation-model substrate); Part V §5.2.2 (long-context drift)

Laban, Philippe, et al. LLMs Get Lost in Multi-Turn Conversation. 2025. The single-turn-to-multi-turn ~39% performance degradation compounds across agent steps because each step is at minimum a multi-turn interaction relative to the previous step's state. ↩
METR. Experienced Open-Source Developers Slower with AI Tools on Familiar Repositories. 2025. The sign-inversion finding against the Peng et al. controlled-task benchmark. Documented in the AHI program review at content/research/ai-human-interaction/sources/topic-reviews/longitudinal-cognitive-effects-and-skill-change-in-ai-assisted-programming.md. ↩
People Analytics Toolbox forecasting spoke. Monte Carlo simulation, EVPI, and discrete EVSI on aligned-chance decision trees, exposed as service-substrate over HTTP and MCP. The spoke is one of the few production implementations of formal VOI analysis in enterprise AI tooling. ↩
Performix protected-feedback capability — the min-N + redaction + identity-risk-scoring + role-based-visibility + safe-aggregation-policies primitive that every other Performix capability passes through. Documented at /Users/mikewest/Vibe Coding Projects/Performix/docs/VISION.md and the Performix product card at peopleanalyst.com. ↩
Bainbridge, Lisanne. Ironies of Automation. Automatica 19, no. 6 (1983): 775-779. The C1 pre-registered field study at the DevPlane research surface (peopleanalyst.com/research/devplane/) tests whether the effect operates in real coding work with AI tools. ↩