What is PeopleAnalyst?

PeopleAnalyst is the front door for people-analytics research: 205+ works indexed and profiled, 40+ citation-grade findings extracted, and peer-reviewed behavioral science translated from academic to actionable — the missing manual for the people analytics you always meant to do.

What is people analytics?

People analytics is not a dashboard. It is behavioral science and statistical inference applied to workforce decisions — a discipline with its own methodology, spanning measurement, organizational design, talent, leadership, and analytics craft.

Why does AI in HR need measurement science?

AI is being deployed in high-stakes people decisions — hiring, performance, attrition — without the measurement science to evaluate whether it works or whom it harms. Construct validity, effect sizes, and criterion validity are the vocabulary for asking an AI vendor the right questions.

How is the research made accessible?

The evidence is indexed and searchable: 205+ works, 40+ citation-grade insight cards, and 8 research arcs, so the right finding reaches the right decision at the right time.

What separates good people measurement from assertion?

Good measurement has a method: construct validity, reliability, and effect-size interpretation are not optional — they are what separates evidence from assertion.

← Portfolio

CanonicAI↗ canonicai.com

↻ brief 69d ago

CanonicAI — corpus in, canonical data out; the public face of the engine at canonicai.com.

Corpus in, canonical data out. CanonicAI is the engine that turns a provided corpus — books, papers, domain documents — into canonical, queryable datasets, and owns the canonical schema + provenance the rest of the PeopleAnalyst family builds on. A production line, not a prompt.

Microstory

Customer: Portfolio properties — PA-site, vela, principia, Performix, DevPlane, Fourth & Two, anycomp, segmentation-studio — that need the same library, the same canonical IDs, and the same passages queryable the same way.
Problem · external: Each consumer re-implements retrieval, identity, and integrity against the substrate, with definitions drifting per product.
Problem · internal: Every cross-portfolio question feels like a week of plumbing before the actual question gets answered.
Problem · philosophical: Cross-portfolio infrastructure that ultimately speaks to humans about humans should not have seven slightly-different definitions of competency, persona, or passage — it should be one substrate, legible to every consumer the same way.
Guide: `meta-factory-prod` — the thin Vercel API host that exposes the OLD meta-factory engine over a frozen v1.x REST + MCP contract plus a build-time-bundled cross-portfolio library snapshot every product reads from.
Plan: Pin to a v1.x contract version → read snapshot, library, and passages via REST or MCP per `docs/CONSUMERS.md` → migrate explicitly on major bumps; the engine and write-side stay in OLD, the host stays thin.
Success: One library, one canonical-output spec, one passage search, one identity authority — and a contract changelog the consumer can read in a sitting.
Failure avoided: Per-product taxonomy drift, runtime coupling to engine internals, and the substrate sliding back toward the half-renovated state the OLD/PROD split was designed to fix.

The problem

Doing this by hand means driving an LLM through a brittle, multi-step extraction for every document — and getting something slow, inconsistent, and unrepeatable. Anyone can write a prompt. The defensible thing is running thousands of multi-step extractions reliably, idempotently, and with lineage — at a cost you control. That is why 'factory' is the honest metaphor: a production line with QA and inventory control, not a clever prompt. The earlier 'meta-factory' framing tried to be every kind of factory at once (compensation, jobs, segmentation, business ideas); the narrowing decision was to own one job completely — canonicalization — and let the rest become downstream products.

What I built

The load-bearing pipeline. The **Book Factory** (collector → organizer → referee: PDF → text → chapter-respecting detection → deconstructed, tagged canonical summaries, BrandScripts, and factor models) and the **Article Factory** (research_agent + measurement-ingest: peer-reviewed papers → instruments, constructs, citations, effect data feeding Principia), over a ~8,491-asset registry with SHA-256 provenance on every source file. Model-agnostic through a single `llm.ts` router (Anthropic / OpenAI / Gemini), with idempotency guards, a durable processing index, and cost discipline at ~$0.13 per research-run at 30K-passage scale. The **federation spine** — `measurement-core` (the canonical vocabulary every consumer conforms to), `library-core` (one work-identity), `measurement-ingest` (publishes to Principia's registry) — is the schema-authority half: not just data out, but the version-of-record downstream tools build against. **Narrowed (2026):** compensation, job-family, segmentation, and HR-metrics folded into the People Analytics Toolbox; the prompt-driven business services (business ideas, requirements, personas, publishing) packaged out as standalone products. What remains is the pure corpus→canonical engine. **Now live (2026):** per-book factor models reconcile into a cross-book **total model** per domain cluster (canonical constructs with tier / coverage / provenance, reconciled relationships, and the *open divergences* the books genuinely disagree on), served as a **REST + MCP service** (`GET /api/v1/clusters/{id}/model` · `get_cluster_model`) — the grounding spine for the **capability-guide pipeline** (corpus → model → guide) that powers bicycle.guide. Ingestion is **~2.5× faster** (dependency-level step-parallelism) and runs **30-book-concurrent on Modal**, off-machine and durable; every capability ships **REST + MCP** per the portfolio service standard, so the engine is natively AI-controllable.

What's novel

01The moat is the production line, not the prompt — thousands of multi-step extractions run reliably, idempotently, and with full lineage, at a controlled cost. Multi-step LLM orchestration + process control + data management is the defensible asset; the prompts are replaceable.
02Producer + schema authority — CanonicAI doesn't just emit datasets, it owns `measurement-core`, the canonical vocabulary and provenance contract every downstream tool conforms to. Constructs, items, instruments, and effect-sizes are defined once in canonical form, so measurement compares cleanly across Principia / the toolbox / Performix instead of drifting per consumer.
03One engine, many corpora — the same 'corpus in → deconstructed, tagged, canonical datasets out' line runs over books, peer-reviewed articles, and other domains. The corpus is provided (the library catalog is the source of truth); CanonicAI's job is canonicalization, not acquisition.
04From books to a callable model of a field — per-book factor models reconcile into a cross-book total model (canonical constructs + relationships + the genuine disagreements), exposed as a live REST+MCP service that grounds capability guides. Corpus in → a queryable model of a domain out, not just summaries; consumers ground on it over HTTP/MCP, not files.
05Chapter-respecting ingestion fidelity — books are deconstructed at real chapter boundaries (whitespace-insensitive locators, hybrid detection), not naively chunked, so the canonical summary, factor model, and per-chapter extracts stay faithful to the source's structure.
06Cryptographic provenance contract — SHA-256 tracked for every source file; safe-delete invariants require hash verification before any local delete. Every output traces to its source; the system cannot lose source material to a careless deletion.
07~$0.13 per research-run synthesis at 30K+ passage scale — most 'AI research' tools run 10–100× more expensive because they retrieve without pattern extraction. Extract patterns once, cite the evidence, don't re-retrieve.

Recent ships

2026-05-18**DP-161 + DP-163 (Phase 1):** MetaFactory Console v2 lift — 10 admin routes under app/admin/substrate/* + login, 57 shadcn primitives, console widgets (substrate-browser, record-detail, ingestion-jobs, integrity-dashboard, pathb-planner, drm-queue, cross-property-memberships), gold/Geist-Mono operator-console visual discipline (P233). SHA d5914a3.
2026-06-03**DP-161 Phase 2 activation:** Supabase transaction pooler + Vercel DATABASE_URL + META_FACTORY_ADMIN_SECRET; 6 admin HTTP write routes live (smoke 7/7). Migration 002_metafactory_console_phase_2.sql on project meta-factory (gujosrsqmunzpuorkjaq).
2026-05-18**DP-161 Phase 2 prep:** supabase/migrations/002_metafactory_console_phase_2.sql (10 tables) + 6 admin HTTP routes + scripts/mcp-meta-factory.ts v1.5.0 with 6 mirrored admin write tools; gated on DATABASE_URL so nothing 500s pre-activation. SHA 2e3917b.
2026-05-18**MF-150:** chat-capture pipeline (Chrome extension → /api/capture/chat-turn → chat_turns_raw → /books). SHA e6492ed.
2026-05-18**DP-162:** portfolio adapter corrected for structured consumes[] edges. SHA db65cca.
2026-05-18**MF-012:** receiver-archive notification handoff (HRIS fold landed in toolbox; segmentation-studio receiver repo safe to archive).
2026-05-14**MF-200:** research-discovery engine (problem-anchored scans). SHA 85b9fae.
2026-05-13**Stream 8 / state-ui:** six accessibility wins — /state/fixes, /state/queues, filters, evidence pane, weekly cron. SHA fdca4bf.
2026-05-13**MF-100 → MF-106:** Content-State Service v1 — /state page + REST API, cloud-mirrored canonical_outputs + restore, quality validators (caught 350+ registry mismaps), reconciliation reports + registry consolidation (846 proposals), re-extraction workflow with budget gates.
2026-05-11**MF-050:** chapter-level passages search baseline — 32,768 chunks indexed across 513 books, 99 ms/query, MCP search_passages tool v1.3.0. SHA 96d214a.
2026-05-09**Phase 1C cloud-press:** REST + MCP host live at meta-factory-prod.vercel.app; library snapshot + Supabase Storage content; documented in docs/handoff/2026-05-09-phase-1c-shipped.md.

In progress

·MetaFactory Console v2 Phase 3 — wrap the OLD engine MCP for ingestion orchestration through the same console (queued behind Phase 2).
·PA-022 Path B curate batch — title-match plan output landed (4 actionable, 23 NOT_FOUND, 17 MISSING_TEXT); awaiting Mike's budget-reset decision on the 68-book aggregate-fixable rerun (~$68 actual vs $25 authorized).
·PAT-47 substrate-first disposition — producer half dropped per 2026-05-18 stash disposition; substrate stays canonical, consumer half tracked in toolbox.
·MF-031 UI/UX leverage survey — pills production-ready (0.5d lift), player 2-3d, kanban 1-3d; rollout plan staged in docs/ui/MF-031-leverage-survey.md.
·PA-SPEC §5 alignment ask — off-session ask to PA-site; unblocks every Phase 2 canonical_id join.

Packageable components

Component	Stage	Reuse
Cross-portfolio library snapshot `lib/library/data/library.snapshot.json`	production	Consumed by PA-site, vela, principia, Performix, DevPlane, Fourth & Two via REST + MCP (Stream 7).
Asset registry snapshot `lib/v1/data/asset-registry.snapshot.json`	production	Bundled read surface for the engine's 5,013-entry registry.
MCP server scaffold `scripts/mcp-meta-factory.ts`	production (v1.5.0)	Reference implementation for portfolio MCP servers — 13 read tools + 6 admin write tools + job-family-agent suite.
Operator-console visual discipline (P233) `app/admin/substrate/`, `components/ui/`, `components/console-layout.tsx`, `app/globals.css`	early-build	Gold accent + green-only-status + Geist Mono for code-shaped strings; inherit across new admin routes per `AGENTS.md`.
Admin write-surface scaffold `app/api/v1/admin/{jobs,records/[id]/{tags,overlay,memberships,lifecycle},remediation}/route.ts` + `lib/db/client.ts`	early-build	`dbNotProvisionedError()` gating pattern — durable-write routes ship dark and activate on env.
Library importer `scripts/import-library-from-pa-site.ts`	production	Snapshot + cloud-storage refresh seam between OLD engine outputs and PROD host.

Architecture

`meta-factory-prod` is a thin API host on Vercel over a build-time-bundled library snapshot, with the engine and write-side living in OLD `people-analyst/meta-factory` (MF-DEC-1 settled). The contract is frozen at v1.x and additive-only (`docs/API-CONTRACT.md` + `CONTRACT-CHANGELOG.md`), so consumers pin a version and migrate explicitly on majors. Phase 1C shipped 2026-05-09; Phase 2 — durable operational DB on Neon plus the MetaFactory Console write surface — is staged behind a `DATABASE_URL` gate so nothing 500s pre-activation, and Phase 3 wraps the OLD engine MCP for ingestion orchestration through the same console. The seam between OLD and PROD is a manual snapshot + Supabase-Storage refresh; the two repos evolve independently.

Outcome

Renamed **CanonicAI** (2026-06-10); public identity live at canonicai.com — the engine itself stays private. Scope narrowed in 2026 to the corpus→canonical engine: compensation / job-family / segmentation / HR-metrics capabilities folded into the People Analytics Toolbox, and the prompt-driven business services packaged out as standalone products — leaving the Book Factory, the Article Factory, and the federation spine (`measurement-core` / `library-core` / `measurement-ingest`). Asset registry at ~8,491 entries across six domains (research, books, onet, hr_metrics, competency, bls) with SHA-256 provenance. Model-agnostic via a single `llm.ts` router; cost discipline at ~$0.13 per research-run. Consumed by Principia, the toolbox, and peopleanalyst.com. The everything-factory state is behind us; the engine does one job, completely. The cross-book **total model is now a live REST+MCP service** (`get_cluster_model`; first cluster *start-a-company* = 33 books → 10 core founder constructs), the **capability-guide pipeline** (corpus → model → guide) is proven end-to-end and grounds **bicycle.guide**, and ingestion runs ~2.5× faster / 30-wide on Modal.

CanonicAI started as 'meta-factory' — a good internal codename that helped think architecturally, and a repo that tried to be every kind of factory at once: books, research, competencies, personas, jobs, compensation, segmentation, business ideas. The maturing move was the narrowing. The defensible thing was never 'an AI that runs prompts' — anyone can write a prompt. It was running thousands of multi-step extractions reliably, idempotently, and with lineage, at a cost you control: a production line with QA and inventory control. So the scope collapsed to one job done completely — turn a provided corpus into canonical, queryable datasets, and own the schema + provenance the rest of the family builds on. Everything that did something *with* those outputs (compensation tools, job models, segmentation, business services) moved downstream to the products that monetize them. What's left is the producer and source-of-truth: corpus in, canonical data out.