peopleanalyst

← Portfolio

CanonicAI

↻ brief 29d ago
CanonicAI workflow — provided corpus (books, papers, domain documents) → chapter-respecting ingestion → ~8,491-entry asset registry with SHA-256 provenance → canonical, tagged datasets → consumed by Principia, the People Analytics Toolbox, and peopleanalyst.com. (Illustration: the engine has no UI; the diagram stands in for a screenshot — note: SVG predates the 2026 narrowing and should be redrawn.)

CanonicAI workflow — provided corpus (books, papers, domain documents) → chapter-respecting ingestion → ~8,491-entry asset registry with SHA-256 provenance → canonical, tagged datasets → consumed by Principia, the People Analytics Toolbox, and peopleanalyst.com. (Illustration: the engine has no UI; the diagram stands in for a screenshot — note: SVG predates the 2026 narrowing and should be redrawn.)

Corpus in, canonical data out. CanonicAI is the engine that turns a provided corpus — books, papers, domain documents — into canonical, queryable datasets, and owns the canonical schema + provenance the rest of the PeopleAnalyst family builds on. A production line, not a prompt.

Microstory
Customer
Portfolio properties — PA-site, vela, principia, Performix, DevPlane, Fourth & Two, anycomp, segmentation-studio — that need the same library, the same canonical IDs, and the same passages queryable the same way.
Problem · external
Each consumer re-implements retrieval, identity, and integrity against the substrate, with definitions drifting per product.
Problem · internal
Every cross-portfolio question feels like a week of plumbing before the actual question gets answered.
Problem · philosophical
Cross-portfolio infrastructure that ultimately speaks to humans about humans should not have seven slightly-different definitions of competency, persona, or passage — it should be one substrate, legible to every consumer the same way.
Guide
`meta-factory-prod` — the thin Vercel API host that exposes the OLD meta-factory engine over a frozen v1.x REST + MCP contract plus a build-time-bundled cross-portfolio library snapshot every product reads from.
Plan
Pin to a v1.x contract version → read snapshot, library, and passages via REST or MCP per `docs/CONSUMERS.md` → migrate explicitly on major bumps; the engine and write-side stay in OLD, the host stays thin.
Success
One library, one canonical-output spec, one passage search, one identity authority — and a contract changelog the consumer can read in a sitting.
Failure avoided
Per-product taxonomy drift, runtime coupling to engine internals, and the substrate sliding back toward the half-renovated state the OLD/PROD split was designed to fix.
The problem

Doing this by hand means driving an LLM through a brittle, multi-step extraction for every document — and getting something slow, inconsistent, and unrepeatable. Anyone can write a prompt. The defensible thing is running thousands of multi-step extractions reliably, idempotently, and with lineage — at a cost you control. That is why 'factory' is the honest metaphor: a production line with QA and inventory control, not a clever prompt. The earlier 'meta-factory' framing tried to be every kind of factory at once (compensation, jobs, segmentation, business ideas); the narrowing decision was to own one job completely — canonicalization — and let the rest become downstream products.

What I built

The load-bearing pipeline. The **Book Factory** (collector → organizer → referee: PDF → text → chapter-respecting detection → deconstructed, tagged canonical summaries, BrandScripts, and factor models) and the **Article Factory** (research_agent + measurement-ingest: peer-reviewed papers → instruments, constructs, citations, effect data feeding Principia), over a ~8,491-asset registry with SHA-256 provenance on every source file. Model-agnostic through a single `llm.ts` router (Anthropic / OpenAI / Gemini), with idempotency guards, a durable processing index, and cost discipline at ~$0.13 per research-run at 30K-passage scale. The **federation spine** — `measurement-core` (the canonical vocabulary every consumer conforms to), `library-core` (one work-identity), `measurement-ingest` (publishes to Principia's registry) — is the schema-authority half: not just data out, but the version-of-record downstream tools build against. **Narrowed (2026):** compensation, job-family, segmentation, and HR-metrics folded into the People Analytics Toolbox; the prompt-driven business services (business ideas, requirements, personas, publishing) packaged out as standalone products. What remains is the pure corpus→canonical engine.

What's novel
  • 01The moat is the production line, not the prompt — thousands of multi-step extractions run reliably, idempotently, and with full lineage, at a controlled cost. Multi-step LLM orchestration + process control + data management is the defensible asset; the prompts are replaceable.
  • 02Producer + schema authority — CanonicAI doesn't just emit datasets, it owns `measurement-core`, the canonical vocabulary and provenance contract every downstream tool conforms to. Constructs, items, instruments, and effect-sizes are defined once in canonical form, so measurement compares cleanly across Principia / the toolbox / Performix instead of drifting per consumer.
  • 03One engine, many corpora — the same 'corpus in → deconstructed, tagged, canonical datasets out' line runs over books, peer-reviewed articles, and other domains. The corpus is provided (the library catalog is the source of truth); CanonicAI's job is canonicalization, not acquisition.
  • 04Chapter-respecting ingestion fidelity — books are deconstructed at real chapter boundaries (whitespace-insensitive locators, hybrid detection), not naively chunked, so the canonical summary, factor model, and per-chapter extracts stay faithful to the source's structure.
  • 05Cryptographic provenance contract — SHA-256 tracked for every source file; safe-delete invariants require hash verification before any local delete. Every output traces to its source; the system cannot lose source material to a careless deletion.
  • 06~$0.13 per research-run synthesis at 30K+ passage scale — most 'AI research' tools run 10–100× more expensive because they retrieve without pattern extraction. Extract patterns once, cite the evidence, don't re-retrieve.
Recent ships
  1. 2026-05-18**DP-161 + DP-163 (Phase 1):** MetaFactory Console v2 lift — 10 admin routes under app/admin/substrate/* + login, 57 shadcn primitives, console widgets (substrate-browser, record-detail, ingestion-jobs, integrity-dashboard, pathb-planner, drm-queue, cross-property-memberships), gold/Geist-Mono operator-console visual discipline (P233). SHA d5914a3.
  2. 2026-06-03**DP-161 Phase 2 activation:** Supabase transaction pooler + Vercel DATABASE_URL + META_FACTORY_ADMIN_SECRET; 6 admin HTTP write routes live (smoke 7/7). Migration 002_metafactory_console_phase_2.sql on project meta-factory (gujosrsqmunzpuorkjaq).
  3. 2026-05-18**DP-161 Phase 2 prep:** supabase/migrations/002_metafactory_console_phase_2.sql (10 tables) + 6 admin HTTP routes + scripts/mcp-meta-factory.ts v1.5.0 with 6 mirrored admin write tools; gated on DATABASE_URL so nothing 500s pre-activation. SHA 2e3917b.
  4. 2026-05-18**MF-150:** chat-capture pipeline (Chrome extension → /api/capture/chat-turnchat_turns_raw/books). SHA e6492ed.
  5. 2026-05-18**DP-162:** portfolio adapter corrected for structured consumes[] edges. SHA db65cca.
  6. 2026-05-18**MF-012:** receiver-archive notification handoff (HRIS fold landed in toolbox; segmentation-studio receiver repo safe to archive).
  7. 2026-05-14**MF-200:** research-discovery engine (problem-anchored scans). SHA 85b9fae.
  8. 2026-05-13**Stream 8 / state-ui:** six accessibility wins — /state/fixes, /state/queues, filters, evidence pane, weekly cron. SHA fdca4bf.
  9. 2026-05-13**MF-100 → MF-106:** Content-State Service v1 — /state page + REST API, cloud-mirrored canonical_outputs + restore, quality validators (caught 350+ registry mismaps), reconciliation reports + registry consolidation (846 proposals), re-extraction workflow with budget gates.
  10. 2026-05-11**MF-050:** chapter-level passages search baseline — 32,768 chunks indexed across 513 books, 99 ms/query, MCP search_passages tool v1.3.0. SHA 96d214a.
  11. 2026-05-09**Phase 1C cloud-press:** REST + MCP host live at meta-factory-prod.vercel.app; library snapshot + Supabase Storage content; documented in docs/handoff/2026-05-09-phase-1c-shipped.md.
In progress
  • ·MetaFactory Console v2 Phase 3 — wrap the OLD engine MCP for ingestion orchestration through the same console (queued behind Phase 2).
  • ·PA-022 Path B curate batch — title-match plan output landed (4 actionable, 23 NOT_FOUND, 17 MISSING_TEXT); awaiting Mike's budget-reset decision on the 68-book aggregate-fixable rerun (~$68 actual vs $25 authorized).
  • ·PAT-47 substrate-first disposition — producer half dropped per 2026-05-18 stash disposition; substrate stays canonical, consumer half tracked in toolbox.
  • ·MF-031 UI/UX leverage survey — pills production-ready (0.5d lift), player 2-3d, kanban 1-3d; rollout plan staged in docs/ui/MF-031-leverage-survey.md.
  • ·PA-SPEC §5 alignment ask — off-session ask to PA-site; unblocks every Phase 2 canonical_id join.
Packageable components
ComponentStageReuse
Cross-portfolio library snapshot
lib/library/data/library.snapshot.json
productionConsumed by PA-site, vela, principia, Performix, DevPlane, Fourth & Two via REST + MCP (Stream 7).
Asset registry snapshot
lib/v1/data/asset-registry.snapshot.json
productionBundled read surface for the engine's 5,013-entry registry.
MCP server scaffold
scripts/mcp-meta-factory.ts
production (v1.5.0)Reference implementation for portfolio MCP servers — 13 read tools + 6 admin write tools + job-family-agent suite.
Operator-console visual discipline (P233)
app/admin/substrate/*, components/ui/*, components/console-layout.tsx, app/globals.css
early-buildGold accent + green-only-status + Geist Mono for code-shaped strings; inherit across new admin routes per AGENTS.md.
Admin write-surface scaffold
app/api/v1/admin/{jobs,records/[id]/{tags,overlay,memberships,lifecycle},remediation}/route.ts + lib/db/client.ts
early-builddbNotProvisionedError() gating pattern — durable-write routes ship dark and activate on env.
Library importer
scripts/import-library-from-pa-site.ts
productionSnapshot + cloud-storage refresh seam between OLD engine outputs and PROD host.
Architecture

`meta-factory-prod` is a thin API host on Vercel over a build-time-bundled library snapshot, with the engine and write-side living in OLD `people-analyst/meta-factory` (MF-DEC-1 settled). The contract is frozen at v1.x and additive-only (`docs/API-CONTRACT.md` + `CONTRACT-CHANGELOG.md`), so consumers pin a version and migrate explicitly on majors. Phase 1C shipped 2026-05-09; Phase 2 — durable operational DB on Neon plus the MetaFactory Console write surface — is staged behind a `DATABASE_URL` gate so nothing 500s pre-activation, and Phase 3 wraps the OLD engine MCP for ingestion orchestration through the same console. The seam between OLD and PROD is a manual snapshot + Supabase-Storage refresh; the two repos evolve independently.

Outcome

Private; being renamed **CanonicAI** (canonicai.com) for public-facing identity. Scope narrowed in 2026 to the corpus→canonical engine: compensation / job-family / segmentation / HR-metrics capabilities folded into the People Analytics Toolbox, and the prompt-driven business services packaged out as standalone products — leaving the Book Factory, the Article Factory, and the federation spine (`measurement-core` / `library-core` / `measurement-ingest`). Asset registry at ~8,491 entries across six domains (research, books, onet, hr_metrics, competency, bls) with SHA-256 provenance. Model-agnostic via a single `llm.ts` router; cost discipline at ~$0.13 per research-run. Consumed by Principia, the toolbox, and peopleanalyst.com. The everything-factory state is behind us; the engine does one job, completely.

CanonicAI started as 'meta-factory' — a good internal codename that helped think architecturally, and a repo that tried to be every kind of factory at once: books, research, competencies, personas, jobs, compensation, segmentation, business ideas. The maturing move was the narrowing. The defensible thing was never 'an AI that runs prompts' — anyone can write a prompt. It was running thousands of multi-step extractions reliably, idempotently, and with lineage, at a cost you control: a production line with QA and inventory control. So the scope collapsed to one job done completely — turn a provided corpus into canonical, queryable datasets, and own the schema + provenance the rest of the family builds on. Everything that did something *with* those outputs (compensation tools, job models, segmentation, business services) moved downstream to the products that monetize them. What's left is the producer and source-of-truth: corpus in, canonical data out.