8-command pipeline · Claude Code · v0.9.1
Multi-agent review at every gate. And then the system measures whether those reviews are worth anything.
A structured planning, multi-agent review, and execution-discipline pipeline for
Claude Code.
Ships 40 reviewers — 29 always-available pipeline personas, 9 domain personas, plus 2 focused Claude Code subagents.
Each phase calls only the slice it needs. One /wrap at the end compiles what you learned into durable memory.
Or jump to the pipeline ↓
Why this exists. I've always been a builder, so when none of the harnesses I could find had the self-learning loops I wanted — a fully self-improving harness that adapts to how the user actually leverages the tool — I built one. The 5 multi-agent gates give the leverage. The judging is what gives me the trust.
As one of my friends recently confessed: I don't ship code anymore, I ship outcomes.
— Justin · MIT-licensed, genuinely experimental · Read the full note →
Five gates between an idea and shipped code. Specialist agents run in parallel at each one — their findings judged, deduplicated, and synthesized before the next phase starts. After the first /spec you can jump to /autobuild if you want less control or to run things overnight. Otherwise step through process gate by gate and evaluate the plan as you go
flowchart TD
K["/kickoff\nconstruction + agent roster"]:::setup
S["/spec\nQ&A · confidence-tracked"]:::define
SR["/spec-review\nrequirements · gaps · ambiguity\nfeasibility · scope · stakeholders"]:::review
JS1["Judge · Dedupe · Synth\ncluster · attribute · compose → review.md"]:::synth
P["/plan\napi · data-model · ux · scalability\nsecurity · integration · wave-sequencer"]:::plan
JS2["Judge · Dedupe · Synth → plan.md"]:::synth
C["/check\ncompleteness · sequencing · risk\nscope-discipline · testability"]:::gate
JS3["Judge · Dedupe · Synth → check.md\nthree-tier verdict (v0.9.1+)"]:::synth
V1["GO\nclean pass"]:::verdictGo
V2["GO_WITH_FIXES\nwarn → followups.jsonl\n(common case in permissive mode)"]:::verdictWarn
V3["NO_GO\narchitectural · security · unclassified\nhalt"]:::verdictBlock
B["/build\nparallel execute\n(consumes followups wave 1)"]:::execute
W["/wrap"]:::wrap
K --> S --> SR --> JS1 --> P --> JS2 --> C --> JS3
JS3 --> V1 --> B
JS3 --> V2 --> B
JS3 --> V3
B --> W
SP["Superpowers\nTDD · verification"]:::side
CX["Codex\nadversarial review"]:::accent
KL["Knowledge layer\ngraphify · wiki"]:::side
PM["Persona Metrics\nload-bearing · silent · survival rates"]:::metrics
SP -.-> B
CX -.-> SR
CX -.-> C
CX -.-> B
W -. compiles .-> KL
JS1 ==records==> PM
JS2 ==> PM
JS3 ==> PM
W ==surfaces drift==> PM
KL -. "wiki-query · graphify\nauto-memory" .-> S
PM -. "drift informs\nroster decisions" .-> K
%% Codex edges (links 4,5,6 = CX→SR, CX→C, CX→B) are visually softened
%% so they read as ambient adversarial reviews without competing with
%% the main forward flow. Same intent as the other dashed-orange edges,
%% just lower-opacity stroke so overlapping reads as a wash, not a clash.
linkStyle 4,5,6 stroke:#92400e,stroke-width:1.5px,stroke-dasharray:4 3,opacity:0.55
classDef setup fill:#1e3a5f,stroke:#7c9cff,color:#bfdbfe,stroke-width:2px
classDef define fill:#0f4c4c,stroke:#5eead4,color:#99f6e4,stroke-width:2px
classDef review fill:#7c2d12,stroke:#fdba74,color:#fed7aa,stroke-width:2px
classDef plan fill:#3b1f7a,stroke:#c4b5fd,color:#ede9fe,stroke-width:2px
classDef gate fill:#881337,stroke:#fda4af,color:#ffe4e6,stroke-width:2px
classDef execute fill:#14532d,stroke:#86efac,color:#bbf7d0,stroke-width:2px
classDef wrap fill:#27272a,stroke:#a1a1aa,color:#e4e4e7,stroke-width:2px
classDef synth fill:#0c3a5f,stroke:#7dd3fc,color:#bae6fd,stroke-width:2px
classDef side fill:#1e293b,stroke:#64748b,color:#94a3b8,stroke-width:2px,stroke-dasharray:4 3
classDef accent fill:#451a03,stroke:#fde68a,color:#fde68a,stroke-width:2px,stroke-dasharray:4 3
classDef metrics fill:#2d1b69,stroke:#a78bfa,color:#e9d5ff,stroke-width:3px
classDef verdictGo fill:#14532d,stroke:#86efac,color:#bbf7d0,stroke-width:2px
classDef verdictWarn fill:#713f12,stroke:#fde047,color:#fef9c3,stroke-width:2px
classDef verdictBlock fill:#7f1d1d,stroke:#fca5a5,color:#fee2e2,stroke-width:2px
Three-tier verdict at /check (v0.9.1+). The pipeline-gate-permissiveness change replaces the old binary halt-or-pass behavior with three verdicts: GO (clean pass), GO_WITH_FIXES (non-architectural findings warn-routed to followups.jsonl and consumed by /build wave 1 — the common case in permissive mode), and NO_GO (architectural, security, or unclassified findings halt the pipeline). The same verdict shape applies at /spec-review and /plan; /check remains the last gate before code.
Each gate runs specialist agents in parallel; a Judge then clusters their raw output, attributes each finding to its source, and composes the synthesized artifact that drives the next phase. spec.md captures intent through Q&A, then feeds 6 PRD reviewers → Judge → review.md; that feeds 7 design agents → Judge → plan.md; that feeds 5 plan validators → Judge → check.md. /build executes against the plan with TDD and verification discipline. /wrap distills what changed back into memory and the wiki. Small work skips the gates it doesn't need — same flow scales from a typo fix to a V2.
Codex (optional). If installed, an OpenAI Codex agent runs an adversarial pass at /spec-review, /check, and /build — independent perspective from a different model family, joining the same Judge synthesis as the in-house personas. Silent skip if not set up.
Every gate quietly records which findings shaped the next phase. /wrap Phase 1c renders the drift table and emits triage candidates for any persona that shifted meaningfully — closing the loop into Phase 2.
flowchart LR
R["Roster
28 personas + Codex"]:::roster
G["3 multi-agent gates
/spec-review · /plan · /check"]:::gate
E["findings.jsonl
participation.jsonl
survival.jsonl"]:::data
W["/wrap Phase 1c
10-feature window
drift → [TRIAGE]"]:::metrics
H["Human reads drift
roster judgment"]:::human
R --> G
G ==records==> E
E ==> W
W ==surfaces==> H
H -. roster edit .-> R
classDef roster fill:#1e293b,stroke:#64748b,color:#94a3b8,stroke-width:2px
classDef gate fill:#881337,stroke:#fda4af,color:#ffe4e6,stroke-width:2px
classDef data fill:#0c3a5f,stroke:#7dd3fc,color:#bae6fd,stroke-width:2px
classDef metrics fill:#2d1b69,stroke:#a78bfa,color:#e9d5ff,stroke-width:3px
classDef human fill:#14532d,stroke:#86efac,color:#bbf7d0,stroke-width:2px
Each Judge writes findings.jsonl (what each persona raised) and participation.jsonl (who ran). When you revise the spec or plan, a survival classifier compares pre- and post-artifacts and labels each finding addressed / not_addressed / rejected. /wrap Phase 1c rolls a 10-feature window into per-persona stats — load-bearing rate, silent rate, survival — renders drift (e.g. ↑ a11y 4% → 18%, ↓ test-quality 22% → 9%), and emits [TRIAGE MEMORY] lines for any persona that shifted ≥ 5pp. Those triage candidates flow into Phase 2's approval gate — roster edits can be written in the same session.
/wrap doesn't just end a session — it compiles what you learned into stores that the next session reads from. Every /spec and /kickoff starts smarter than the last.
flowchart LR
subgraph SN["Session N"]
direction TB
W["/wrap
distill · capture · index"]:::wrap
end
W --> G[("graphify graph
code structure ·
god nodes")]:::store
W --> WIKI[("Obsidian wiki
distilled
knowledge pages")]:::store
W --> MEM[("CLAUDE.md
+ auto-memory
preferences ·
decisions")]:::store
W --> RAW[("_raw/
cheap
captures")]:::store
RAW -. wiki-ingest at next /wrap .-> WIKI
subgraph SN1["Session N+1, N+2, ..."]
direction TB
S["/spec · /kickoff
starts with full prior context"]:::define
end
G -. /graphify query .-> S
WIKI -. wiki-query .-> S
MEM -. auto-loaded at session start .-> S
classDef wrap fill:#3f3f46,stroke:#d4d4d8,color:#fff
classDef define fill:#0f766e,stroke:#5eead4,color:#fff
classDef store fill:#1e293b,stroke:#7c9cff,color:#e7e9ee
Compile, don't retrieve. Capture is cheap during the session ("capture this: X" → _raw/). Distillation happens once at /wrap. Reads at the start of the next session are free — the wiki is already structured, the graph is already built, memory is already loaded.
Measured weekly. Graph-driven queries on real codebases land ~10–20× fewer tokens than full-corpus reads — 14.2× on a 1.5K-node codebase, 16.4× on a 2.2K-node one. scripts/benchmark-json.sh writes data to dashboard/data/<project>.jsonl; see the example dashboard for what it looks like, then run open ~/Projects/MonsterFlow/dashboard/index.html from a clone to see your own data.
The default /wrap runs the full ingestion chain — no flags needed. /wrap-quick skips all three insight phases for speed. /wrap-full forces phases that would otherwise soft-skip.
flowchart TD
ENTRY["/wrap\nwrap-quick · wrap-full"]:::entry
P1["Phase 1 · always\nsummary + token cost"]:::always
P1a["Phase 1a · default\nfacets → friction · outcome\nskip: quick"]:::auto
P1b["Phase 1b · default\ninsights-parser → report.html\nCLAUDE.md · hooks · skills · prompts\nskip: quick"]:::auto
P1c["Phase 1c · default\npersona drift · [TRIAGE] on ≥5pp\nskip: quick or cold-start"]:::auto
P2["Phase 2 · always\nlearning triage\napprove → CLAUDE.md · Memory · Settings · Skills"]:::triage
P2c["Phase 2c · conditional\nwiki flush + distill\nif vault present"]:::wiki
P3["Phases 3–4 · default\ngit loose ends · dep audit\npermission cleanup\nskip: quick (partial)"]:::loose
P5["Phase 5 · default\nCLAUDE.md health check\nskip: quick"]:::health
CM["CLAUDE.md"]:::artifact
MEM["Memory\nfeedback · project · ref"]:::artifact
ENTRY --> P1 --> P1a --> P1b --> P1c --> P2
P2 --> CM & MEM
P2 -. vault .-> P2c
P2 --> P3 --> P5
classDef entry fill:#1e3a5f,stroke:#7c9cff,color:#bfdbfe,stroke-width:2px
classDef always fill:#14532d,stroke:#86efac,color:#bbf7d0,stroke-width:2px
classDef auto fill:#0c3a5f,stroke:#7dd3fc,color:#bae6fd,stroke-width:2px
classDef triage fill:#3b1f7a,stroke:#c4b5fd,color:#ede9fe,stroke-width:2px
classDef wiki fill:#451a03,stroke:#fde68a,color:#fde68a,stroke-width:2px
classDef loose fill:#27272a,stroke:#a1a1aa,color:#e4e4e7,stroke-width:2px
classDef health fill:#0f4c4c,stroke:#5eead4,color:#99f6e4,stroke-width:2px
classDef artifact fill:#1e293b,stroke:#64748b,color:#94a3b8,stroke-width:2px,stroke-dasharray:4 3
Three automatic insight phases feed one triage gate. Phase 1a reads the per-session facets file — friction, outcome, helpfulness. Phase 1b parses report.html from the built-in /insights command, extracting pre-written CLAUDE.md sections, friction patterns, hook configs, and skill templates. Phase 1c computes persona drift across the last 10 features and emits triage candidates for any rate that shifted ≥ 5pp. All [TRIAGE] lines converge at Phase 2 — one approval gate writes CLAUDE.md edits, feedback memories, settings.json hooks, and skill files in the same session.
After the gate, the pipeline surfaces its own next version. Copyable prompts appear for next-session use. Horizon cards from /insights — autonomous pipelines, parallel worktree racing, multi-repo sync — surface as /spec candidates: the pipeline proposing what to build next.
Each command writes a persistent artifact under docs/specs/<feature>/.
| Command | What it does | Agents |
|---|---|---|
/kickoff | One-time project init — scans repo, drafts constitution, picks agent roster | — |
/spec | Confidence-tracked Q&A — writes spec.md | Interactive |
/spec-review | Parallel PRD review — gaps, risks, ambiguity; + Codex adversarial pass (optional) | 6 reviewers |
/plan | Architecture + implementation design (incl. wave-sequencer for data-contract precedence) | 7 designers |
/check | Last gate before code — validates the plan; + Codex adversarial pass (optional) | 5 validators |
/build | Parallel execution with verification discipline; + Codex implementation review (optional) | Superpowers |
/autorun | Headless overnight pipeline — queues a spec and drives all 8 stages unattended. Single-slug per invocation (per AC#24); multi-spec queues use autorun-batch.sh --mode=overnight. Per-axis warn/block policy framework (verdict, branch, codex_probe, verify_infra) lets you say "warn overnight, block in supervised mode"; security and integrity findings are hardcoded blocks regardless. Works cross-project: engine scripts stay in MonsterFlow, target git/docs/queue live in $PWD. | Shell |
/flow | Displays the workflow reference card | — |
/wrap | Session wrap-up — three automatic insight phases feed one triage gate (CLAUDE.md · Memory · hooks · skills), then git loose ends. Variants: quick (fast, skips insights) · full (forces soft-skip phases) | — |
Unattended overnight runs need a sharper question than "should we halt?". The autorun-overnight-policy spec (26 ACs, shipped via PR #6) replaced the old halt-on-anything behavior with a per-axis warn/block framework. The principle: more permissive overnight, except for security gaps. Testing in the morning with warnings beats a halted pipeline at 3am.
Per-axis policy
Four overrideable axes — verdict, branch, codex_probe, verify_infra — each independently set to warn or block. --mode=overnight warns on everything; --mode=supervised blocks on everything. Per-axis env vars override the mode preset. Three classes are hardcoded blocks regardless of mode:
sev:security by reviewersSticky run-degraded gate
Any single warn during a run sets RUN_DEGRADED=1 (sticky). Auto-merge fires only when RUN_DEGRADED=0 AND CODEX_HIGH_COUNT=0. Non-clean runs ship as a PR awaiting review — you wake to artifacts + a PR, not a halted pipeline.
Single-slug invocation + queue-loop wrapper
run.sh <slug> processes exactly one slug per invocation. Multi-spec queues call the new autorun-batch.sh wrapper, which iterates queue/*.spec.md and honors queue/STOP at iteration boundaries. Cron migration:
# Before (≤ v0.6):
0 22 * * * cd /path/to/repo && scripts/autorun/run.sh
# After (v0.7+):
0 22 * * * cd /path/to/repo && scripts/autorun/autorun-batch.sh --mode=overnight
Single-fence verdict extractor (D33)
Synthesis emits a fenced check-verdict block at the end of its output; a deterministic shell+Python post-processor (_policy_json.py extract-fence) extracts it to check-verdict.json. Multi-fence detection — more than one check-verdict fence — blocks as a possible prompt-injection attempt. NFKC-normalize + zero-width-strip happens before scanning, so disguised-character fences (homoglyph attacks) get caught.
Known v1 limitation
D33 multi-fence rejection blocks the easy attack class but does not authenticate a single fence quoted from reviewed content. If synthesis omits its own fence and reviewed content quotes a single fake one, count==1 passes and a forged GO ships. Mitigation is detection-hardening, not prevention. For repos processing untrusted spec sources (third-party PRs, externally-authored queue items), set verdict_policy=block and disable unattended auto-merge until the architectural fix lands. The fix is carved into the autorun-verdict-deterministic follow-up spec — deterministic verdict aggregation from structured reviewer outputs (drops the synthesis-emits-sidecar pattern entirely).
Full design + 26 acceptance criteria: docs/specs/autorun-overnight-policy/.
The pipeline ships in this repo (v0.9.1). Everything else is third-party — installed at latest from its source. No version pinning required for normal use.
Required
| Tool | Why | How to get it |
|---|---|---|
| Claude Code CLI | The harness this pipeline runs in | claude.com/claude-code |
| Python ≥ 3.9 | Used by session-cost.py, persona-metrics scripts, benchmarks | brew install python |
Plugins
| Tier | Plugin | Purpose |
|---|---|---|
| Always-on | superpowers | Execution discipline — TDD, debugging, verification, code review |
| Always-on | context7 | Library / framework / API documentation fetching |
| On-demand | firecrawl · code-review · ralph-loop · playwright | Research · GitHub PR review · micro-iteration · browser automation |
| Periodic | claude-md-management · skill-creator · claude-code-setup | Meta-tooling — audit CLAUDE.md, build new skills, recommend automations |
$ claude plugins install superpowers context7Optional integrations
| Integration | Why | Install |
|---|---|---|
| graphify recommended for best performance |
Knowledge-graph backend driving the 10–20× token reduction shown in the Knowledge loop | pip install graphifyylast reviewed: 0.4.21 |
| Codex | Adversarial reviewer at /spec-review, /check, /build — silent skip if not installed |
npm i -g @openai/codex+ openai/codex-plugin-cc marketplace |
| Obsidian | Destination for distilled wiki pages produced at /wrap |
obsidian.md + set OBSIDIAN_VAULT_PATH |
| gh CLI | Used by code-review plugin and a few git-aware scripts |
brew install gh && gh auth login |
Clone, run the installer, then open any project and type /kickoff.
$ git clone https://github.com/Jstottlemyer/MonsterFlow.git ~/Projects/MonsterFlow $ cd ~/Projects/MonsterFlow && ./install.sh
The installer symlinks commands, personas, templates, and settings into ~/.claude/, then offers to install plugins.
29 always-available pipeline personas + 9 domain personas + 2 focused Claude Code subagents (autorun-shell-reviewer, persona-metrics-validator). A session calls only the slice for the current phase — never all 40 at once.
/spec-review 6Requirements · Gaps · Ambiguity · Feasibility · Scope · Stakeholders
/plan 6API · Data Model · UX · Scalability · Security · Integration · Wave Sequencer
/check 5Completeness · Sequencing · Risk · Scope Discipline · Testability
Correctness · Dependency · Design Quality · Documentation · Performance · Resilience · Security · Test Quality · Wiring
Judge (quality scoring) · Synthesis (multi-agent consolidation) — used by /spec-review, /plan, /check
mobile/ 6 iOS · games/ 3 game-dev. Loaded only when /kickoff matches the project — never globally active. Projects can add their own (e.g. AuthTools adds 5).
/flow reference cardWhat you see in-session when you type /flow.
╔══════════════════════════════════════════════════════════════╗ ║ SESSION WORKFLOW ║ ╠══════════════════════════════════════════════════════════════╣ ║ ║ ║ PROJECT SETUP (once per project) ║ ║ /kickoff → constitution + agent roster ║ ║ ║ ║ FEATURE (full pipeline) ║ ║ /spec → /spec-review → /plan → /check → /build ║ ║ define 6 PRD 7 design 5 plan execute ║ ║ (Q&A) agents agents agents (parallel) ║ ║ + firecrawl (research) · context7 (API docs) ║ ║ + codex adversarial review at spec-review, check, build ║ ║ (optional — silent skip if not set up) ║ ║ ║ ║ WORK-SIZE SCALING ║ ║ Bug fix: describe it → fix it → verify ║ ║ Small change: /spec (quick) → /build ║ ║ Feature: full pipeline above ║ ║ V2/Rework: revise existing spec → full pipeline ║ ║ ║ ║ PARALLEL WORK ║ ║ "work on X, Y, and Z in parallel" ║ ║ → Each dispatched to a subagent ║ ║ ║ ║ IN-SESSION DISCIPLINE [Superpowers] ║ ║ → systematic-debugging · verification-before-done ║ ║ → requesting-code-review · ralph-loop (micro-iteration) ║ ║ ║ ║ CODE REVIEW ║ ║ Quick: superpowers requesting-code-review ║ ║ PR: /code-review plugin ║ ║ Full: 9 parallel code-review personas ║ ║ ║ ║ ARTIFACTS ║ ║ docs/specs/constitution.md (project principles) ║ ║ docs/specs/<feature>/spec.md (living spec) ║ ║ docs/specs/<feature>/review.md (PRD review findings) ║ ║ docs/specs/<feature>/plan.md (implementation plan) ║ ║ docs/specs/<feature>/check.md (gap checkpoint) ║ ║ ║ ║ KNOWLEDGE LAYER [graphify + obsidian] ║ ║ Fires automagically at /wrap — no typing, no friction: ║ ║ _raw/ → wiki pages (wiki-ingest) ║ ║ session → projects/<name>/ (wiki-update) ║ ║ graph export + lint (wiki-export · wiki-lint) ║ ║ graphify digest → _raw/ (silent arch snapshot) ║ ║ Manual (rare): ║ ║ /graphify [path] build code knowledge graph ║ ║ /graphify query "Q" graph traversal answer ║ ║ "what do I know about X" wiki-query ║ ║ "capture this: X" wiki-capture → _raw/ ║ ║ Compile, don't retrieve. Capture cheap, distill at /wrap. ║ ║ ║ ║ SESSION END ║ ║ /wrap → insights (facets · report.html · persona drift) ║ ║ triage gate (CLAUDE.md · memory · hooks · skills) ║ ║ knowledge flush · git loose ends ║ ║ ║ ╠══════════════════════════════════════════════════════════════╣ ║ AGENTS: review(6) plan(6) check(5) code-review(9) ║ ║ + judge · synthesis · domain agents ║ ║ ║ ║ PLUGINS ║ ║ Always-on: superpowers · context7 ║ ║ On-demand: firecrawl · code-review · ralph-loop ║ ║ playwright ║ ║ Periodic: claude-md-management · skill-creator ║ ║ claude-code-setup ║ ║ Optional: codex — adversarial review at spec-review, ║ ║ /check, /build (silent skip if not set up) ║ ║ ║ ║ Superpowers: in-session execution discipline ║ ║ Plugins: specialized capabilities ║ ║ You say WHAT. Claude handles HOW. ║ ╚══════════════════════════════════════════════════════════════╝