The MonsterFlow mascot — a friendly six-armed coding monster wearing a leather harness, each arm typing on a different floating terminal window. 8-command pipeline · Claude Code · v0.9.1

You say WHAT.
Claude handles HOW.

Multi-agent review at every gate. And then the system measures whether those reviews are worth anything.

A structured planning, multi-agent review, and execution-discipline pipeline for Claude Code. Ships 40 reviewers — 29 always-available pipeline personas, 9 domain personas, plus 2 focused Claude Code subagents. Each phase calls only the slice it needs. One /wrap at the end compiles what you learned into durable memory.

Or jump to the pipeline ↓

Why this exists. I've always been a builder, so when none of the harnesses I could find had the self-learning loops I wanted — a fully self-improving harness that adapts to how the user actually leverages the tool — I built one. The 5 multi-agent gates give the leverage. The judging is what gives me the trust.

As one of my friends recently confessed: I don't ship code anymore, I ship outcomes.

— Justin · MIT-licensed, genuinely experimental · Read the full note →

The pipeline

Five gates between an idea and shipped code. Specialist agents run in parallel at each one — their findings judged, deduplicated, and synthesized before the next phase starts. After the first /spec you can jump to /autobuild if you want less control or to run things overnight. Otherwise step through process gate by gate and evaluate the plan as you go

flowchart TD
    K["/kickoff\nconstruction + agent roster"]:::setup
    S["/spec\nQ&A · confidence-tracked"]:::define
    SR["/spec-review\nrequirements · gaps · ambiguity\nfeasibility · scope · stakeholders"]:::review
    JS1["Judge · Dedupe · Synth\ncluster · attribute · compose → review.md"]:::synth
    P["/plan\napi · data-model · ux · scalability\nsecurity · integration · wave-sequencer"]:::plan
    JS2["Judge · Dedupe · Synth → plan.md"]:::synth
    C["/check\ncompleteness · sequencing · risk\nscope-discipline · testability"]:::gate
    JS3["Judge · Dedupe · Synth → check.md\nthree-tier verdict (v0.9.1+)"]:::synth
    V1["GO\nclean pass"]:::verdictGo
    V2["GO_WITH_FIXES\nwarn → followups.jsonl\n(common case in permissive mode)"]:::verdictWarn
    V3["NO_GO\narchitectural · security · unclassified\nhalt"]:::verdictBlock
    B["/build\nparallel execute\n(consumes followups wave 1)"]:::execute
    W["/wrap"]:::wrap
    K --> S --> SR --> JS1 --> P --> JS2 --> C --> JS3
    JS3 --> V1 --> B
    JS3 --> V2 --> B
    JS3 --> V3
    B --> W

    SP["Superpowers\nTDD · verification"]:::side
    CX["Codex\nadversarial review"]:::accent
    KL["Knowledge layer\ngraphify · wiki"]:::side
    PM["Persona Metrics\nload-bearing · silent · survival rates"]:::metrics

    SP -.-> B
    CX -.-> SR
    CX -.-> C
    CX -.-> B
    W -. compiles .-> KL
    JS1 ==records==> PM
    JS2 ==> PM
    JS3 ==> PM
    W ==surfaces drift==> PM
    KL -. "wiki-query · graphify\nauto-memory" .-> S
    PM -. "drift informs\nroster decisions" .-> K

    %% Codex edges (links 4,5,6 = CX→SR, CX→C, CX→B) are visually softened
    %% so they read as ambient adversarial reviews without competing with
    %% the main forward flow. Same intent as the other dashed-orange edges,
    %% just lower-opacity stroke so overlapping reads as a wash, not a clash.
    linkStyle 4,5,6 stroke:#92400e,stroke-width:1.5px,stroke-dasharray:4 3,opacity:0.55
    classDef setup   fill:#1e3a5f,stroke:#7c9cff,color:#bfdbfe,stroke-width:2px
    classDef define  fill:#0f4c4c,stroke:#5eead4,color:#99f6e4,stroke-width:2px
    classDef review  fill:#7c2d12,stroke:#fdba74,color:#fed7aa,stroke-width:2px
    classDef plan    fill:#3b1f7a,stroke:#c4b5fd,color:#ede9fe,stroke-width:2px
    classDef gate    fill:#881337,stroke:#fda4af,color:#ffe4e6,stroke-width:2px
    classDef execute fill:#14532d,stroke:#86efac,color:#bbf7d0,stroke-width:2px
    classDef wrap    fill:#27272a,stroke:#a1a1aa,color:#e4e4e7,stroke-width:2px
    classDef synth   fill:#0c3a5f,stroke:#7dd3fc,color:#bae6fd,stroke-width:2px
    classDef side    fill:#1e293b,stroke:#64748b,color:#94a3b8,stroke-width:2px,stroke-dasharray:4 3
    classDef accent  fill:#451a03,stroke:#fde68a,color:#fde68a,stroke-width:2px,stroke-dasharray:4 3
    classDef metrics fill:#2d1b69,stroke:#a78bfa,color:#e9d5ff,stroke-width:3px
    classDef verdictGo    fill:#14532d,stroke:#86efac,color:#bbf7d0,stroke-width:2px
    classDef verdictWarn  fill:#713f12,stroke:#fde047,color:#fef9c3,stroke-width:2px
    classDef verdictBlock fill:#7f1d1d,stroke:#fca5a5,color:#fee2e2,stroke-width:2px
      

Three-tier verdict at /check (v0.9.1+). The pipeline-gate-permissiveness change replaces the old binary halt-or-pass behavior with three verdicts: GO (clean pass), GO_WITH_FIXES (non-architectural findings warn-routed to followups.jsonl and consumed by /build wave 1 — the common case in permissive mode), and NO_GO (architectural, security, or unclassified findings halt the pipeline). The same verdict shape applies at /spec-review and /plan; /check remains the last gate before code.

Each gate runs specialist agents in parallel; a Judge then clusters their raw output, attributes each finding to its source, and composes the synthesized artifact that drives the next phase. spec.md captures intent through Q&A, then feeds 6 PRD reviewers → Judge → review.md; that feeds 7 design agents → Judge → plan.md; that feeds 5 plan validators → Judge → check.md. /build executes against the plan with TDD and verification discipline. /wrap distills what changed back into memory and the wiki. Small work skips the gates it doesn't need — same flow scales from a typo fix to a V2.

Codex (optional). If installed, an OpenAI Codex agent runs an adversarial pass at /spec-review, /check, and /build — independent perspective from a different model family, joining the same Judge synthesis as the in-house personas. Silent skip if not set up.

Self-learning

Every gate quietly records which findings shaped the next phase. /wrap Phase 1c renders the drift table and emits triage candidates for any persona that shifted meaningfully — closing the loop into Phase 2.

flowchart LR
    R["Roster
28 personas + Codex"]:::roster G["3 multi-agent gates
/spec-review · /plan · /check"]:::gate E["findings.jsonl
participation.jsonl
survival.jsonl"]:::data W["/wrap Phase 1c
10-feature window
drift → [TRIAGE]
"]:::metrics H["Human reads drift
roster judgment"]:::human R --> G G ==records==> E E ==> W W ==surfaces==> H H -. roster edit .-> R classDef roster fill:#1e293b,stroke:#64748b,color:#94a3b8,stroke-width:2px classDef gate fill:#881337,stroke:#fda4af,color:#ffe4e6,stroke-width:2px classDef data fill:#0c3a5f,stroke:#7dd3fc,color:#bae6fd,stroke-width:2px classDef metrics fill:#2d1b69,stroke:#a78bfa,color:#e9d5ff,stroke-width:3px classDef human fill:#14532d,stroke:#86efac,color:#bbf7d0,stroke-width:2px

Each Judge writes findings.jsonl (what each persona raised) and participation.jsonl (who ran). When you revise the spec or plan, a survival classifier compares pre- and post-artifacts and labels each finding addressed / not_addressed / rejected. /wrap Phase 1c rolls a 10-feature window into per-persona stats — load-bearing rate, silent rate, survival — renders drift (e.g. ↑ a11y 4% → 18%, ↓ test-quality 22% → 9%), and emits [TRIAGE MEMORY] lines for any persona that shifted ≥ 5pp. Those triage candidates flow into Phase 2's approval gate — roster edits can be written in the same session.

The knowledge loop

/wrap doesn't just end a session — it compiles what you learned into stores that the next session reads from. Every /spec and /kickoff starts smarter than the last.

flowchart LR
    subgraph SN["Session N"]
      direction TB
      W["/wrap
distill · capture · index"]:::wrap end W --> G[("graphify graph
code structure ·
god nodes
")]:::store W --> WIKI[("Obsidian wiki
distilled
knowledge pages
")]:::store W --> MEM[("CLAUDE.md
+ auto-memory
preferences ·
decisions
")]:::store W --> RAW[("_raw/
cheap
captures
")]:::store RAW -. wiki-ingest at next /wrap .-> WIKI subgraph SN1["Session N+1, N+2, ..."] direction TB S["/spec · /kickoff
starts with full prior context"]:::define end G -. /graphify query .-> S WIKI -. wiki-query .-> S MEM -. auto-loaded at session start .-> S classDef wrap fill:#3f3f46,stroke:#d4d4d8,color:#fff classDef define fill:#0f766e,stroke:#5eead4,color:#fff classDef store fill:#1e293b,stroke:#7c9cff,color:#e7e9ee

Compile, don't retrieve. Capture is cheap during the session ("capture this: X"_raw/). Distillation happens once at /wrap. Reads at the start of the next session are free — the wiki is already structured, the graph is already built, memory is already loaded.

Measured weekly. Graph-driven queries on real codebases land ~10–20× fewer tokens than full-corpus reads — 14.2× on a 1.5K-node codebase, 16.4× on a 2.2K-node one. scripts/benchmark-json.sh writes data to dashboard/data/<project>.jsonl; see the example dashboard for what it looks like, then run open ~/Projects/MonsterFlow/dashboard/index.html from a clone to see your own data.

Session wrap

The default /wrap runs the full ingestion chain — no flags needed. /wrap-quick skips all three insight phases for speed. /wrap-full forces phases that would otherwise soft-skip.

flowchart TD
    ENTRY["/wrap\nwrap-quick · wrap-full"]:::entry
    P1["Phase 1 · always\nsummary + token cost"]:::always
    P1a["Phase 1a · default\nfacets → friction · outcome\nskip: quick"]:::auto
    P1b["Phase 1b · default\ninsights-parser → report.html\nCLAUDE.md · hooks · skills · prompts\nskip: quick"]:::auto
    P1c["Phase 1c · default\npersona drift · [TRIAGE] on ≥5pp\nskip: quick or cold-start"]:::auto
    P2["Phase 2 · always\nlearning triage\napprove → CLAUDE.md · Memory · Settings · Skills"]:::triage
    P2c["Phase 2c · conditional\nwiki flush + distill\nif vault present"]:::wiki
    P3["Phases 3–4 · default\ngit loose ends · dep audit\npermission cleanup\nskip: quick (partial)"]:::loose
    P5["Phase 5 · default\nCLAUDE.md health check\nskip: quick"]:::health
    CM["CLAUDE.md"]:::artifact
    MEM["Memory\nfeedback · project · ref"]:::artifact
    ENTRY --> P1 --> P1a --> P1b --> P1c --> P2
    P2 --> CM & MEM
    P2 -. vault .-> P2c
    P2 --> P3 --> P5
    classDef entry    fill:#1e3a5f,stroke:#7c9cff,color:#bfdbfe,stroke-width:2px
    classDef always   fill:#14532d,stroke:#86efac,color:#bbf7d0,stroke-width:2px
    classDef auto     fill:#0c3a5f,stroke:#7dd3fc,color:#bae6fd,stroke-width:2px
    classDef triage   fill:#3b1f7a,stroke:#c4b5fd,color:#ede9fe,stroke-width:2px
    classDef wiki     fill:#451a03,stroke:#fde68a,color:#fde68a,stroke-width:2px
    classDef loose    fill:#27272a,stroke:#a1a1aa,color:#e4e4e7,stroke-width:2px
    classDef health   fill:#0f4c4c,stroke:#5eead4,color:#99f6e4,stroke-width:2px
    classDef artifact fill:#1e293b,stroke:#64748b,color:#94a3b8,stroke-width:2px,stroke-dasharray:4 3
      

Three automatic insight phases feed one triage gate. Phase 1a reads the per-session facets file — friction, outcome, helpfulness. Phase 1b parses report.html from the built-in /insights command, extracting pre-written CLAUDE.md sections, friction patterns, hook configs, and skill templates. Phase 1c computes persona drift across the last 10 features and emits triage candidates for any rate that shifted ≥ 5pp. All [TRIAGE] lines converge at Phase 2 — one approval gate writes CLAUDE.md edits, feedback memories, settings.json hooks, and skill files in the same session.

After the gate, the pipeline surfaces its own next version. Copyable prompts appear for next-session use. Horizon cards from /insights — autonomous pipelines, parallel worktree racing, multi-repo sync — surface as /spec candidates: the pipeline proposing what to build next.

Commands

Each command writes a persistent artifact under docs/specs/<feature>/.

CommandWhat it doesAgents
/kickoffOne-time project init — scans repo, drafts constitution, picks agent roster
/specConfidence-tracked Q&A — writes spec.mdInteractive
/spec-reviewParallel PRD review — gaps, risks, ambiguity; + Codex adversarial pass (optional)6 reviewers
/planArchitecture + implementation design (incl. wave-sequencer for data-contract precedence)7 designers
/checkLast gate before code — validates the plan; + Codex adversarial pass (optional)5 validators
/buildParallel execution with verification discipline; + Codex implementation review (optional)Superpowers
/autorunHeadless overnight pipeline — queues a spec and drives all 8 stages unattended. Single-slug per invocation (per AC#24); multi-spec queues use autorun-batch.sh --mode=overnight. Per-axis warn/block policy framework (verdict, branch, codex_probe, verify_infra) lets you say "warn overnight, block in supervised mode"; security and integrity findings are hardcoded blocks regardless. Works cross-project: engine scripts stay in MonsterFlow, target git/docs/queue live in $PWD.Shell
/flowDisplays the workflow reference card
/wrapSession wrap-up — three automatic insight phases feed one triage gate (CLAUDE.md · Memory · hooks · skills), then git loose ends. Variants: quick (fast, skips insights) · full (forces soft-skip phases)

Overnight policy framework (v0.7+)

Unattended overnight runs need a sharper question than "should we halt?". The autorun-overnight-policy spec (26 ACs, shipped via PR #6) replaced the old halt-on-anything behavior with a per-axis warn/block framework. The principle: more permissive overnight, except for security gaps. Testing in the morning with warnings beats a halted pipeline at 3am.

Per-axis policy

Four overrideable axes — verdict, branch, codex_probe, verify_infra — each independently set to warn or block. --mode=overnight warns on everything; --mode=supervised blocks on everything. Per-axis env vars override the mode preset. Three classes are hardcoded blocks regardless of mode:

Sticky run-degraded gate

Any single warn during a run sets RUN_DEGRADED=1 (sticky). Auto-merge fires only when RUN_DEGRADED=0 AND CODEX_HIGH_COUNT=0. Non-clean runs ship as a PR awaiting review — you wake to artifacts + a PR, not a halted pipeline.

Single-slug invocation + queue-loop wrapper

run.sh <slug> processes exactly one slug per invocation. Multi-spec queues call the new autorun-batch.sh wrapper, which iterates queue/*.spec.md and honors queue/STOP at iteration boundaries. Cron migration:

# Before (≤ v0.6):
0 22 * * * cd /path/to/repo && scripts/autorun/run.sh

# After (v0.7+):
0 22 * * * cd /path/to/repo && scripts/autorun/autorun-batch.sh --mode=overnight

Single-fence verdict extractor (D33)

Synthesis emits a fenced check-verdict block at the end of its output; a deterministic shell+Python post-processor (_policy_json.py extract-fence) extracts it to check-verdict.json. Multi-fence detection — more than one check-verdict fence — blocks as a possible prompt-injection attempt. NFKC-normalize + zero-width-strip happens before scanning, so disguised-character fences (homoglyph attacks) get caught.

Known v1 limitation

D33 multi-fence rejection blocks the easy attack class but does not authenticate a single fence quoted from reviewed content. If synthesis omits its own fence and reviewed content quotes a single fake one, count==1 passes and a forged GO ships. Mitigation is detection-hardening, not prevention. For repos processing untrusted spec sources (third-party PRs, externally-authored queue items), set verdict_policy=block and disable unattended auto-merge until the architectural fix lands. The fix is carved into the autorun-verdict-deterministic follow-up spec — deterministic verdict aggregation from structured reviewer outputs (drops the synthesis-emits-sidecar pattern entirely).

Full design + 26 acceptance criteria: docs/specs/autorun-overnight-policy/.

Requirements

The pipeline ships in this repo (v0.9.1). Everything else is third-party — installed at latest from its source. No version pinning required for normal use.

Required

ToolWhyHow to get it
Claude Code CLIThe harness this pipeline runs inclaude.com/claude-code
Python ≥ 3.9Used by session-cost.py, persona-metrics scripts, benchmarksbrew install python

Plugins

TierPluginPurpose
Always-onsuperpowersExecution discipline — TDD, debugging, verification, code review
Always-oncontext7Library / framework / API documentation fetching
On-demandfirecrawl · code-review · ralph-loop · playwrightResearch · GitHub PR review · micro-iteration · browser automation
Periodicclaude-md-management · skill-creator · claude-code-setupMeta-tooling — audit CLAUDE.md, build new skills, recommend automations
$ claude plugins install superpowers context7

Optional integrations

IntegrationWhyInstall
graphify
recommended for best performance
Knowledge-graph backend driving the 10–20× token reduction shown in the Knowledge loop pip install graphifyy
last reviewed: 0.4.21
Codex Adversarial reviewer at /spec-review, /check, /build — silent skip if not installed npm i -g @openai/codex
+ openai/codex-plugin-cc marketplace
Obsidian Destination for distilled wiki pages produced at /wrap obsidian.md + set OBSIDIAN_VAULT_PATH
gh CLI Used by code-review plugin and a few git-aware scripts brew install gh && gh auth login

Install

Clone, run the installer, then open any project and type /kickoff.

$ git clone https://github.com/Jstottlemyer/MonsterFlow.git ~/Projects/MonsterFlow
$ cd ~/Projects/MonsterFlow && ./install.sh

The installer symlinks commands, personas, templates, and settings into ~/.claude/, then offers to install plugins.

Agent roster — 40 total (38 personas + 2 subagents)

29 always-available pipeline personas + 9 domain personas + 2 focused Claude Code subagents (autorun-shell-reviewer, persona-metrics-validator). A session calls only the slice for the current phase — never all 40 at once.

Review · /spec-review 6

Requirements · Gaps · Ambiguity · Feasibility · Scope · Stakeholders

Plan · /plan 6

API · Data Model · UX · Scalability · Security · Integration · Wave Sequencer

Check · /check 5

Completeness · Sequencing · Risk · Scope Discipline · Testability

Code review · full mode 9

Correctness · Dependency · Design Quality · Documentation · Performance · Resilience · Security · Test Quality · Wiring

Synthesis layer 2

Judge (quality scoring) · Synthesis (multi-agent consolidation) — used by /spec-review, /plan, /check

Domain agents 9

mobile/ 6 iOS · games/ 3 game-dev. Loaded only when /kickoff matches the project — never globally active. Projects can add their own (e.g. AuthTools adds 5).

The /flow reference card

What you see in-session when you type /flow.

╔══════════════════════════════════════════════════════════════╗
║                    SESSION WORKFLOW                          ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  PROJECT SETUP (once per project)                            ║
║  /kickoff  →  constitution + agent roster                    ║
║                                                              ║
║  FEATURE (full pipeline)                                     ║
║  /spec  →  /spec-review  →  /plan  →  /check  →  /build      ║
║   define    6 PRD          7 design   5 plan     execute     ║
║   (Q&A)     agents         agents     agents     (parallel)  ║
║  + firecrawl (research) · context7 (API docs)                ║
║  + codex adversarial review at spec-review, check, build     ║
║    (optional — silent skip if not set up)                    ║
║                                                              ║
║  WORK-SIZE SCALING                                           ║
║  Bug fix:      describe it → fix it → verify                 ║
║  Small change: /spec (quick) → /build                        ║
║  Feature:      full pipeline above                           ║
║  V2/Rework:    revise existing spec → full pipeline          ║
║                                                              ║
║  PARALLEL WORK                                               ║
║  "work on X, Y, and Z in parallel"                           ║
║    → Each dispatched to a subagent                           ║
║                                                              ║
║  IN-SESSION DISCIPLINE                      [Superpowers]    ║
║  → systematic-debugging · verification-before-done           ║
║  → requesting-code-review · ralph-loop (micro-iteration)     ║
║                                                              ║
║  CODE REVIEW                                                 ║
║  Quick:  superpowers requesting-code-review                  ║
║  PR:     /code-review plugin                                 ║
║  Full:   9 parallel code-review personas                     ║
║                                                              ║
║  ARTIFACTS                                                   ║
║  docs/specs/constitution.md     (project principles)         ║
║  docs/specs/<feature>/spec.md   (living spec)                ║
║  docs/specs/<feature>/review.md (PRD review findings)        ║
║  docs/specs/<feature>/plan.md   (implementation plan)        ║
║  docs/specs/<feature>/check.md  (gap checkpoint)             ║
║                                                              ║
║  KNOWLEDGE LAYER                   [graphify + obsidian]     ║
║  Fires automagically at /wrap — no typing, no friction:      ║
║    _raw/ → wiki pages           (wiki-ingest)                ║
║    session → projects/<name>/   (wiki-update)                ║
║    graph export + lint          (wiki-export · wiki-lint)    ║
║    graphify digest → _raw/      (silent arch snapshot)       ║
║  Manual (rare):                                              ║
║    /graphify [path]    build code knowledge graph            ║
║    /graphify query "Q" graph traversal answer                ║
║    "what do I know about X"  wiki-query                      ║
║    "capture this: X"         wiki-capture → _raw/            ║
║  Compile, don't retrieve. Capture cheap, distill at /wrap.   ║
║                                                              ║
║  SESSION END                                                 ║
║  /wrap → insights (facets · report.html · persona drift)     ║
║          triage gate (CLAUDE.md · memory · hooks · skills)   ║
║          knowledge flush · git loose ends                    ║
║                                                              ║
╠══════════════════════════════════════════════════════════════╣
║  AGENTS: review(6) plan(6) check(5) code-review(9)           ║
║  + judge · synthesis · domain agents                         ║
║                                                              ║
║  PLUGINS                                                     ║
║  Always-on:  superpowers · context7                          ║
║  On-demand:  firecrawl · code-review · ralph-loop            ║
║              playwright                                      ║
║  Periodic:   claude-md-management · skill-creator            ║
║              claude-code-setup                               ║
║  Optional:   codex — adversarial review at spec-review,      ║
║              /check, /build (silent skip if not set up)      ║
║                                                              ║
║  Superpowers: in-session execution discipline                ║
║  Plugins: specialized capabilities                           ║
║  You say WHAT. Claude handles HOW.                           ║
╚══════════════════════════════════════════════════════════════╝