Research Loop

What is it? (Start here)

Imagine a doctor who wants to try a new treatment on a patient. They don't just try it immediately — they write a hypothesis ("I expect this drug to reduce inflammation within 48 hours"), define what success looks like (measurable outcomes), and run a controlled trial. A colleague reviews the results without knowing the original hypothesis (blind review), to avoid confirmation bias. Only then is the treatment adopted.

The Research Loop is that process for architectural decisions. Before any agent builds a new framework, swaps a core dependency, or changes a data model, they must: write a hypothesis, run a time-boxed experiment, and pass a blind review. No PROPOSAL.md means no merge. No exceptions.

A real example

Splinter proposes switching from a basic text search to a hybrid vector + BM25 retrieval system. Instead of building it directly, the process is:

Hypothesis written in PROPOSAL.md: "Hybrid vector + BM25 retrieval with 70/30 weighting will reduce p95 search latency below 50ms on a 1,000-document corpus, compared to 120ms baseline."
Success criteria defined: p95 ≤ 50ms on benchmark X, recall rate ≥ 90% on test queries, no regression on exact-match queries.
Time-boxed PoC built in an isolated worktree — not production code, just enough to run benchmarks.
Velma reviews the diff and numbers without seeing the proposal — she sees benchmark output and code changes, not the motivation or desired outcome.
Splinter accepts based on evidence. The pattern goes to clan-learnings: "hybrid vec+BM25, 70/30, 50ms p95 at 1k docs."

If the experiment fails (latency is 80ms, not 50ms), that also goes to clan-learnings as an anti-pattern. Other agents won't repeat the same experiment.

How it works — the mechanism

When does it activate?

Six triggers require the research loop (Standing Order #16). Any one is sufficient:

New architectural pattern — something not already in clan-learnings
Framework or library swap — changing a dependency others rely on
Data model changes — schema migrations, new entity relationships
Cross-cutting concerns — changes that touch 3+ agents' workflows
Low reversibility — changes that are hard to roll back cleanly
Previous failure on same problem — the same approach failed before

The loop does not apply to: bug fixes, features within existing patterns, config changes, or documentation. Only architectural decisions.

PROPOSAL.md — hypothesis first, always

The proposal file goes in the worktree root before any implementation. Three sections are required:

Hypothesis — a falsifiable statement with measurable outcome
Success Criteria — specific, objective thresholds
Time Box — a hard limit on experiment duration

"This will be better" is not a hypothesis.
"p95 latency under 50ms on benchmark X" is a hypothesis.

Time-boxed PoC — evidence, not features

The PoC builds the minimum needed to test the hypothesis. It does not build the feature. It does not handle production edge cases. It exists only to generate data — benchmarks, profiles, test results — that directly address the success criteria. It never merges to main.

Blind review — Velma without the proposal

This is the critical gate. Velma reviews the PoC diff without seeing the PROPOSAL.md. She sees code changes and numbers. She does not see the hypothesis, the desired outcome, or the original motivation. This removes confirmation bias from the review — if the evidence is compelling, it shows in the diff and the numbers, not in how the problem was framed.

Splinter's decision — architecture judgment

Splinter accepts or rejects based on Velma's review and the PoC artefacts. Splinter can accept a PoC Velma had concerns about (with documented rationale), or reject one Velma approved (architecture reasons override functional validation). The two reviews are independent inputs, not a consensus.

Outcomes feed the fleet — always

Both pass and fail outcomes publish to clan-learnings:

Accepted: becomes a positive pattern — "hybrid vec+BM25, 70/30, 50ms p95"
Rejected: becomes an anti-pattern — "direct token swap breaks Convex client scope; use middleware hook instead"

A rejected experiment that publishes its result saves every other agent from repeating the same failure. The experiment wasn't wasted — it produced knowledge.

How it connects to the larger loop

Accepted outcomes become patterns in clan-learnings — available fleet-wide at the next session start
PROPOSAL.md is written in the journal worktree — it's part of the task's documented history
When an experiment hits 3 failures and triggers M6 (Invert the Approach), that's the Reflection Loop's mutation protocol activating inside the research loop