Discovery Gossip — Epidemic Broadcast Protocol
What is gossip? (Start here)
Imagine a team of doctors in a hospital. One doctor discovers that a new drug interacts badly with a common anaesthetic. She tells two colleagues. They each tell two more. Within hours, every doctor in the hospital knows — without a memo, without a meeting, without anyone coordinating the broadcast. That's gossip.
Agent gossip works the same way. Cantona discovers the Convex client silently swallows errors on cold starts. He publishes that finding to the gossip bus. Two agents receive it on their next heartbeat. Those agents each push it to two more. Within minutes the whole fleet knows — without any coordinator, notification, or direct messaging.
The key difference from before: we used to run tail -20 discoveries.jsonl at session start and call that gossip. That was shared file read — passive, all-or-nothing, no push. Real gossip is epidemic broadcast: every heartbeat, each agent pushes state to random peers, and peers push to their peers. Information spreads like a rumor, mathematically guaranteed to reach the whole fleet in O(log fanout) rounds.
Epidemic broadcast vs the old tail -20
The old tail -20 approach had fundamental problems:
| Property | Old tail -20 | Epidemic broadcast |
|---|---|---|
| Direction | Passive read at session start only | Push on every heartbeat (every 60s) |
| Mid-session propagation | ❌ None — a discovery made at 10am is invisible until 8am tomorrow | ✅ All agents receive it within ~3 minutes |
| State exchange | Full file read (all or nothing) | Digest/delta — only missing entries transferred |
| Version tracking | ❌ None — duplicate entries possible | ✅ Version vectors + Lamport timestamps prevent duplicates |
| TTL / GC | ❌ None — file grows forever | ✅ 72h TTL, 500-entry cap, automatic GC |
| Convergence | Depends on session timing — non-deterministic | O(log_fanout(n)) mathematical rounds — guaranteed |
| Fault tolerance | ❌ None — missed reads are permanent gaps | ✅ Anti-entropy full-sync every 10 rounds closes gaps |
With 7 agents and fan-out 2, a discovery reaches all agents in 3 rounds — ~3 minutes at 60s heartbeat intervals. That's the difference between "I discovered this yesterday and nobody knew" and "the fleet knew before I finished typing the solution."
How a gossip round works — step by step
Every 60 seconds, each agent runs this loop:
def gossip_round(self):
# 1. Select 2 random peers (fan-out = 2)
peers = random.sample([p for p in all_peers if p != self], 2)
# 2. For each peer, exchange digest/delta for both channels
for peer in peers:
for channel in ["discoveries", "patterns"]:
my_digest = {
"agent": self.agent_id,
"channel": channel,
"round": self.round_number,
"vector": self.version_vector[channel], # what I know per agent
"entry_ids": list(self.replica[channel].keys()), # what I have
"my_lamport": self.clock.now(),
}
delta = send_digest_to_peer(peer, my_digest)
# 3. Apply what peer has that I don't (OR-Set merge)
for entry in delta.missing_entries:
self.apply_entry(entry, channel)
# 4. Merge version vectors (max per agent slot)
self.merge_vector(self.version_vector[channel], delta.new_vector)
# 5. Every 10 rounds: anti-entropy full-sync (fault-tolerance safety net)
if self.round_number % 10 == 0:
self.anti_entropy_full_sync()
The digest you send to a peer tells them exactly what you have — not the entries themselves, just the entry IDs and your version vector. The peer computes what you're missing and sends back only those entries. No unnecessary transfer.
Digest and delta — a concrete exchange
When Agent Tank gossips with Agent Velma, here's what actually flows over the wire:
1. Tank sends his digest to Velma
// Tank's digest — what he has in the discoveries channel
{
"agent": "tank",
"channel": "discoveries",
"round": 42,
"vector": {
"tank": 1742400000000,
"cantona": 1742398000000,
"velma": 1742399000000,
"popashot": 1742399500000,
"zerocool": 1742396000000,
"slash": 1742397000000
},
"entry_ids": [
"d-tank-1742400000000",
"d-tank-1742399000000",
"d-tank-1742398000000",
"d-cantona-1742398000000"
],
"my_lamport": 1742400000000
}The vector tells Velma what Tank has seen from each agent. Theentry_ids is the full list of discovery IDs Tank's replica contains. Velma compares against her own replica to compute what's missing.
2. Velma computes the delta — what Tank is missing
// What Velma has that Tank doesn't (diff = Velma's IDs minus Tank's IDs)
missing_from_tank = set(Velma.entry_ids) - set(Tank.entry_ids)
// e.g., if Velma has d-velma-1742399500000 and d-popashot-1742399500000
// but Tank only has entries up to 1742399000000 for those agents
// → missing_from_tank = { "d-velma-1742399500000", "d-popashot-1742399500000" }
3. Velma sends the delta back to Tank
// Velma's delta response to Tank
{
"from_agent": "velma",
"channel": "discoveries",
"round": 42,
"missing_entries": [
{
"id": "d-velma-1742399500000",
"type": "discovery",
"agent": "velma",
"ts": "2026-03-18T10:05:00Z",
"lamport": 1742399500000,
"context": "monitoring convex deploy after 2h idle",
"problem": "30s cold-start delay on first request after idle",
"solution": "pre-warm with lightweight ping every 90 minutes via cron",
"tags": ["convex", "cold-start", "cron"],
"confidence": "high"
},
{
"id": "d-popashot-1742399500000",
"type": "discovery",
"agent": "popashot",
"ts": "2026-03-18T10:04:30Z",
"lamport": 1742399500000,
"context": "concurrent convex client init",
"problem": "two agents starting at same time causes conflict resolution timeout",
"solution": "add jitter to agent startup delay, stagger by 5s",
"tags": ["convex", "concurrency", "startup"],
"confidence": "high"
}
],
"new_vector": {
"tank": 1742400000000,
"velma": 1742399500000,
"cantona": 1742398000000,
"popashot": 1742399500000,
"zerocool": 1742396000000,
"slash": 1742397000000
}
}Tank applies these two entries to his replica (OR-Set merge — if the ID is new, insert it; if it exists, take the higher Lamport version). Tank then merges his version vector with Velma's: result[agent] = max(local[agent], remote[agent]). Both agents are now consistent. No coordinator, no locks on the gossip exchange, no waiting for other agents to be online.
Version vectors and Lamport timestamps
Every write gets two numbers attached to it:
- Lamport timestamp — a single counter that increases on every write. If event A causally precedes event B, then Lamport(A) < Lamport(B). Provides total ordering across all agents without synchronized clocks.
- Version vector — a map from agent ID → last-seen Lamport for that agent's writes. Each agent maintains one per gossip channel. Tells you exactly what version of each agent's state you have.
// Agent Tank's version vector for discoveries channel
{
"tank": 1742400000000, // I've seen tank's writes up to this Lamport
"cantona": 1742398000000, // I've seen cantona's writes up to here
"velma": 1742399500000, // I've seen velma's writes up to here
"popashot": 1742399500000,
"zerocool": 1742396000000,
"slash": 1742397000000
}
// To check if I know about a specific entry:
entry_is_known = entry.lamport <= my_vector[entry.agent]
// e.g., is d-cantona-1742399000000 known? 1742399000000 <= 1742398000000 → NO (gap!)
Why both? Lamport gives ordering (if A happened before B, we know). Version vectors give partial ordering (you know what you know about each agent independently). Together they give you deterministic merge without coordination.
The ID format encodes both: d-tank-1742400000000 is a unique, orderable identifier that reflects the Lamport at time of writing.
TTL and garbage collection — 72 hours, 500 entries
Discoveries are ephemeral. They capture immediate findings that lose relevance over time. Two limits enforce this:
- TTL: 72 hours — after 72 hours from
ts, a discovery is considered expired. GC runs during anti-entropy rounds (every 10 gossip rounds) and soft-deletes expired entries by marking them"retracted": true. They stay in the JSONL (append-only) but are excluded from the replica and never shown. - Cap: 500 entries — if the replica exceeds 500 non-retracted entries, the oldest entries (by
ts, FIFO) are soft-deleted first. This prevents unbounded growth even if TTL doesn't trigger.
// Garbage collection runs every anti-entropy round (every 10 heartbeats)
// Three phases:
// 1. Soft-delete retracted entries (already handled)
// 2. FIFO eviction if over 500 entries
// 3. Hard TTL expiry (entries older than 72h)
//
// After GC, replica size ≤ 500, all entries ≤ 72h old, no retracted entries
def garbage_collect(replica):
candidates = [e for e in replica.values() if not e.get("retracted")]
if len(candidates) > MAX_DISCOVERIES:
candidates.sort(key=lambda e: e["ts"])
excess = candidates[:len(candidates) - MAX_DISCOVERIES]
for entry in excess:
entry["retracted"] = True
now = time.time()
for entry in candidates:
if now - iso_to_ts(entry["ts"]) > 72 * 3600:
entry["retracted"] = True
return {k: v for k, v in replica.items() if not v.get("retracted")}
Patterns (patterns.jsonl) have no TTL and no cap. They are permanent validated learnings. A pattern entry never expires — it can be superseded by a newer entry (via "supersedes" field) but never retracted.
A real example — Tank publishes, Cantona receives
Tank is monitoring a production deploy. He notices a 30-second cold-start delay when no agent has been active for more than 2 hours. After 20 minutes of debugging, he identifies the cause (Convex free tier hibernates on idle) and a solution (cron ping every 90 minutes). He publishes his discovery:
echo '{"id":"d-tank-1742400000000","type":"discovery","agent":"tank",
"ts":"2026-03-18T10:00:00Z","lamport":1742400000000,
"context":"monitoring convex deploy after 2h idle period",
"problem":"30s cold-start delay on first request after idle",
"solution":"pre-warm with a lightweight ping every 90 minutes via cron",
"tags":["convex","cold-start","cron"],"confidence":"high"
}' >> ~/d/clan-learnings/gossip/discoveries.jsonlHis Lamport clock increments: lamport = max(local, received) + 1. The entry is written with ID d-tank-1742400000000. His version vector for discoveries is updated: tank: 1742400000000.
At the next heartbeat (≤60s), Tank's gossip round selects 2 random peers — say Velma and Popashot. He sends them his digest. They compute deltas. Velma is missing the cold-start entry and applies it. Popashot is also missing it and applies it. At Velma's next heartbeat (≤60s later), she gossips with Cantona and Slash. Cantona receives the cold-start discovery. The whole fleet knows within 3 rounds (~3 minutes). Nobody had to start a session. Nobody had to remember to share. The epidemic handled it.
Two channels: discoveries (ephemeral) and patterns (permanent)
| Channel | What goes here | TTL / Cap | Bar for entry |
|---|---|---|---|
discoveries.jsonl | Fresh, immediate findings — undocumented behaviours, workarounds that cost >15 min, performance surprises | 72h TTL · 500-entry cap | Non-obvious · not already in patterns · not a secret |
patterns.jsonl | Validated, repeatable lessons — corrections that hit 3+ occurrences, successful mutation strategies | No TTL · no cap | Verified · generalises beyond one agent · worth encoding permanently |
Discoveries are quick to publish and naturally expire (72h window + cap). Patterns are permanent — promoted from corrections after validation, never expire, and live in patterns.md forever (or until superseded).patterns.md is generated — never edit it directly.
Why no merge conflicts? — CRDT semantics
Multiple agents can publish discoveries simultaneously. This could cause conflicts — but it doesn't, because of CRDT (Conflict-free Replicated Data Types). Three CRDT types keep the knowledge base consistent without any coordinator:
- OR-Set (Observed-Remove Set) — for adding entries. Concurrent adds from multiple agents always converge to the union. No add is ever lost. An entry is identified by its
id(which includes the Lamport). - LWW-Register (Last-Write-Wins) — for updates. When the same entry ID appears from two agents, the one with the higher Lamport wins. No coordination needed — the math does it.
- G-Counter (Grow-only Counter) — for occurrence counts in patterns. Each agent maintains its own count slot; merge takes the maximum per agent. Counts only ever go up.
// OR-Set merge: union of IDs, LWW per ID
def merge(entry₁, entry₂):
if entry₁["lamport"] >= entry₂["lamport"]:
return entry₁
else:
return entry₂
// Version vector merge: max per agent slot (idempotent, commutative, associative)
def merge_vector(local, remote):
result = {}
for agent in set(local.keys()) | set(remote.keys()):
result[agent] = max(local.get(agent, 0), remote.get(agent, 0))
return result
Concurrent file writes are still protected by fcntl advisory locks on the shared filesystem. The CRDT semantics handle logical conflicts; the locks handle physical concurrent writes to the JSONL.
What NOT to publish
- Routine fixes — "fixed a typo" or "updated a dependency" is noise
- Things already in patterns.md — duplicate signal, check first
- Secrets — API keys, passwords, tokens, personal data — never
- Speculation — only publish what you confirmed actually works
Searching fleet knowledge before you start work
Before starting on any unfamiliar topic, every agent checks what the fleet already knows. This is a standing order — not optional:
grep -A5 "topic" ~/d/clan-learnings/patterns.mdFleet intelligence is the first stop — before web search, before exploring the codebase, before asking another agent. An agent that skips this check risks rediscovering something the fleet already knows.
Deeper dives
- Reflection Loop — how corrections become patterns (the promotion pipeline)
- Distributed Knowledge — gossip at fleet scale, CRDT convergence guarantees, CAP tradeoffs
- Architecture Reference — full technical spec with CRDT data structures, transport modes, and convergence math