Summary
The AgentWorlds platform is moving to a Gateway-as-single-source-of-truth architecture for world runtime liveness. This requires three protocol-level changes in the Gateway and SDK.
Spec: https://gist.github.com/Jing-yilin/c2777c4b46fe0d52692ec159ba6e5d93 (Phase 2)
1. POST /peer/heartbeat — Lightweight liveness signal
Problem: The only liveness signal is full POST /peer/announce (Ed25519 signed, full payload with identity/endpoints/capabilities). Running it every 30s is protocol-shape overkill — using an expensive registration path as a lease-renewal path.
Solution: Add a lightweight heartbeat endpoint that only refreshes lastSeen:
// gateway/server.mjs
peer.post("/peer/heartbeat", async (req, reply) => {
const { agentId, ts, signature } = req.body;
const agent = registry.get(agentId);
if (!agent?.publicKey) return reply.code(404).send({ error: "Unknown agent" });
if (!verifyWithDomainSeparator(DOMAIN_SEPARATORS.HEARTBEAT, agent.publicKey, { agentId, ts }, signature)) {
return reply.code(403).send({ error: "Invalid signature" });
}
agent.lastSeen = Date.now();
// Do NOT trigger saveRegistry() — memory only
return { ok: true };
});
SDK changes:
- Add
DOMAIN_SEPARATORS.HEARTBEAT = "aw:hb:" in crypto.ts
- Add
sendHeartbeat() in gateway-announce.ts
startGatewayAnnounce(): full announce every 10min (unchanged) + heartbeat every 30s (new)
createWorldServer(): automatically starts heartbeat alongside announce
2. Reduce Gateway stale TTL: 15min → 90s
Problem: With Gateway as the sole liveness source, a crashed world stays visible for up to 15 minutes. This is unacceptable for a live directory.
Solution:
const DEFAULT_STALE_TTL_MS = 90 * 1000; // was: 15 * 60 * 1000
Persistence adjustment: With 90s TTL and 30s heartbeats, lastSeen updates are frequent. Heartbeats should only update in-memory; disk snapshots every 30-60s for crash recovery:
// Heartbeat: memory only (no saveRegistry)
// Announce: triggers saveRegistry (existing behavior)
// New: periodic snapshot every 30s for crash recovery
let _snapshotTimer = setInterval(() => {
if (registryModified) { writeRegistry(); }
}, 30_000);
3. Optional announce webhook
Problem: When AgentWorlds deploys a world via SSM, the platform needs to know when the world has successfully registered with Gateway. Currently there is no callback mechanism.
Solution: Fire a webhook on first-seen announce (edge-triggered, idempotent):
const WEBHOOK_URL = process.env.WEBHOOK_URL || null;
// In upsertAgent():
if (isFirstSeen && WEBHOOK_URL) {
fetch(WEBHOOK_URL, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ event: "world.announced", agentId, worldId, ts: Date.now() }),
signal: AbortSignal.timeout(5000),
}).catch(() => {}); // best-effort, not blocking
}
- No
WEBHOOK_URL → no webhook fired (local dev friendly)
- Idempotent: only on first-seen after boot or after TTL expiry
- Best-effort: fire-and-forget, not a critical path
4. Hand-written gateway/openapi.yaml
Add an OpenAPI 3.1 spec covering the 7 public Gateway endpoints:
GET /health
GET /worlds
GET /world/{worldId}
GET /agents
POST /peer/announce
POST /peer/heartbeat (new)
WS /ws (document as info)
This allows AgentWorlds (and other consumers) to generate TypeScript types from the spec instead of hand-writing interfaces that drift out of sync.
Checklist
Summary
The AgentWorlds platform is moving to a Gateway-as-single-source-of-truth architecture for world runtime liveness. This requires three protocol-level changes in the Gateway and SDK.
Spec: https://gist.github.com/Jing-yilin/c2777c4b46fe0d52692ec159ba6e5d93 (Phase 2)
1.
POST /peer/heartbeat— Lightweight liveness signalProblem: The only liveness signal is full
POST /peer/announce(Ed25519 signed, full payload with identity/endpoints/capabilities). Running it every 30s is protocol-shape overkill — using an expensive registration path as a lease-renewal path.Solution: Add a lightweight heartbeat endpoint that only refreshes
lastSeen:SDK changes:
DOMAIN_SEPARATORS.HEARTBEAT = "aw:hb:"incrypto.tssendHeartbeat()ingateway-announce.tsstartGatewayAnnounce(): full announce every 10min (unchanged) + heartbeat every 30s (new)createWorldServer(): automatically starts heartbeat alongside announce2. Reduce Gateway stale TTL: 15min → 90s
Problem: With Gateway as the sole liveness source, a crashed world stays visible for up to 15 minutes. This is unacceptable for a live directory.
Solution:
Persistence adjustment: With 90s TTL and 30s heartbeats,
lastSeenupdates are frequent. Heartbeats should only update in-memory; disk snapshots every 30-60s for crash recovery:3. Optional announce webhook
Problem: When AgentWorlds deploys a world via SSM, the platform needs to know when the world has successfully registered with Gateway. Currently there is no callback mechanism.
Solution: Fire a webhook on first-seen announce (edge-triggered, idempotent):
WEBHOOK_URL→ no webhook fired (local dev friendly)4. Hand-written
gateway/openapi.yamlAdd an OpenAPI 3.1 spec covering the 7 public Gateway endpoints:
GET /healthGET /worldsGET /world/{worldId}GET /agentsPOST /peer/announcePOST /peer/heartbeat(new)WS /ws(document as info)This allows AgentWorlds (and other consumers) to generate TypeScript types from the spec instead of hand-writing interfaces that drift out of sync.
Checklist
DOMAIN_SEPARATORS.HEARTBEATto SDKcrypto.tssendHeartbeat()to SDKgateway-announce.tsstartGatewayAnnounce()(30s interval)createWorldServer()POST /peer/heartbeattogateway/server.mjsDEFAULT_STALE_TTL_MSto 90s ingateway/server.mjsgateway/server.mjsWEBHOOK_URLenv + first-seen webhook ingateway/server.mjsgateway/openapi.yamlgateway/DockerfilewithWEBHOOK_URLenv documentation