Architecture
Three components work together for crash resilience:- Detached runners — agent processes run independently of Polpo’s main process
- RunStore — SQLite-backed registry of running processes
- Orphan recovery — automatic reconnection on restart
Detached Runners
When Polpo spawns an agent, it doesn’t run it in-process. Instead, it launches a detached runner as a separate Node.js process:- Runner process has its own PID and survives Polpo crashes
- Uses
detached: true+unref()so Polpo can exit without killing it - Config is passed via a temporary JSON file in
.polpo/tmp/ - Results are written to SQLite (RunStore), not IPC
RunStore
The RunStore tracks all runner processes in SQLite:busy_timeout=5000 to handle concurrent access.
Recovery Flow
When Polpo starts, it checks for orphaned processes:- If Polpo crashes, runners keep working. On restart, Polpo reconnects.
- If a runner crashes, Polpo detects the dead PID and retries the task.
- If both crash, on restart the dead PIDs trigger retries.
Graceful Shutdown
When Polpo receives SIGTERM, SIGINT, or SIGHUP:- Stop spawning new tasks
- SIGTERM all runner PIDs
- Wait for RunStore writes (runners log their exit state)
- Force-mark any remaining active records as killed
- Exit cleanly
Health Checks
The orchestrator periodically checks active runners:Timeout Detection
Tasks that exceedmaxDuration (default: 30 minutes) are killed:
Stale Detection
Agents that haven’t reported activity forstaleThreshold (default: 5 minutes) get a warning, then are killed:
activity.lastUpdate on each tick. Stale agents trigger the agent:stale event.
Settings are configured in
.polpo/polpo.json under the settings object.Volatile Agent Cleanup
Volatile agents (scoped to a mission) are cleaned up when their mission completes. AcleanedGroups set prevents repeated cleanup attempts:
- Mission reaches terminal state (completed/failed/cancelled)
- Find all volatile agents for that mission group
- Remove them from the team
- Mark the group as cleaned
Database Safety
All SQLite operations use:- WAL mode: Write-ahead logging for concurrent reads
busy_timeout=5000: Wait up to 5 seconds for locks- Atomic writes: Task transitions use write-then-rename pattern
- Own connections: The orchestrator and each runner have independent SQLite connections