Skip to main content
Polpo is designed to survive crashes without losing work. This guide explains the resilience mechanisms.

Architecture

Three components work together for crash resilience:
  1. Detached runners — agent processes run independently of Polpo’s main process
  2. RunStore — SQLite-backed registry of running processes
  3. Orphan recovery — automatic reconnection on restart

Detached Runners

When Polpo spawns an agent, it doesn’t run it in-process. Instead, it launches a detached runner as a separate Node.js process:
Detached runner communication flow: orchestrator writes config, spawns runner, runner initializes agent, writes results to RunStore
Key properties:
  • Runner process has its own PID and survives Polpo crashes
  • Uses detached: true + unref() so Polpo can exit without killing it
  • Config is passed via a temporary JSON file in .polpo/tmp/
  • Results are written to SQLite (RunStore), not IPC

RunStore

The RunStore tracks all runner processes in SQLite:
interface RunRecord {
  runId: string;
  taskId: string;
  agentName: string;
  pid: number;
  state: "running" | "completed" | "failed" | "killed";
  exitCode?: number;
  stdout?: string;
  stderr?: string;
  duration?: number;
  startedAt: string;
  endedAt?: string;
}
Both Polpo and its runners write to the same SQLite database with busy_timeout=5000 to handle concurrent access.

Recovery Flow

When Polpo starts, it checks for orphaned processes:
Recovery flow: orchestrator restarts, reads RunStore, checks PIDs — alive processes keep running, dead processes get retried
This means:
  • If Polpo crashes, runners keep working. On restart, Polpo reconnects.
  • If a runner crashes, Polpo detects the dead PID and retries the task.
  • If both crash, on restart the dead PIDs trigger retries.

Graceful Shutdown

When Polpo receives SIGTERM, SIGINT, or SIGHUP:
  1. Stop spawning new tasks
  2. SIGTERM all runner PIDs
  3. Wait for RunStore writes (runners log their exit state)
  4. Force-mark any remaining active records as killed
  5. Exit cleanly
// Signal handlers
process.on("SIGTERM", () => orchestrator.gracefulStop());
process.on("SIGHUP",  () => orchestrator.gracefulStop());
process.on("SIGINT",  () => orchestrator.gracefulStop());
For automatic retry logic, escalation policies, and the fix phase, see Escalation Chain.

Health Checks

The orchestrator periodically checks active runners:

Timeout Detection

Tasks that exceed maxDuration (default: 30 minutes) are killed:
{
  "settings": {
    "taskTimeout": 1800000
  }
}
Override per-task when creating tasks via TUI/CLI:
orchestrator.addTask({
  title: "Long-running migration",
  maxDuration: 3600000  // 1 hour
})

Stale Detection

Agents that haven’t reported activity for staleThreshold (default: 5 minutes) get a warning, then are killed:
{
  "settings": {
    "staleThreshold": 300000
  }
}
The orchestrator checks activity.lastUpdate on each tick. Stale agents trigger the agent:stale event.
Settings are configured in .polpo/polpo.json under the settings object.

Volatile Agent Cleanup

Volatile agents (scoped to a mission) are cleaned up when their mission completes. A cleanedGroups set prevents repeated cleanup attempts:
  1. Mission reaches terminal state (completed/failed/cancelled)
  2. Find all volatile agents for that mission group
  3. Remove them from the team
  4. Mark the group as cleaned

Database Safety

All SQLite operations use:
  • WAL mode: Write-ahead logging for concurrent reads
  • busy_timeout=5000: Wait up to 5 seconds for locks
  • Atomic writes: Task transitions use write-then-rename pattern
  • Own connections: The orchestrator and each runner have independent SQLite connections
If you need to reset corrupted state, delete .polpo/state.db (or .polpo/state.json) and restart. The orchestrator will recreate it.