Scoring

Each reviewer agent scores the task output across multiple dimensions on a 1-5 scale. Scores from the 3 parallel reviewers are merged via median consensus, and the weighted average determines pass/fail.

Default Dimensions

When no custom dimensions are specified, Polpo uses these defaults:

Dimension	Weight	Description
`correctness`	0.35	Task achieves its stated goals correctly
`completeness`	0.30	All requirements in the description are addressed
`code_quality`	0.20	Code is clean, readable, and maintainable
`edge_cases`	0.15	Edge cases and error conditions are handled

Score Scale

Each dimension is scored 1-5:

Score	Meaning
1	Completely wrong or missing
2	Major issues, barely functional
3	Acceptable but with notable gaps
4	Good quality with minor issues
5	Excellent, exceeds expectations

Global Score

The global score is a weighted average of all dimension scores:

globalScore = Σ (dimension.score × dimension.weight)

Tasks pass when globalScore >= threshold (default: 3.0).

Multi-Judge Consensus

Each of the 3 reviewer agents independently analyzes the agent’s work and submits a structured review with per-dimension scores. For file-based tasks, reviewers explore the codebase with tools; for output-based tasks (APIs, emails, etc.), reviewers assess the execution timeline and outcomes directly. Polpo merges results using median scoring — entries that deviate more than 1.5 from the median are excluded as outliers. At least 2 of 3 reviewers must succeed; if only 1 succeeds, its scores are used directly. Each reviewer produces chain-of-thought reasoning per dimension:

{
  "dimension": "correctness",
  "score": 4,
  "reasoning": "The API endpoints correctly implement CRUD operations. GET /users returns paginated results. POST /users validates required fields. However, the PATCH endpoint doesn't handle partial updates correctly — it requires all fields.",
  "weight": 0.35
}

This reasoning is included in retry/fix feedback so agents know exactly what to improve.

Custom Rubrics

For fine-grained control, define custom rubrics per dimension:

{
  "dimensions": [
    {
      "name": "performance",
      "weight": 0.5,
      "rubric": {
        "1": "O(n²) or worse algorithms, no caching",
        "2": "Basic functionality but poor algorithmic choices",
        "3": "Reasonable performance, standard patterns",
        "4": "Good performance with appropriate caching",
        "5": "Optimal algorithms, efficient caching, lazy loading"
      }
    }
  ]
}

Quality Gates

Quality gates are checkpoints defined in mission documents that control task execution flow. A gate sits between two phases of tasks: it waits for afterTasks to complete, evaluates pass/fail criteria, and then either unblocks blocksTasks or halts the mission.

Quality gate flow: Phase 1 tasks complete → quality gate (review, minScore:4) → passed → Phase 2 tasks unblocked

Gate Properties

Property	Type	Description
`name`	`string`	Gate identifier
`afterTasks`	`string[]`	Tasks that must complete before evaluation
`blocksTasks`	`string[]`	Tasks blocked until this gate passes
`minScore`	`number`	Minimum average score of `afterTasks` to pass (1-5 scale)
`requireAllPassed`	`boolean`	All `afterTasks` must have status `done` (not `failed`)
`notifyChannels`	`string[]`	Notification channels to alert on pass/fail

Gate Configuration in Missions

{
  "name": "Build Feature X",
  "tasks": [
    { "title": "Implement core module", "assignTo": "backend-agent" },
    { "title": "Write unit tests", "assignTo": "test-agent" },
    { "title": "Integration testing", "assignTo": "test-agent" },
    { "title": "Deploy to staging", "assignTo": "devops-agent" }
  ],
  "qualityGates": [
    {
      "name": "code-review",
      "afterTasks": ["Implement core module", "Write unit tests"],
      "blocksTasks": ["Integration testing"],
      "minScore": 4.0,
      "requireAllPassed": true,
      "notifyChannels": ["slack-alerts"]
    },
    {
      "name": "integration-check",
      "afterTasks": ["Integration testing"],
      "blocksTasks": ["Deploy to staging"],
      "requireAllPassed": true
    }
  ]
}

Gate Evaluation

Gates are evaluated by the MissionExecutor when checking if blocked tasks can proceed:

Checks that all afterTasks are in a terminal state (done or failed)
If requireAllPassed is set, verifies none of the afterTasks failed
If minScore is set, computes the average assessment score across afterTasks and compares against the threshold

Gates are evaluated at most once — after passing, they are cached and not re-evaluated.

Mission Quality Threshold

Missions can define a qualityThreshold — a minimum weighted average score across all completed tasks:

{
  "name": "Release Pipeline",
  "qualityThreshold": 3.5,
  "tasks": [
    { "title": "Critical path", "priority": 2.0, "assignTo": "agent-a" },
    { "title": "Nice-to-have", "priority": 0.5, "assignTo": "agent-b" }
  ]
}

If the weighted average score falls below the threshold, a quality:threshold:failed event is emitted.

Quality Metrics

The QualityController aggregates metrics per entity (task, agent, or mission):

Metric	Description
`totalAssessments`	Total assessments run
`passedAssessments`	Assessments that passed
`avgScore`	Average global score (1-5)
`minScore` / `maxScore`	Score range
`dimensionScores`	Per-dimension average scores
`totalRetries`	Total retries consumed
`totalFixes`	Total fix attempts consumed
`deadlinesMet` / `deadlinesMissed`	SLA tracking

SLA Monitor

The SLAMonitor tracks deadlines on tasks and missions, emitting warning and violation events as deadlines approach or pass. The monitor runs on Polpo’s tick loop, checking all active tasks and missions with deadlines at a configurable interval (default: 30 seconds).

SLA Configuration

{
  "settings": {
    "sla": {
      "warningThreshold": 0.8,
      "checkIntervalMs": 30000,
      "warningChannels": ["slack-alerts"],
      "violationChannels": ["slack-alerts", "pagerduty"],
      "violationAction": "notify"
    }
  }
}

Property	Type	Default	Description
`warningThreshold`	`number`	`0.8`	Fraction of deadline elapsed before warning (0-1)
`checkIntervalMs`	`number`	`30000`	Check interval in milliseconds
`warningChannels`	`string[]`	`[]`	Channels for warning notifications
`violationChannels`	`string[]`	`[]`	Channels for violation notifications
`violationAction`	`"notify" \| "fail"`	`"notify"`	Action on violation

The "fail" violation action immediately transitions the task to failed status, bypassing normal retry and escalation flows. Use with care.

Events

Event	Payload	Description
`quality:gate:passed`	`{ missionId, gateName, avgScore? }`	A quality gate passed
`quality:gate:failed`	`{ missionId, gateName, reason, avgScore? }`	A quality gate failed
`quality:threshold:failed`	`{ missionId, avgScore, threshold }`	Mission did not meet its quality threshold
`sla:warning`	`{ entityId, entityType, deadline, elapsed, remaining, percentUsed }`	Approaching deadline
`sla:violated`	`{ entityId, entityType, deadline, overdueMs }`	Deadline exceeded
`sla:met`	`{ entityId, entityType, deadline, marginMs }`	Completed before deadline

Core

Agent Capabilities

Sharing & Registry

Mission Execution

Human-in-the-Loop

Quality & Recovery

Security

Default Dimensions

Score Scale

Global Score

Multi-Judge Consensus

Custom Rubrics

Quality Gates

Gate Properties

Gate Configuration in Missions

Gate Evaluation

Mission Quality Threshold

Quality Metrics

SLA Monitor

SLA Configuration

Events

Core

Agent Capabilities

Sharing & Registry

Mission Execution

Human-in-the-Loop

Quality & Recovery

Security

​Default Dimensions

​Score Scale

​Global Score

​Multi-Judge Consensus

​Custom Rubrics

​Quality Gates

​Gate Properties

​Gate Configuration in Missions

​Gate Evaluation

​Mission Quality Threshold

​Quality Metrics

​SLA Monitor

​SLA Configuration

​Events

Default Dimensions

Score Scale

Global Score

Multi-Judge Consensus

Custom Rubrics

Quality Gates

Gate Properties

Gate Configuration in Missions

Gate Evaluation

Mission Quality Threshold

Quality Metrics

SLA Monitor

SLA Configuration

Events