Skip to main content
Each reviewer agent scores the task output across multiple dimensions on a 1-5 scale. Scores from the 3 parallel reviewers are merged via median consensus, and the weighted average determines pass/fail.

Default Dimensions

When no custom dimensions are specified, Polpo uses these defaults:
DimensionWeightDescription
correctness0.35Task achieves its stated goals correctly
completeness0.30All requirements in the description are addressed
code_quality0.20Code is clean, readable, and maintainable
edge_cases0.15Edge cases and error conditions are handled

Score Scale

Each dimension is scored 1-5:
ScoreMeaning
1Completely wrong or missing
2Major issues, barely functional
3Acceptable but with notable gaps
4Good quality with minor issues
5Excellent, exceeds expectations

Global Score

The global score is a weighted average of all dimension scores:
globalScore = Σ (dimension.score × dimension.weight)
Tasks pass when globalScore >= threshold (default: 3.0).

Multi-Judge Consensus

Each of the 3 reviewer agents independently analyzes the agent’s work and submits a structured review with per-dimension scores. For file-based tasks, reviewers explore the codebase with tools; for output-based tasks (APIs, emails, etc.), reviewers assess the execution timeline and outcomes directly. Polpo merges results using median scoring — entries that deviate more than 1.5 from the median are excluded as outliers. At least 2 of 3 reviewers must succeed; if only 1 succeeds, its scores are used directly. Each reviewer produces chain-of-thought reasoning per dimension:
{
  "dimension": "correctness",
  "score": 4,
  "reasoning": "The API endpoints correctly implement CRUD operations. GET /users returns paginated results. POST /users validates required fields. However, the PATCH endpoint doesn't handle partial updates correctly — it requires all fields.",
  "weight": 0.35
}
This reasoning is included in retry/fix feedback so agents know exactly what to improve.

Custom Rubrics

For fine-grained control, define custom rubrics per dimension:
{
  "dimensions": [
    {
      "name": "performance",
      "weight": 0.5,
      "rubric": {
        "1": "O(n²) or worse algorithms, no caching",
        "2": "Basic functionality but poor algorithmic choices",
        "3": "Reasonable performance, standard patterns",
        "4": "Good performance with appropriate caching",
        "5": "Optimal algorithms, efficient caching, lazy loading"
      }
    }
  ]
}

Quality Gates

Quality gates are checkpoints defined in mission documents that control task execution flow. A gate sits between two phases of tasks: it waits for afterTasks to complete, evaluates pass/fail criteria, and then either unblocks blocksTasks or halts the mission.
Quality gate flow: Phase 1 tasks complete → quality gate (review, minScore:4) → passed → Phase 2 tasks unblocked

Gate Properties

PropertyTypeDescription
namestringGate identifier
afterTasksstring[]Tasks that must complete before evaluation
blocksTasksstring[]Tasks blocked until this gate passes
minScorenumberMinimum average score of afterTasks to pass (1-5 scale)
requireAllPassedbooleanAll afterTasks must have status done (not failed)
notifyChannelsstring[]Notification channels to alert on pass/fail

Gate Configuration in Missions

{
  "name": "Build Feature X",
  "tasks": [
    { "title": "Implement core module", "assignTo": "backend-agent" },
    { "title": "Write unit tests", "assignTo": "test-agent" },
    { "title": "Integration testing", "assignTo": "test-agent" },
    { "title": "Deploy to staging", "assignTo": "devops-agent" }
  ],
  "qualityGates": [
    {
      "name": "code-review",
      "afterTasks": ["Implement core module", "Write unit tests"],
      "blocksTasks": ["Integration testing"],
      "minScore": 4.0,
      "requireAllPassed": true,
      "notifyChannels": ["slack-alerts"]
    },
    {
      "name": "integration-check",
      "afterTasks": ["Integration testing"],
      "blocksTasks": ["Deploy to staging"],
      "requireAllPassed": true
    }
  ]
}

Gate Evaluation

Gates are evaluated by the MissionExecutor when checking if blocked tasks can proceed:
  1. Checks that all afterTasks are in a terminal state (done or failed)
  2. If requireAllPassed is set, verifies none of the afterTasks failed
  3. If minScore is set, computes the average assessment score across afterTasks and compares against the threshold
Gates are evaluated at most once — after passing, they are cached and not re-evaluated.

Mission Quality Threshold

Missions can define a qualityThreshold — a minimum weighted average score across all completed tasks:
{
  "name": "Release Pipeline",
  "qualityThreshold": 3.5,
  "tasks": [
    { "title": "Critical path", "priority": 2.0, "assignTo": "agent-a" },
    { "title": "Nice-to-have", "priority": 0.5, "assignTo": "agent-b" }
  ]
}
If the weighted average score falls below the threshold, a quality:threshold:failed event is emitted.

Quality Metrics

The QualityController aggregates metrics per entity (task, agent, or mission):
MetricDescription
totalAssessmentsTotal assessments run
passedAssessmentsAssessments that passed
avgScoreAverage global score (1-5)
minScore / maxScoreScore range
dimensionScoresPer-dimension average scores
totalRetriesTotal retries consumed
totalFixesTotal fix attempts consumed
deadlinesMet / deadlinesMissedSLA tracking

SLA Monitor

The SLAMonitor tracks deadlines on tasks and missions, emitting warning and violation events as deadlines approach or pass. The monitor runs on Polpo’s tick loop, checking all active tasks and missions with deadlines at a configurable interval (default: 30 seconds).
SLA monitor: task with deadline → 80% elapsed: warning → deadline passed: violated → completes in time: met

SLA Configuration

{
  "settings": {
    "sla": {
      "warningThreshold": 0.8,
      "checkIntervalMs": 30000,
      "warningChannels": ["slack-alerts"],
      "violationChannels": ["slack-alerts", "pagerduty"],
      "violationAction": "notify"
    }
  }
}
PropertyTypeDefaultDescription
warningThresholdnumber0.8Fraction of deadline elapsed before warning (0-1)
checkIntervalMsnumber30000Check interval in milliseconds
warningChannelsstring[][]Channels for warning notifications
violationChannelsstring[][]Channels for violation notifications
violationAction"notify" | "fail""notify"Action on violation
The "fail" violation action immediately transitions the task to failed status, bypassing normal retry and escalation flows. Use with care.

Events

EventPayloadDescription
quality:gate:passed{ missionId, gateName, avgScore? }A quality gate passed
quality:gate:failed{ missionId, gateName, reason, avgScore? }A quality gate failed
quality:threshold:failed{ missionId, avgScore, threshold }Mission did not meet its quality threshold
sla:warning{ entityId, entityType, deadline, elapsed, remaining, percentUsed }Approaching deadline
sla:violated{ entityId, entityType, deadline, overdueMs }Deadline exceeded
sla:met{ entityId, entityType, deadline, marginMs }Completed before deadline