Each reviewer agent scores the task output across multiple dimensions on a 1-5 scale. Scores from the 3 parallel reviewers are merged via median consensus, and the weighted average determines pass/fail.
Default Dimensions
When no custom dimensions are specified, Polpo uses these defaults:
| Dimension | Weight | Description |
|---|
correctness | 0.35 | Task achieves its stated goals correctly |
completeness | 0.30 | All requirements in the description are addressed |
code_quality | 0.20 | Code is clean, readable, and maintainable |
edge_cases | 0.15 | Edge cases and error conditions are handled |
Score Scale
Each dimension is scored 1-5:
| Score | Meaning |
|---|
| 1 | Completely wrong or missing |
| 2 | Major issues, barely functional |
| 3 | Acceptable but with notable gaps |
| 4 | Good quality with minor issues |
| 5 | Excellent, exceeds expectations |
Global Score
The global score is a weighted average of all dimension scores:
globalScore = Σ (dimension.score × dimension.weight)
Tasks pass when globalScore >= threshold (default: 3.0).
Multi-Judge Consensus
Each of the 3 reviewer agents independently analyzes the agent’s work and submits a structured review with per-dimension scores. For file-based tasks, reviewers explore the codebase with tools; for output-based tasks (APIs, emails, etc.), reviewers assess the execution timeline and outcomes directly. Polpo merges results using median scoring — entries that deviate more than 1.5 from the median are excluded as outliers. At least 2 of 3 reviewers must succeed; if only 1 succeeds, its scores are used directly.
Each reviewer produces chain-of-thought reasoning per dimension:
{
"dimension": "correctness",
"score": 4,
"reasoning": "The API endpoints correctly implement CRUD operations. GET /users returns paginated results. POST /users validates required fields. However, the PATCH endpoint doesn't handle partial updates correctly — it requires all fields.",
"weight": 0.35
}
This reasoning is included in retry/fix feedback so agents know exactly what to improve.
Custom Rubrics
For fine-grained control, define custom rubrics per dimension:
{
"dimensions": [
{
"name": "performance",
"weight": 0.5,
"rubric": {
"1": "O(n²) or worse algorithms, no caching",
"2": "Basic functionality but poor algorithmic choices",
"3": "Reasonable performance, standard patterns",
"4": "Good performance with appropriate caching",
"5": "Optimal algorithms, efficient caching, lazy loading"
}
}
]
}
Quality Gates
Quality gates are checkpoints defined in mission documents that control task execution flow. A gate sits between two phases of tasks: it waits for afterTasks to complete, evaluates pass/fail criteria, and then either unblocks blocksTasks or halts the mission.
Gate Properties
| Property | Type | Description |
|---|
name | string | Gate identifier |
afterTasks | string[] | Tasks that must complete before evaluation |
blocksTasks | string[] | Tasks blocked until this gate passes |
minScore | number | Minimum average score of afterTasks to pass (1-5 scale) |
requireAllPassed | boolean | All afterTasks must have status done (not failed) |
notifyChannels | string[] | Notification channels to alert on pass/fail |
Gate Configuration in Missions
{
"name": "Build Feature X",
"tasks": [
{ "title": "Implement core module", "assignTo": "backend-agent" },
{ "title": "Write unit tests", "assignTo": "test-agent" },
{ "title": "Integration testing", "assignTo": "test-agent" },
{ "title": "Deploy to staging", "assignTo": "devops-agent" }
],
"qualityGates": [
{
"name": "code-review",
"afterTasks": ["Implement core module", "Write unit tests"],
"blocksTasks": ["Integration testing"],
"minScore": 4.0,
"requireAllPassed": true,
"notifyChannels": ["slack-alerts"]
},
{
"name": "integration-check",
"afterTasks": ["Integration testing"],
"blocksTasks": ["Deploy to staging"],
"requireAllPassed": true
}
]
}
Gate Evaluation
Gates are evaluated by the MissionExecutor when checking if blocked tasks can proceed:
- Checks that all
afterTasks are in a terminal state (done or failed)
- If
requireAllPassed is set, verifies none of the afterTasks failed
- If
minScore is set, computes the average assessment score across afterTasks and compares against the threshold
Gates are evaluated at most once — after passing, they are cached and not re-evaluated.
Mission Quality Threshold
Missions can define a qualityThreshold — a minimum weighted average score across all completed tasks:
{
"name": "Release Pipeline",
"qualityThreshold": 3.5,
"tasks": [
{ "title": "Critical path", "priority": 2.0, "assignTo": "agent-a" },
{ "title": "Nice-to-have", "priority": 0.5, "assignTo": "agent-b" }
]
}
If the weighted average score falls below the threshold, a quality:threshold:failed event is emitted.
Quality Metrics
The QualityController aggregates metrics per entity (task, agent, or mission):
| Metric | Description |
|---|
totalAssessments | Total assessments run |
passedAssessments | Assessments that passed |
avgScore | Average global score (1-5) |
minScore / maxScore | Score range |
dimensionScores | Per-dimension average scores |
totalRetries | Total retries consumed |
totalFixes | Total fix attempts consumed |
deadlinesMet / deadlinesMissed | SLA tracking |
SLA Monitor
The SLAMonitor tracks deadlines on tasks and missions, emitting warning and violation events as deadlines approach or pass.
The monitor runs on Polpo’s tick loop, checking all active tasks and missions with deadlines at a configurable interval (default: 30 seconds).
SLA Configuration
{
"settings": {
"sla": {
"warningThreshold": 0.8,
"checkIntervalMs": 30000,
"warningChannels": ["slack-alerts"],
"violationChannels": ["slack-alerts", "pagerduty"],
"violationAction": "notify"
}
}
}
| Property | Type | Default | Description |
|---|
warningThreshold | number | 0.8 | Fraction of deadline elapsed before warning (0-1) |
checkIntervalMs | number | 30000 | Check interval in milliseconds |
warningChannels | string[] | [] | Channels for warning notifications |
violationChannels | string[] | [] | Channels for violation notifications |
violationAction | "notify" | "fail" | "notify" | Action on violation |
The "fail" violation action immediately transitions the task to failed status, bypassing normal retry and escalation flows. Use with care.
Events
| Event | Payload | Description |
|---|
quality:gate:passed | { missionId, gateName, avgScore? } | A quality gate passed |
quality:gate:failed | { missionId, gateName, reason, avgScore? } | A quality gate failed |
quality:threshold:failed | { missionId, avgScore, threshold } | Mission did not meet its quality threshold |
sla:warning | { entityId, entityType, deadline, elapsed, remaining, percentUsed } | Approaching deadline |
sla:violated | { entityId, entityType, deadline, overdueMs } | Deadline exceeded |
sla:met | { entityId, entityType, deadline, marginMs } | Completed before deadline |