Skip to main content
When an agent finishes a task, Polpo reviews the work automatically. Three independent G-Eval reviewer agents run in parallel, scoring the agent’s work across multiple dimensions. For tasks that produce file changes, reviewers explore the codebase with tools (file reads, grep, glob). For tasks that work via external tools, APIs, or text output (e.g. email drafts, web requests), reviewers analyze the agent’s execution timeline, tool usage, and registered outcomes directly — no filesystem exploration needed. Scores are merged via median consensus — outliers are filtered, and the final score reflects agreement between reviewers. A separate expectation judge acts as a meta-judge: when a task fails, Polpo first checks whether the expectations themselves were wrong (bad test command, wrong file path, threshold too strict) before blaming the agent.
Review system: agent completes task → run expectations → 3 parallel G-Eval reviewers read code and score → median consensus → passed? yes → done, no → expectation judge
The review pipeline runs in order:
  1. Collect context — gather task description, expectations, agent output, stderr, execution timeline (from JSONL transcript logs), tool usage stats, and registered outcomes
  2. File existence checks — verify required files exist
  3. Test/script execution — run test commands and validation scripts
  4. Diff analysis — analyze files created and edited by the agent
  5. Multi-judge LLM review — 3 parallel G-Eval reviewers score all dimensions independently
  6. Consensus — median aggregation across reviewers, outlier filtering
  7. Result determination — pass/fail decision with feedback generation

Expectation Types

Tasks can define four types of expectations:

test — Run a Test Command

{
  "expectations": [
    {
      "type": "test",
      "command": "npm test"
    }
  ]
}
Runs the command and checks the exit code. Exit code 0 = pass.

file_exists — Check File Presence

{
  "expectations": [
    {
      "type": "file_exists",
      "paths": [
        "src/routes/users.ts",
        "src/routes/posts.ts"
      ]
    }
  ]
}
Verifies that the specified files exist in the working directory.

script — Run a Custom Script

{
  "expectations": [
    {
      "type": "script",
      "command": "node scripts/validate-schema.js"
    }
  ]
}
Similar to test but intended for custom validation scripts that check specific conditions.

llm_review — AI-Powered Code Review

{
  "expectations": [
    {
      "type": "llm_review",
      "criteria": "Verify proper error handling and input validation",
      "threshold": 3.5,
      "dimensions": [
        {
          "name": "security",
          "weight": 0.5,
          "rubric": "No SQL injection, XSS, or authentication bypass vulnerabilities"
        },
        {
          "name": "error_handling",
          "weight": 0.5,
          "rubric": "All error paths return proper HTTP status codes with messages"
        }
      ]
    }
  ]
}
The most powerful expectation type. Three independent LLM reviewer agents explore the agent’s output with tools and score the work against configurable dimensions. Scores are merged via median consensus. See Scoring for dimensions, rubrics, and score mechanics.
Expectations are evaluated in order. If a file_exists or test check fails, the llm_review is skipped to save API calls. The agent receives specific feedback about which check failed.

Review Flow

Task completes → Run expectations in order:
  1. file_exists checks
  2. test/script commands
  3. llm_review → 3 parallel reviewer agents
     → median consensus scoring
→ Aggregate results
→ Pass: task → done
→ Fail: expectation judge evaluates
  → Expectations wrong? → auto-correct → re-assess
  → Agent work wrong? → fix phase (or retry)

Two-Phase Reviewer Architecture

Each of the 3 independent reviewer agents follows a two-phase process:

Phase 1 — Evidence Gathering

Polpo determines the review mode based on whether the agent produced file changes:
  • File-based tasks: The reviewer explores the codebase using read_file, glob, and grep tools (up to 20 turns), building a detailed analysis with specific file:line evidence.
  • Output-based tasks (no files created/edited): The reviewer analyzes the execution timeline, agent output, stderr, tool call results, and registered outcomes that are provided inline — no filesystem exploration is needed.
In both cases, the reviewer builds a detailed analysis for each evaluation dimension. No scoring happens in this phase.

Phase 2 — Scoring

A separate LLM call receives the full Phase 1 analysis and must produce structured scores by calling the submit_review tool. Polpo uses 3 scoring strategies as fallback (forced tool choice → prompt-based → raw JSON), and validates all output with Zod schemas for type safety and coercion.

Reviewer Trace Persistence

The complete trace of each reviewer is persisted on the task result for full transparency and debugging:
DataWhere it’s stored
Phase 1 analysis texttask.result.assessment.checks[].reviewers[].exploration.analysis
Files read during explorationtask.result.assessment.checks[].reviewers[].exploration.filesRead
Full conversation (prompts, responses, tool calls, tool results)task.result.assessment.checks[].reviewers[].exploration.messages
Per-dimension scores with reasoning and evidencetask.result.assessment.checks[].reviewers[].scores
Scoring strategy attempt errorstask.result.assessment.checks[].reviewers[].scoringAttemptErrors
This means you can reconstruct exactly what each reviewer read, how it reasoned, which files it inspected, and how it arrived at its scores — even after the assessment is complete.

Retry with Feedback

When the review fails, the feedback includes:
  • Which expectations failed and why
  • Per-dimension scores with reasoning
  • Execution timeline — what the agent did step-by-step (tool calls, outputs, errors)
  • Agent stderr — errors and warnings from execution
  • Registered outcomes — explicit deliverables the agent produced
  • Specific suggestions for improvement
  • Original task description (preserved across retries)
The fix phase sends this feedback to the agent so it can make targeted corrections without starting from scratch.

Retry and Fix Phase

When review fails, Polpo uses a multi-stage retry strategy:
  1. Fix phase — Polpo sends targeted feedback to the agent (up to maxFixAttempts attempts)
  2. Full retry — Polpo resets the task and starts from scratch (up to maxRetries attempts)
  3. Escalation — Polpo switches to a different agent or model if configured via the escalation policy

Expectation Judge

Sometimes the expectations (acceptance criteria) are wrong, not the agent’s work. For example:
  • A test command that was incorrect in the mission
  • A file path that doesn’t match the actual project structure
  • An LLM review threshold that’s too strict for the task scope
Without the expectation judge, the agent would be repeatedly retried for failing criteria that can never be met. When a task fails, Polpo’s expectation judge evaluates whether the expectations themselves might be wrong:
Expectation judge: task fails assessment → agent did work correctly? → YES: correct expectations → NO: normal retry/fix
The judge LLM receives the task description, the agent’s output, the failed expectations, and review feedback — then determines whether the expectations are realistic.

Correction Types

The judge can fix test commands, file paths, and score thresholds:
// Test command fix
{ "type": "test", "command": "npm test -- --grep users" }
→ { "type": "test", "command": "npm test -- --grep \"user management\"" }

// File path fix
{ "type": "file_exists", "paths": ["src/routes/users.ts"] }
→ { "type": "file_exists", "paths": ["src/api/users.ts"] }

// Threshold adjustment
{ "type": "llm_review", "threshold": 4.5 }
→ { "type": "llm_review", "threshold": 3.5 }

Safeguards

  • Only runs on first failure — doesn’t keep correcting on subsequent retries
  • Conservative — when in doubt, doesn’t correct (lets the normal retry flow handle it)
  • Audit trail — the assessment:corrected event logs what was changed
  • Original preservedtask.originalDescription is never modified
The expectation judge is most useful when missions are AI-generated. Human-written expectations are typically more accurate, but AI-generated missions may include test commands or file paths that don’t match the actual project structure.

Configuration

{
  "settings": {
    "maxFixAttempts": 2,
    "maxRetries": 3,
    "orchestratorModel": "claude-sonnet-4-6"
  }
}
Override per-task:
{
  "tasks": [
    {
      "title": "Critical API endpoint",
      "assignTo": "backend-dev",
      "description": "...",
      "maxRetries": 5,
      "expectations": [
        {
          "type": "llm_review",
          "threshold": 4.0,
          "dimensions": []
        }
      ]
    }
  ]
}
See Configuration Reference for the full schema.

Events

EventPayloadDescription
assessment:corrected{ taskId, corrections }Expectations auto-corrected