Assessment

Assessment runs when a task has at least one expectation or metric. It is not applied automatically to every task. All configured checks and metrics must pass for the assessment to pass.

Expectation types

Type	Required fields	Behavior
`test`	`command`	Runs the command and requires success
`script`	`command`	Runs a one-line command or a generated Bash script and requires exit code 0
`file_exists`	non-empty `paths`	Requires every listed path to exist
`llm_review`	`criteria` or `dimensions`	Scores the result from 1 to 5

{
  "expectations": [
    {"type": "test", "command": "pnpm test"},
    {"type": "file_exists", "paths": ["dist/index.js", "dist/index.d.ts"]},
    {"type": "script", "command": "pnpm lint\npnpm typecheck"},
    {
      "type": "llm_review",
      "criteria": "The implementation handles the specified failure cases.",
      "threshold": 4
    }
  ]
}

Malformed expectations are dropped during sanitization with warnings. Use paths, not path, for file_exists; use criteria, not prompt, for llm_review. Script checks always use exit code 0 as success and do not accept expectExitCode. Expectations run concurrently. Metrics run as a second concurrent group; each metric executes a command, parses its stdout as a number, and compares it with threshold using value >= threshold.

LLM review

An llm_review uses three independent reviewers. For file-changing work, each reviewer can explore with read, glob, and grep tools for up to 20 steps before a separate structured-scoring call. For output-only work, the reviewer skips filesystem exploration and scores the execution output and registered outcomes. The default dimensions are:

Dimension	Weight
`correctness`	0.35
`completeness`	0.30
`code_quality`	0.20
`edge_cases`	0.15

If dimensions is omitted, Polpo first attempts to generate task-specific dimensions and uses the defaults as fallback. Custom dimensions require name, description, and a weight from 0 to 1; they may also define a numeric rubric.

{
  "type": "llm_review",
  "criteria": "Evaluate the public API design.",
  "threshold": 3.5,
  "dimensions": [
    {
      "name": "compatibility",
      "description": "Existing clients continue to work.",
      "weight": 0.6
    },
    {
      "name": "error_handling",
      "description": "Failures have stable status codes and response bodies.",
      "weight": 0.4
    }
  ]
}

Scores are clamped to 1-5 and combined by weight. When at least two reviewers succeed, Polpo takes the median per dimension and excludes entries more than 1.5 points from that median. One successful reviewer is used as fallback; if all fail, the check fails and the assessment can be retried according to maxAssessmentRetries. The expectation’s threshold defaults to 3.0. A mission’s qualityThreshold is an aggregate mission-quality setting and does not replace an explicit llm_review.threshold.

Fix and retry

When an assessed task fails, the runtime may ask the same agent to fix the lowest-scoring problems before consuming a full retry. maxFixAttempts defaults to 2. After that, normal task retry and escalation policy applies. Tasks marked sideEffects move to awaiting_approval before repeating work. Every assessment is stored on TaskResult.assessment; the previous value moves into assessmentHistory. Triggers are initial, reassess, fix, retry, auto-correct, or judge.

Judge configuration

The reviewer model resolves from POLPO_JUDGE_MODEL, then POLPO_MODEL, then the runtime default. POLPO_JUDGE_REASONING can set judge reasoning when no explicit reasoning value is passed. Use a model identifier supported by the configured provider or gateway.

Getting Started

Platform

Agents

Orchestration

Expectation types

LLM review

Fix and retry

Judge configuration

​Expectation types

​LLM review

​Fix and retry

​Judge configuration

Expectation types

LLM review

Fix and retry

Judge configuration