Skip to main content
Assessment is the automated quality layer that scores every task result. It uses an LLM-as-Judge approach: a separate LLM instance reviews the agent’s output against defined expectations and produces a structured score.

How it works

1

Run expectations

All expectations (test, file_exists, script, llm_review) execute in parallel.
2

Phase 1: Exploration

The judge model explores the task result using tools (read files, run commands) for up to 20 turns to gather evidence.
3

Phase 2: Structured scoring

The judge produces a G-Eval score across weighted dimensions.
4

Consensus

Three parallel reviewers score independently. Median scores are taken with outlier filtering.
5

Verdict

If the average score meets the threshold, the task passes. Otherwise, the fix loop begins.

Expectation types

Expectations define what “correct” looks like. All expectations run in parallel before scoring begins.
TypeDescriptionExample
testRun a test command, assert exit code 0npm test
file_existsCheck that a file was createddist/index.js
scriptRun an arbitrary script and check output./validate.sh
llm_reviewAsk the judge to evaluate a specific aspect"Is the code well-documented?"
{
  "expectations": [
    { "type": "test", "command": "npm test" },
    { "type": "file_exists", "path": "dist/index.js" },
    { "type": "script", "command": "./validate.sh", "expectExitCode": 0 },
    { "type": "llm_review", "prompt": "Does the implementation handle edge cases?" }
  ]
}

G-Eval scoring

Scoring uses the G-Eval framework with four default dimensions:
DimensionWeightWhat it measures
correctness0.35Does the output satisfy the task requirements?
completeness0.30Are all aspects of the task addressed?
code_quality0.20Is the code clean, idiomatic, and maintainable?
edge_cases0.15Are edge cases and error conditions handled?
Each dimension is scored 1-5. The weighted average produces the final score.

Custom dimensions

Override the defaults by defining your own dimensions, or let the judge model generate task-specific dimensions automatically:
{
  "expectations": [
    {
      "type": "llm_review",
      "prompt": "Evaluate the API design",
      "dimensions": [
        { "name": "restfulness", "weight": 0.4, "description": "Follows REST conventions" },
        { "name": "documentation", "weight": 0.3, "description": "Endpoints are well-documented" },
        { "name": "error_handling", "weight": 0.3, "description": "Proper HTTP status codes and error bodies" }
      ]
    }
  ]
}
If you omit dimensions, the judge can generate task-specific dimensions based on the task description and expected outcomes. The four defaults are used as a fallback.

Score threshold

The default passing threshold is 3.0 out of 5. Override it per mission with qualityThreshold:
{
  "qualityThreshold": 4.0
}

Multi-reviewer consensus

To reduce variance, assessment runs 3 parallel reviewers. The final score for each dimension is the median of the three scores. Outlier filtering discards any individual score more than 1.5 points from the median before computing the final average. This prevents a single erratic reviewer from skewing the result.

2-phase architecture

The judge model uses tools (file reading, command execution) to explore the task output. It can run up to 20 turns of tool calls to gather evidence before scoring.This phase lets the judge verify file contents, run test suites, check build outputs, and inspect any artifact the agent produced.

Fix and retry loop

When a task fails assessment:
fail → fix attempt (with feedback) → reassess → pass?
                                              → retry (re-execute from scratch)
  1. Fix: The agent receives the reviewer’s feedback and attempts targeted fixes. Up to maxFixAttempts (default 2) per review cycle.
  2. Reassess: The fixed result is scored again.
  3. Retry: If fixes don’t raise the score above threshold, the task is retried from scratch (up to maxRetries).
Each retry is a full re-execution. The agent starts fresh but receives context about previous failures. Tasks with sideEffects: true require human approval before each retry.

Assessment triggers

TriggerWhen it fires
initialFirst assessment after task execution
reassessAfter a fix attempt
fixAgent is producing a fix based on feedback
retryTask re-executing from scratch
auto-correctSystem-initiated correction
judgeManual judge invocation via API

Judge model

The judge model defaults to the project’s configured model. Override it with the POLPO_JUDGE_MODEL environment variable:
POLPO_JUDGE_MODEL=anthropic:claude-sonnet-4-5
Using a different model for judging than for execution reduces self-assessment bias. A common pattern is to use a stronger model as judge (e.g., claude-sonnet-4-5 judging claude-haiku-4).