How it works
Phase 1: Exploration
The judge model explores the task result using tools (read files, run commands) for up to 20 turns to gather evidence.
Consensus
Three parallel reviewers score independently. Median scores are taken with outlier filtering.
Expectation types
Expectations define what “correct” looks like. All expectations run in parallel before scoring begins.| Type | Description | Example |
|---|---|---|
test | Run a test command, assert exit code 0 | npm test |
file_exists | Check that a file was created | dist/index.js |
script | Run an arbitrary script and check output | ./validate.sh |
llm_review | Ask the judge to evaluate a specific aspect | "Is the code well-documented?" |
G-Eval scoring
Scoring uses the G-Eval framework with four default dimensions:| Dimension | Weight | What it measures |
|---|---|---|
correctness | 0.35 | Does the output satisfy the task requirements? |
completeness | 0.30 | Are all aspects of the task addressed? |
code_quality | 0.20 | Is the code clean, idiomatic, and maintainable? |
edge_cases | 0.15 | Are edge cases and error conditions handled? |
Custom dimensions
Override the defaults by defining your own dimensions, or let the judge model generate task-specific dimensions automatically:If you omit
dimensions, the judge can generate task-specific dimensions based on the task description and expected outcomes. The four defaults are used as a fallback.Score threshold
The default passing threshold is 3.0 out of 5. Override it per mission withqualityThreshold:
Multi-reviewer consensus
To reduce variance, assessment runs 3 parallel reviewers. The final score for each dimension is the median of the three scores. Outlier filtering discards any individual score more than 1.5 points from the median before computing the final average. This prevents a single erratic reviewer from skewing the result.2-phase architecture
- Phase 1: Exploration
- Phase 2: Structured scoring
The judge model uses tools (file reading, command execution) to explore the task output. It can run up to 20 turns of tool calls to gather evidence before scoring.This phase lets the judge verify file contents, run test suites, check build outputs, and inspect any artifact the agent produced.
Fix and retry loop
When a task fails assessment:- Fix: The agent receives the reviewer’s feedback and attempts targeted fixes. Up to
maxFixAttempts(default 2) per review cycle. - Reassess: The fixed result is scored again.
- Retry: If fixes don’t raise the score above threshold, the task is retried from scratch (up to
maxRetries).
Assessment triggers
| Trigger | When it fires |
|---|---|
initial | First assessment after task execution |
reassess | After a fix attempt |
fix | Agent is producing a fix based on feedback |
retry | Task re-executing from scratch |
auto-correct | System-initiated correction |
judge | Manual judge invocation via API |
Judge model
The judge model defaults to the project’s configured model. Override it with thePOLPO_JUDGE_MODEL environment variable:
Using a different model for judging than for execution reduces self-assessment bias. A common pattern is to use a stronger model as judge (e.g.,
claude-sonnet-4-5 judging claude-haiku-4).