20%of overall score

Oversight Quality

Measures whether you catch and correct bad outputs at the right frequency — calibrated supervision, not passive acceptance or constant correction.

Weight: 20% of overall score · How the overall score is calculated


Definition

Oversight Quality measures whether you catch and correct bad outputs at the right frequency. Unlike every other dimension, this is not a "higher is always better" metric. The optimal range is a moderate correction rate — high enough to show active supervision, low enough to show that delegation is working.

Two failure modes exist: passive acceptance (zero corrections, implying all outputs are accepted uncritically) and over-correction (constant redirection, implying either poor initial task framing or misplaced distrust).


How it's measured

Two sources are combined to detect oversight events in each session:

Keyword-based correction count — human turns are scanned for the following correction signals. Each match counts as one oversight event:

no , wrong, not that, don't, stop, wait, actually, instead, undo, revert, that's not, not right, incorrect, you missed, you forgot

LLM-classified events — an LLM reviewer classifies each human turn as one of: correction, redirection, validation, or pure_input. Corrections and redirections count as oversight events.

The LLM classification takes precedence when available. The correction rate is calculated as:

correction rate = oversight events / total human turns

This rate is then mapped to a score using an inverted-U curve. The peak is at 20% correction rate = score 10:

Correction rateScoreSignal
20%10Optimal — active, calibrated supervision
10–30%8–10Strong oversight band
5–10%5–7Under-supervising
< 5%1–4Passive — outputs accepted without verification
30–50%5–7Over-correcting
> 50%1–4Micro-managing — poor delegation or task mismatch
0%1No oversight detected

What high vs low looks like

High (score 8–10)

  • Reviewing Claude's output before accepting it, with targeted corrections when something is wrong
  • Catching logical errors, wrong assumptions, or missed edge cases — then redirecting precisely
  • Validation turns ("this looks right, continue") count positively
  • Correction rate lands in the 10–30% range across sessions

Low — passive (score 1–3, correction rate near 0%)

  • Accepting all outputs without review
  • No corrections across multiple sessions
  • Treating Claude's output as final rather than as a first draft to validate

Low — over-correcting (score 1–4, correction rate > 40%)

  • Constant redirections suggesting the task was poorly scoped from the start
  • Re-explaining the same requirement multiple times per session
  • Using Claude interactively rather than as an autonomous tool

Behavioural patterns in real sessions

Anthropic's work study raises a specific concern about oversight that is worth quoting directly: supervision requires the same coding expertise that delegation may erode over time. Engineers who offload coding to Claude may gradually lose the technical depth needed to evaluate whether Claude's output is correct — creating a compounding risk as autonomy increases.

This is why Oversight Quality carries significant weight even as Autonomy Calibration rewards longer uninterrupted runs. The two dimensions create a productive tension: grant autonomy, but verify output. The research found that engineers who maintained high oversight quality alongside high autonomy were the ones whose complexity scores rose fastest over the study period.

The cohort data also shows that 55% of Anthropic engineers use Claude daily for debugging — a task type where output verification is built into the workflow (run the tests, see if the bug is fixed). Debugging sessions tend to produce naturally high Oversight Quality scores because verification is inherent to the task.


How it affects your overall score

Oversight Quality carries 20% of your total score.

A one-point improvement in this dimension adds 0.20 points to your overall score.

Because the scoring curve is non-linear (it peaks in a band, not at the maximum), this is the one dimension where reducing a behavior — specifically, reducing over-correction — can raise your score.

It interacts strongly with Autonomy Calibration (high autonomy only earns its score if oversight quality remains healthy) and Delegation Intelligence (well-chosen tasks are easier to verify).

Analyze your sessions →

All 6 dimensions — Claude Code Maturity Score

Score your own sessions

See your Oversight Quality score alongside all 6 dimensions.

Score Your Sessions