Weight: 20% of overall score · How the overall score is calculated
Definition
Autonomy Calibration measures whether you grant Claude enough uninterrupted space to do substantive work. Every human turn in a session interrupts Claude's execution chain — some interruptions are necessary corrections, but most are premature check-ins that fragment complex tasks and eliminate the efficiency gains of AI delegation.
High autonomy does not mean blind trust. It means knowing when to intervene and when to let the tool run.
How it's measured
Two signals are extracted from each session's metadata:
Max consecutive tool calls — the longest uninterrupted sequence of tool calls Claude completed before a human turn. This is the primary signal.
Ratio of max consecutive tool calls to human turns — normalises the raw tool call count against session length, so a short focused session scores comparably to a long session.
avgMaxConsec = average(maxConsecutiveToolCalls) across non-pure-chat sessions
avgTurns = average(totalTurns) across all sessions
ratio = avgMaxConsec / avgTurns
The two Anthropic benchmark anchors that define the scale:
- Ratio 1.58 = February 2025 cohort baseline (9.8 tool calls ÷ 6.2 turns) → score 6
- Ratio 5.17 = August 2025 best practice (21.2 tool calls ÷ 4.1 turns) → score 10
The ratio is mapped to a 1–10 score using thresholds derived from the Anthropic cohort data:
| Ratio | Score |
|---|---|
| ≥ 5.17 | 10 |
| ≥ 4.0 | 9 |
| ≥ 3.0 | 8 |
| ≥ 2.5 | 7 |
| ≥ 1.58 | 6 |
| ≥ 1.2 | 5 |
| ≥ 0.8 | 4 |
| ≥ 0.5 | 3 |
| ≥ 0.2 | 2 |
| < 0.2 | 1 |
Sessions with zero tool calls (pure chat) are excluded from this calculation.
What high vs low looks like
High (score 8–10)
- Providing a clear task brief, then letting Claude read files, run tests, edit code, and return a result before intervening
- Tool call chains of 15–25+ calls before a human turn
- Interventions are purposeful: a correction, a scope change, or a follow-on task — not a check-in
Low (score 1–4)
- Interrupting after every 2–3 tool calls to confirm Claude is on track
- Asking Claude to "just show me the plan first" before each step
- Treating Claude like a junior developer who needs constant approval rather than a tool given a spec
Behavioural patterns in real sessions
The most striking finding in Anthropic's internal work study is the change in maximum consecutive tool calls between February and August 2025.
The cohort average went from 9.8 to 21.2 — a 116% increase in six months. This was not driven by longer sessions. Human turns per session fell from 6.2 to 4.1 over the same period. Engineers were not doing more work — they were interrupting less and letting Claude run further before stepping in.
This pattern was most pronounced in the pre-training and security teams, who also showed the highest task complexity scores — suggesting that engineers doing technically complex work learned earlier that long autonomous runs were necessary to get useful output.
The implication for scoring: a ratio of ~1.6 (tool calls to human turns) corresponds to the February 2025 cohort baseline. A ratio of ~5.2 corresponds to the August 2025 upper range. These define the scoring scale.
How it affects your overall score
Autonomy Calibration carries 20% of your total score.
A one-point improvement in this dimension adds 0.20 points to your overall score.
This dimension is most sensitive to habit change — the same tasks, handled with fewer check-in interruptions, can move this score significantly without changing what you delegate.
It interacts strongly with Delegation Intelligence (well-scoped tasks are easier to run autonomously) and Oversight Quality (autonomy only works if you can verify output afterward).