A behavioral scoring system that measures how effectively you use AI-assisted coding — not what you build, but how you delegate, supervise, and iterate. Benchmarked against Anthropic's internal engineering cohort across 200,000+ real Claude Code sessions.
What the score measures
Most AI productivity metrics measure what gets built. This score measures how you build — the delegation decisions, oversight patterns, and autonomy calibration that determine whether you're extracting full value from AI-assisted coding.
The framework is grounded in Anthropic's own analysis of real-world Claude usage. Using the methodology developed in Clio: Privacy-Preserving Insights into Real-World AI Use — a system that analyzes behavioral patterns across millions of interactions without exposing raw conversation content — Anthropic studied how its own engineers' Claude Code usage evolved from February to August 2025.
Key finding: Between February and August 2025, Anthropic engineers increased maximum consecutive tool calls by 116% (9.8 → 21.2), reduced human turns per session by 33% (6.2 → 4.1), and raised average task complexity from 3.2 to 3.8 on a 1–5 scale — without sacrificing oversight quality.
The maturity score translates these behavioral patterns into a single comparable number, giving engineers a clear benchmark for where they are and what to improve.
Research foundation
| Stat | Source |
|---|---|
| 200,000+ Claude Code transcripts analyzed (Feb → Aug 2025) | Anthropic work study |
| 132 engineers and researchers surveyed | Anthropic work study |
| 27% of AI-assisted work is net-new — tasks that wouldn't have been done otherwise | Anthropic work study |
| 1M+ conversations analyzed for behavioral pattern methodology | Clio paper |
The six dimensions were derived from what the research identified as the highest-signal behavioral differences between engineers who gained the most from Claude Code versus those who plateaued. Delegation choice, interruption frequency, task diversity, and new-work generation all emerged as separating factors.
The 6 dimensions
Each dimension is scored 1–10 and weighted by its impact on effective AI collaboration.
Delegation Intelligence — 25%
Are you delegating tasks that Claude is actually suited for — appropriately complex, verifiable, and code-adjacent?
- High: debugging, refactoring, self-contained feature work, papercut fixes
- Low: high-level design, strategic decisions, tasks requiring organizational context
The research found the most-delegated tasks among high performers are "easily verifiable, low-complexity, self-contained, boring, or throwaway code" — tasks where AI error is cheap to catch.
Autonomy Calibration — 20%
Are you letting Claude run uninterrupted long enough to do substantive work? Excessive steering creates overhead that cancels out AI productivity gains.
- High: long consecutive tool call chains, few human interruptions per task
- Low: constant re-steering, short autonomous runs before intervention
The 116% increase in max consecutive tool calls from Feb to Aug 2025 is the clearest single signal separating developing from high-maturity users.
Oversight Quality — 20%
When Claude goes off-track, do you catch it and redirect it precisely? This dimension rewards calibrated supervision — not passive acceptance and not constant correction.
- High: targeted corrections at the right moments; consistent output validation
- Low: zero corrections (passive acceptance of all outputs) or extremely high correction rate (poor delegation or task mismatch)
The work study notes that effective supervision requires the same coding expertise that delegation may erode over time — making oversight a critical ongoing skill, not a passive one.
Complexity Progression — 15%
Is the complexity of tasks you delegate increasing over time? Staying at low-complexity tasks signals that trust in Claude is not growing.
- High: upward trend in task complexity; architecture, feature implementation, design planning
- Low: flat or declining complexity — only simple edits and fixes delegated
Anthropic engineers increased average task complexity from 3.2 to 3.8 (on a 1–5 scale) across this period. Feature implementation grew from 14.3% to 36.9% of all sessions; design and planning from 1.0% to 9.9%.
Task Breadth — 10%
How wide a range of task types are you delegating? Engineers who confine Claude to one task type miss the compounding benefits of full-stack AI collaboration.
- High: debugging, front-end, data science, refactoring, code understanding all present
- Low: only one or two task types used across sessions
The study found engineers becoming "more full-stack" by leveraging Claude across domains previously requiring specialist knowledge.
New Work Generation — 10%
What proportion of your AI-assisted sessions involve tasks that wouldn't have been done without Claude? This is AI creating genuine economic surplus, not just redistributing existing work.
- High: throwaway tooling, exploratory prototypes, work in unfamiliar domains
- Low: AI used only for tasks already on the roadmap
Anthropic found 27% of Claude-assisted work falls into this category across its internal cohort.
Weighting rationale
| Dimension | Weight | Rationale |
|---|---|---|
| Delegation Intelligence | 25% | Wrong task selection is the most common failure mode; it undermines all other dimensions |
| Autonomy Calibration | 20% | The 116% tool call increase is the clearest behavioral signal of maturity growth |
| Oversight Quality | 20% | High autonomy without effective supervision is the primary risk of advanced AI use |
| Complexity Progression | 15% | Flat complexity signals stalled trust — growth requires delegating harder work over time |
| Task Breadth | 10% | Cross-domain delegation produces compounding returns; single-type use caps gains |
| New Work Generation | 10% | Weighted lower because it varies significantly by role and project type |
Maturity levels
1–3 · Early Adopter
Using Claude for simple, highly-supervised tasks. Short autonomous runs, heavy steering, narrow task range. Matches engineers who report being able to fully delegate only 0–20% of their work.
3–5 · Developing Collaborator
Growing delegation confidence with some oversight patterns emerging. Correction rate is still high relative to session length, indicating over-steering. Task complexity predominantly low-to-medium.
5–7 · Effective Delegator
Strategic task selection with appropriate autonomy granted. Longer uninterrupted tool call chains. Consistent output validation without excessive intervention. Approaching the February 2025 Anthropic engineering cohort baseline.
7–9 · AI-Native Builder
High autonomy granted with strong oversight discipline. Wide task range — including front-end, data science, and architecture work. Generating net-new work that previously couldn't be attempted. Consistent with top-quartile Anthropic engineers in the August 2025 data.
9–10 · AI Power User
Benchmark-beating metrics across all six dimensions. Sustained complexity progression and optimal autonomy/oversight balance. Consistent with the highest-performing segment of the Anthropic internal engineering cohort by August 2025.
Benchmark data
Anthropic analyzed 200,000+ Claude Code session transcripts from its own engineers across a six-month period (February → August 2025). The cohort spans pre-training, security, alignment, and non-technical teams.
| Metric | Feb 2025 | Aug 2025 | Change |
|---|---|---|---|
| Max consecutive tool calls | 9.8 | 21.2 | +116% |
| Avg human turns per session | 6.2 | 4.1 | −33% |
| Avg task complexity (1–5) | 3.2 | 3.8 | +19% |
| Feature implementation share | 14.3% | 36.9% | +158% |
| Design & planning share | 1.0% | 9.9% | +890% |
| Papercut fix share | — | 8.6% | — |
These numbers define the scoring range for each dimension. A score of 10 on Autonomy Calibration corresponds to tool call patterns at or above the August 2025 cohort benchmark. A score of 5 corresponds roughly to the February 2025 baseline.
How the analyzer works
The Claude Code Maturity Score analyzer makes 3 API calls per analysis run, all sent in batches of 5 sessions to stay within rate limits. If your project has more than 100 sessions, structural metrics are computed on all of them but LLM classification runs on a uniform sample of 100.
| API call | What it sends | What it returns |
|---|---|---|
| 1 — Session classification | First 2 + last human message per session (max 1,200 chars), turn count, max tool calls | task_type, complexity, is_new_work, delegation_appropriateness |
| 2 — Oversight detection | Up to 20 human turns per session, each capped at 150 characters | Each turn labelled: correction, redirection, validation, or pure_input |
| 3 — Holistic summary | Aggregated dimension scores and metadata only — no message content | 3 strengths, 3 gaps, delegation pattern, maturity narrative |
Privacy: Session files never leave your machine except for the summarised inputs above. No full conversation content is transmitted. Nothing is stored or logged by this tool — analysis runs entirely in your browser using your own API key.
Limitations acknowledged by Anthropic: The cohort has selection bias toward engaged respondents, social desirability bias from non-anonymous responses, and reflects Anthropic employees who may have above-average AI familiarity. Patterns may have shifted with newer model releases since August 2025.