Claude Code Maturity Score

A behavioral scoring system that measures how effectively you use AI-assisted coding — not what you build, but how you delegate, supervise, and iterate. Benchmarked against Anthropic's internal engineering cohort across 200,000+ real Claude Code sessions.

Score Your Sessions

6 scored dimensions

25%of score

Delegation Intelligence

Measures whether you choose the right tasks to hand off to Claude — appropriately complex, verifiable, and code-adjacent.

20%of score

Autonomy Calibration

Measures whether you grant Claude enough uninterrupted space to do substantive work — without collapsing into micromanagement.

20%of score

Oversight Quality

Measures whether you catch and correct bad outputs at the right frequency — calibrated supervision, not passive acceptance or constant correction.

15%of score

Complexity Progression

Measures whether the difficulty of tasks you delegate is increasing over time — tracking whether your trust in Claude is actually growing.

10%of score

Task Breadth

Measures how wide a range of task types you delegate — rewarding coverage across domains, not just depth in one.

10%of score

New Work Generation

Measures what proportion of your sessions involve tasks that wouldn't have been done without AI — genuine economic surplus, not just faster delivery of existing work.

A behavioral scoring system that measures how effectively you use AI-assisted coding — not what you build, but how you delegate, supervise, and iterate. Benchmarked against Anthropic's internal engineering cohort across 200,000+ real Claude Code sessions.


What the score measures

Most AI productivity metrics measure what gets built. This score measures how you build — the delegation decisions, oversight patterns, and autonomy calibration that determine whether you're extracting full value from AI-assisted coding.

The framework is grounded in Anthropic's own analysis of real-world Claude usage. Using the methodology developed in Clio: Privacy-Preserving Insights into Real-World AI Use — a system that analyzes behavioral patterns across millions of interactions without exposing raw conversation content — Anthropic studied how its own engineers' Claude Code usage evolved from February to August 2025.

Key finding: Between February and August 2025, Anthropic engineers increased maximum consecutive tool calls by 116% (9.8 → 21.2), reduced human turns per session by 33% (6.2 → 4.1), and raised average task complexity from 3.2 to 3.8 on a 1–5 scale — without sacrificing oversight quality.

The maturity score translates these behavioral patterns into a single comparable number, giving engineers a clear benchmark for where they are and what to improve.


Research foundation

StatSource
200,000+ Claude Code transcripts analyzed (Feb → Aug 2025)Anthropic work study
132 engineers and researchers surveyedAnthropic work study
27% of AI-assisted work is net-new — tasks that wouldn't have been done otherwiseAnthropic work study
1M+ conversations analyzed for behavioral pattern methodologyClio paper

The six dimensions were derived from what the research identified as the highest-signal behavioral differences between engineers who gained the most from Claude Code versus those who plateaued. Delegation choice, interruption frequency, task diversity, and new-work generation all emerged as separating factors.


The 6 dimensions

Each dimension is scored 1–10 and weighted by its impact on effective AI collaboration.

Delegation Intelligence — 25%

Are you delegating tasks that Claude is actually suited for — appropriately complex, verifiable, and code-adjacent?

  • High: debugging, refactoring, self-contained feature work, papercut fixes
  • Low: high-level design, strategic decisions, tasks requiring organizational context

The research found the most-delegated tasks among high performers are "easily verifiable, low-complexity, self-contained, boring, or throwaway code" — tasks where AI error is cheap to catch.

Autonomy Calibration — 20%

Are you letting Claude run uninterrupted long enough to do substantive work? Excessive steering creates overhead that cancels out AI productivity gains.

  • High: long consecutive tool call chains, few human interruptions per task
  • Low: constant re-steering, short autonomous runs before intervention

The 116% increase in max consecutive tool calls from Feb to Aug 2025 is the clearest single signal separating developing from high-maturity users.

Oversight Quality — 20%

When Claude goes off-track, do you catch it and redirect it precisely? This dimension rewards calibrated supervision — not passive acceptance and not constant correction.

  • High: targeted corrections at the right moments; consistent output validation
  • Low: zero corrections (passive acceptance of all outputs) or extremely high correction rate (poor delegation or task mismatch)

The work study notes that effective supervision requires the same coding expertise that delegation may erode over time — making oversight a critical ongoing skill, not a passive one.

Complexity Progression — 15%

Is the complexity of tasks you delegate increasing over time? Staying at low-complexity tasks signals that trust in Claude is not growing.

  • High: upward trend in task complexity; architecture, feature implementation, design planning
  • Low: flat or declining complexity — only simple edits and fixes delegated

Anthropic engineers increased average task complexity from 3.2 to 3.8 (on a 1–5 scale) across this period. Feature implementation grew from 14.3% to 36.9% of all sessions; design and planning from 1.0% to 9.9%.

Task Breadth — 10%

How wide a range of task types are you delegating? Engineers who confine Claude to one task type miss the compounding benefits of full-stack AI collaboration.

  • High: debugging, front-end, data science, refactoring, code understanding all present
  • Low: only one or two task types used across sessions

The study found engineers becoming "more full-stack" by leveraging Claude across domains previously requiring specialist knowledge.

New Work Generation — 10%

What proportion of your AI-assisted sessions involve tasks that wouldn't have been done without Claude? This is AI creating genuine economic surplus, not just redistributing existing work.

  • High: throwaway tooling, exploratory prototypes, work in unfamiliar domains
  • Low: AI used only for tasks already on the roadmap

Anthropic found 27% of Claude-assisted work falls into this category across its internal cohort.


Weighting rationale

DimensionWeightRationale
Delegation Intelligence25%Wrong task selection is the most common failure mode; it undermines all other dimensions
Autonomy Calibration20%The 116% tool call increase is the clearest behavioral signal of maturity growth
Oversight Quality20%High autonomy without effective supervision is the primary risk of advanced AI use
Complexity Progression15%Flat complexity signals stalled trust — growth requires delegating harder work over time
Task Breadth10%Cross-domain delegation produces compounding returns; single-type use caps gains
New Work Generation10%Weighted lower because it varies significantly by role and project type

Maturity levels

1–3 · Early Adopter

Using Claude for simple, highly-supervised tasks. Short autonomous runs, heavy steering, narrow task range. Matches engineers who report being able to fully delegate only 0–20% of their work.

3–5 · Developing Collaborator

Growing delegation confidence with some oversight patterns emerging. Correction rate is still high relative to session length, indicating over-steering. Task complexity predominantly low-to-medium.

5–7 · Effective Delegator

Strategic task selection with appropriate autonomy granted. Longer uninterrupted tool call chains. Consistent output validation without excessive intervention. Approaching the February 2025 Anthropic engineering cohort baseline.

7–9 · AI-Native Builder

High autonomy granted with strong oversight discipline. Wide task range — including front-end, data science, and architecture work. Generating net-new work that previously couldn't be attempted. Consistent with top-quartile Anthropic engineers in the August 2025 data.

9–10 · AI Power User

Benchmark-beating metrics across all six dimensions. Sustained complexity progression and optimal autonomy/oversight balance. Consistent with the highest-performing segment of the Anthropic internal engineering cohort by August 2025.


Benchmark data

Anthropic analyzed 200,000+ Claude Code session transcripts from its own engineers across a six-month period (February → August 2025). The cohort spans pre-training, security, alignment, and non-technical teams.

MetricFeb 2025Aug 2025Change
Max consecutive tool calls9.821.2+116%
Avg human turns per session6.24.1−33%
Avg task complexity (1–5)3.23.8+19%
Feature implementation share14.3%36.9%+158%
Design & planning share1.0%9.9%+890%
Papercut fix share8.6%

These numbers define the scoring range for each dimension. A score of 10 on Autonomy Calibration corresponds to tool call patterns at or above the August 2025 cohort benchmark. A score of 5 corresponds roughly to the February 2025 baseline.


How the analyzer works

The Claude Code Maturity Score analyzer makes 3 API calls per analysis run, all sent in batches of 5 sessions to stay within rate limits. If your project has more than 100 sessions, structural metrics are computed on all of them but LLM classification runs on a uniform sample of 100.

API callWhat it sendsWhat it returns
1 — Session classificationFirst 2 + last human message per session (max 1,200 chars), turn count, max tool callstask_type, complexity, is_new_work, delegation_appropriateness
2 — Oversight detectionUp to 20 human turns per session, each capped at 150 charactersEach turn labelled: correction, redirection, validation, or pure_input
3 — Holistic summaryAggregated dimension scores and metadata only — no message content3 strengths, 3 gaps, delegation pattern, maturity narrative

Privacy: Session files never leave your machine except for the summarised inputs above. No full conversation content is transmitted. Nothing is stored or logged by this tool — analysis runs entirely in your browser using your own API key.

Limitations acknowledged by Anthropic: The cohort has selection bias toward engaged respondents, social desirability bias from non-anonymous responses, and reflects Anthropic employees who may have above-average AI familiarity. Patterns may have shifted with newer model releases since August 2025.

See where you stand

Upload your Claude Code session files. Analysis runs in your browser using your API key — nothing is stored.

Score Your Sessions