Weight: 25% of overall score · How the overall score is calculated
Definition
Delegation Intelligence measures whether you are choosing the right tasks to hand off to Claude. It is not about how much you delegate — it is about the quality of that selection. Delegating the wrong tasks produces poor outputs, high correction rates, and erodes trust in AI tooling. Delegating the right tasks produces reliable, verifiable results with minimal steering overhead.
This is the highest-weighted dimension because task selection failure cascades into every other dimension. Poor delegation creates false signals across autonomy, oversight, and complexity scores.
How it's measured
Each session is classified by an LLM reviewer across two signals:
Task type — the session is assigned one of: debugging, feature_implementation, refactoring, code_understanding, design_planning, data_science, front_end, papercut_fix, other.
Delegation appropriateness — the reviewer judges whether the task is a good fit for Claude given its complexity and type: good, poor, or unclear.
A session scores as good delegation when:
good_session = delegation_appropriateness = "good"
AND (complexity ≤ 3
OR task_type ∈ {debugging, refactoring, papercut_fix,
code_understanding, data_science, front_end})
score = (good_sessions / total_sessions) × 10
0% good sessions = score 1. 100% good sessions = score 10. Linear scale.
What high vs low looks like
High (score 8–10)
- Debugging sessions: isolating a specific failure, not redesigning the system
- Refactoring a well-defined module with clear before/after criteria
- Writing throwaway scripts, test fixtures, or one-off data transformations
- Front-end implementation from a defined spec
- Tasks where correctness is easy to verify by running or reviewing output
Low (score 1–4)
- Asking Claude to make architectural decisions without constraints
- Delegating tasks that require organizational context Claude cannot have
- Handing off work where you cannot verify the output without significant re-work
- Using Claude for strategic planning or prioritization decisions
Behavioural patterns in real sessions
Anthropic's internal work study found a consistent pattern in what its engineers chose to delegate. The most commonly delegated tasks were described as "easily verifiable, low-complexity, self-contained, boring, or throwaway code." These are exactly the tasks that score well on this dimension.
The least-delegated tasks were "high-level design, strategic thinking, and organizational context decisions" — tasks that scored poorly when engineers did attempt to delegate them, producing outputs that required heavy correction or were discarded entirely.
44% of Claude-assisted work involved tasks employees "wouldn't enjoy doing manually" — a reliable proxy for tasks that are well-scoped, repetitive, and verifiable.
By August 2025, feature implementation had grown from 14.3% to 36.9% of sessions in the cohort — indicating that as engineers matured, they learned to frame feature work as well-scoped delegation tasks rather than open-ended requests.
Papercut fixes — small, high-value, low-risk changes — accounted for 8.6% of all sessions in the cohort by August 2025. These are a reliable indicator of good delegation: self-contained, verifiable, and typically below the complexity threshold where AI errors compound.
How it affects your overall score
Delegation Intelligence carries 25% of your total score — the largest single weight.
A one-point improvement in this dimension adds 0.25 points to your overall score.
It also has indirect effects: sessions with poor delegation tend to have high correction rates (suppressing your Oversight Quality score) and low tool call autonomy (suppressing your Autonomy Calibration score). Improving delegation choice is typically the fastest path to improving three dimensions simultaneously.