Leaderboard
Current rankings of code generation models on SlopCodeBench
Showing 8 of 14 runs
| Agent/Model▲ | Version▲ | Thinking▲ | % Solved▲ | % Checkpoints▼ | $/Checkpoint▲ | Erosion▲ | Verbosity▲ |
|---|---|---|---|---|---|---|---|
| Claude Code/Opus 4.5 | 2.0.51 | High | 0.0 | 10.75 | $4.56 | 0.437 | 0.905 |
| Codex/GPT 5.1-Codex-Max | 0.65.0 | High | 0.0 | 10.75 | $2.86 | 0.402 | 0.768 |
| Codex/GPT 5.2 | 0.71.0 | High | 0.0 | 10.75 | $4.55 | 0.425 | 0.765 |
| Codex/GPT 5.2-Codex | 0.74.0 | High | 0.0 | 10.75 | $3.09 | 0.348 | 0.777 |
| Claude Code/Opus 4.5 | 2.0.62 | High | 0.0 | 8.60 | $2.63 | 0.408 | 0.905 |
| Claude Code/Opus 4.5 | 2.0.75 | High | 0.0 | 7.53 | $2.69 | 0.457 | 0.911 |
| Claude Code/Sonnet 4.5 | 2.0.51 | High | 0.0 | 5.38 | $1.49 | 0.490 | 0.749 |
| Claude Code/GLM 4.7 | 2.0.75 | High | 0.0 | 4.30 | $1.61 | 0.443 | 0.758 |
GLM 4.7
GPT 5.1-Codex-Max
GPT 5.2
GPT 5.2-Codex
Opus 4.5
Sonnet 4.5
View on a larger screen to see the chart
Metric Definitions
% Solved — Percentage of problems solved
% Checkpoints — Percentage of checkpoints solved
Erosion — Percentage of functions with >10 Cyclomatic Complexity + lint errors per line of code. Functions with CC>30 counted twice.
Verbosity — Count of AST Grep rules or LLM-as-Judge flagged spans compared to logical lines of code
* Last updated: January 11, 2026