Leaderboard

Current rankings of code generation models on SlopCodeBench

Showing 11 of 25 runs
Showing best version by % Checkpoints. Select both Model and Harness to view all versions.
ModelHarnessVersionStrict SolveIso Solve$/CKPTErosionVerbosity
GPT 5.3-Codex (High)Codex0.98.09.6823.66$3.140.6760.356
Opus 4.6 (High)Claude Code2.1.3217.2021.51$3.470.7740.346
GPT 5.4 (High)Codex0.110.011.8320.43$3.270.5150.286
GPT 5.2 (High)Codex0.71.010.7519.35$4.550.7110.358
Sonnet 4.6 (High)Claude Code2.1.448.5418.29$1.920.7030.313
GPT 5.2-Codex (High)Codex0.80.09.6818.28$2.890.6890.388
Opus 4.5 (High)Claude Code2.0.5110.8717.39$2.640.7100.287
GPT 5.1-Codex-Max (High)Codex0.65.010.7517.20$2.860.6420.331
Sonnet 4.5 (High)Claude Code2.0.655.3816.13$1.490.6820.293
GPT 5.3-Codex-Spark (High)Codex0.100.05.3812.90$0.910.5750.352
GLM 4.7 (High)Claude Code2.0.764.309.68$1.610.6640.305
OpenAI
Anthropic
Z-AI
Other
View on a larger screen to see the chart
Metric Definitions
% Solved — Percentage of problems solved
CKPT Solved — Checkpoint, and all prior checkpoints, are solved
Isolated Solved — % Passes only the tests for the checkpoint.
Core Solved — just passes the core tests for a checkpoint.
$/CKPT — Average USD cost per checkpoint
Erosion — Fraction of total complexity mass in high-complexity functions (CC > 10), where mass(f) = CC(f) × √SLOC(f). 0 = no high-complexity functions, 1 = all mass in high-CC functions.
Verbosity — Union of AST-Grep flagged lines and clone lines divided by LOC. Bounded [0, 1].
% AST-Grep — Percentage of lines flagged by AST-Grep rules for wasteful code patterns.
% Cloned — Percentage of lines that are structural duplicates (clone lines / LOC).

* Last updated: March 27, 2026