Leaderboard

Current rankings of code generation models on SlopCodeBench

Showing 11 of 20 runs
Showing best version by % Checkpoints. Select both Model and Harness to view all versions.
ModelHarnessVersionIsolated Solved$/CKPTErosionVerbosity
GPT 5.3-Codex (High)Codex0.98.023.66$3.140.6780.844
Opus 4.6 (High)Claude Code2.1.3221.51$3.470.6470.965
GPT 5.4 (High)Codex0.110.020.43$3.270.6430.682
GPT 5.2 (High)Codex0.71.019.35$4.550.6710.765
Opus 4.5 (High)Claude Code2.0.5117.39$2.640.6420.932
GPT 5.1-Codex-Max (High)Codex0.65.017.20$2.860.6560.768
GPT 5.2-Codex (High)Codex0.74.016.13$3.210.6730.723
Sonnet 4.6 (High)Claude Code2.1.4416.13$1.920.6090.953
Sonnet 4.5 (High)Claude Code2.0.7612.90$1.540.6270.909
GPT 5.3-Codex-Spark (High)Codex0.100.012.90$0.910.5640.819
GLM 4.7 (High)Claude Code2.0.759.68$1.610.6260.758
OpenAI
Anthropic
Z-AI
Other
View on a larger screen to see the chart
Metric Definitions
% Solved — Percentage of problems solved
CKPT Solved — Checkpoint, and all prior checkpoints, are solved
Isolated Solved — % Passes only the tests for the checkpoint.
Core Solved — just passes the core tests for a checkpoint.
$/CKPT — Average USD cost per checkpoint
Erosion — Gini coefficient of complexity mass distribution across functions. 0 = uniform complexity, 1 = all complexity concentrated in one function.
Verbosity — Count of AST Grep rules or LLM-as-Judge flagged spans compared to logical lines of code
Clone Lines / 1K LOC — at final checkpoint
Complexity Concentration — Function gini coef at final checkpoint

* Last updated: March 6, 2026