Leaderboard

Current rankings of code generation models on SlopCodeBench

Showing 16 of 24 runs
Showing best version by % Checkpoints. Select both Model and Harness to view all versions.
ModelHarnessVersionStrict SolveIso Solve$/CKPTErosionVerbosity
GPT 5.5 (High)Codex0.124.014.2928.06$1.510.4940.269
GPT 5.3-Codex (High)Codex0.98.011.2226.02$0.690.6440.336
GPT 5.4 (High)Codex0.110.010.7123.47$0.820.5080.273
GPT 5.2-Codex (High)Codex0.93.09.6921.94$0.850.7280.398
Opus 4.6 (High)Claude Code2.1.329.6920.92$3.170.7370.318
Opus 4.7 (High)Claude Code2.1.1118.1620.92$2.170.7590.357
Kimi K2.6 (High)Kimi CLI1.37.010.7118.88$0.740.7640.399
Opus 4.5 (High)Claude Code2.0.519.1817.35$2.530.6910.297
Sonnet 4.6 (High)Claude Code2.1.447.1416.84$1.960.7410.316
GLM 5.1 (High)Claude Code2.1.449.6913.78$1.470.6840.322
GPT 5.4-Mini (High)Codex0.110.05.1013.78$0.450.6550.330
Kimi K2.5 (High)Kimi CLI1.37.04.599.69$0.330.7120.309
GLM 5.1 (High)OpenCode1.4.35.618.16$0.590.5500.387
GPT 5.3-Codex-Spark (High)Codex0.100.03.068.16$0.200.5860.357
Kimi K2.5 (High)Claude Code2.1.443.577.14$1.070.6920.310
MiniMax M2.7 (High)Claude Code2.1.442.554.08$0.330.5000.265
OpenAI
Anthropic
Z-AI
Moonshot
Cursor
MiniMax
Other
View on a larger screen to see the chart
Metric Definitions
CKPT Solved — Checkpoint, and all prior checkpoints, are solved
Isolated Solved — % Passes only the tests for the checkpoint.
Core Solved — just passes the core tests for a checkpoint.
$/CKPT — Average USD cost per checkpoint
Erosion — Fraction of total complexity mass in high-complexity functions (CC > 10), where mass(f) = CC(f) × √SLOC(f). 0 = no high-complexity functions, 1 = all mass in high-CC functions.
Verbosity — Union of AST-Grep flagged lines and clone lines divided by LOC. Bounded [0, 1].
% AST-Grep — Percentage of lines flagged by AST-Grep rules for wasteful code patterns.
% Cloned — Percentage of lines that are structural duplicates (clone lines / LOC).

* Last updated: April 24, 2026