Leaderboard

Current rankings of code generation models on SlopCodeBench

Showing 8 of 14 runs
Agent/ModelVersionThinking% Solved% Checkpoints$/CheckpointErosionVerbosity
Claude Code/Opus 4.52.0.51High0.010.75$4.560.4370.905
Codex/GPT 5.1-Codex-Max0.65.0High0.010.75$2.860.4020.768
Codex/GPT 5.20.71.0High0.010.75$4.550.4250.765
Codex/GPT 5.2-Codex0.74.0High0.010.75$3.090.3480.777
Claude Code/Opus 4.52.0.62High0.08.60$2.630.4080.905
Claude Code/Opus 4.52.0.75High0.07.53$2.690.4570.911
Claude Code/Sonnet 4.52.0.51High0.05.38$1.490.4900.749
Claude Code/GLM 4.72.0.75High0.04.30$1.610.4430.758
GLM 4.7
GPT 5.1-Codex-Max
GPT 5.2
GPT 5.2-Codex
Opus 4.5
Sonnet 4.5
View on a larger screen to see the chart
Metric Definitions
% Solved — Percentage of problems solved
% Checkpoints — Percentage of checkpoints solved
Erosion — Percentage of functions with >10 Cyclomatic Complexity + lint errors per line of code. Functions with CC>30 counted twice.
Verbosity — Count of AST Grep rules or LLM-as-Judge flagged spans compared to logical lines of code

* Last updated: January 11, 2026