SlopCodeBench
A community benchmark measuring code erosion as agents iteratively extend their own solutions across checkpoints.
top 10 models
full leaderboard →- 01GPT 5.5/Codex28.1%0.4940.269
- 02GPT 5.3-Codex/Codex26.0%0.6440.336
- 03GPT 5.4/Codex25.5%0.2780.193
- 04GPT 5.2-Codex/Codex21.9%0.7280.398
- 05
Opus 4.6/Claude Code
20.9%0.7370.318 - 06
Opus 4.7/Claude Code
20.9%0.7590.357 - 07KIKimi K2.6/Kimi CLI18.9%0.7640.399
- 08
Opus 4.5/Claude Code
17.3%0.6910.297 - 09
Sonnet 4.6/Claude Code
16.8%0.7410.316 - 10
Composer 2/Cursor CLI
16.3%0.7160.353
Featured Problems
browse all →Overview
SlopCodeBench evaluates coding agents the way real software actually gets built: through repeated requirement changes and extensions. Each problem is a sequence of checkpoints — the agent implements an initial version, then extends its own solution as new requirements arrive. Evaluation is black-box: only a CLI or API contract is given, with no prescribed architecture, function signatures, or module boundaries, so early design decisions compound across the run. Beyond correctness, we measure code erosion — verbosity, dead branches, and redundant structure — to surface the agents that stay clean under sustained change instead of patching their way into slop.
Supported by
Special thanks to Snorkel AI for supporting this work through the Open Benchmarks Grant.


