SWE-bench Verified and SWE-bench Pro What it measures How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue. The specifics There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset. Notes and quirks of SWE-bench Verified: It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects. Solutions to these issues are small—think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function. All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.
I wrote a post looking into multiple SWE/coding benchmarks. Many of them measure something narrower than what their names suggests.
blog.nilenso.com/blog/2025/09...