Why it matters:
Small leaderboard gaps can reflect benchmark artifacts, not just model quality.
Tests should prefer behavior-first assertions, and naming requirements should be explicit when they matter.
Full post:
jatinganhotra.dev/blog/swe-age...
Posts by Jatin Ganhotra
High-risk rates are concentrated, not uniform:
SWE-bench_Pro: 10.9%
Multi-SWE-bench: 5.5%
SWE-PolyBench: 3.2%
SWE-bench: 1.5%
SWE-bench_Verified: 0.8%
SWE-bench_Multilingual: 0.0%
To isolate the highest-risk cases, I asked:
1. Is the symbol name mentioned in the issue text?
2. Does that symbol already exist elsewhere in the repo?
If neither is true, I count it as high-risk coupling.
In a broad screening pass, 2,167 instances (28.6%) had tests referencing symbols newly introduced in the reference solution.
That is not a final false-negative estimate.
But it is a strong signal that naming coupling is common enough to matter.
I scanned 6 SWE-bench-style datasets:
SWE-bench
SWE-bench_Verified
SWE-bench_Multilingual
SWE-bench_Pro
Multi-SWE-bench
SWE-PolyBench
Total: 7,567 instances.
That means an agent can produce a behaviorally correct fix and still be graded as wrong because it used a different reasonable name.
Same behavior.
Different identifier.
Benchmark failure.
The failure mode is simple:
Tests sometimes call a symbol that was newly introduced in the reference solution.
If that name was never made explicit in the issue, evaluation is no longer just about behavior. It is also about reproducing a hidden naming choice.
A coding agent can solve the right bug and still fail a benchmark for choosing the wrong unstated identifier.
I call this a hidden naming contract.
New post:
jatinganhotra.dev/blog/swe-age...
The paper behind #IBMResearch iSWE-Agent is now out on arXiv.
If you saw posts on iSWE-Agent topping Multi-SWE-Bench and SWE-PolyBench for Java, this paper shares the system behind those results: agent design, workflow, and Java-aware tooling.
arxiv.org/abs/2603.11356
3/ The eval “needle” should move to harder + broader benchmarks: SWE‑Bench Pro, Multi‑SWE‑Bench, SWE‑PolyBench.
iSWE‑Agent is top-ranked on Java for Multi‑SWE‑Bench and SWE‑PolyBench - research.ibm.com/blog/ibm-sof... , bsky.app/profile/did:...
2/ I argued last year from a different lens: Verified is becoming non‑discriminative as the leaderboard saturates; measure the frontier slice.
jatinganhotra.dev/blog/swe-age...
1/ OpenAI: SWE‑Bench Verified is no longer a good frontier eval — test/spec mismatch + contamination.
openai.com/index/why-we...
🚀 IBM Research's iSWE-Agent is now #1 on the SWE-PolyBench (full) Java leaderboard🎉
On the Verified subset, iSWE-Agent scores 46.38% on Java — matching Atlassian Rovo Dev and significantly outperforming Prometheus (33.33%).
More details: jatinganhotra.dev/news/
#AI #Java #SWEPolyBench
(repost welcome) The Generative Model Alignment team at IBM Research is looking for next summer interns! Two candidates for two topics
🍰Reinforcement Learning environments for LLMs
🐎Speculative and non-auto regressive generation for LLMs
interested/curious? DM or email ramon.astudillo@ibm.com
4/4 Ready to see how AI really stacks up against human developers?
Join researchers and developers already evaluating patches → swebencharena.com
#AI #SoftwareEngineering #CodeQuality #AIEvaluation #SWEBenchArena
3/4 Unlike other platforms:
🚫 PR Arena: Tracks merge rates, not code quality
🚫 Yupp AI: Known models, not blind
🚫 SWE Arena: General coding, not SWE tasks
✅ SWE-Bench-Arena: Blind quality evaluation of real bug fixes
2/4 SWE-Bench-Arena fills this gap with blind evaluation across 5 dimensions:
• Simplicity
• Readability
• Performance
• Maintainability
• Correctness
No bias. Just quality assessment.
🧵 1/4 Current AI coding benchmarks miss the mark.
Claude 4 Sonnet hits 72.7% on SWE-Bench, but industry data shows code clones rose 48% (8.3% to 12.3%) and refactoring rates dropped from 25% to 10% since AI adoption.
(GitClear: gitclear.com/ai_assistant_code_quality_2025_research)
Try evaluating patches → swebencharena.com
What quality issues have you noticed with AI-generated code?
#AIEvaluation #SWEBenchArena #CodeQuality #AI #SoftwareEngineering
We need diverse perspectives from:
🎓 AI researchers
👩💻 Professional developers
📚 Academic teams
🚀 Startup engineers
Your input shapes the future of AI code evaluation standards.
How it works:
• Real GitHub issues from actual projects
• Side-by-side patch comparison
• Blind evaluation (you don't know which is AI vs human)
• Multi-dimensional quality assessment
Early results are fascinating - some AI solutions are surprisingly elegant, others create hidden technical debt 📊
That's why we built SWE-Bench-Arena - the first blind evaluation platform for AI code quality.
Instead of just "does it work?", we ask:
✅ Is it maintainable?
✅ Will teams understand it?
✅ Does it follow best practices?
✅ Is it unnecessarily complex?
🔍 AI models hit 72%+ on coding benchmarks, but there's a hidden problem...
Recent data shows concerning trends since AI adoption:
• 48% increase in code cloning
• Refactoring dropped from 25% to 10%
• Developers report "missing context" as #1 issue
Are we optimizing for the wrong metrics? 🧵
5. I call it the Visual Complexity Penalty — and I break it down in detail in my latest post:
🔗 jatinganhotra.dev/blog/swe-age...
📊 Includes full leaderboard analysis, complexity breakdown, and takeaways.
RT if you're building SWE agents — or trying to understand their real limits.
4. This isn't a benchmark artifact.
It's a wake-up call.
🧠 Current AI systems cannot effectively combine visual + structural code understanding.
And that's a serious problem for real-world software workflows.
3. It's not just the images.
Multimodal tasks often require multi-file edits and focus on JavaScript-based, user-facing applications rather than Python backends.
The combination of visual reasoning + frontend complexity is devastating.
2. Why the collapse?
📸 90.6% of instances in SWE-bench Multimodal contain visual content.
When images are present, solve rates drop from ~100% to ~25% across all top-performing agents.
1. SWE agents are getting better. Some achieve 70-75% accuracy on code-only benchmarks like SWE-bench Verified.
But when the same models are tested on SWE-bench Multimodal, scores fall to ~30%.
🚨 New Blog Post:
AI agents collapse under visual complexity.
A 73.2% performance drop when images are introduced in SWE-bench Multimodal.
Here's why this matters — and what it tells us about the future of AI in software engineering:
🧵👇