Jatin Ganhotra (@jatinganhotra.dev) Bsky

Jatin Ganhotra | Hidden Naming Contracts in SWE-Agent Benchmarks A programmatic scan across six SWE-bench-style benchmarks finds that tests sometimes encode hidden naming requirements, penalizing behaviorally correct fixes that choose different identifiers.

Why it matters:

Small leaderboard gaps can reflect benchmark artifacts, not just model quality.

Tests should prefer behavior-first assertions, and naming requirements should be explicit when they matter.

Full post:
jatinganhotra.dev/blog/swe-age...

3 days ago 0 0 0 0

High-risk rates are concentrated, not uniform:

SWE-bench_Pro: 10.9%
Multi-SWE-bench: 5.5%
SWE-PolyBench: 3.2%
SWE-bench: 1.5%
SWE-bench_Verified: 0.8%
SWE-bench_Multilingual: 0.0%

3 days ago 0 0 1 0

To isolate the highest-risk cases, I asked:
1. Is the symbol name mentioned in the issue text?
2. Does that symbol already exist elsewhere in the repo?

If neither is true, I count it as high-risk coupling.

3 days ago 0 0 1 0

In a broad screening pass, 2,167 instances (28.6%) had tests referencing symbols newly introduced in the reference solution.

That is not a final false-negative estimate.
But it is a strong signal that naming coupling is common enough to matter.

3 days ago 1 0 1 0

I scanned 6 SWE-bench-style datasets:
SWE-bench
SWE-bench_Verified
SWE-bench_Multilingual
SWE-bench_Pro
Multi-SWE-bench
SWE-PolyBench

Total: 7,567 instances.

3 days ago 0 0 1 0

That means an agent can produce a behaviorally correct fix and still be graded as wrong because it used a different reasonable name.

Same behavior.
Different identifier.
Benchmark failure.

3 days ago 0 0 1 0

The failure mode is simple:

Tests sometimes call a symbol that was newly introduced in the reference solution.

If that name was never made explicit in the issue, evaluation is no longer just about behavior. It is also about reproducing a hidden naming choice.

3 days ago 0 0 1 0

Jatin Ganhotra | Hidden Naming Contracts in SWE-Agent Benchmarks A programmatic scan across six SWE-bench-style benchmarks finds that tests sometimes encode hidden naming requirements, penalizing behaviorally correct fixes that choose different identifiers.

A coding agent can solve the right bug and still fail a benchmark for choosing the wrong unstated identifier.

I call this a hidden naming contract.

New post:
jatinganhotra.dev/blog/swe-age...

3 days ago 3 0 1 0

Resolving Java Code Repository Issues with iSWE Agent Resolving issues on code repositories is an important part of software engineering. Various recent systems automatically resolve issues using large language models and agents, often with impressive pe...

The paper behind #IBMResearch iSWE-Agent is now out on arXiv.

If you saw posts on iSWE-Agent topping Multi-SWE-Bench and SWE-PolyBench for Java, this paper shares the system behind those results: agent design, workflow, and Java-aware tooling.

arxiv.org/abs/2603.11356

3 weeks ago 4 1 0 0

3/ The eval “needle” should move to harder + broader benchmarks: SWE‑Bench Pro, Multi‑SWE‑Bench, SWE‑PolyBench.

iSWE‑Agent is top-ranked on Java for Multi‑SWE‑Bench and SWE‑PolyBench - research.ibm.com/blog/ibm-sof... , bsky.app/profile/did:...

1 month ago 1 0 0 0

Jatin Ganhotra | From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets Uncovering the real performance of SWE-Agents by analyzing discriminative subsets of SWE-Bench Verified, showing how aggregate scores can mask significant performance variations across task types.

2/ I argued last year from a different lens: Verified is becoming non‑discriminative as the leaderboard saturates; measure the frontier slice.

jatinganhotra.dev/blog/swe-age...

1 month ago 1 0 1 0

Why SWE-bench Verified no longer measures frontier coding capabilities SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

1/ OpenAI: SWE‑Bench Verified is no longer a good frontier eval — test/spec mismatch + contamination.
openai.com/index/why-we...

1 month ago 1 0 1 0

Jatin Ganhotra | News Senior Software Engineer at IBM Research specializing in autonomous SWE-Agents for intelligent code generation, issue localization, and software testing.

🚀 IBM Research's iSWE-Agent is now #1 on the SWE-PolyBench (full) Java leaderboard🎉

On the Verified subset, iSWE-Agent scores 46.38% on Java — matching Atlassian Rovo Dev and significantly outperforming Prometheus (33.33%).

More details: jatinganhotra.dev/news/

#AI #Java #SWEPolyBench

1 month ago 3 2 0 1

(repost welcome) The Generative Model Alignment team at IBM Research is looking for next summer interns! Two candidates for two topics

🍰Reinforcement Learning environments for LLMs

🐎Speculative and non-auto regressive generation for LLMs

interested/curious? DM or email ramon.astudillo@ibm.com

6 months ago 19 14 1 1

4/4 Ready to see how AI really stacks up against human developers?

Join researchers and developers already evaluating patches → swebencharena.com

#AI #SoftwareEngineering #CodeQuality #AIEvaluation #SWEBenchArena

6 months ago 1 0 0 0

3/4 Unlike other platforms:
🚫 PR Arena: Tracks merge rates, not code quality
🚫 Yupp AI: Known models, not blind
🚫 SWE Arena: General coding, not SWE tasks

✅ SWE-Bench-Arena: Blind quality evaluation of real bug fixes

6 months ago 0 0 1 0

2/4 SWE-Bench-Arena fills this gap with blind evaluation across 5 dimensions:
• Simplicity
• Readability
• Performance
• Maintainability
• Correctness

No bias. Just quality assessment.

6 months ago 0 0 1 0

AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones - GitClear

🧵 1/4 Current AI coding benchmarks miss the mark.

Claude 4 Sonnet hits 72.7% on SWE-Bench, but industry data shows code clones rose 48% (8.3% to 12.3%) and refactoring rates dropped from 25% to 10% since AI adoption.

(GitClear: gitclear.com/ai_assistant_code_quality_2025_research)

6 months ago 0 0 1 0

Try evaluating patches → swebencharena.com

What quality issues have you noticed with AI-generated code?

#AIEvaluation #SWEBenchArena #CodeQuality #AI #SoftwareEngineering

7 months ago 1 1 0 0

We need diverse perspectives from:
🎓 AI researchers
👩‍💻 Professional developers
📚 Academic teams
🚀 Startup engineers

Your input shapes the future of AI code evaluation standards.

7 months ago 0 0 1 0

How it works:
• Real GitHub issues from actual projects
• Side-by-side patch comparison
• Blind evaluation (you don't know which is AI vs human)
• Multi-dimensional quality assessment

Early results are fascinating - some AI solutions are surprisingly elegant, others create hidden technical debt 📊

7 months ago 0 0 1 0

That's why we built SWE-Bench-Arena - the first blind evaluation platform for AI code quality.

Instead of just "does it work?", we ask:
✅ Is it maintainable?
✅ Will teams understand it?
✅ Does it follow best practices?
✅ Is it unnecessarily complex?

7 months ago 0 0 1 0

🔍 AI models hit 72%+ on coding benchmarks, but there's a hidden problem...

Recent data shows concerning trends since AI adoption:
• 48% increase in code cloning
• Refactoring dropped from 25% to 10%
• Developers report "missing context" as #1 issue

Are we optimizing for the wrong metrics? 🧵

7 months ago 2 0 1 0

The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis | Jatin Ganhotra Analyzing how visual content dramatically impacts AI agents' performance on SWE tasks

5. I call it the Visual Complexity Penalty — and I break it down in detail in my latest post:
🔗 jatinganhotra.dev/blog/swe-age...
📊 Includes full leaderboard analysis, complexity breakdown, and takeaways.

RT if you're building SWE agents — or trying to understand their real limits.

8 months ago 0 0 0 0

4. This isn't a benchmark artifact.
It's a wake-up call.
🧠 Current AI systems cannot effectively combine visual + structural code understanding.
And that's a serious problem for real-world software workflows.

8 months ago 1 0 1 0

3. It's not just the images.
Multimodal tasks often require multi-file edits and focus on JavaScript-based, user-facing applications rather than Python backends.
The combination of visual reasoning + frontend complexity is devastating.

8 months ago 0 0 1 0

2. Why the collapse?
📸 90.6% of instances in SWE-bench Multimodal contain visual content.
When images are present, solve rates drop from ~100% to ~25% across all top-performing agents.

8 months ago 0 0 1 0

1. SWE agents are getting better. Some achieve 70-75% accuracy on code-only benchmarks like SWE-bench Verified.
But when the same models are tested on SWE-bench Multimodal, scores fall to ~30%.

8 months ago 0 0 1 0

🚨 New Blog Post:
AI agents collapse under visual complexity.
A 73.2% performance drop when images are introduced in SWE-bench Multimodal.

Here's why this matters — and what it tells us about the future of AI in software engineering:
🧵👇

8 months ago 0 0 1 0

SWE-Bench Verified Discriminative Subsets Leaderboard - a Hugging Face Space by jatinganhotra This application shows the SWE-Bench leaderboard and automatically updates it with the latest data. No input is required; you just need to run the app, and it will provide you with the current lead...

huggingface.co/spaces/jatin...

8 months ago 0 0 0 0

Posts by Jatin Ganhotra