nilenso (@nilenso.com) Bsky

SWE-bench Verified and SWE-bench Pro What it measures How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue. The specifics There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset. Notes and quirks of SWE-bench Verified: It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects. Solutions to these issues are small—think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function. All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.

I wrote a post looking into multiple SWE/coding benchmarks. Many of them measure something narrower than what their names suggests.

blog.nilenso.com/blog/2025/09...

6 months ago 1 1 0 0

Wrote about units of work being a useful lever for getting good results from AI-assisted coding.

blog.nilenso.com/blog/2025/09...

7 months ago 1 1 0 0

My Quarterly System Health Check-in It is essential to periodically take a few steps back from the day to day and reflect on where we are against our strategic goals. If you’re an engineering leader, a head of engineering, a director, or a VP, you likely have a recurring meeting to this effect. In this post, I propose a structure for this operational exercise (complementing a business review) that lasts 2-4 hours, every month or quarter. I see quality as solving for the Pareto front with the tangible dimensions of reliability, performance, cost, delivery and security, and the more intangible dimensions of simplicity and social structures. For each dimension, go through the list of questions below and try to answer them together.

I've been poking Srihari, our most experienced engineer @nilenso.com to share his hard-earned knowledge for the benefit of others.

Even if you're not an engineering leader like me, this checklist gives a lot of insight into what makes a great engineering org.

blog.nilenso.com/blog/2025/09...

7 months ago 2 2 0 0

Image description While Kuhn doesn’t go into it, the technological diffusion, and the economic impact of scientific revolutions are better studied through the GPT (general purpose technology) paper from Bresnahan & Trajtenberg in 1995. > “General Purpose Technologies (GPTs) are technologies that can affect an entire economy (usually at a national or global level). They have the potential for pervasive use in a wide range of sectors and, as they improve, they contribute to overall productivity growth.” And Calvino et al in June 2025, finds that AI meets the key criteria of a General Purpose Technology. It’s pervasive, rapidly improving, enables new products, services and research methodologies, and enhances other sectors’ R&D and productivity.

Why Does AI Feel So Different?

An enjoyable read from my colleague, Srihari.

We've been talking about why this disruption feels different from other recent technological disruptions and he captured a lot of that really well in this post.

Link: blog.nilenso.com/blog/2025/08...

8 months ago 0 1 1 0

Table of Contents General guidelines Starting Points Official announcements, blogs and papers from those building AI High signal people to follow News and Media Esoterica Do I chug water from a firehose?

Several people ask me about how I'm keeping up with all the AI things and finding signal in this noisy landscape. I wrote a guide explaining this.

blog.nilenso.com/blog/2025/06...

9 months ago 4 1 0 0

AI-assisted coding for teams that can't get away with vibes - nilenso blog ...

AI-assisted coding for teams that can't get away with vibes

blog.nilenso.com/blog/2025/05...

10 months ago 111 18 5 5

i quickly concocted a writer's block unblocker
(with @tldraw.com computer)

it takes an oblique strategy (from the brian eno et al card deck) and uses it to provide unhinged critique of the essay you are working on to help you break out of a rut

link to program: computer.tldraw.com/t/4KoB33nFEr...

1 year ago 9 2 1 0

Acknowledge that all metrics are approximations, and that they will fall short (as numbers always do), when they try to represent reality. Also ask yourself if a more precise measurement will really make a significant difference to outcomes. If not, move on. You can always revisit the metrics as your product and consumers evolve, but frequent churn with definitions, and parsing increasing amounts of data in ever more complex ways will invariably cost you more than it’s worth.

Wise product managers know that "good enough data" is better than "perfect data".

Focus on purpose, not perfection, writes Deepa on the nilenso blog.

blog.nilenso.com/blog/2024/12...

1 year ago 4 0 0 0

Huh? A software cooperative? - nilenso blog Steven Deobald ...

🦋

Huh? A software cooperative?

blog.nilenso.com/blog/2014/11...

1 year ago 3 0 0 0

Posts by nilenso