Advertisement · 728 × 90

Posts by atharva

analogies are like wet balloons, they are not very effective at explaining things

2 weeks ago 1 0 0 0

Hacker-types wanting to build random software side projects for fun, without needing to justify their utility or ability to generate capital is one of the oldest programmer stereotypes.

Many such builders still exist!

2 months ago 0 0 0 0

I've been calling it Lee Sedol'd but Deep Blue is perhaps a better term for it. I thought the AlphaGo documentary is perhaps the greatest depiction of this "Deep Blue" feeling, especially knowing where we are now.

2 months ago 2 0 0 0
Preview
Codex CLI vs Claude Code on autonomy

My colleague analysed the system prompts for codex and claude and realised the reason they feel different is because of deliberate product decisions in the prompt!

blog.nilenso.com/blog/2026/02...

2 months ago 6 0 0 0

It's also OpenAI-led and the reference schema is more-or-less identical to the current Responses API. It doesn't seem like Anthropic or Google have bought into this—they have competing formats.

The vendors that are bought in don't have fully compliant implementations yet.

2 months ago 0 0 0 0
Preview
Open Responses Open Responses documentation overview.

I was hoping the OpenResponses API would be a meaningful step forward deal with the LLM API standardisation headaches, but right now the spec is really undercooked.

There are lots of inconsistencies/contradictions between the reference schemas and what the specification says!

www.openresponses.org

2 months ago 2 0 1 0
Overview - A2A Protocol The official documentation for the Agent2Agent (A2A) protocol. The A2A protocol is an open standard that allows different AI agents to securely communicate, collaborate, and solve complex problems tog...

The problem it is solving makes sense (cross-vendor agent communication), but I don't understand why there's such a massive and detailed spec for an *anticipated* use case that hasn't properly materialised yet.

Castles in the sky energy.

a2a-protocol.org/latest/speci...

2 months ago 1 0 0 0
A2A Protocol The official documentation for the Agent2Agent (A2A) protocol. The A2A protocol is an open standard that allows different AI agents to securely communicate, collaborate, and solve complex problems tog...

Is A2A protocol completely useless? I don't know of anyone building enterprise multi-agent communication.

Why design such a thick protocol for a use case that does not yet exist in practice?

a2a-protocol.org/latest/

Feels like another SOAP/CORBA etc

2 months ago 1 0 1 0

ese
bsky.app/profile/grac...

2 months ago 1 0 0 0
Advertisement
Preview
Ese Large language models are better thinkers than writers. Well okay, they don't think as humans do, but I've been letting it write the vast chunk of my computer programs over the last year or so, which ...

thoughts on ese.
atharvaraykar.com/ese/

2 months ago 3 1 1 0
Preview
Quantifying infrastructure noise in agentic coding evals Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

The other reason is that infrastructure noise and variations can affect benchmarks a lot, I wonder if that's the case with the SWE Bench Pro runs.

www.anthropic.com/engineering/...

2 months ago 1 0 0 0

The harness matters a lot. SWE Bench Pro uses SWE-Agent-Mini by default. OpenAI likely reports a result on their own harness.

Codex is a weird model that performs much worse in generic, minimal harnesses. That's likely why the tool shapes in Codex CLI are strange like "apply_patch".

2 months ago 1 0 1 0

While SWE Bench Pro is a pretty good benchmark (especially compared to Verified) the leaderboard rankings clearly "look wrong". No one will agree that Claude 4 Sonnet is better than 5.2 Codex.

Really shows how insufficient public benchmarks are getting at conveying model capabilities.

2 months ago 1 0 0 0
Preview
SWE-Bench Pro (Public Dataset) Explore the SEAL leaderboard with expert-driven LLM benchmarks and updated AI model leaderboards, ranking top models across coding, reasoning and more.

It's particularly strange that Anthropic won't report SWE-Bench Pro in their announcements. Their models have always done better on it than OpenAI (at least on the public dataset):
scale.com/leaderboard/...

I think it might just be that ~80% solved looks more impressive than ~50% solved.

2 months ago 1 0 2 0

the screenshot is from my post: blog.nilenso.com/blog/2025/09...

2 months ago 3 0 0 0


    Our tasks typically use environments that do not significantly change unless directly acted upon by the agent. In contrast, real tasks often occur in the context of a changing environment.

    […]

    Similarly, very few of our tasks are punishing of single mistakes. This is in part to reduce the expected cost of collecting human baselines.

This is not at all like the tasks I am doing.

METR acknowledges the messiness of the real world. They have come up with a “messiness rating” for their tasks, and the “mean messiness” of their tasks is 3.2/16.

By METR’s definitions, the kind of software engineering work that I’m mostly exposed to would score at least around 7-8, given that software engineering projects are path-dependent, dynamic and without clear counterfactuals. I have worked on problems that get to around 13/16 levels of messiness.

    An increase in task messiness by 1 point reduces mean success rates by roughly 8.1%

Extrapolating from METR’s measured effect of messiness, GPT-5 would go from 70% to around 40% success rate for 2-hour tasks. This maps to my experienced reality.

Our tasks typically use environments that do not significantly change unless directly acted upon by the agent. In contrast, real tasks often occur in the context of a changing environment. […] Similarly, very few of our tasks are punishing of single mistakes. This is in part to reduce the expected cost of collecting human baselines. This is not at all like the tasks I am doing. METR acknowledges the messiness of the real world. They have come up with a “messiness rating” for their tasks, and the “mean messiness” of their tasks is 3.2/16. By METR’s definitions, the kind of software engineering work that I’m mostly exposed to would score at least around 7-8, given that software engineering projects are path-dependent, dynamic and without clear counterfactuals. I have worked on problems that get to around 13/16 levels of messiness. An increase in task messiness by 1 point reduces mean success rates by roughly 8.1% Extrapolating from METR’s measured effect of messiness, GPT-5 would go from 70% to around 40% success rate for 2-hour tasks. This maps to my experienced reality.

The METR tasks are narrow (ie, "not messy") and not very numerous, so it's hard to generalise the automatability of software engineering from that alone.

It looks like 100% replacement for software engineering can happen, but perhaps not in the next 2 years at least.

2 months ago 5 0 1 0
Preview
Taking Jaggedness Seriously Why we should expect AI capabilities to keep being extremely uneven, and why that matters

"Taking Jaggedness Seriously" by Helen Toner talks about this in some depth.

helentoner.substack.com/p/taking-jag...

2 months ago 6 1 0 0
screenshot:

PART I: 2025 KEY INITIATIVES Toxicity Filtering a Toxicity is a persistent challenge for all large-scale social apps. As communities grow, maintaining space for both friendly conversation and fierce disagreement requires intentional design choices. Our community doubled in size over the past year, and with that growth came tension: how to preserve healthy discourse while respecting genuine debate and diverse user preferences. Toxic and inflammatory discourse appears across all forms of social media; and almost universally, it's the case that a small percentage of people contribute disproportionately to causing this problem. A tiny number of users can have an outsize impact on conversation quality and on people's willingness to participate. In 2023-2024, anti-social behavior, such as harassment, trolling, and intolerance, consistently ranked among our top complaints reported by users. This content drives people away from forming connections, posting, or engaging, for fear of attacks and pile-ons.

screenshot: PART I: 2025 KEY INITIATIVES Toxicity Filtering a Toxicity is a persistent challenge for all large-scale social apps. As communities grow, maintaining space for both friendly conversation and fierce disagreement requires intentional design choices. Our community doubled in size over the past year, and with that growth came tension: how to preserve healthy discourse while respecting genuine debate and diverse user preferences. Toxic and inflammatory discourse appears across all forms of social media; and almost universally, it's the case that a small percentage of people contribute disproportionately to causing this problem. A tiny number of users can have an outsize impact on conversation quality and on people's willingness to participate. In 2023-2024, anti-social behavior, such as harassment, trolling, and intolerance, consistently ranked among our top complaints reported by users. This content drives people away from forming connections, posting, or engaging, for fear of attacks and pile-ons.

screenshot:

In October, we began experimenting with improving conversation quality, starting with replies. Rather than only reacting after users report abusive or toxic interactions, we launched an experiment to identify replies that are toxic, spammy, off-topic, or posted in bad faith, and reduce their visibility in the Bluesky app. This approach adds friction most viewers casually scanning a conversation won't encounter the toxic or potentially harmful replies while preserving content access in case we get it wrong. These replies remain accessible in the thread for those who want to see them. We also made sure this feature is aware of who you follow: Replies from accounts you follow appear above the fold, while toxic replies from people you don't follow require an additional click to view. After implementing this detection, daily reports of anti-social behavior dropped by approximately 79%. This reduction demonstrates measurable improvement in user experience: People are encountering substantially less toxicity in their day-to-day interactions on Bluesky.

screenshot: In October, we began experimenting with improving conversation quality, starting with replies. Rather than only reacting after users report abusive or toxic interactions, we launched an experiment to identify replies that are toxic, spammy, off-topic, or posted in bad faith, and reduce their visibility in the Bluesky app. This approach adds friction most viewers casually scanning a conversation won't encounter the toxic or potentially harmful replies while preserving content access in case we get it wrong. These replies remain accessible in the thread for those who want to see them. We also made sure this feature is aware of who you follow: Replies from accounts you follow appear above the fold, while toxic replies from people you don't follow require an additional click to view. After implementing this detection, daily reports of anti-social behavior dropped by approximately 79%. This reduction demonstrates measurable improvement in user experience: People are encountering substantially less toxicity in their day-to-day interactions on Bluesky.

my guess is due to this initiative by the bluesky team.

bsky.social/about/blog/0...

2 months ago 26 2 2 1
Advertisement

The actual issue to solve, at least for large projects, is getting DOS'd by a flood of low quality slop patches.

It's a similar problem to the old Hacktoberfest spam issues, but perhaps worse in scale and scope.

2 months ago 3 0 1 0

I also like to think of forking to potentially be like a function call stack allocations, whose memory/"context" gets dumped out after the work is done and substituted with a return value, which for agents would be a summary of sorts.

2 months ago 0 0 1 0

nice to see someone else who is fork-pilled.

@mariozechner.at's pi coding agent handles this pattern very well (still manual, like what you described with your claude code workflow, but with much smoother UX and first class support)

2 months ago 1 0 1 0

I've already had some aggressive muting and "not interested" preference spamming in place, and it isn't working quite as well anymore. It's just enough friction to make me try newer platforms for the time being!

2 months ago 1 0 0 0
clawdbot star history showing hockey stick growth, inflection point on Jan 20-ish

clawdbot star history showing hockey stick growth, inflection point on Jan 20-ish

do you have any idea what caused the inflection point?

2 months ago 0 0 2 0

Some people in some normie-ish group chats I'm in thought this is a product by Anthropic. Bet they'd have got support requests for this already. Perhaps they don't want to have their name attached to this, which is horribly insecure for people who don't know what they are doing.

2 months ago 1 0 0 0
Preview
How the Lobsters front page works Lobsters is a computing-focused community centered around link aggregation and discussion. The code is open source, so I had a look at how the front page algorithm works. This is it: $$\textbf{hotn...

things I have dumped on the internet this month.

How the lobsters algorithm works:
atharvaraykar.com/lobsters/

11:59 PM:
atharvaraykar.com/reinforce/

2 months ago 1 0 0 0

fwiw, I'm trying to use this site (and substack) more ever since the new X algorithm completely trashed my feed, it's surfacing only toxic sludge and slop

the vibe here has improved quite a bit in the meantime.

but I can also imagine a timeline where I stop microblogging altogether and touch grass

2 months ago 5 0 1 0

Exploring the weirdness of this would fall under the goals of AI village. At least as I understand it. But they definitely should not be unleashing these agents "outside the lab", hence my mention of this needing to be opt-in/consented or sandboxed in some way.

3 months ago 3 0 0 0
Advertisement

yeah they messed up with today's goal, these kind of things need to be opt-in.

3 months ago 3 0 1 0

The "excellence" still depends on whether the language is in the training distribution. It's pretty competent at Python and JS. Less so in Clojure. Or HashiCorp Language.

Even so I agree that it's still a productivity boost across most languages.

4 months ago 5 0 0 0
SWE-bench Verified and SWE-bench Pro
What it measures

How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue.
The specifics

There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset.

Notes and quirks of SWE-bench Verified:

    It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects.
    Solutions to these issues are small—think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function.
    All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.

SWE-bench Verified and SWE-bench Pro What it measures How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue. The specifics There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on SWE-bench Verified, which is a cleaned and human-reviewed subset. Notes and quirks of SWE-bench Verified: It has 500 problems, all in Python. Over 40% are issues from the Django source repository; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects. Solutions to these issues are small—think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that over 77.6% of the solutions touch only one function. All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.

I wrote a post looking into multiple SWE/coding benchmarks. Many of them measure something narrower than what their names suggests.

blog.nilenso.com/blog/2025/09...

6 months ago 1 1 0 0