Horizontal bar chart comparing agentic coding reliability across three groups. Frontier models (Claude Sonnet 4.5, Gemini Pro 3.1, GPT 4.1) score 80-100% correct. Four-months-ago's local models (Qwen 3 14B, GPT OSS 20B, Mistral 3.1 24B) all score 0%. Today's local models (Gemma 4 26B-A4B, Qwen 3.5 35B-A3B) both score 90%.
A few months ago, any LLM that I could run on my Macbook scored 0% on an agentic coding eval I put together. This month's Qwen 3.5 and Gemma 4 releases both scored 90%.
On my blog: simonpcouch.com/blog/2026-04...