>> I have been running a few social experiments on this front to see whether a group of users can differentiate a Frontier Model from a less expensive model for specific functions.
I will write more about this later, if more folks are interested in it
I would love to share notes
Posts by Basel Ismail
The questions I keep coming back to are:
> Whether the remaining gaps in conversation and memory tasks matter enough for your specific use case, or if 90% of the way there at 10% of the cost is the better trade, time will tell.
> For anyone building AI products right now, the calculus on model selection just changed.
That maps well to how we think about cost management when you're running agents at scale.
This is a challenge I struggle with daily with my projects, because the model cost is only one small component of the larger infrastructure costs, and it is not a simple optimization problem.
In the past, I have said that model routing algorithms are the future, and I have never felt stronger than now about this
> The demand for these models and the associated infosec risks will continue to grow exponentially, placing even more computational demand on the overall system.
What caught my attention (the full writeup here blog.langchain.com/open-models...) is:
> the pattern of using a frontier model for planning and then swapping to an open model for execution mid-session...as you don't need to run a frontier model all the way through!
> GLM-5 on Baseten averaged 0.65 seconds of latency and 70 tokens a second compared to 2.56 seconds and 34 tokens a second for Opus.
> For anything user-facing, that's a different product experience entirely, but this truly depends on the app's intention and the complexity of its functions
LangChain ran both open and closed models through the same agent tasks, and GLM-5 hit 0.64 correctness vs Opus at 0.68...
...all while costing roughly $3/M output tokens instead of $25.
The latency gap is even more striking.
This is a very controversial topic, because it ruffles many feathers when you evaluate an open-source model (at 1/10th to 1/20th of the price) relative to its more expensive benchmark
Open source models are now scoring within a few points of Claude Opus and GPT-5.4 on agentic evals by @Langchain, and the cost difference is fairly absurd.
I wonder how many tech and engineering departments are genuinely pre-planning for a world where mid to late 2028 is the number?
The gap between "useful coding assistant" and "autonomous coder" is closing faster than the benchmarks would even suggest.
As wild as that may sound, that is the truth, and I am not convinced enough business executives are grasping that.
From the building side, I already see teams developing and deploying new features faster than ever, when it would have taken weeks to months a year ago.
That is a very specific bar. And the fact that some AI lab researchers are privately doubling down on even shorter timelines than those forecast by these leaders is worth sitting with.
Look at one of the most powerful examples of hyper growth
> Claude Code hit $2.5B annualized revenue in Feb, nine months after launch
This is unbelievable, if you really process the speed to the current state.
...partly because agentic coding just blew up in practice @EpochAIResearch, a trend I have mentioned a few times now.
It's a very interesting research project, and the most recent change was triggered partly because of new model performance (Gemini 3, GPT-5.2, Claude Opus 4.6), of course
This continues the fast trend in METR's coding time horizon benchmark, and...
The AC milestone is basically the point in time where an AGI-powered company would likely consider laying off all of their actual software engineers, rather than slowing down AI software engineering adoption
The AI Futures Project just shifted its median Automated Coder (AC) timeline forward by about 1.5 years.
@DKokotajlo moved it from late 2029 to mid 2028, @eli_lifland from early 2032 to mid 2030.
YES! Exactly what I have been saying for a while, but it's so hard to convince others who oversimplify everything and blindly overrely just on the model itself
Makes me wonder how long before the harness becomes the real moat and the model becomes fully swappable. The way the model is used is what matters, whether or not a RAG is used and if a model is fine tuned and trained on a proprietary data set, etc...those factors are critical
The gap between a demo and something production-grade is almost always in the plumbing, the context handling, the caching, the tool integration. Not the model call itself.
Building an enterprise grade app is not same as rapid prototyping with experimental inexpensive models
The model matters, but the orchestration layer around it might matter more. I have mentioned this before and it's now showing up more and more
This tracks with what I keep seeing when we ship key product features which have AI integrated.
The implication is pretty wild for anyone building AI products right now. If you dropped DeepSeek or Kimi into this same harness and tuned it just a little bit, you would probably get surprisingly strong coding performance.
2) it does file-read deduplication so unchanged files never get reprocessed, and when tool outputs get too large, it writes them to disk and just passes a preview plus a reference. That is serious context management.
Two things stood out to me:
1) it keeps a structured markdown file during each session with sections like "Errors & Corrections" and a worklog, basically mimicking how a good engineer keeps notes while debugging.
Sebastian Raschka breaks it down here: x.com/rasbt/statu... @rasbt
The Claude Code source code leaked and the most interesting finding is how much of its performance comes from the software harness, not the model itself.
Most of the clones have been taken down already, I was half convinced it was a marketing ploy when the first leak happened