Advertisement ยท 728 ร— 90

Posts by Basel Ismail

>> I have been running a few social experiments on this front to see whether a group of users can differentiate a Frontier Model from a less expensive model for specific functions.

I will write more about this later, if more folks are interested in it

I would love to share notes

2 weeks ago 0 0 0 0

The questions I keep coming back to are:

> Whether the remaining gaps in conversation and memory tasks matter enough for your specific use case, or if 90% of the way there at 10% of the cost is the better trade, time will tell.

2 weeks ago 0 0 1 0

> For anyone building AI products right now, the calculus on model selection just changed.

2 weeks ago 1 0 1 0

That maps well to how we think about cost management when you're running agents at scale.

This is a challenge I struggle with daily with my projects, because the model cost is only one small component of the larger infrastructure costs, and it is not a simple optimization problem.

2 weeks ago 0 0 1 0

In the past, I have said that model routing algorithms are the future, and I have never felt stronger than now about this

> The demand for these models and the associated infosec risks will continue to grow exponentially, placing even more computational demand on the overall system.

2 weeks ago 0 0 1 0
Preview
Open Models have crossed a threshold ๐Ÿ’กTL;DR: Open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks โ€” file operations, tool use, and instruction following โ€” at a fraction of the cost and latency. Here's what our evals show and how to start using them in Deep Agents. Over the

What caught my attention (the full writeup here blog.langchain.com/open-models...) is:

> the pattern of using a frontier model for planning and then swapping to an open model for execution mid-session...as you don't need to run a frontier model all the way through!

2 weeks ago 0 0 1 0

> GLM-5 on Baseten averaged 0.65 seconds of latency and 70 tokens a second compared to 2.56 seconds and 34 tokens a second for Opus.

> For anything user-facing, that's a different product experience entirely, but this truly depends on the app's intention and the complexity of its functions

2 weeks ago 0 0 1 0
Post image

LangChain ran both open and closed models through the same agent tasks, and GLM-5 hit 0.64 correctness vs Opus at 0.68...

...all while costing roughly $3/M output tokens instead of $25.

The latency gap is even more striking.

2 weeks ago 0 0 1 0
Advertisement

This is a very controversial topic, because it ruffles many feathers when you evaluate an open-source model (at 1/10th to 1/20th of the price) relative to its more expensive benchmark

2 weeks ago 0 0 1 0
Post image

Open source models are now scoring within a few points of Claude Opus and GPT-5.4 on agentic evals by @Langchain, and the cost difference is fairly absurd.

2 weeks ago 0 0 1 0

I wonder how many tech and engineering departments are genuinely pre-planning for a world where mid to late 2028 is the number?

2 weeks ago 0 0 0 0

The gap between "useful coding assistant" and "autonomous coder" is closing faster than the benchmarks would even suggest.

As wild as that may sound, that is the truth, and I am not convinced enough business executives are grasping that.

2 weeks ago 0 0 1 0

From the building side, I already see teams developing and deploying new features faster than ever, when it would have taken weeks to months a year ago.

2 weeks ago 0 0 1 0

That is a very specific bar. And the fact that some AI lab researchers are privately doubling down on even shorter timelines than those forecast by these leaders is worth sitting with.

2 weeks ago 0 0 1 0
Preview
Q1 2026 Timelines Update We told you we'd be updating in both directions!

This is the full study: blog.aifutures.org/p/q1-2026-t...

2 weeks ago 0 0 1 0

Look at one of the most powerful examples of hyper growth

> Claude Code hit $2.5B annualized revenue in Feb, nine months after launch

This is unbelievable, if you really process the speed to the current state.

2 weeks ago 0 0 1 0

...partly because agentic coding just blew up in practice @EpochAIResearch, a trend I have mentioned a few times now.

2 weeks ago 0 0 1 0
Post image

It's a very interesting research project, and the most recent change was triggered partly because of new model performance (Gemini 3, GPT-5.2, Claude Opus 4.6), of course

This continues the fast trend in METR's coding time horizon benchmark, and...

2 weeks ago 0 0 2 0
Advertisement

The AC milestone is basically the point in time where an AGI-powered company would likely consider laying off all of their actual software engineers, rather than slowing down AI software engineering adoption

2 weeks ago 0 0 1 0
Post image

The AI Futures Project just shifted its median Automated Coder (AC) timeline forward by about 1.5 years.

@DKokotajlo moved it from late 2029 to mid 2028, @eli_lifland from early 2032 to mid 2030.

2 weeks ago 0 0 1 0

YES! Exactly what I have been saying for a while, but it's so hard to convince others who oversimplify everything and blindly overrely just on the model itself

2 weeks ago 1 0 0 0

Makes me wonder how long before the harness becomes the real moat and the model becomes fully swappable. The way the model is used is what matters, whether or not a RAG is used and if a model is fine tuned and trained on a proprietary data set, etc...those factors are critical

2 weeks ago 0 0 0 0

The gap between a demo and something production-grade is almost always in the plumbing, the context handling, the caching, the tool integration. Not the model call itself.

Building an enterprise grade app is not same as rapid prototyping with experimental inexpensive models

2 weeks ago 0 0 1 0

The model matters, but the orchestration layer around it might matter more. I have mentioned this before and it's now showing up more and more

This tracks with what I keep seeing when we ship key product features which have AI integrated.

2 weeks ago 0 0 1 0

The implication is pretty wild for anyone building AI products right now. If you dropped DeepSeek or Kimi into this same harness and tuned it just a little bit, you would probably get surprisingly strong coding performance.

2 weeks ago 1 0 1 0
Advertisement

2) it does file-read deduplication so unchanged files never get reprocessed, and when tool outputs get too large, it writes them to disk and just passes a preview plus a reference. That is serious context management.

2 weeks ago 0 0 1 0

Two things stood out to me:

1) it keeps a structured markdown file during each session with sections like "Errors & Corrections" and a worklog, basically mimicking how a good engineer keeps notes while debugging.

2 weeks ago 0 0 1 0

Sebastian Raschka breaks it down here: x.com/rasbt/statu... @rasbt

2 weeks ago 0 0 1 0
Post image

The Claude Code source code leaked and the most interesting finding is how much of its performance comes from the software harness, not the model itself.

Most of the clones have been taken down already, I was half convinced it was a marketing ploy when the first leak happened

2 weeks ago 2 0 2 0
Preview
Skillful.sh | The Definitive AI Agent Ecosystem Directory 137,000+ MCP servers, AI skills, and agents from 50+ directories. All in one place.

No need to worry about the exact formatting requirements for publishing to each service!

skillful.sh

3 weeks ago 0 0 0 0