Flop at the sestina, too
Posts by Ethan Mollick
Kimi 2.6 Thinking seems very good for an open weights model, but many rough edges compared to closed SoTA. The gap remains.
The Lem Test resulted in a 74 page thinking trace... and an okay-ish answer.
It did an okay TiKZ unicorn, an adequate twigl shader for a neogothic city in the waves, etc.
AI reviewers then ranked the submissions, and gave the same ordering every time, regardless of model doing the ranking: Codex GPT-5.4 > GPT-5.3-Codex > Opus 4.6 > humans.
Paper: claude-code-economist.com/data/paper.pdf
Classic study gave 146 economist teams the same dataset & got wildly different answers
New paper reruns it with agentic AI. Claude Code & Codex land near the human median but with far tighter dispersion & no extremes
This suggests that agentic AI is now useful for doing scalable economics research
The imaginary optimal selfish scenario for OpenAI, in retrospect, was to keep Reasoners a secret, skip releasing o1 and o1-preview, and release o3 as GPT-5
There would have been no Deep Seek moment, other labs may not have discovered Reasoners quickly, and OpenAI's lead would have been hard to beat
And there are many really good AI components at Google: they have top-flight image, music, & video generation, and good UIs for each of them.
Google AI studio is probably the best playground for AI experimentation. NotebookLM still has no equivalent product from other labs, etc.
It could also decide when to use other Google tools (and Google has a lot of very good AI tools) and apply them, taking advantage of the ecosystem, but it doesn’t consistently.
I assume something will be coming out here eventually, but the gap with Claude and ChatGPT has only been growing.
The reason this is odd is that Google is trusted by enterprises & has the compute to burn, so a good harness would solve so many of Gemini’s gaps and make it an easier sell to companies. The model can make Office documents, for example, but the harness doesn’t allow it.
2/
The continuing gap between the capabilities of Gemini Pro 3.1 (very good model) and the capabilities of the Gemini app/website is odd. The model can do what Claude/GPT can do, but there is a minimal harness for tools (file creation, research etc), no auditable CoT/actions, manual canvas, etc.
1/
And, we have not seen Mythos (or whatever OpenAI and Google are releasing)
Nope.
A major lesson to take away from Opus 4.7 is that, while there is a lot of arguments about implementation and personality, models keep improving measurably on economically important tasks with each release (which are accelerating, it has been two months since Opus 4.6), with no signs of slowdown
Still refuses to write sestinas for some reason, so I don't think all the rough edges are gone.
I'll give Anthropic credit for moving quickly. Opus 4.7 Adaptive Thinking now triggers thinking much more often, including for the tasks it failed at yesterday. That also means it is doing a lot more web search.
So far, a large improvement in output quality on non-coding tasks.
Procedurally generated Bruegel with little workers and everything!
With max thinking Opus 4.7 is quite impressive, with a real sense of style
In two prompts: "implement the Tower of Babel, in 3D, in as sophisticated and visually interesting a way as possible. It should be interactive" and then "make it better."
Play: tower-of-babel-1776392618.netlify.app
I was told by Anthropic that they are looking at ways of fixing this, which is good.
And I had posted about it too. ha. too much research out there
I forgot about that paper!
I think the adaptive thinking requirement in the new Claude Opus 4.7 is bad in the ways that all AI effort routers are bad, but magnified by the fact that there is no manual override like in ChatGPT.
It regularly decides that non-math/code stuff is "low effort" & produces worse results.
I have found that asking for a sestina regularly triggers Opus 4.7's safety guardrails.
The forbidden poetic form!
Claude remains irreducibly Claude, across many generations. If you know, you know.
(The fact that models have distinct personalities that are consistent across generations is technically interesting, it also makes it easy to use new releases when they come along, because they feel very similar).
Instead of the gold standard, we can, as a thought experiment, imagine an inference standard of exchange, the FLOP. (As opposed to tokens, this accounts for AI ability)
With some AI help, I figure $1 buys roughly 10^17 managed-LLM inference FLOPs
So that $4 coffee would cost half an exaFLOP, choom
But it isn’t because the expert mathematicians who called out the previous issues say it isn’t.
Comment from a math professor on the quality of the latest proofs.
Because some of the best mathematicians in the world say they are not.
This is becoming a pattern in AI that makes talking about capabilities challenging.
First, there are overstated claims (like the flubbed Erdos problems that were announced last year), then minor wins (AI helps with discovery) then breakthroughs.
The first stage feels like (& often is) hype, but…
AI keeps getting better but the last time the shape of the jagged frontier changed radically was o1 & the Reasoner.
A good mental model of the coming months is that models get extremely good at the things they are already quite good at (coding), but weaknesses will be similar (long form fiction)