Ethan Mollick (@emollick) Bsky

Flop at the sestina, too

9 hours ago 8 0 0 0

Kimi 2.6 Thinking seems very good for an open weights model, but many rough edges compared to closed SoTA. The gap remains.

The Lem Test resulted in a 74 page thinking trace... and an okay-ish answer.

It did an okay TiKZ unicorn, an adequate twigl shader for a neogothic city in the waves, etc.

10 hours ago 37 1 2 0

AI reviewers then ranked the submissions, and gave the same ordering every time, regardless of model doing the ranking: Codex GPT-5.4 > GPT-5.3-Codex > Opus 4.6 > humans.

Paper: claude-code-economist.com/data/paper.pdf

13 hours ago 17 2 0 0

Classic study gave 146 economist teams the same dataset & got wildly different answers

New paper reruns it with agentic AI. Claude Code & Codex land near the human median but with far tighter dispersion & no extremes

This suggests that agentic AI is now useful for doing scalable economics research

13 hours ago 74 13 6 1

Something New: On OpenAI's "Strawberry" and Reasoning Solving hard problems in new ways

Context, from the time of the o1-preview launch: www.oneusefulthing.org/p/something-...

1 day ago 6 1 0 0

The imaginary optimal selfish scenario for OpenAI, in retrospect, was to keep Reasoners a secret, skip releasing o1 and o1-preview, and release o3 as GPT-5

There would have been no Deep Seek moment, other labs may not have discovered Reasoners quickly, and OpenAI's lead would have been hard to beat

1 day ago 40 1 8 0

And there are many really good AI components at Google: they have top-flight image, music, & video generation, and good UIs for each of them.

Google AI studio is probably the best playground for AI experimentation. NotebookLM still has no equivalent product from other labs, etc.

1 day ago 22 1 3 0

It could also decide when to use other Google tools (and Google has a lot of very good AI tools) and apply them, taking advantage of the ecosystem, but it doesn’t consistently.

I assume something will be coming out here eventually, but the gap with Claude and ChatGPT has only been growing.

1 day ago 22 0 2 0

The reason this is odd is that Google is trusted by enterprises & has the compute to burn, so a good harness would solve so many of Gemini’s gaps and make it an easier sell to companies. The model can make Office documents, for example, but the harness doesn’t allow it.
2/

1 day ago 22 0 2 0

The continuing gap between the capabilities of Gemini Pro 3.1 (very good model) and the capabilities of the Gemini app/website is odd. The model can do what Claude/GPT can do, but there is a minimal harness for tools (file creation, research etc), no auditable CoT/actions, manual canvas, etc.
1/

1 day ago 71 4 8 1

And, we have not seen Mythos (or whatever OpenAI and Google are releasing)

3 days ago 23 2 3 0

Nope.

3 days ago 0 0 0 0

A major lesson to take away from Opus 4.7 is that, while there is a lot of arguments about implementation and personality, models keep improving measurably on economically important tasks with each release (which are accelerating, it has been two months since Opus 4.6), with no signs of slowdown

3 days ago 110 6 7 1

Still refuses to write sestinas for some reason, so I don't think all the rough edges are gone.

3 days ago 15 0 1 0

I'll give Anthropic credit for moving quickly. Opus 4.7 Adaptive Thinking now triggers thinking much more often, including for the tasks it failed at yesterday. That also means it is doing a lot more web search.

So far, a large improvement in output quality on non-coding tasks.

3 days ago 73 0 3 1

GitHub - emollick/tower-of-babel: Interactive 3D Tower of Babel built with Three.js. Live at https://tower-of-babel-1776392618.netlify.app Interactive 3D Tower of Babel built with Three.js. Live at https://tower-of-babel-1776392618.netlify.app - emollick/tower-of-babel

github.com/emollick/tow...

4 days ago 13 0 1 0

Procedurally generated Bruegel with little workers and everything!

4 days ago 29 1 2 0

With max thinking Opus 4.7 is quite impressive, with a real sense of style

In two prompts: "implement the Tower of Babel, in 3D, in as sophisticated and visually interesting a way as possible. It should be interactive" and then "make it better."

Play: tower-of-babel-1776392618.netlify.app

4 days ago 125 11 8 6

I was told by Anthropic that they are looking at ways of fixing this, which is good.

4 days ago 79 2 2 0

And I had posted about it too. ha. too much research out there

4 days ago 1 0 0 0

I forgot about that paper!

4 days ago 3 0 1 0

I think the adaptive thinking requirement in the new Claude Opus 4.7 is bad in the ways that all AI effort routers are bad, but magnified by the fact that there is no manual override like in ChatGPT.

It regularly decides that non-math/code stuff is "low effort" & produces worse results.

4 days ago 65 4 4 2

I have found that asking for a sestina regularly triggers Opus 4.7's safety guardrails.

The forbidden poetic form!

4 days ago 69 3 7 0

Claude remains irreducibly Claude, across many generations. If you know, you know.

(The fact that models have distinct personalities that are consistent across generations is technically interesting, it also makes it easy to use new releases when they come along, because they feel very similar).

4 days ago 64 3 5 0

Instead of the gold standard, we can, as a thought experiment, imagine an inference standard of exchange, the FLOP. (As opposed to tokens, this accounts for AI ability)

With some AI help, I figure $1 buys roughly 10^17 managed-LLM inference FLOPs

So that $4 coffee would cost half an exaFLOP, choom

5 days ago 34 1 3 3

But it isn’t because the expert mathematicians who called out the previous issues say it isn’t.

5 days ago 18 0 0 0

Comment from a math professor on the quality of the latest proofs.

5 days ago 57 7 0 0

Because some of the best mathematicians in the world say they are not.

5 days ago 17 0 1 0

This is becoming a pattern in AI that makes talking about capabilities challenging.

First, there are overstated claims (like the flubbed Erdos problems that were announced last year), then minor wins (AI helps with discovery) then breakthroughs.

The first stage feels like (& often is) hype, but…

5 days ago 72 7 7 1

AI keeps getting better but the last time the shape of the jagged frontier changed radically was o1 & the Reasoner.

A good mental model of the coming months is that models get extremely good at the things they are already quite good at (coding), but weaknesses will be similar (long form fiction)

6 days ago 50 4 6 0

Posts by Ethan Mollick