I wasn't familiar with WeirdML, and in a way the outstanding performance of Gemma4 might prove your point as well as anything (or else the team that did distillation for Gemini Flash also did Gemma and they are really really good)
Posts by Nick Lothian
Yes, they also say "Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher."
Yes, yes I feel like "elite" does a lot of work there. But it's reasonably balanced.
Not sure they are downplaying it: "We have many years of experience picking apart the work of the world’s best security researchers, and Mythos Preview is every bit as capable. So far we’ve found no category or complexity of vulnerability that humans can find that this model can’t."
The timing works very well in Australia. Generally our timezone sucks, but for once we win!
I found the new Nemotron-Cascade-2-30B-A3B is stronger than Qwen 3.5-35B-A3B for my benchmark. Very impressed with it!
@ajfisher.social there are some interesting results in this for Inception Mercury 2 - it's certainly fast! See sql-benchmark.nicklothian.com#fastest-models for details around the issues I had with it.
Thinking mode isn't a clear win for small models
Even very heavily quantized models can perform well
Models differ hugely in their token efficiency
Interesting things I learnt:
Grok and GLM5 beats Claude Opus and Sonnet. Same score but *much* cheaper!
Qwen 27B is great.
Nvidia's Nemotron-Cascade-2-30B-A3B beats Qwen 3.5-35B-A3B.
I wrote a new Agentic text-to-SQL benchmark and tested every local model I could against it: sql-benchmark.nicklothian.com
Thanks to DuckDB WASM you can try your own models from the browser.
Is Mercury2 open source? I couldn't find the weights. It is available on OpenRouter though.
hmm yeah, and the the obvious question is how weak can you go?
Taalas's chatjimmy.ai can write a Sudoku at 15000 TPS, but its Lllama 3 model isn't going to do agentic engineering no matter how hard you harness it.
This is pretty interesting but it's unclear if this is unclear what happens when a problem is too hard for the weaker model.
I haven't tried this exact approach but there absolutely are cases that overload weaker models. In those cases the weaker model just churns - do you detect this?
Using a 2026 model to backtest 2024-2026 data is *not* legit. 🤷🏻♂️
He backtested using a 2026 model against September 2024 - March 2026 data.
Too bad about all that world knowledge built into the model. It would take real effort to make it *NOT* work.
Levels and Locks as Agent Orchestration primitives
nicklothian.com/blog/2026/02...
Hmm makes some sense..
Was the short 60 turn limit done with the same hold out test set? TS at 82% is a bit higher
(Python guy here, but I have noticed I've been using TS more for AI projects because the type system is nice, hence interested in hard metrics)
This is super interesting.
I'm surprised TS dropped so much in the long SQL test. Any insights there?
Llama 3.1 8B running at 16,000 tokens per second on taalas.com custom hardware.
umm yeah ok.
I mean yes you can get it to do it of course, but it's pretty rare unless you are doing it deliberately.
I don't think asking for fast square root code and getting something it's seen is the same as asking a task based thing.
Humans memorize things without understanding too.
I don't know why you'd say that? Riddles confuse humans, and we claim humans understand things.
In the modern RL-trained era, LLMs aren't just parroting training data and instead have sophisticated end-to-end understanding of tasks.
The term "understanding" is as loaded as "intelligence" but they certainly aren't parroting training data.
I agree with most of your post, but this misses some of the major criticisms of the term "Stochastic Parrot".
That comes from dl.acm.org/doi/epdf/10.... where one of the main claims is that LLM don't "understand".
@timkellogg.me you might find x.com/AJakkli/stat... interesting:
"What happens when you leave two copies of the same model talking to each other? They have different attractor states: Grok devolves into gibberish while GPT-5.2 starts writing code and editing imaginary spreadsheets"
Strong agree.
Although the $20 plan is very generous compared to the Claude one, so maybe it is a way to build market share
The big labs do extra RL on their models for special use cases like this all the time. Listen to the podcast with Yi Tay on the Gemini IMO Gold Medal where he talks about this.
I guess it's not technically exactly the same model but Noam's point is mostly right.
www.youtube.com/watch?v=unUe...
It actually makes perfect sense. Only Agents can work out how to create matplotlib charts that aren't just copies of other charts anyway.
>slightly outdated. glm, step exist
No, that's the point. These are the top *used* models, still.
Nemotron is a *great* model and the 3B and 12B models are free on OpenRouter for people to try out.
LMNotebook generates great infographics of WTF code does if you rename python files to .txt and dump them in there.
Gaslighting from Opus. Grr