Nick Lothian (@nlothian) Bsky

I wasn't familiar with WeirdML, and in a way the outstanding performance of Gemma4 might prove your point as well as anything (or else the team that did distillation for Gemini Flash also did Gemma and they are really really good)

13 hours ago 0 0 0 0

Yes, they also say "Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher."

Yes, yes I feel like "elite" does a lot of work there. But it's reasonably balanced.

16 hours ago 0 0 0 0

Not sure they are downplaying it: "We have many years of experience picking apart the work of the world’s best security researchers, and Mythos Preview is every bit as capable. So far we’ve found no category or complexity of vulnerability that humans can find that this model can’t."

16 hours ago 2 0 1 0

The timing works very well in Australia. Generally our timezone sucks, but for once we win!

3 weeks ago 1 0 0 0

I found the new Nemotron-Cascade-2-30B-A3B is stronger than Qwen 3.5-35B-A3B for my benchmark. Very impressed with it!

3 weeks ago 1 0 0 0

LLM SQL Benchmark A fast benchmark for Agentic Natural Language to SQL Generation.

@ajfisher.social there are some interesting results in this for Inception Mercury 2 - it's certainly fast! See sql-benchmark.nicklothian.com#fastest-models for details around the issues I had with it.

3 weeks ago 1 0 0 0

Thinking mode isn't a clear win for small models
Even very heavily quantized models can perform well
Models differ hugely in their token efficiency

3 weeks ago 0 0 1 0

Interesting things I learnt:
Grok and GLM5 beats Claude Opus and Sonnet. Same score but *much* cheaper!
Qwen 27B is great.
Nvidia's Nemotron-Cascade-2-30B-A3B beats Qwen 3.5-35B-A3B.

3 weeks ago 0 0 1 0

I wrote a new Agentic text-to-SQL benchmark and tested every local model I could against it: sql-benchmark.nicklothian.com

Thanks to DuckDB WASM you can try your own models from the browser.

3 weeks ago 12 2 1 0

Is Mercury2 open source? I couldn't find the weights. It is available on OpenRouter though.

4 weeks ago 0 0 1 0

chat jimmy chat jimmy LLM web interface

hmm yeah, and the the obvious question is how weak can you go?

Taalas's chatjimmy.ai can write a Sudoku at 15000 TPS, but its Lllama 3 model isn't going to do agentic engineering no matter how hard you harness it.

4 weeks ago 0 0 1 0

This is pretty interesting but it's unclear if this is unclear what happens when a problem is too hard for the weaker model.

I haven't tried this exact approach but there absolutely are cases that overload weaker models. In those cases the weaker model just churns - do you detect this?

4 weeks ago 0 0 1 0

Using a 2026 model to backtest 2024-2026 data is *not* legit. 🤷🏻‍♂️

1 month ago 5 0 1 0

He backtested using a 2026 model against September 2024 - March 2026 data.

Too bad about all that world knowledge built into the model. It would take real effort to make it *NOT* work.

1 month ago 4 1 0 1

Levels and Locks as Agent Orchestration primitives

nicklothian.com/blog/2026/02...

1 month ago 1 0 0 0

Hmm makes some sense..

Was the short 60 turn limit done with the same hold out test set? TS at 82% is a bit higher

(Python guy here, but I have noticed I've been using TS more for AI projects because the type system is nice, hence interested in hard metrics)

1 month ago 0 0 1 0

This is super interesting.

I'm surprised TS dropped so much in the long SQL test. Any insights there?

1 month ago 0 0 1 0

Llama 3.1 8B running at 16,000 tokens per second on taalas.com custom hardware.

umm yeah ok.

2 months ago 2 1 0 0

I mean yes you can get it to do it of course, but it's pretty rare unless you are doing it deliberately.

I don't think asking for fast square root code and getting something it's seen is the same as asking a task based thing.

Humans memorize things without understanding too.

2 months ago 0 0 0 0

I don't know why you'd say that? Riddles confuse humans, and we claim humans understand things.

2 months ago 0 0 1 0

In the modern RL-trained era, LLMs aren't just parroting training data and instead have sophisticated end-to-end understanding of tasks.

The term "understanding" is as loaded as "intelligence" but they certainly aren't parroting training data.

2 months ago 0 0 1 0

I agree with most of your post, but this misses some of the major criticisms of the term "Stochastic Parrot".

That comes from dl.acm.org/doi/epdf/10.... where one of the main claims is that LLM don't "understand".

2 months ago 0 0 1 0

@timkellogg.me you might find x.com/AJakkli/stat... interesting:

"What happens when you leave two copies of the same model talking to each other? They have different attractor states: Grok devolves into gibberish while GPT-5.2 starts writing code and editing imaginary spreadsheets"

2 months ago 0 0 0 0

Strong agree.

Although the $20 plan is very generous compared to the Claude one, so maybe it is a way to build market share

2 months ago 0 0 0 0

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay YouTube video by Latent Space

The big labs do extra RL on their models for special use cases like this all the time. Listen to the podcast with Yi Tay on the Gemini IMO Gold Medal where he talks about this.

I guess it's not technically exactly the same model but Noam's point is mostly right.

www.youtube.com/watch?v=unUe...

2 months ago 1 0 0 0

It actually makes perfect sense. Only Agents can work out how to create matplotlib charts that aren't just copies of other charts anyway.

2 months ago 1 0 0 0

>slightly outdated. glm, step exist

No, that's the point. These are the top *used* models, still.

2 months ago 2 0 0 0

Nemotron is a *great* model and the 3B and 12B models are free on OpenRouter for people to try out.

2 months ago 1 0 0 0

LMNotebook generates great infographics of WTF code does if you rename python files to .txt and dump them in there.

2 months ago 0 0 0 0

Gaslighting from Opus. Grr

2 months ago 0 0 0 0

Posts by Nick Lothian