Thaddée Tyl (@espadrine) Bsky

Gemma 4 Day

near-Kimi 2.5 on your laptop

- 32B & 26B-A4B
- effective 4B & 2B for mobile
- Apache 2

blog.google/innovation-a...

2 weeks ago 56 7 2 4

Solutions for Developers | Google for Developers Learn how to create an AI-powered educational game focused on vibe-coding and agentic characters using Angular, PhaserJS, Gemini and Gemma.App Hosting.

With today's Gemma4 announcement, you might want to try the model yourself and not know where to start. Well how about an AIventure where you learn about vibe-coding, agents and gen-ai with Gemma4 specifically!
developers.google.com/solutions/le...

2 weeks ago 7 3 3 1

Gemma 4 day? 🤞

2 weeks ago 5 0 1 0

I like "goal-oriented processing".

We're fully in the weeds described by the paper now, with a good part of post-training targeted at sycophancy-adjacent rewards (DPO and the like, dataset depending), which causes some parasocial relationships with sometimes negative results.

1 month ago 0 0 0 0

One useful lesson: Muon is a reasonable optimizer.

But obviously if you're hill climbing, you never get to the taller hill.

1 month ago 0 0 0 0

Spent 1h back-and-forth with ChatGPT trying to pinpoint a configuration issue. It guessed all manners of reasonable issues, none of which were the right one.

Spent 5 min doing a reverse image search of the error. Someone on the Web had the same issue. Instant fix.

1 month ago 1 0 0 0

Friendly laser fire

@funranium.bsky.social is this a 3 digit count or 4 digit count of swear words situation?

1 month ago 1 1 1 0

Which website do you use to generate this video?

2 months ago 0 0 1 0

@duckduckgo.com Is there a way to add 1password to your browser? I don't see extensions.

2 months ago 0 0 0 0

That strongly implies that Mistral's next step is TTS. In fact, other tokens corroborate it: while [AUDIO] likely indicates that speech tokens follow, [REF] might indicate a reference voice pattern to copy, and [OUTPUT_AUDIO] might start converting text to audio.

2 months ago 0 0 0 0

It also outputs a [word] token, which fits the [STREAMING_WORD] token found in Voxtral 2.
Why have that?

For text-to-speech: there, when the model knows it has finished outputting the audio for a word, it generates the WORD action, so that we can feed it the next word to say.

2 months ago 0 0 1 0

But there are a lot more tokens in there that are unexplained!

To learn more, we can look at what inspired Mistral: the Kyutai Delayed Stream Modeling, arxiv.org/abs/2509.08753

It has the same delay design with the [pad] tokens.

2 months ago 1 0 1 0

Of course, the output does not contain exactly one word per text token, since the audio file does not contain exactly one word per 80ms.

The trick? Look at those new tokens: when the model needs to wait before outputting a word, it outputs a [STREAMING_PAD] token.

2 months ago 0 0 1 0

4. The audio token embedding history + delay tokens go through a Transformer to output a speech token. This is why the delay is variable: it can be any multiple of 80ms.
5. The history of speech token embeddings go through a Transformer to output a text token embedding → text token probs → text.

2 months ago 1 0 1 0

Look at its architecture:
1. The audio is cut into 80ms, which are sampled to 16 KHz (16000*0.08=1280 floats).
2. It is converted to a spectrogram,
3. A whisper-style encoder converts it to an audio token embedding through a convnet,

2 months ago 1 0 1 0

Shoutout to Voxtral 2, which really feels unparalleled in quality.

The interesting bit is its ability to do realtime transcription.
How does it do that, with a variable delay?

2 months ago 1 1 1 0

zai-org/GLM-4.7-Flash · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Announcement: huggingface.co/zai-org/GLM-...

Find the graphs here: metabench.organisons.com

2 months ago 0 0 0 0

On Math, honestly, it is impressive how close it is to GPT-OSS 20B and Gemini 3 Flash, even when it does not beat them.

All in all, one of the best local models out there. Architecturally, one of the most innovative.

2 months ago 1 0 1 0

Reasoning is another big purpose. Using MLA, it may be quite good at in-context reasoning on a large corpus, even locally.

But it won't be far above leading local models like Ministral 3. Meanwhile API models like Gemini 3 and DeepSeek will surpass it at the same price.

2 months ago 1 0 1 0

Where GLM-4.7 Flash shines is when you feed it enormous inputs.
That is typical of agentic coding tools. It’s on the Pareto frontier there.

Better than GPT-OSS 20B, cheaper and faster than Devstral Small 2.

2 months ago 10 1 1 0

What happened to Z.ai servers in December, for them to suddenly have a spikey boost in token throughput?!

3 months ago 1 0 0 0

One of my favorite findings: Positional embeddings are just training wheels. They help convergence but hurt long-context generalization.

We found that if you simply delete them after pretraining and recalibrate for <1% of the original budget, you unlock massive context windows. Smarter, not harder.

3 months ago 219 33 8 1

Phenomenal work.

I wonder about DroPE scaling laws: can it be executed after 4B pretraining tokens regardless of model size (and then the rest of pretraining does NoPE)? Or does it have to be done at the end of pretraining?

3 months ago 0 0 0 0

LLM Benchmark Aggregator & Estimator

As always, find these comparisons at metabench.organisons.com

and the announcement at www.minimax.io/news/minimax...

3 months ago 0 0 0 0

Other metrics don't improve as much… but the M2 baseline was already quite good.

Keep in mind that this model is much faster than the others around it, clocking at 100 tokens per second compared to similar ones doing 30 tokens/sec.

3 months ago 0 0 1 0

M2.1 from @MiniMax__AI has a welcome jump in agentic coding! It matches @Zai_org’s GLM-4.7 released yesterday, but at a lower cost.

3 months ago 2 1 1 0

LLM Benchmark Aggregator & Estimator

As always, the full leaderboard is here: metabench.organisons.com

And the announcement: z.ai/blog/glm-4.7

3 months ago 0 0 0 0

Other metrics are good, but the improvement is more marginal, such as in raw agentic use (typical of customer service):

3 months ago 0 0 1 0

As often, code training improves math as well, where we see a very positive jump!

3 months ago 0 0 1 0

Impressive jump on agentic coding according to its benchmarks! Now on par with Claude Opus 4.1 (from 5 months ago!), K2 Thinking, and GPT-5.2 Codex, at a lower cost.

A bit overshadowed by DeepSeek, whose DSA mechanisms achieve great cost cuts.

3 months ago 0 0 1 0

Posts by Thaddée Tyl