Gemma 4 Day
near-Kimi 2.5 on your laptop
- 32B & 26B-A4B
- effective 4B & 2B for mobile
- Apache 2
blog.google/innovation-a...
Posts by Thaddée Tyl
With today's Gemma4 announcement, you might want to try the model yourself and not know where to start. Well how about an AIventure where you learn about vibe-coding, agents and gen-ai with Gemma4 specifically!
developers.google.com/solutions/le...
Gemma 4 day? 🤞
I like "goal-oriented processing".
We're fully in the weeds described by the paper now, with a good part of post-training targeted at sycophancy-adjacent rewards (DPO and the like, dataset depending), which causes some parasocial relationships with sometimes negative results.
One useful lesson: Muon is a reasonable optimizer.
But obviously if you're hill climbing, you never get to the taller hill.
Spent 1h back-and-forth with ChatGPT trying to pinpoint a configuration issue. It guessed all manners of reasonable issues, none of which were the right one.
Spent 5 min doing a reverse image search of the error. Someone on the Web had the same issue. Instant fix.
Friendly laser fire
@funranium.bsky.social is this a 3 digit count or 4 digit count of swear words situation?
Which website do you use to generate this video?
@duckduckgo.com Is there a way to add 1password to your browser? I don't see extensions.
That strongly implies that Mistral's next step is TTS. In fact, other tokens corroborate it: while [AUDIO] likely indicates that speech tokens follow, [REF] might indicate a reference voice pattern to copy, and [OUTPUT_AUDIO] might start converting text to audio.
It also outputs a [word] token, which fits the [STREAMING_WORD] token found in Voxtral 2.
Why have that?
For text-to-speech: there, when the model knows it has finished outputting the audio for a word, it generates the WORD action, so that we can feed it the next word to say.
But there are a lot more tokens in there that are unexplained!
To learn more, we can look at what inspired Mistral: the Kyutai Delayed Stream Modeling, arxiv.org/abs/2509.08753
It has the same delay design with the [pad] tokens.
Of course, the output does not contain exactly one word per text token, since the audio file does not contain exactly one word per 80ms.
The trick? Look at those new tokens: when the model needs to wait before outputting a word, it outputs a [STREAMING_PAD] token.
4. The audio token embedding history + delay tokens go through a Transformer to output a speech token. This is why the delay is variable: it can be any multiple of 80ms.
5. The history of speech token embeddings go through a Transformer to output a text token embedding → text token probs → text.
Look at its architecture:
1. The audio is cut into 80ms, which are sampled to 16 KHz (16000*0.08=1280 floats).
2. It is converted to a spectrogram,
3. A whisper-style encoder converts it to an audio token embedding through a convnet,
Shoutout to Voxtral 2, which really feels unparalleled in quality.
The interesting bit is its ability to do realtime transcription.
How does it do that, with a variable delay?
On Math, honestly, it is impressive how close it is to GPT-OSS 20B and Gemini 3 Flash, even when it does not beat them.
All in all, one of the best local models out there. Architecturally, one of the most innovative.
Reasoning is another big purpose. Using MLA, it may be quite good at in-context reasoning on a large corpus, even locally.
But it won't be far above leading local models like Ministral 3. Meanwhile API models like Gemini 3 and DeepSeek will surpass it at the same price.
Where GLM-4.7 Flash shines is when you feed it enormous inputs.
That is typical of agentic coding tools. It’s on the Pareto frontier there.
Better than GPT-OSS 20B, cheaper and faster than Devstral Small 2.
What happened to Z.ai servers in December, for them to suddenly have a spikey boost in token throughput?!
One of my favorite findings: Positional embeddings are just training wheels. They help convergence but hurt long-context generalization.
We found that if you simply delete them after pretraining and recalibrate for <1% of the original budget, you unlock massive context windows. Smarter, not harder.
Phenomenal work.
I wonder about DroPE scaling laws: can it be executed after 4B pretraining tokens regardless of model size (and then the rest of pretraining does NoPE)? Or does it have to be done at the end of pretraining?
As always, find these comparisons at metabench.organisons.com
and the announcement at www.minimax.io/news/minimax...
Other metrics don't improve as much… but the M2 baseline was already quite good.
Keep in mind that this model is much faster than the others around it, clocking at 100 tokens per second compared to similar ones doing 30 tokens/sec.
M2.1 from @MiniMax__AI has a welcome jump in agentic coding! It matches @Zai_org’s GLM-4.7 released yesterday, but at a lower cost.
As always, the full leaderboard is here: metabench.organisons.com
And the announcement: z.ai/blog/glm-4.7
Other metrics are good, but the improvement is more marginal, such as in raw agentic use (typical of customer service):
As often, code training improves math as well, where we see a very positive jump!
Impressive jump on agentic coding according to its benchmarks! Now on par with Claude Opus 4.1 (from 5 months ago!), K2 Thinking, and GPT-5.2 Codex, at a lower cost.
A bit overshadowed by DeepSeek, whose DSA mechanisms achieve great cost cuts.