(@erogol.com) Bsky - nopzon.com

Claude Mythos and misguided open-weight fearmongering Another dance around fears of open-source.

1. dont fall for anti open model fearmongering, but
2. acknowledge that AI capabilities are proceeding fast, and eventually there may be a reason to be more careful with open weight models

I don't think Mythos is that trigger, but I'm not 100% confident
www.interconnects.ai/p/claude-myt...

1 week ago 46 2 1 2

made Claude Opus out of GPT-5.4. well, sort of.

I used a genetic algorithm to make GPT act more like Claude on coding tasks.

Not by making it smarter. Mostly by tuning the boring stuff that changes how it feels in practice: tool cadence, stop/go judgment, how deep it digs, and when it stops.

1 week ago 0 0 0 0

Created ngi inspired from the latest Cursor post.

It is a faster and more efficient alternative to grep for agents.

Already using it and works nicely!

github.com/erogol/ngi

3 weeks ago 0 0 0 0

GitHub - erogol/toklog: htop for your LLM endpoints. htop for your LLM endpoints. Contribute to erogol/toklog development by creating an account on GitHub.

i was vibing on a few LLM projects with zero visibility into what was actually happening. silent retry bugs burning tokens, wrong API keys, traffic to dead endpoints. couldn't debug any of it. so i created this

github.com/erogol/toklog

4 weeks ago 1 0 0 0

Agentic tools like OpenClaw grow with every PR — but bigger codebases are harder for AI to understand and extend.

What if we kept the core tiny and let agents adapt themselves to user needs? No PR, just evolve.

2 months ago 0 0 0 0

Model check - DeepSeek-V3.2-Exp - Fine-Grained Sparse Attention for Efficient Long-Context LLMs Going over the recently released DeepSeek-V3.2-Exp technical paper, source code and innovations.

Here is my take on new DeepSeek-V3.2-Exp

erogol.substack.com/p/model-chec...

6 months ago 1 0 0 0

Model Check - MiMo-Audio: Scaling Speech Pre-Training to 100M Hours Going over the code and the technical report of the new Speech LM model from Xiaomi that rivals GPT4o-audio and Gemini

My post on MiMo-Audio

open.substack.com/pub/erogol/p...

🔥 Trained on 100M+ hours and shows emergent few-shot learning:
• Voice conversion
• Emotion transfer• Speech translation
• Cross-modal reasoning

⚡ Key finding: Speech follows same scaling laws as text LLMs

6 months ago 1 0 0 0

Machine Learns #55 Voice + reasoning releases (Ling‑flash‑2.0, VoxCPM, Kimi K2, ultraVAD) and 2 papers: long‑horizon execution & decay‑free LR schedules.

Machine Learns #55 is out!

Full of new models… check it out

open.substack.com/pub/erogol/p...

7 months ago 0 0 0 0

Machine Learns #54 🤖 Voice models, long-context tricks, and a token-order loss worth trying Flashy audio releases + 5 papers (MoC, TOP, FELLE, M2N2, Motif TR)

machine learns #54 is out
open.substack.com/pub/erogol/p...

7 months ago 1 0 0 0

Model Check - VibeVoice: Next-Token Diffusion Meets Long-Form Speech Generation Going over the code and the technical report of the new TTS model from Microsoft Research.

My breakdown of VibeVoice - new open-weight TTS model from Microsoft.

open.substack.com/pub/erogol/p...

7 months ago 1 0 0 0

microsoft/VibeVoice-1.5B · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

ms released a tts model… nice…

You can create long form convos and podcasts with 4 distinct voice

huggingface.co/microsoft/Vi...

7 months ago 1 0 0 0

Model check - KyutaiTTS: Streaming Text-to-Speech with Delayed Streams Modeling Going over the Kyutai's new TTS model and its delayed streaming model.

KyutaiTTS solved streaming text-to-speech with a state machine that generates audio word-by-word as text arrives.

220ms latency, 10-second voice cloning, 32 concurrent users on single GPU.

No more waiting for complete sentences.

Full analysis: erogol.substack.com/p/model-chec...

8 months ago 1 1 0 0

This is such a great idea

10 months ago 1 0 0 0

claude is the best coding model

gemini cause frequent syntax errors

openai does not even understand the task at hand

10 months ago 0 0 0 0

BlaGPT/bla_gpt/llada.py at main · erogol/BlaGPT Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration. - erogol/BlaGPT

lately spending sometime with Diffusion LMs and working on NanoGPT style LlaDA model

so far I've not achieved comparable results to AR models but its a good start

github.com/erogol/BlaGP...

10 months ago 0 0 0 0

This work was done in collaboration with Jeff Clune’s lab at UBC, and led by his PhD students Jenny Zhang and Shengran Hu, together with Cong Lu and Robert Lange.

Paper: arxiv.org/abs/2505.22954
Code: github.com/jennyzzt/dgm

10 months ago 12 3 0 0

Machine Learns #48 OpenAI's 'Sign in with ChatGPT', Meta's AGI ambitions, new models like Gemma 3 & MAGI-1, research breakthroughs in KV caching for diffusion & PaTH Attention, and fresh open-source releases.

⚡ Machine Learns issue 48 is out

🚀 dKV-Cache accelerates diffusion models up to 10x faster
🔐 OpenAI's authentication play (think OAuth for AI)
🎯 PaTH Attention beats RoPE on long-context tasks
🤖 Humanoid Robot fights became real

open.substack.com/pub/erogol/p...

10 months ago 2 0 0 0

GitHub - erogol/BlaGPT: Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration. Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration. - erogol/BlaGPT

Following the bread crumbs, implemented PLE from Gemma3n.

It gave a significant performance boost and resulted in a new best model with almost no compute overhead.

github.com/erogol/BlaGPT

10 months ago 2 0 0 0

Paper check: Merging LLMs at Pre-training, Considering Token Probabilities at RL 🔬Two papers in scope: "Model Merging in Pre-training for LLMs" and "Do Not Let Low-Probability Tokens Over-Dominate in RL"

My paper notes on 2 new papers

- Model Merging in Pre-training of Large Language Models,
- Do Not Let Low-Probability Tokens Over-Dominate in RL,

open.substack.com/pub/erogol/p...

10 months ago 1 0 0 0

GitHub - erogol/BlaGPT: Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration. Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration. - erogol/BlaGPT

muon really works. got best results in BlaGPT

```
torchrun --standalone --nproc_per_node=8 train.py --run_name best_model --model_name best
```

github.com/erogol/BlaGPT

11 months ago 1 0 0 0

GitHub - erogol/BlaGPT: Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration. Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration. - erogol/BlaGPT

All code is available in BlaGPT if you want to check it out yourself!

github.com/erogol/BlaGPT

11 months ago 1 0 0 0

My results:

• Canon Layers definitely improved performance when placed before Attention/MLP blocks
• Softpick had worse validation loss but completely removed attention sinks
• Parallel blocks matched baseline performance but trained 15% faster

11 months ago 2 0 1 0

Parallel Transformer blocks run MLP and Attention in parallel instead of one after another.

So you get: z = x + MLP(x) + Attention(x)

PaLM models use this approach, which improves memory usage and speed without hurting performance.

11 months ago 1 0 1 0

The Canon Layers paper shows they boost performance when added to transformer blocks.

They also help models without positional encoding work just as well as RoPE models.

❗Worth noting that RWKV used a similar idea years ago.

11 months ago 1 0 1 0

Canon Layers are basically causal 1D convolutions that mix the current hidden state with previous states (how many depends on the kernel size).

11 months ago 1 0 1 0

Softpick replaces regular softmax in attention blocks.

It allows zero values in the numerator and lets negative values contribute to the denominator.

This prevents attention sinks while keeping math properties similar to regular softmax.

11 months ago 1 0 1 0

🧵 Here is a small thread with my notes about some of the recent Transformer papers.

- Softpick: an alternative to softmax in Attention
- Canon Layers: mixing states with conv1d
- Parallel Transformer blocks

11 months ago 1 0 1 0

Machine Learns #45 OpenAI's social network & GPT-4.1, China launches $8.2B AI fund, NVIDIA's US manufacturing push, new GLM-4 & MineWorld models, C3PO expert pathways optimization, GigaTok's 3B visual tokenizer...

Machine learns #45 - no fluff AI newsletter - is out!

I normally share bi-weekly but last week was full enough so here we go

open.substack.com/pub/erogol/p...

1 year ago 1 0 0 0

Updated my LLM usage and cancelled ChatGPT sub for now

Coding - Claude, Gemini 2.5
Reading papers - Claude
Research - Gemini 2.5
Daily - Gemini 2.5
Search - Gemini 2.5

1 year ago 0 0 0 0

Thanks :)

1 year ago 1 0 0 0

Posts by