1. dont fall for anti open model fearmongering, but
2. acknowledge that AI capabilities are proceeding fast, and eventually there may be a reason to be more careful with open weight models
I don't think Mythos is that trigger, but I'm not 100% confident
www.interconnects.ai/p/claude-myt...
Posts by
made Claude Opus out of GPT-5.4. well, sort of.
I used a genetic algorithm to make GPT act more like Claude on coding tasks.
Not by making it smarter. Mostly by tuning the boring stuff that changes how it feels in practice: tool cadence, stop/go judgment, how deep it digs, and when it stops.
Created ngi inspired from the latest Cursor post.
It is a faster and more efficient alternative to grep for agents.
Already using it and works nicely!
github.com/erogol/ngi
i was vibing on a few LLM projects with zero visibility into what was actually happening. silent retry bugs burning tokens, wrong API keys, traffic to dead endpoints. couldn't debug any of it. so i created this
github.com/erogol/toklog
Agentic tools like OpenClaw grow with every PR — but bigger codebases are harder for AI to understand and extend.
What if we kept the core tiny and let agents adapt themselves to user needs? No PR, just evolve.
My post on MiMo-Audio
open.substack.com/pub/erogol/p...
🔥 Trained on 100M+ hours and shows emergent few-shot learning:
• Voice conversion
• Emotion transfer• Speech translation
• Cross-modal reasoning
⚡ Key finding: Speech follows same scaling laws as text LLMs
My breakdown of VibeVoice - new open-weight TTS model from Microsoft.
open.substack.com/pub/erogol/p...
ms released a tts model… nice…
You can create long form convos and podcasts with 4 distinct voice
huggingface.co/microsoft/Vi...
KyutaiTTS solved streaming text-to-speech with a state machine that generates audio word-by-word as text arrives.
220ms latency, 10-second voice cloning, 32 concurrent users on single GPU.
No more waiting for complete sentences.
Full analysis: erogol.substack.com/p/model-chec...
This is such a great idea
claude is the best coding model
gemini cause frequent syntax errors
openai does not even understand the task at hand
lately spending sometime with Diffusion LMs and working on NanoGPT style LlaDA model
so far I've not achieved comparable results to AR models but its a good start
github.com/erogol/BlaGP...
This work was done in collaboration with Jeff Clune’s lab at UBC, and led by his PhD students Jenny Zhang and Shengran Hu, together with Cong Lu and Robert Lange.
Paper: arxiv.org/abs/2505.22954
Code: github.com/jennyzzt/dgm
⚡ Machine Learns issue 48 is out
🚀 dKV-Cache accelerates diffusion models up to 10x faster
🔐 OpenAI's authentication play (think OAuth for AI)
🎯 PaTH Attention beats RoPE on long-context tasks
🤖 Humanoid Robot fights became real
open.substack.com/pub/erogol/p...
Following the bread crumbs, implemented PLE from Gemma3n.
It gave a significant performance boost and resulted in a new best model with almost no compute overhead.
github.com/erogol/BlaGPT
My paper notes on 2 new papers
- Model Merging in Pre-training of Large Language Models,
- Do Not Let Low-Probability Tokens Over-Dominate in RL,
open.substack.com/pub/erogol/p...
muon really works. got best results in BlaGPT
```
torchrun --standalone --nproc_per_node=8 train.py --run_name best_model --model_name best
```
github.com/erogol/BlaGPT
My results:
• Canon Layers definitely improved performance when placed before Attention/MLP blocks
• Softpick had worse validation loss but completely removed attention sinks
• Parallel blocks matched baseline performance but trained 15% faster
Parallel Transformer blocks run MLP and Attention in parallel instead of one after another.
So you get: z = x + MLP(x) + Attention(x)
PaLM models use this approach, which improves memory usage and speed without hurting performance.
The Canon Layers paper shows they boost performance when added to transformer blocks.
They also help models without positional encoding work just as well as RoPE models.
❗Worth noting that RWKV used a similar idea years ago.
Canon Layers are basically causal 1D convolutions that mix the current hidden state with previous states (how many depends on the kernel size).
Softpick replaces regular softmax in attention blocks.
It allows zero values in the numerator and lets negative values contribute to the denominator.
This prevents attention sinks while keeping math properties similar to regular softmax.
🧵 Here is a small thread with my notes about some of the recent Transformer papers.
- Softpick: an alternative to softmax in Attention
- Canon Layers: mixing states with conv1d
- Parallel Transformer blocks
Machine learns #45 - no fluff AI newsletter - is out!
I normally share bi-weekly but last week was full enough so here we go
open.substack.com/pub/erogol/p...
Updated my LLM usage and cancelled ChatGPT sub for now
Coding - Claude, Gemini 2.5
Reading papers - Claude
Research - Gemini 2.5
Daily - Gemini 2.5
Search - Gemini 2.5
Thanks :)