At least for now Brazil is 10/10 on things that matter (breakfast)
Posts by Alexander Doria
More formal announcement: I'll be at ICLR this week to present the oral on Common Corpus with Pavel Chizhov. Happy to discuss anything data research, synth pretraining, models and products.
Mostly doing ablation/shifting layers/adapting tokenizer to multilingual.
Yep.
Not surprised at all. I've been reorganizing my complete pretraining stack below 5M.
Basically identifying titles/running titles/footnotes, etc. at pretraining scale (so the smaller, the better). went with a four layer custom encoder, actually 12m total but most of it go to token embed.
(layers still valid in other context, but otherwise)
Parameter ceiling seems ridiculously low now that we can generate data at will. For a complex token classification task (editorial structure detection) i'm maxing out at 4M.
(from the reference book: hott.github.io/book/hott-on...)
With even a genealogical case: concepts of evidence/judgement are straight coming from Brentano (through Martin-Löff) and his own personal twist on Aristotlean logic.
I doubt anyone ever did that, but there is a philosophical thesis to write on the ties between Homotopy Type Theory and ancient logic.
It’s been going on for a few months. Strong secular religious feel all around.
Would definitely have appreciated seeing more of them in Alt-Edic.
Don't know how to say to Mistral that some of it already exists…
Calm looks definitely common corpus coded.
Just realized Common Corpus is (again!) in Anthropic's Transformers Circuit — new chapter on emotions transformer-circuits.pub/2026/emotion...
Just a little more RL and…
Hum. I have a slight Chinese doubt over Meta’s new model…
Meanwhile, just occurred to me that the "AI con" is still on a book tour.
meanwhile training section is we used data i guess.
circuit transformers once more confirmed confirmed to be the one oblique access to anthropic core research: most interesting parts of the 240 pages mythos report.
La c’est juste des md, rien de fou. Le principal changement vient de l’évolution des modèles, beaucoup plus efficaces pour chercher par eux mêmes.
Maybe it’s too soon to talk about a vibe shift toward data research in AI, but definitely easier to get paper accepted.
One of the weirdest part of my job right now is being naturally led to read philosophical works for practical data pipeline design.
apparently not by a small margin but i'll wait this week for someone else to rerun the evals.
for a change, it's the data
hmm. seems sota.
Accuracy close to modernbert and less than 100 GPU hours to process all of Common Corpus. What I needed ever since.
Looks like I’m bound to reinvent all pretraining tooling around sub-4M bytes model. Classification done, language detection next.
Not so much a stretch: for classification I’m training exclusively byte-level tiny models now.