Alexander Doria (@dorialexander) Bsky

At least for now Brazil is 10/10 on things that matter (breakfast)

1 hour ago 2 0 0 0

More formal announcement: I'll be at ICLR this week to present the oral on Common Corpus with Pavel Chizhov. Happy to discuss anything data research, synth pretraining, models and products.

23 hours ago 29 1 0 0

Mostly doing ablation/shifting layers/adapting tokenizer to multilingual.

1 week ago 1 0 0 0

Yep.

1 week ago 0 0 1 0

Not surprised at all. I've been reorganizing my complete pretraining stack below 5M.

1 week ago 2 0 0 0

Basically identifying titles/running titles/footnotes, etc. at pretraining scale (so the smaller, the better). went with a four layer custom encoder, actually 12m total but most of it go to token embed.

1 week ago 5 0 1 0

(layers still valid in other context, but otherwise)

1 week ago 6 0 0 0

Parameter ceiling seems ridiculously low now that we can generate data at will. For a complex token classification task (editorial structure detection) i'm maxing out at 4M.

1 week ago 18 0 3 0

(from the reference book: hott.github.io/book/hott-on...)

1 week ago 1 0 0 0

With even a genealogical case: concepts of evidence/judgement are straight coming from Brentano (through Martin-Löff) and his own personal twist on Aristotlean logic.

1 week ago 0 0 1 0

I doubt anyone ever did that, but there is a philosophical thesis to write on the ties between Homotopy Type Theory and ancient logic.

1 week ago 4 0 1 0

It’s been going on for a few months. Strong secular religious feel all around.

1 week ago 5 0 2 0

Would definitely have appreciated seeing more of them in Alt-Edic.

1 week ago 4 0 0 0

Don't know how to say to Mistral that some of it already exists…

1 week ago 17 0 6 0

Calm looks definitely common corpus coded.

1 week ago 3 0 0 0

Just realized Common Corpus is (again!) in Anthropic's Transformers Circuit — new chapter on emotions transformer-circuits.pub/2026/emotion...

1 week ago 22 1 2 0

Just a little more RL and…

1 week ago 0 0 0 0

Hum. I have a slight Chinese doubt over Meta’s new model…

1 week ago 3 0 2 0

Meanwhile, just occurred to me that the "AI con" is still on a book tour.

1 week ago 60 4 4 2

meanwhile training section is we used data i guess.

1 week ago 16 0 1 0

circuit transformers once more confirmed confirmed to be the one oblique access to anthropic core research: most interesting parts of the 240 pages mythos report.

1 week ago 39 1 2 1

La c’est juste des md, rien de fou. Le principal changement vient de l’évolution des modèles, beaucoup plus efficaces pour chercher par eux mêmes.

1 week ago 0 0 1 0

Maybe it’s too soon to talk about a vibe shift toward data research in AI, but definitely easier to get paper accepted.

2 weeks ago 29 1 4 0

One of the weirdest part of my job right now is being naturally led to read philosophical works for practical data pipeline design.

2 weeks ago 40 4 3 1

apparently not by a small margin but i'll wait this week for someone else to rerun the evals.

2 weeks ago 6 0 0 0

for a change, it's the data

2 weeks ago 8 0 0 0

hmm. seems sota.

2 weeks ago 34 1 2 0

Accuracy close to modernbert and less than 100 GPU hours to process all of Common Corpus. What I needed ever since.

2 weeks ago 10 1 0 0

Looks like I’m bound to reinvent all pretraining tooling around sub-4M bytes model. Classification done, language detection next.

2 weeks ago 30 0 1 1

Not so much a stretch: for classification I’m training exclusively byte-level tiny models now.

2 weeks ago 4 0 1 0

Posts by Alexander Doria