Tomorrow 2pm I am giving a talk at the AfricaNLP workshop about actionable interpretability for low-resource language modelling. Stop by if you're interested in the intersection of interpretability and data-efficient modelling!
Posts by Francois Meyer
Donโt miss the BabyBabelLM talk on Friday! Find the paper on developmentally plausible data in many languages here: aclanthology.org/2026.eacl-lo...
I am in Rabat for @eaclmeeting.bsky.social!
I am giving an invited talk at AfricaNLP on Saturday about actionable interpretability for low-resource language modelling.
@jumelet.bsky.social is presenting BabyBabelLM on Friday in the 11AM session on Parameter-Efficient Tuning and Training Dynamics.
I am at @aaclmeeting.bsky.social in Mumbai to present our paper "The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages". I'm giving a talk tomorrow at 2pm in Room 21, in the MC-OP2 session on Multilingualism and Cross-Lingual NLP.
I will be at @aaclmeeting.bsky.social in Mumbai to present this!
Our model enables a new type of analysis by tracking subwords as a learnable part of LM training. Like other LM dynamics, subword learning progresses in clear stages. Optimal subwords change over time, so using fixed tokenisers like BPE/ULM might be constraining model learning.
We see 4 stages of subword learning.
(1) Initially, subwords change rapidly.
(2) Next, learning trajectories undergo a sudden shift (around 30% in the plot below).
(3) After a while, subword boundaries stabilise.
(4) In finetuning, subwords change again to suit downstream tasks.
We study subword learning for 3 morphologically diverse languages: isiXhosa is agglutinative, Setswana is disjunctive (morphemes space-separated), and English as a typological middle ground. Learning dynamics vary across languages, with agglutinative isiXhosa being most unstable.
T-SSLM (Transformer Subword Segmental LM) marginalises over tokenisation candidates and learns which subwords optimise its training objective. We extract its learned subwords over the course of training, using metrics like fertility and productivity to track subword properties.
Tokenisation is usually fixed, so research on LM learning dynamics (how grammar/knowledge emerges during training) excludes subword learning. We create an architecture that learns tokenisation during training and study how its subword units evolve across checkpoints.
If a language model could dynamically optimise subword tokenisation, how would its subwords evolve during training? In our new paper we study the learning dynamics of subword segmentation:
arxiv.org/pdf/2511.09197
๐Introducing BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data!
LLMs learn from vastly more data than humans ever experience. BabyLM challenges this paradigm by focusing on developmentally plausible data
We extend this effort to 45 new languages!
๐๐จ ๐ฒ๐จ๐ฎ ๐ซ๐๐๐ฅ๐ฅ๐ฒ ๐ฐ๐๐ง๐ญ ๐ญ๐จ ๐ฌ๐๐ ๐ฐ๐ก๐๐ญ ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฅ๐ข๐ง๐ ๐ฎ๐๐ฅ ๐๐๐๐จ๐ซ๐ญ ๐ฅ๐จ๐จ๐ค๐ฌ ๐ฅ๐ข๐ค๐? ๐จ๐ณ๐ฎ๐ฉ๐ธ๐ช
Hereโs the proof! ๐๐๐๐ฒ๐๐๐๐๐ฅ๐๐ is the first Multilingual Benchmark of Developmentally Plausible Training Data available for 45 languages to the NLP community ๐
arxiv.org/abs/2510.10159
Today our poster will be up at @loreslm.bsky.social Poster Session #2 (2-3pm local time Abu Dhabi).
It's also available online at Whova: whova.com/portal/webap...
This work was carried out by three great UCT CS Honours students - Alexis, Charl, and Hishaam.
This work unites two directions of research: cognitively plausible modelling and NLP for low-resource languages. We hope more researchers pursue work at the intersection of these two subfields, since they share the goal of improving data-efficiency in the era of scaling.
However, unlike in the original BabyLM challenge, our isiXhosa BabyLMs do not outperform all skylines. We attribute this to a lack of developmentally plausible isiXhosa data. The success of English BabyLMs is due to both modelling innovations and highly curated pretraining data.
We pretrain two of the top BabyLM submissions (ELC-BERT and MLSM) for isiXhosa and evaluate it on isiXhosa POS tagging, NER, and topic classification. The BabyLMs outperform an isiXhosa RoBERTa and ELC-BERT even outperforms XLM-R on two tasks.
The BabyLM challenge (babylm.github.io) produced new sample-efficient architectures. We investigate the potential of BabyLMs to improve LMs for low-resource languages with limited pretraining data. As a case study we use isiXhosa, a language with corpora similar in size to BabyLM strict-small.
Our paper "BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context" will be presented at The First Workshop on Language Models for Low-Resource Languages at #COLING2025 in Abu Dhabi.
Paper: arxiv.org/pdf/2501.03855