Advertisement ยท 728 ร— 90

Posts by Francois Meyer

Tomorrow 2pm I am giving a talk at the AfricaNLP workshop about actionable interpretability for low-resource language modelling. Stop by if you're interested in the intersection of interpretability and data-efficient modelling!

3 weeks ago 2 0 0 0

Donโ€™t miss the BabyBabelLM talk on Friday! Find the paper on developmentally plausible data in many languages here: aclanthology.org/2026.eacl-lo...

4 weeks ago 3 2 0 0

I am in Rabat for @eaclmeeting.bsky.social!

I am giving an invited talk at AfricaNLP on Saturday about actionable interpretability for low-resource language modelling.

@jumelet.bsky.social is presenting BabyBabelLM on Friday in the 11AM session on Parameter-Efficient Tuning and Training Dynamics.

4 weeks ago 1 0 0 1

I am at @aaclmeeting.bsky.social in Mumbai to present our paper "The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages". I'm giving a talk tomorrow at 2pm in Room 21, in the MC-OP2 session on Multilingualism and Cross-Lingual NLP.

4 months ago 3 0 0 0

I will be at @aaclmeeting.bsky.social in Mumbai to present this!

5 months ago 0 0 0 0

Our model enables a new type of analysis by tracking subwords as a learnable part of LM training. Like other LM dynamics, subword learning progresses in clear stages. Optimal subwords change over time, so using fixed tokenisers like BPE/ULM might be constraining model learning.

5 months ago 0 0 1 0
Post image

We see 4 stages of subword learning.
(1) Initially, subwords change rapidly.
(2) Next, learning trajectories undergo a sudden shift (around 30% in the plot below).
(3) After a while, subword boundaries stabilise.
(4) In finetuning, subwords change again to suit downstream tasks.

5 months ago 0 0 1 0
Post image

We study subword learning for 3 morphologically diverse languages: isiXhosa is agglutinative, Setswana is disjunctive (morphemes space-separated), and English as a typological middle ground. Learning dynamics vary across languages, with agglutinative isiXhosa being most unstable.

5 months ago 0 0 1 0

T-SSLM (Transformer Subword Segmental LM) marginalises over tokenisation candidates and learns which subwords optimise its training objective. We extract its learned subwords over the course of training, using metrics like fertility and productivity to track subword properties.

5 months ago 0 0 1 0

Tokenisation is usually fixed, so research on LM learning dynamics (how grammar/knowledge emerges during training) excludes subword learning. We create an architecture that learns tokenisation during training and study how its subword units evolve across checkpoints.

5 months ago 0 0 1 0
Advertisement
Post image

If a language model could dynamically optimise subword tokenisation, how would its subwords evolve during training? In our new paper we study the learning dynamics of subword segmentation:
arxiv.org/pdf/2511.09197

5 months ago 3 1 1 1
Post image

๐ŸŒIntroducing BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data!

LLMs learn from vastly more data than humans ever experience. BabyLM challenges this paradigm by focusing on developmentally plausible data

We extend this effort to 45 new languages!

6 months ago 44 16 1 4
Post image

๐ƒ๐จ ๐ฒ๐จ๐ฎ ๐ซ๐ž๐š๐ฅ๐ฅ๐ฒ ๐ฐ๐š๐ง๐ญ ๐ญ๐จ ๐ฌ๐ž๐ž ๐ฐ๐ก๐š๐ญ ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฅ๐ข๐ง๐ ๐ฎ๐š๐ฅ ๐ž๐Ÿ๐Ÿ๐จ๐ซ๐ญ ๐ฅ๐จ๐จ๐ค๐ฌ ๐ฅ๐ข๐ค๐ž? ๐Ÿ‡จ๐Ÿ‡ณ๐Ÿ‡ฎ๐Ÿ‡ฉ๐Ÿ‡ธ๐Ÿ‡ช

Hereโ€™s the proof! ๐๐š๐›๐ฒ๐๐š๐›๐ž๐ฅ๐‹๐Œ is the first Multilingual Benchmark of Developmentally Plausible Training Data available for 45 languages to the NLP community ๐ŸŽ‰

arxiv.org/abs/2510.10159

6 months ago 42 16 2 1

Today our poster will be up at @loreslm.bsky.social Poster Session #2 (2-3pm local time Abu Dhabi).

It's also available online at Whova: whova.com/portal/webap...

1 year ago 1 0 0 1

This work was carried out by three great UCT CS Honours students - Alexis, Charl, and Hishaam.

1 year ago 0 0 0 0

This work unites two directions of research: cognitively plausible modelling and NLP for low-resource languages. We hope more researchers pursue work at the intersection of these two subfields, since they share the goal of improving data-efficiency in the era of scaling.

1 year ago 0 0 1 0

However, unlike in the original BabyLM challenge, our isiXhosa BabyLMs do not outperform all skylines. We attribute this to a lack of developmentally plausible isiXhosa data. The success of English BabyLMs is due to both modelling innovations and highly curated pretraining data.

1 year ago 0 0 1 0

We pretrain two of the top BabyLM submissions (ELC-BERT and MLSM) for isiXhosa and evaluate it on isiXhosa POS tagging, NER, and topic classification. The BabyLMs outperform an isiXhosa RoBERTa and ELC-BERT even outperforms XLM-R on two tasks.

1 year ago 0 0 1 0

The BabyLM challenge (babylm.github.io) produced new sample-efficient architectures. We investigate the potential of BabyLMs to improve LMs for low-resource languages with limited pretraining data. As a case study we use isiXhosa, a language with corpora similar in size to BabyLM strict-small.

1 year ago 0 0 1 0
Advertisement

Our paper "BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context" will be presented at The First Workshop on Language Models for Low-Resource Languages at #COLING2025 in Abu Dhabi.

Paper: arxiv.org/pdf/2501.03855

1 year ago 1 1 1 1