BPE-knockout just got outperformed by an algorithm that modifies BPE tokenisers in a feedback loop to make them absorb more and more constraints. It doesn't even need more data to do that. It uses the tokeniser itself as a dataset. 🧵
Posts by LAGoM NLP
📅 March 27, 11:00 (Findings Poster Session 6)
📅 March 29 at the SIGTYP Workshop
"TIPA: Typologically Informed Parameter Aggregation"
A training-free method to boost cross-lingual performance by mixing existing language adapters based on typological similarity.
by Stef Accou and @wpoelman.bsky.social
📅 March 25, 16:30 (Structural Foundations and Cross-Lingual Representations Oral Session)
📅 March 28 at the Multilingual and Multicultural Evaluation Workshop
"Form and Meaning in Intrinsic Multilingual Evaluations"
Parallel data might not be enough for fair cross-lingual evaluation: consistency in meaning does not neutralize differences in form.
by @wpoelman.bsky.social and @mdlhx.bsky.social
📅 March 25, 16:30 (Findings Poster Session 3)
"ReBPE: Iteratively Improving the Internal Structure of a Structured Tokeniser by Mining its Internal Structure"
If you are tired of BPE tokenisers not segmenting words like you want, you can now instruct them with examples to create their own constraints.
by Thomas Bauwens and @mdlhx.bsky.social
LAGoM will present several papers at #EACL2026 in Rabat next week! Our work at this year’s conference spans tokenisation, multilingual evaluation, and model design.
New EACL paper (with @mdlhx.bsky.social)! We tested if comparing perplexity of parallel data across languages is fair. Turns out: it depends. We show the choice of test set (even with consistent meaning) can flip conclusions about which language is easier to model.
Paper: arxiv.org/abs/2601.10580
Authors: @wpoelman.bsky.social, Thomas Bauwens and @mdlhx.bsky.social
We are presenting this paper at #EMNLP2025 in the “Multilinguality and Language Diversity” oral session this Wednesday (November 5th) from 11:00-12:30 (UTC+8). Paper: aclanthology.org/2025.emnlp-m... Code: github.com/LAGoM-NLP/Co...
Our proposed tokenizer metrics are a step in that direction
We disentangle more such factors in an attempt to outline what the “ideal” experiment would look like and how to work backwards to a feasible setup. This way, we outline the requirements to reliably answer whether, and how, morphology relates to language modeling.
Finally, we take a look at experimental factors that confounded experiments and conclusions in prior research. Coarse language grouping is one of several confounding factors.
What's more: using entropy allows for finer-grained ordering of languages than the coarse groupings of "agglutinative" and "fusional".
We compute the normalized entropy over each token's distribution of neighbors, and indeed find that agglutinative languages tend to have higher entropy than fusional languages on average.
To measure this token ambiguity, we re-visit the idea of accessor variety (AV) from Harris (1955) and Feng et al. (2004) by counting which tokens neighbor each other in a corpus and how many times.
it is harder to predict the next token. We then hypothesize that this contextual ambiguity is higher in morphologically complex languages.
In our new #EMNLP2025 paper, we argue that such statistics should relate directly to what a language model actually does: reliably predicting the next token produced by its tokenizer. We argue that if the most recent token has more contextual ambiguity,
When is a language hard to model? Previous research has suggested that morphological complexity both does and does not play a role, but it does so by relating the performance of language models to corpus statistics of words or subword tokens in isolation.
Ok, added the ones that were missing from yours to ours
✅
You're included in the NLP labs starter pack, see go.bsky.app/LKGekew
* (Findings) Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in BERT by Elke Vandermeerschen and @mdlhx.bsky.social, presented by Elke. URL: aclanthology.org/2025.finding...
Our group has two papers at #acl2025:
* (Main) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model by Thomas Bauwens, David Kaczér and @mdlhx.bsky.social, presented by Thomas. URL: aclanthology.org/2025.acl-lon...
The submission deadline for #CLIN35 has been extended by one week! New deadline: June 20th. 🔊 Spread the word! More info: clin35.ccl.kuleuven.be/call-for-abs...
Reminder, a few more days to apply!
📅 Don't forget! The deadline for submitting your abstract to the #CLIN conference in Leuven is coming: 13th of June! Submitting is easy: name, title of your work, 500-word abstract, done! #nlp #nlproc #compling #llm #ai #dutch clin35.ccl.kuleuven.be
We are hiring in #nlproc!!
✅