Advertisement · 728 × 90

Posts by LAGoM NLP

Post image

BPE-knockout just got outperformed by an algorithm that modifies BPE tokenisers in a feedback loop to make them absorb more and more constraints. It doesn't even need more data to do that. It uses the tokeniser itself as a dataset. 🧵

1 week ago 1 2 1 0

📅 March 27, 11:00 (Findings Poster Session 6)
📅 March 29 at the SIGTYP Workshop

1 month ago 0 1 0 0

"TIPA: Typologically Informed Parameter Aggregation"
A training-free method to boost cross-lingual performance by mixing existing language adapters based on typological similarity.
by Stef Accou and @wpoelman.bsky.social

1 month ago 0 0 1 0

📅 March 25, 16:30 (Structural Foundations and Cross-Lingual Representations Oral Session)
📅 March 28 at the Multilingual and Multicultural Evaluation Workshop

1 month ago 0 0 1 0

"Form and Meaning in Intrinsic Multilingual Evaluations"
Parallel data might not be enough for fair cross-lingual evaluation: consistency in meaning does not neutralize differences in form.
by @wpoelman.bsky.social and @mdlhx.bsky.social

1 month ago 0 0 1 0

📅 March 25, 16:30 (Findings Poster Session 3)

1 month ago 0 0 1 0

"ReBPE: Iteratively Improving the Internal Structure of a Structured Tokeniser by Mining its Internal Structure"
If you are tired of BPE tokenisers not segmenting words like you want, you can now instruct them with examples to create their own constraints.
by Thomas Bauwens and @mdlhx.bsky.social

1 month ago 0 0 1 0

LAGoM will present several papers at #EACL2026 in Rabat next week! Our work at this year’s conference spans tokenisation, multilingual evaluation, and model design.

1 month ago 1 1 1 0
Advertisement
Preview
Form and Meaning in Intrinsic Multilingual Evaluations Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforwar...

New EACL paper (with @mdlhx.bsky.social)! We tested if comparing perplexity of parallel data across languages is fair. Turns out: it depends. We show the choice of test set (even with consistent meaning) can flip conclusions about which language is easier to model.

Paper: arxiv.org/abs/2601.10580

2 months ago 10 3 0 0

Authors: @wpoelman.bsky.social, Thomas Bauwens and @mdlhx.bsky.social

5 months ago 0 0 0 0
Preview
Confounding Factors in Relating Model Performance to Morphology Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.

We are presenting this paper at #EMNLP2025 in the “Multilinguality and Language Diversity” oral session this Wednesday (November 5th) from 11:00-12:30 (UTC+8). Paper: aclanthology.org/2025.emnlp-m... Code: github.com/LAGoM-NLP/Co...

5 months ago 3 0 1 0

Our proposed tokenizer metrics are a step in that direction

5 months ago 0 0 1 0

We disentangle more such factors in an attempt to outline what the “ideal” experiment would look like and how to work backwards to a feasible setup. This way, we outline the requirements to reliably answer whether, and how, morphology relates to language modeling.

5 months ago 0 0 1 0

Finally, we take a look at experimental factors that confounded experiments and conclusions in prior research. Coarse language grouping is one of several confounding factors.

5 months ago 0 0 1 0
Post image

What's more: using entropy allows for finer-grained ordering of languages than the coarse groupings of "agglutinative" and "fusional".

5 months ago 2 0 1 0
Post image

We compute the normalized entropy over each token's distribution of neighbors, and indeed find that agglutinative languages tend to have higher entropy than fusional languages on average.

5 months ago 2 0 1 0
Advertisement

To measure this token ambiguity, we re-visit the idea of accessor variety (AV) from Harris (1955) and Feng et al. (2004) by counting which tokens neighbor each other in a corpus and how many times.

5 months ago 1 0 1 0

it is harder to predict the next token. We then hypothesize that this contextual ambiguity is higher in morphologically complex languages.

5 months ago 0 0 1 0
Post image

In our new #EMNLP2025 paper, we argue that such statistics should relate directly to what a language model actually does: reliably predicting the next token produced by its tokenizer. We argue that if the most recent token has more contextual ambiguity,

5 months ago 3 0 1 0
Post image

When is a language hard to model? Previous research has suggested that morphological complexity both does and does not play a role, but it does so by relating the performance of language models to corpus statistics of words or subword tokens in isolation.

5 months ago 7 3 1 0

Ok, added the ones that were missing from yours to ours

8 months ago 1 0 0 0

8 months ago 1 0 0 0

You're included in the NLP labs starter pack, see go.bsky.app/LKGekew

8 months ago 1 0 1 0
Preview
Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in BERT Elke Vandermeerschen, Miryam De Lhoneux. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

* (Findings) Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in BERT by Elke Vandermeerschen and @mdlhx.bsky.social, presented by Elke. URL: aclanthology.org/2025.finding...

8 months ago 3 0 0 0
Advertisement
Preview
GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model Thomas Bauwens, David Kaczér, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

Our group has two papers at #acl2025:
* (Main) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model by Thomas Bauwens, David Kaczér and @mdlhx.bsky.social, presented by Thomas. URL: aclanthology.org/2025.acl-lon...

8 months ago 3 0 1 2
Preview
CLIN35 - Call for Abstracts We invite submissions for CLIN35, the 35th edition of the Computational Linguistics in the Netherlands (CLIN) conference, which will take place in Leuven on September 12th, 2025. Abstracts describing ...

The submission deadline for #CLIN35 has been extended by one week! New deadline: June 20th. 🔊 Spread the word! More info: clin35.ccl.kuleuven.be/call-for-abs...

10 months ago 1 3 0 0

Reminder, a few more days to apply!

10 months ago 2 3 0 0
Preview
CLIN35 Computational Linguistics in The Netherlands (CLIN) is a yearly conference on computational linguistics. Each year the conference is organized by a different institution in the Dutch-speaking region. ...

📅 Don't forget! The deadline for submitting your abstract to the #CLIN conference in Leuven is coming: 13th of June! Submitting is easy: name, title of your work, 500-word abstract, done! #nlp #nlproc #compling #llm #ai #dutch clin35.ccl.kuleuven.be

10 months ago 1 2 0 2

We are hiring in #nlproc!!

11 months ago 2 1 0 0

1 year ago 1 0 0 0