The world’s largest NLP conference with almost 2,000 papers presented, ACL 2025 just took place in Vienna! 🎓✨ Here is a quick snapshot of the event via a short interview with one of the authors whose work caught my attention.
🎥 Watch: youtu.be/GBISWggsQOA
Posts by
ACL paper: aclanthology.org/2023.acl-lon...
Models: github.com/Heidelberg-N...
Read more: cl.uni-heidelberg.de/nlpgroup/new...
Morphological Analysis Demo: huggingface.co/spaces/bowph...
Machine Translation Demo: huggingface.co/spaces/bowphs/
Best Thesis Award: www.gscl.org/en/activitie...
I am honored to receive the 2025 #GSCL Best Thesis Award at #KONVENS in Hildesheim for my Master’s thesis, which investigates multilinguality and develops language models for Ancient Greek and Latin. Thank you to my mentors and collaborators. I look forward to what comes next.
Looking at Bruegel's Tower of Babel in Vienna makes you wonder: How can multilingual language models overcome the language barriers? Find out tomorrow!
📍 Level 1 (ironic, right?), Room 1.15-1
🕐 2 PM
#ACL2025NLP
Read the full paper here: arxiv.org/pdf/2506.01629
Reach out if you have any questions or if you are attending ACL and want to say hi. 🙋
Table comparing text generations between early and late checkpoints for the concepts "earthquake" and "joy". Early checkpoint generations show language-specific text, while late checkpoint generations demonstrate a shift toward "language-agnostic" (= English) text.
This phenomenon has a visible effect on text generation: In BLOOM-560m, activating 'earthquake' neurons derived from Spanish data at checkpoint 10,000 generates Spanish text. At checkpoint 400,000, the same method yields English text!
Average overlap proportion of expert neurons across layers and training checkpoints. Later checkpoints exhibit more shared neurons, particularly in the middle layers.
This is not a bug, it's a feature! These layers are repurposing the space to form cross-lingual abstractions.
We track this by examining how specific concepts (like "earthquake" or "joy") align across languages.
We ask a probing classifier: "Given this hidden state from layer l, what is the language of the source text?" The results are striking: earlier checkpoints consistently solve this with high accuracy across layers. Later checkpoints, however, exhibit clear performance drops.
Probing classifier performance comparison between early and late checkpoint across layers. While the early checkpoint shows uniformly high performance, the later checkpoint exhibits relatively high variance across layers.
How and when do multilingual LMs achieve cross-lingual generalization during pre-training? And why do later, supposedly more advanced checkpoints, lose some language identification abilities in the process? Our #ACL2025 paper investigates.
Read the full paper here: arxiv.org/pdf/2506.01629
Reach out if you have any questions or if you are attending ACL and want to say hi. 🙋
Table comparing text generations between early and late checkpoints for the concepts "earthquake" and "joy". Early checkpoint generations show language-specific text, while late checkpoint generations demonstrate a shift toward "language-agnostic" (= English) text.
This phenomenon has a visible effect on text generation: In BLOOM-560m, activating 'earthquake' neurons derived from Spanish data at checkpoint 10,000 generates Spanish text. At checkpoint 400,000, the same method yields English text!
Average overlap proportion of expert neurons across layers and training checkpoints. Later checkpoints exhibit more shared neurons, particularly in the middle layers.
This is not a bug, it's a feature! These layers are repurposing the space to form cross-lingual abstractions.
We track this by examining how specific concepts (like "earthquake" or "joy") align across languages.
We ask a probing classifier: "Given this hidden state from layer l, what is the language of the source text?" The results are striking: earlier checkpoints consistently solve this with high accuracy across layers. Later checkpoints, however, exhibit clear performance drops.
Read the full paper here: arxiv.org/pdf/2506.01629
Reach out if you have any questions or if you are attending ACL and want to say hi. 🙋
Sample generations demonstrating language-specific generation in early checkpoint and language-agnostic (= English) generation in late checkpoint.
This phenomenon has a visible effect on text generation: In BLOOM-560m, activating 'earthquake' neurons derived from Spanish data at checkpoint 10,000 generates Spanish text. At checkpoint 400,000, the same method yields English text!
Expert overlap proportion across layers for different training checkpoints.
This is not a bug, it's a feature! These layers are repurposing the space to form cross-lingual abstractions.
We track this by examining how specific concepts (like "earthquake" or "joy") align across languages.
We ask a probing classifier: "Given this hidden state from layer l, what is the language of the source text?" The results are striking: earlier checkpoints consistently solve this with high accuracy across layers. Later checkpoints, however, exhibit clear performance drops.
Debates aren’t always black and white—opposing sides often share common ground. These partial agreements are key for meaningful compromises
Presenting “Perspectivized Stance Vectors” (PSVs) — an interpretable method to identify nuanced (dis)agreements
📜 arxiv.org/abs/2502.09644
🧵 More details below
Read the full paper: aclanthology.org/2025.finding...
Work by Creston Brooks, Johannes Haubold, Charlie Cowen-Breen, Jay White, Desmond DeVaul, me, Karthik Narasimhan, and Barbara Graziosi
Our work brings new computational methods to a field traditionally dominated by manual scholarship, potentially accelerating the discovery of textual errors that have remained hidden for centuries.
Perhaps most surprising: even powerful models like GPT-4 performed barely above random chance on this specialized task! This highlights the limitations of general-purpose LLMs when dealing with ancient text restoration.
We tested several error detection methods and found that our discriminator-based approach outperforms all others. Interestingly, scribal errors (the oldest type) are universally more difficult to detect than print or digitization errors across ALL methods.
Prior work has only evaluated error detection on artificially-generated errors. Our dataset contains REAL errors that naturally accumulated over centuries - the subtle mistakes that survived precisely because they often appear perfectly reasonable.
Creating this dataset was painstaking! Our domain expert spent over 100 hours reviewing potential errors, categorizing them as scribal errors (from manuscript copying), print errors (from creating editions), or digitization errors (from converting to digital).
In "An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them," we introduce the first expert-labeled dataset of real errors in ancient texts, enabling proper evaluation of error detection methods on authentic textual problems.
What did Aristotle actually write? We think we know, but reality is messy. As ancient Greek texts traveled through 2,500 years of history, they were copied and recopied countless times, accumulating subtle errors with each generation. Our new #NAACL2025 paper tackles this fascinating challenge.