Advertisement · 728 × 90

Posts by Jindřich Libovický

Nice work by my student @gianlucavico.bsky.social on a topic close to home: crowdsourcing Piedmontese to test LLMs on non-standard orthography. New dataset covering tokenization, classification & translation.

3 weeks ago 4 1 0 0
Post image

On the Credibility of Evaluating LLMs using Survey Questions
by @jlibovicky.bsky.social
aclanthology.org/2026.mme-mai...
Survey-based LLM value evals are unreliable — prompting & decoding choices skew results drastically.

3 weeks ago 7 2 1 0

So proud of my new PhD student Adnan Al Ali for presenting their master's thesis work at EACL! 🎓 A great contribution to understanding bias in AI text detectors across languages.

3 weeks ago 2 0 0 0

I'm at #EACL2026 in Rabat 🇲🇦. Find me and talk to me about tokenization or multilingual model eval.

Also, check out our work on how to eval morphological plausibility of your tokenizer if you don't have gold segmentation data, but you happen to have morphosyntactic features 👇

3 weeks ago 4 0 1 0

Thanks for organizing this! I am definitely coming, and everyone interested in tokenization at #EACL2025 should too. 🫵

1 month ago 1 0 0 0
Natural Language Processing | ÚFAL

Intro to NLP for bachelor students ufal.mff.cuni.cz/courses/npfl...
Most of it would be History of NLP in the Stanford slides 😀

1 month ago 0 1 0 0

None of this would work without my TAs: Dušan Variš, Tomáš Musil, Jan Bronec, @gianlucavico.bsky.social , Adnan Al Ali, Kristýna Onderková, and @straka-milan.bsky.social taking care of ReCodEx: recodex.mff.cuni.cz. Thank you 🙏

1 month ago 3 0 0 0
Advertisement
Post image

Some students find the assignments too time-consuming. Fair. But here's what the data shows over 3 years:
📉 Forum questions dropped ~4×
📈 Full bonus points: 20% → 27% → 52%
📉 Avg. test attempts: 2.7 → 2.4 → 1.9

Asking less, achieving more, iterating less. 🤔

1 month ago 3 0 1 0
Introduction to Machine Learning with Python | ÚFAL

3rd run teaching ML to 250+ bachelor students (with great materials originaly by @straka-milan.bsky.social). Core philosophy: explain the math, implement algorithms from scratch, Kaggle-style competitions, all auto-graded. ufal.mff.cuni.cz/courses/npfl...

But look what LLMs did to the course 👇

1 month ago 3 0 1 0
Post image

Spent time making AI-generated images of Bayes' Rule, Laplace Smoothing, Markov Chains & Shannon Entropy for class today 🎨🤖 Even though the images are objectively hilarious, none of the 50 students in the room laughed. Or even smiled. 💀

1 month ago 5 1 1 0
Post image

More interesting: humans are predictably inconsistent in their values. LLMs capture this but overgeneralize: they become more stereotypically consistent than actual humans.

After several rejections, finally publishable. To appear at the Multilingual Multicultural Evaluation workshop at EACL 2026.

2 months ago 2 2 0 0

I reviewed papers evaluating LLM values using sociology questionnaires. Different methods, different results. Didn't trust them, so I tested it myself.
Methodology matters. Short answers vs CoT, squared err vs KL div.: each changes which populations an LLM "aligns" with.
www.arxiv.org/pdf/2602.04033

2 months ago 5 1 1 0

We have updated the pre-print on CUS-QA, benchmark for regional knowledge about Czechia, Slovakia and Ukraine arxiv.org/abs/2507.22752
Now, there are results of retrieval-augmented generation and more detailed analysis of model performance depending on the topic of the question or visual context.

2 months ago 7 1 0 0
Post image

👉 What do we do?
We use the good old IBM1 model to align subwords with morphological features from Unimorph and we show it captures the same thing as morpheme boundary recall.
👉 Why it matters?
For many languages good segmentation data is missing. Morphological features are more widely available.

2 months ago 4 1 0 0
Preview
Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features We present a novel metric for the evaluation of the morphological plausibility of subword segmentation. Unlike the typically used morpheme boundary or retrieval F-score, which requires gold segmentati...

We (= mostly @abyste.bsky.social) developed a way to evaluate how morphological a #tokenization is w/o gold segmentation labels. arxiv.org/abs/2601.18536 The key: align subword tokens with morphological features from UniMorph using IBM Model 1.
To appear in EACL 2026 Findings.

2 months ago 9 1 1 1
Advertisement
Video

Happy holidays! 🎄🎅🤩🎁

3 months ago 4 0 0 0

Attenzione! 🇮🇹 Know Piedmontese or Neapolitan speakers? @gianlucavico.bsky.social is collecting crowd-sourced translations to evaluate LLM performance on these regional languages. Partecipate!

5 months ago 2 1 0 0

Cultural awareness is trickier. Different data for different cultures means we can't really compare performance across cultures in a straightforward way. And there's no clear optimization target for cultural awareness beyond curating diverse training data.

6 months ago 1 0 0 0

☝️🧵 Most current approaches emphasize langauge neutrality: about two-thirds of VL benchmarks use translation-based evaluation. This makes sense because we can explicitly train for language neutrality when we have parallel data. But... 🧵👇

6 months ago 0 0 1 0

With @andrei-a-manea.bsky.social, we posted a survey on multilingual vision-language models 👉 arxiv.org/pdf/2509.22123
We reviewed 31 models+21 benchmarks. There's a tension between language neutrality (same results across languages) & cultural awareness (context matters differently across cultures)

6 months ago 3 2 1 0

Most vision-language models only work in English. We explore how different parallel data types (machine-translated vs authentic captions) affect cross-lingual transfer. Key finding: authentic data can outperform machine translation, and multilingual training beats bilingual approaches. #NLP

7 months ago 2 0 0 0

So proud of my PhD student @andrei-a-manea.bsky.social for his first first-author publication! 🎉 He presented this work last week at TSD. Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders arxiv.org/pdf/2504.21681

7 months ago 6 0 1 0

For evaluation researchers: Simple string-overlap metrics (BLEU, chrF) work surprisingly well for factual QA. 🤔 When answers are mostly named entities, exact matches matter more than we thought.

LLM-as-judge 🦙🧑‍⚖️ correlates best with human judgment, though.

7 months ago 1 0 1 0

The results are... humbling 😅
Even the best models:

>40% accuracy on textual questions
<30% on visual questions
Often perform better in English than the local language (!!)

Visual QA with regional images is especially challenging.

7 months ago 0 0 1 0
Advertisement
Post image

The problem: Most QA benchmarks focus on globally known facts. But real users ask about local geography, culture, and history.

We collected questions from native speakers in Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦 about facts locals know but outsiders don't.

7 months ago 0 0 1 0
Preview
ufal/cus-qa · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

🧵 We're releasing CUS-QA - a new benchmark for testing LLMs on regional knowledge!
Find out what your model knows about Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦!
👉 Textual and visual questions, answers, and human judgment on model outputs!
huggingface.co/datasets/ufa...
www.arxiv.org/abs/2507.22752

7 months ago 16 3 1 3

Stay tuned, we will release the dataset soon...

8 months ago 2 0 0 0

We need to have poster fights at the end of every conference.

8 months ago 3 1 0 0

Just presented MAGBIG, a new dataset and evaluation methodology for gender bias in multilingual text-to-image generation. Grammatical gender matters when studying these biases across languages!
Thanks to Felix Friedrich, @kathaem.bsky.social and all co-authors - it was fun to work on this together!

8 months ago 2 0 0 0
Post image

This week I am at #ACL2025NLP in Vienna 🎡🇦🇹. Find me 🕵️ or message 💌 me if you want to chat about multilinguality or tokenization. Stop 🛑 by our poster on gender bias in text-to-image generation on Monday aclanthology.org/2025.acl-lon...

8 months ago 7 0 0 0