Our experts contributed to the latest #HPLT dataset publication, which contains some very interesting results! See here: t.co/uN2zoSF251 #DataScience
You are also welcome to the "Multilingualism: from data crawling to evaluation" birds-of-a-feather (BoF) event, which is co-organized by the #HPLT project.
Join us to discuss web-scale text data collection and processing, as well as open multilingual #LLM training and evaluation. You will have […]
If you are attending the ACL 2025 conference in Vienna, come to the poster presenting the latest #HPLT v2 datasets (the paper is available here: https://arxiv.org/abs/2503.10267
You can find the HPLT folks on Wednesday, July 30, 11:00 at the in-person poster session, Level 0, Exhibit Halls X4 […]
Reference LLMs from #HPLT and #OpenEuroLLM
Happy to share the first models I contributed to as a part of #HPLT + @openeurollm.bsky.social project and @turkunlp.bsky.social group :)
📢 First release: 38 monolingual reference LLMs (2.15B params) via #HPLT + #OpenEuroLLM
⚙️Trained on 100B tokens from HPLT v2 dataset
🌍 Cover EU langs + others
⚙️ Based on LLaMA, trained on #LUMI
📈 Useful for evaluation
Downloads + more info at openeurollm.eu/blog/hplt-oe...
1. "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)", describing a new generation of the #HPLT web-crawled corpora in 193 languages. LTG co-authors: Nikolay Arefyev, Mariia Fedorova, Andrey Kutuzov, Petter Mæhlum, Vladislav Mikhailov, Stephan Oepen […]
That's a wrap for @nodalida.bsky.social ! Short, nice and intense. I presented our work on efficient MT @helsinki-nlp.bsky.social within the #HPLT project⚡️