🎥 Videos from our Tokenization Workshop are now live! Watch invited talks, panel discussions, and the best paper presentation at icml.cc/virtual/2025... #Tokenization #NLP #LLMs
Posts by Tomasz Limisiewicz
Check the BLT poster at @aclmeeting.bsky.social . It’s just fortaste before the main presentation at @tokshop.bsky.social next week from Artidoro Pagnoni!
Looking forward for out panel at 3:30. We’ll talk about future of tokenization: BLT, SuperBPE @alisawuffles.bsky.social, H-nets Albert Gu and further breakthroughs in tokenization @uvp.bsky.social, Sander Land, Kris Cao
bsky.app/profile/toks...
It’d be great to meet at Tokenization Workshop @tokshop.bsky.social #icml
tomorrow July 18 starting at 8:45 in Meeting 112-113!
The TokShop schedule is now live! Join us at #ICML2025 for invited talks, poster sessions, and a panel on the future of tokenization. tokenization-workshop.github.io/schedule #Tokenization #LLM #NLP
I'm pleased to be in Vancouver for @ICML this week 🇨🇦🤖. I'll be happy to chat about multilingual, multimodal LMs and tokenization(free).
If you have experience with tokenization (who doesn’t) your help with reviewing will be hugely appreciated! 🔠🔡
Got a good tokenization paper under review at COLM, but the scores were a letdown? 😬
Why bother with rebuttal when the perfect venue is right around the corner!
Submit your paper to the #ICML2025 Tokenization Workshop (TokShop) by May 30! 🚀
#NAACL2025 ended more than a week ago & @ufal-cuni.bsky.social folks were there:
Main conf: @kathaem.bsky.social presented joint work w/ @tomlim.bsky.social, @jlibovicky.bsky.social and Alex Fraser: Beyond Literal Token Overlap: Token Alignability for Multilinguality aclanthology.org/2025.naacl-s...
📣 Call for Paper Alert: TokShop @ ICML 2025
TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.
It’s finally official: the long-awaited Tokenization Workshop is here!
So, apparently, confusing these two buttons can ignite a serious flame-war in reviewer-author discussion🔥 @aclmeeting.bsky.social
Excited to continue my research adventure as a postdoc at @uwnlp.bsky.social and Meta! I’ve joined @lukezettlemoyer.bsky.social’s fantastic lab. Together, we plan to rethink how LLMs perceive data to unlock their capabilities to uncharted language and, further, beyond text!
Paper 👉Beyond Literal Token Overlap: Token Alignability for Multilinguality👈 by @kathaem.bsky.social, @tomlim.bsky.social, @jlibovicky.bsky.social and Alex Fraser will appear at #NAACL2025! arxiv.org/abs/2502.06468 Congratulations to all authors! 🥳
Happy to say that our paper "Beyond Literal Token Overlap: Token Alignability for Multilinguality" will be presented at #NAACL2025!
This is work with @tomlim.bsky.social, @jlibovicky.bsky.social, and Alex Fraser.
arxiv.org/abs/2502.06468
#newpaper #NLP #NLProc
It’d be great to stay in touch!
Work in progress -- suggestions for NLP-ers based in the EU/Europe & already on Bluesky very welcome!
go.bsky.app/NZDc31B
Haha, that's me, both name and surname 😁
Ahh, that's a pitty to miss that .
Thanks, I'm happy to hear that 🙂. Do you have a rough estimate of when to expect a call for workshop proposals?
How about workshops before or after the main conference?
Good to see you here! #nlp
Also this one:
Lexically Grounded Subword Segmentation
aclanthology.org/2024.emnlp-m...
Poster Session Nov 12 (Tue) 2 pm 🙂
Fantastic list, thank you!
Tokenization is so back! at #EMNLP
also, if you are in Miami for EMNLP this week don’t miss Hila Gonen's MRL keynote about fair multilingual tokenization (including MYTE).
Happening on Saturday (Nov 16) at 9:50 am ET MRL workshop (room: Jasmine).
from transformers import T5ForConditionalGeneration from transformers import MyT5Tokenizer MODEL_SIZE = "large" # small, base, or large MODEL = f"Tomlim/myt5_{MODEL_SIZE}" model = T5ForConditionalGeneration.from_pretrained( MODEL, use_safetensors=True) tokenizer = MyT5Tokenizer.from_pretrained(MODEL)
#firstpost
Are you working on NLP for low-resource or non-Latin script languages?
If yes, I have great news for you! Our MYTE tokenizer and MyT5 models 🪲 are now easily available through🤗. It’s easy to try:
If you are interested in AI, follow the folks in this starter pack! I have just updated it to include a few new arrivals here, but please let me know who else is missing
go.bsky.app/SipA7it
Great list, thanks for making start at 🦋 easier. I’d also love to be added to the list!
That's awesome. Time for a fresh start at 🦋