Advertisement · 728 × 90

Posts by Craig Schmidt

Gandalf the White. A quote for our times.

2 months ago 0 0 1 0

The red cups are a brand called solo cups. They have always been red

3 months ago 2 0 0 0

I’m at @colmweb.org this week in Montreal. Come see our BoundlessBPE paper in the Wed morning poster session. Love to talk to anyone else here, especially about tokenization. #COLM2025

6 months ago 0 0 0 0

I believe he’s talking about Olin College of Engineering. Created from scratch as an undergraduate only school, with the first class in 2002. Kind of a Harvey Mudd of the east. Campus is near me, and they seem to attract great students.

6 months ago 1 0 0 0
Preview
WordPiece can't always avoid <unk> even with ByteLevel pretokenization. · Issue #1863 · huggingface/tokenizers The ByteLevel pre-tokenizer is largely used to avoid the possibility of an <unk> token. However, there is a problem with the continuation characters in WordPiece that prevents you from adding all o...

The other is that is there isn't a way to specify an initial vocabulary with all 256 bytes including the continuation character ##. See github.com/huggingface/.... So in short, if you use their WordPiece you might get <UNK> tokens.

6 months ago 1 0 0 0
Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem Stéphan Tulkens' Blog

There are two different ways that the Huggingface Word Piece implementation can produce <UNK> tokens even with ByteLevel pretokenization. A nice blog post from Stéphan Tulkens talks about how to fix one of them, in response to a question of mine.
stephantul.github.io/blog/better-...

6 months ago 2 0 2 0

I've been using GPT-5 on my phone (since it isn't my web account yet). I've had several bad responses with logical inconsistencies. My hot take: what if GPT-5 is mostly about saving OpenAI money on inference, which is why they are deprecating all the other models so quickly.

8 months ago 2 0 0 0
Advertisement

@crampell.bsky.social’s post got me to thinking and…yes…Trump has apparently canceled the research grant of Judea Pearl, who is one of the world’s leading scholars, is Jewish, Israeli-American, & is vocally opposed to antisemitism, & is the father of Daniel Pearl.
www.science.org/content/arti...

8 months ago 209 91 8 8
Stellen OBP - Georg-August-Universität Göttingen Webseiten der Georg-August-Universität Göttingen

Interested in multilingual tokenization in #NLP? Lisa Beinborn and I are hiring!

PhD candidate position in Göttingen, Germany: www.uni-goettingen.de/de/644546.ht...

PostDoc position in Leuven, Belgium:
www.kuleuven.be/personeel/jo...

Deadline 6th of June

10 months ago 25 13 2 2

I've posted a few papers I missed including yours here bsky.app/profile/crai.... Thomas pointed that out about 5 seconds after I posted on the discord :-)

8 months ago 1 0 1 0
Preview
Causal Estimation of Tokenisation Bias Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

16) Causal Estimation of Tokenisation Bias
Pietro Lesci et al
aclanthology.org/2025.acl-lon...

8 months ago 2 0 0 0
Preview
Tokenisation is NP-Complete Philip Whittington, Gregor Bachmann, Tiago Pimentel. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

15) Tokenisation is NP-Complete
Philip Whittington et al
aclanthology.org/2025.acl-lon...

8 months ago 3 1 1 0
Preview
GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model Thomas Bauwens, David Kaczér, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

14) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model
Thomas Bauwens et al
aclanthology.org/2025.acl-lon...

8 months ago 2 0 1 1

And of course I missed some tokenization related papers at #ACL2025 in my previous post. Any more I should add?

8 months ago 2 0 1 0
Advertisement
Preview
Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages Georgy Andryushchenko, Vladimir V. Ivanov. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 2025.

13) Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages
Georgii Andriushchenko et al
aclanthology.org/2025.acl-srw...

8 months ago 1 0 0 0
Preview
Retrofitting Large Language Models with Dynamic Tokenization Darius Feher, Ivan Vulić, Benjamin Minixhofer. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

12) Retrofitting Large Language Models with Dynamic Tokenization
Darius Feher et al
aclanthology.org/2025.acl-lon...

8 months ago 1 0 1 0
Preview
TokAlign: Efficient Vocabulary Adaptation via Token Alignment Chong Li, Jiajun Zhang, Chengqing Zong. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

11) TokAlign: Efficient Vocabulary Adaptation via Token Alignment
Chong Li et al
aclanthology.org/2025.acl-lon...

8 months ago 1 0 1 0
Preview
Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

10) Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
Kexin Chen et al
aclanthology.org/2025.acl-lon...

8 months ago 1 0 1 0
Preview
Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025.

9) Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Andrew Gambardella et al
aclanthology.org/2025.acl-sho...

8 months ago 1 0 1 0
Preview
Adversarial Tokenization Renato Geh, Zilei Shao, Guy Van Den Broeck. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

8) Adversarial Tokenization
Renato Lui Geh et al
aclanthology.org/2025.acl-lon...

8 months ago 1 0 1 0
Preview
Incorporating Domain Knowledge into Materials Tokenization Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

7) Incorporating Domain Knowledge into Materials Tokenization
Yerim Oh et al
aclanthology.org/2025.acl-lon...

8 months ago 1 0 1 0
Preview
Beyond Text Compression: Evaluating Tokenizers Across Scales Jonas F. Lotz, António V. Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

6) Beyond Text Compression: Evaluating Tokenizers Across Scales
Jonas F. Lotz et al
aclanthology.org/2025.acl-lon...

8 months ago 1 0 1 0
Preview
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:...

5) Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
Zhu Xu et al
aclanthology.org/2025.acl-lon...

8 months ago 1 0 1 0
Advertisement
Preview
Unsupervised Morphological Tree Tokenizer Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

4) Unsupervised Morphological Tree Tokenizer
Xiang Hu et al
aclanthology.org/2025.finding...

8 months ago 1 0 1 0
Preview
Splintering Nonconcatenative Languages for Better Tokenization Bar Gazit, Shaltiel Shmidman, Avi Shmidman, Yuval Pinter. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

3) Splintering Nonconcatenative Languages for Better Tokenization
Yuval Pinter et al
aclanthology.org/2025.finding...

8 months ago 1 0 1 0
Preview
Tokenization is Sensitive to Language Variation Anna Wegmann, Dong Nguyen, David Jurgens. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

2) Tokenization is Sensitive to Language Variation
Anna Wegmann et al
aclanthology.org/2025.finding...

8 months ago 1 0 1 0
Preview
Byte Latent Transformer: Patches Scale Better Than Tokens Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini...

1) Byte Latent Transformer: Patches Scale Better Than Tokens
Artidoro Pagnoni et al
aclanthology.org/2025.acl-lon...

8 months ago 1 0 1 0

I'm sadly not at #ACL2025, but the work on tokenization seem to continue to explode. Here are the tokenization related papers I could find, in no particular order. Let me know if I missed any.

8 months ago 11 4 2 0

Really grateful to the organizers for the recognition of our work!

8 months ago 12 1 1 0
ICML Poster Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and FinetuningICML 2025

You’re right these results apply to general “big” datasets like ThePile or RedPajama. There are several papers at ICML on weighting datasets like Chameleon (icml.cc/virtual/2025...) that could probably let you get away with less data.

8 months ago 1 0 1 0
Advertisement