Marco (@mcognetta) Bsky

So is BPE (for NLP). It's always crazy to me that these came out at roughly the same time.

aclanthology.org/P16-1162/

9 hours ago 13 3 0 0

Just needs prediction markets and it will be product of the year.

19 hours ago 0 0 0 0

Like if there are a few similar moves that all look good (no hanging pieces, equal trades, etc) but one leads to a piece loss a few moves later. These are obviously great calculation and visualization puzzles, but are pretty hard to get correct and are frustrating for people using the app casually.

20 hours ago 1 0 0 0

The primary thing I'm having trouble with is positions where the blunder doesn't become apparent until 4-5 moves later. These are not good for a single-turn daily puzzle app.

blunder.clinic #42 • 1200
🟩🟩🟨⬜🟥🟥
5/6 💪

blunder.clinic #42 • 1500
🟩⬜🟨🟨⬜🟥
4/6 💪

blunder.clinic #42 • 1800
⬜🟩🟨🟨🟥⬜
4/6 💪

20 hours ago 1 0 2 0

I built an interactive guide on how Shazam (the music identification app) works!

This is the next installment in my newly coined "How The Heck?" series, where we explore everyday tech that can feel like magic (QR Codes, GPS, and now this one).

Hope you enjoy it!

perthirtysix.com/how-the-heck...

1 day ago 174 45 9 8

Where is the signal in tokenization space? Renato Geh, Honghua Zhang, Kareem Ahmed, Benjie Wang, Guy Van Den Broeck. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.

Last one for this thread for now.

1 day ago 2 0 0 0

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocab...

Some inference ones:

1 day ago 2 0 1 0

GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model Thomas Bauwens, David Kaczér, Miryam de Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

Another training one:

1 day ago 1 0 1 0

Distributional Properties of Subword Regularization Marco Cognetta, Vilém Zouhar, Naoaki Okazaki. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.

(one of mine 😅)

1 day ago 1 0 1 0

StochasTok: Improving Fine-Grained Subword Understanding in LLMs Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language mo...

A few others:

1 day ago 1 0 1 0

Stochasticity in Tokenisation Improves Robustness The widespread adoption of large language models (LLMs) has increased concerns about their robustness. Vulnerabilities in perturbations of tokenisation of the input indicate that models trained with a...

Stochastic tokenization is one of my favorite topics. Here is a recent preprint on arXiv.

h/t @trappmartin.bsky.social

1 day ago 17 1 1 1

I watched a video where they trained a new barista by giving her a crash course and then making her make like 100 lattes in a row. This is the dream, but I don't know that many people in the bay.

Making 1-2 per day just isn't enough for consistent improvement.

Maybe there is a lesson in this.

1 day ago 0 0 0 0

Brandon Chou on Instagram: "coffee, nujabes on repeat, and good company all around 😌 this was a test run so we had limited capacity - looking forward to the next ones!" 107K likes, 308 comments - withbrandonn on June 27, 2023: "coffee, nujabes on repeat, and good company all around 😌 this was a test run so we had limited capacity - looking forward to the next ones!"...

This is my inspiration.

1 day ago 0 0 1 0

I have been wanting to host an "apartment cafe" for a while, but I need to get a bit more consistent with my latte art. So I held a dry run for a few people this weekend so I could just make a bunch of lattes in a row without risking my health.

Here are some of my better ones.

1 day ago 5 0 1 0

1 day ago 5481 1356 37 70

Oh, it's actually not his last name but his given name:

현덕 = 賢德

His last name is 송 = 宋.

TIL! This is a cool name for sure. Thanks!

1 day ago 1 0 0 0

Hmm, I can't say that "virtuous" and "protoss player" go together in my head!

1 day ago 1 0 1 0

Lee Sedol - Wikipedia

A thought occurred to me last night that Lee Sedol (the Go player) is about as close to Korean nominative determinism as I've ever heard.

이세돌 ≈ 二三石, if you squint a bit.

1 day ago 5 0 1 0

A screenshot of a white-on-black terminal depicting a 19x19 go board in ascii graphics, with empty grid intersections as periods, and black and white as Os and #s

It’s absolutely incredible that one of the largest Japanese-run Go servers, which has been running since 1992, is still accessed entirely via Telnet. And while most players use GUI clients that use Telnet under the hood, you can still connect manually and get ASCII graphics streamed to you

2 days ago 4136 886 10 43

One of these days I'll get 6/6 x 3

blunder.clinic #41 • 1200
🟩🟩🟨🟨🟥⬜
5/6 💪

blunder.clinic #41 • 1500
🟩🟩⬜🟨🟥⬜
4/6 💪

blunder.clinic #41 • 1800
🟩🟩⬜🟨⬜⬜
3/6 💪

#chess

2 days ago 1 0 0 0

Missed a lot of greens today 😬

blunder.clinic #40 • 1200
🟩🟩🟨🟨⬜🟥
5/6 💪

blunder.clinic #40 • 1500
⬜⬜🟨🟨🟥🟥
4/6 💪

blunder.clinic #40 • 1800
⬜🟩🟨🟨🟥⬜
4/6 💪

3 days ago 2 0 0 0

How Do You Measure an A.I. Boom?

Cool profile of @metr.org’s work in the NYT today! Particularly like this from my colleague Ajeya: “METR is an organization that asks... what we think would be most valuable for the world to know about A.I. and its risks, and then the answers are what they are.”
www.nytimes.com/2026/04/17/t....

4 days ago 13 4 0 0

measured Opus 4.7's new tokenizer today. English: 1.45× more tokens than 4.6. Cyrillic: 1.00×.

the optimization play is clear: write your codebase in Russian-transliterated keywords.

функция факториал(н):

free 30% token discount, unreadable to English speakers, huge productivity win

3 days ago 94 15 5 4

This is great idea haha. Now how can it be adapted to English classes?

3 days ago 7 0 1 0

https://tinyurl.com/aiide26dc

Are you a PhD student interested in game AI? Submit to AIIDE 2026’s Doctoral Consortium! The feedback a DC provides helps sharpen your dissertation focus while informing you about potential career options.

Applications are due July 25, 2026! Read more at: tinyurl.com/aiide26dc

3 days ago 3 5 0 0

This is a good point, cause treating whitespace as its own token would fix this problem.

3 days ago 0 0 0 0

The Art of Prompt Design: Prompt Boundaries and Token Healing Learn how standard greedy tokenization introduces a subtle and powerful bias that can have all kinds of unintended consequences.

Token healing partially solves the problem by popping the last token of your prompt preamble and then ensuring that the next token the model produces matches the surface form of the popped token + the beginning of your prompt.

3 days ago 2 0 0 0

Are you going to finish that? A Practical Study of the Partial Token Problem Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt ...

Ah yeah this still happens in a lot of prompting things + constrained generation. The "partial token problem" and "token healing" are the key words.

Here's a great paper about it.

3 days ago 3 0 1 0

Here is one reason you might want to treat whitespace separately: words w/ and w/o a leading whitespace get tokenized differently.

Having a separate whitespace token would unify a lot of these.

3 days ago 2 1 1 0

I feel like I read an xkcd where they place a computer inside of Narnia to speed up computation in our world, but now I can't find it.

There is a Narnia comic with a computer, but I differs right at the end.

3 days ago 1 0 0 0

Posts by Marco