Kwanghee Choi (@juice500ml) Bsky

4 papers submitted & accepted at #ACL2026 🎉 So grateful to work alongside & learn from amazing minds, pushing the boundaries of speech technologies, machine learning, and computational linguistics. See you in San Diego!

1 week ago 3 2 0 0

Thanks a lot for the interest in our work! Here's the recording for people who missed the seminar: youtu.be/DtFYKvNo9IQ

1 month ago 1 0 0 0

Yeah, exactly! Actually, phonotactic restrictions was one of the things that we were also interested in, and currently brainstorming about future work.

1 month ago 1 0 0 0

Yeah, that also may happen because they might simply be too close to each other. In our case, we compared such arithmetic with upper (same phone)/lower (different phone) baselines, and also did some offset-based analogy tests (Sec A.1). Well, personally, speech editing feels most compelling tho.

1 month ago 1 0 0 0

Thanks for the interest! I'm not fully sure whether I fully understood your question, but our observations implied that it was likely to be position-independent (for such devoicing case, it's likely that the voicing vector is not "activated") and inventory-independent (crosslingual).

1 month ago 1 0 1 0

Yup, we managed to predict [p] via [b] + [t] - [d] in the self-supervised representation space! Actually, we tested couple hundred of those analogies, and >90% were successful. Further, we showed speech editing on the existing speech based on such phonological feature-driven vectors!

1 month ago 2 0 1 0

Huge thanks for my wonderful coauthors, Eunjung and Cheol-jun, and my two favorite Davids, Mortensen 🐑 and Harwath 🤠 — best advisors I could ask for 🙏 Can't wait to see what we cook up next! 🚀

1 month ago 0 1 0 0

Self-supervised Speech Models are Phonological Vector Machines Self-supervised Speech Models are Phonological Vector Machines Kwanghee Choi kwanghee@utexas.edu

🧵 Together, both papers take a step beyond the usual "what info do S3Ms encode" probing paradigm. We aim to answer how is that info actually encoded geometrically? Come see for yourself Thursday! 👀
Slides: docs.google.com/presentation...

1 month ago 0 0 1 0

📄 Paper 2 (submitted to IS): "Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces"
We further show how sequences of phone(me)s can be encoded, i.e., contextualize, in a single S3M frame.
arxiv.org/abs/2603.12642

1 month ago 1 0 1 0

📄 Paper 1 (submitted to Jan ARR):
"[b] = [d] − [t] + [p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic"
We show how phone(me)s are encoded in S3Ms: as a linear combination of phonological feature vectors.
arxiv.org/abs/2602.18899

1 month ago 0 0 1 0

This is my third time presenting this work — previous stops were UTAustin (3/6) and CMU (3/13) — but this is the first public one, so everyone can join! 🎉
📩 Email me (kwanghee@utexas.edu) or Marianne (m.l.s.deheerkloots@uva.nl) for the Zoom link.

1 month ago 0 0 1 0

𝐒𝐞𝐥𝐟-𝐬𝐮𝐩𝐞𝐫𝐯𝐢𝐬𝐞𝐝 𝐒𝐩𝐞𝐞𝐜𝐡 𝐌𝐨𝐝𝐞𝐥𝐬 𝐚𝐫𝐞 𝐏𝐡𝐨𝐧𝐨𝐥𝐨𝐠𝐢𝐜𝐚𝐥 𝐕𝐞𝐜𝐭𝐨𝐫 𝐌𝐚𝐜𝐡𝐢𝐧𝐞𝐬!
🗣️ Excited to be giving an invited talk this Thursday (March 19th, 3pm Amsterdam time)!
Huge thanks to @mdhk.net at University of Amsterdam for the invite 🙏

1 month ago 6 2 2 2

✨New paper✨

We find script (e.g. Cyrillic, Latin) to be a linear direction in the activation space of Whisper, enabling transliteration at test-time by adding such script directions to the activations — producing e.g. Cyrillic Japanese transcriptions.

3 months ago 10 5 1 0

CMU LTI Summer 2026 Internship Program Application We are looking for applicants for the Carnegie Mellon University Language Technology Institute's Summer 2026 "Language Technology for All" internship program. The main goal of this internship is to pr...

🚀 Apply to CMU LTI’s Summer 2026 “Language Technology for All” internship! 🎓 Open to pre‑doctoral students new to language tech (non‑CS backgrounds welcome). 🔬 12–14 weeks in‑person in Pittsburgh — travel + stipend paid. 💸 Deadline: Feb 20, 11:59pm ET. Apply → forms.gle/cUu8g6wb27Hs...

2 months ago 14 12 2 0

Had such a great time presenting our tutorial on Interpretability Techniques for Speech Models at #Interspeech2025! 🔍

For anyone looking for an introduction to the topic, we've now uploaded all materials to the website: interpretingdl.github.io/speech-inter...

8 months ago 41 15 2 1

This wouldn't have been possible with my awesome co-first-author @mmiagshatoy.bsky.social and wonderful supervisors @shinjiw.bsky.social and @strubell.bsky.social!
I'll see you at Rotterdam, Wed 17:00-17:20 Area8-Oral4 (Streaming ASR)! (10/10)

8 months ago 0 0 0 0

There's also bunch of engineering tricks that can improve the performance. We provide a pareto-optimal baseline after applying all the available tricks, positioning our work as a foundation for future works in this direction. github.com/Masao-Someki... (9/n)

8 months ago 0 0 1 0

We also verified that DSUs are learnable with smaller weights (# of layers), i.e., more lightweight! This implies that we're using self-supervised models inefficiently when extracting DSUs. (8/n)

8 months ago 0 0 1 0

We verified that DSUs are learnable with limited attention size (window size), i.e., streamable! This implies that DSUs are temporally "local". (7/n)

8 months ago 0 0 1 0

After modifying the architecture, we fine-tune it with the DSUs extracted from the original full model. We're now understanding DSUs as "ground truth" for smaller models. (6/n)

8 months ago 0 0 1 0

However, the underlying Transformer model is heavy and non-streamable. We make the model more lightweight (via reducing # of layers) and streamable (via streaming window). (5/n)

8 months ago 0 0 1 0

Why DSUs?
(1) High transmission efficiency of ~0.6kbps (.wav files are around 512kbps, 3-4 orders of magnitude bigger!)
(2) Easy integration with LLMs (we can say DSUs are "tokenized speech")
(3) DSUs somewhat "acts" like phonemes (4/n)

8 months ago 0 0 1 0

A whirlwind overview of discrete speech units (DSUs): we first train a Transformer model with self-supervision (i.e., self-supervised speech models, S3Ms). Then, we simply apply k-means on top of it. Then, the k-means cluster indices becomes DSUs! (3/n)

8 months ago 0 0 1 0

In short, yes! Long story short:
(1) We are using self-supervised models inefficiently when extracting discrete speech units (DSUs), hence can be made more lightweight.
(2) DSUs do not require full temporal receptive field, hence streamable. (2/n)

8 months ago 0 0 1 0

Can we make discrete speech units lightweight🪶 and streamable🏎? Excited to share our new #Interspeech2025 paper: On-device Streaming Discrete Speech Units arxiv.org/abs/2506.01845 (1/n)

8 months ago 1 1 2 0

Catching crumbs from the table - Nature In the face of metahuman science, humans have become metascientists.

www.nature.com/articles/350...
Ted Chiang. Catching crumbs from the table. Nature 405, 517 (2000). My favorite sci-fi short, which surprisingly well-summarizes what I actually do nowadays. I bet self-supervised speech models contain undiscovered theories on phonetics and phonology.

10 months ago 3 0 0 0

Proofs for Folklore Theorems on the Radon-Nikodym Derivative In this paper, rigorous statements and formal proofs are presented for both foundational and advanced folklore theorems on the Radon-Nikodym derivative. The cases of conditional and marginal probabili...

It's good to finally have a good reference for this stuff! Kudos to the authors.
arxiv.org/abs/2501.18374

11 months ago 4 2 0 1

Check out my presentation and poster for more details. I'll see you at NAACL, 4/30 14:00-15:30 Poster Session C! youtu.be/ZRF4u1eThJM (9/9)

11 months ago 1 0 0 0

GitHub - juice500ml/acoustic-units-for-ood: Official implementation for the paper "Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (NAACL 2025)" Official implementation for the paper "Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (NAACL 2025)" - juice500ml/acoustic-units-for-ood

We provide all the code and additional textgrids for everyone to use! github.com/juice500ml/a... (8/n)

11 months ago 0 0 1 0

We provide an extensive benchmark containing both pathological and non-native speech, with 8 different methods and 4 different speech features. It measures how well does the speech features model each phonemes accurately. (7/n)

11 months ago 0 0 1 0

Posts by Kwanghee Choi