4 papers submitted & accepted at #ACL2026 ๐ So grateful to work alongside & learn from amazing minds, pushing the boundaries of speech technologies, machine learning, and computational linguistics. See you in San Diego!
Posts by Kwanghee Choi
Thanks a lot for the interest in our work! Here's the recording for people who missed the seminar: youtu.be/DtFYKvNo9IQ
Yeah, exactly! Actually, phonotactic restrictions was one of the things that we were also interested in, and currently brainstorming about future work.
Yeah, that also may happen because they might simply be too close to each other. In our case, we compared such arithmetic with upper (same phone)/lower (different phone) baselines, and also did some offset-based analogy tests (Sec A.1). Well, personally, speech editing feels most compelling tho.
Thanks for the interest! I'm not fully sure whether I fully understood your question, but our observations implied that it was likely to be position-independent (for such devoicing case, it's likely that the voicing vector is not "activated") and inventory-independent (crosslingual).
Yup, we managed to predict [p] via [b] + [t] - [d] in the self-supervised representation space! Actually, we tested couple hundred of those analogies, and >90% were successful. Further, we showed speech editing on the existing speech based on such phonological feature-driven vectors!
Huge thanks for my wonderful coauthors, Eunjung and Cheol-jun, and my two favorite Davids, Mortensen ๐ and Harwath ๐ค โ best advisors I could ask for ๐ Can't wait to see what we cook up next! ๐
๐งต Together, both papers take a step beyond the usual "what info do S3Ms encode" probing paradigm. We aim to answer how is that info actually encoded geometrically? Come see for yourself Thursday! ๐
Slides: docs.google.com/presentation...
๐ Paper 2 (submitted to IS): "Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces"
We further show how sequences of phone(me)s can be encoded, i.e., contextualize, in a single S3M frame.
arxiv.org/abs/2603.12642
๐ Paper 1 (submitted to Jan ARR):
"[b] = [d] โ [t] + [p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic"
We show how phone(me)s are encoded in S3Ms: as a linear combination of phonological feature vectors.
arxiv.org/abs/2602.18899
This is my third time presenting this work โ previous stops were UTAustin (3/6) and CMU (3/13) โ but this is the first public one, so everyone can join! ๐
๐ฉ Email me (kwanghee@utexas.edu) or Marianne (m.l.s.deheerkloots@uva.nl) for the Zoom link.
๐๐๐ฅ๐-๐ฌ๐ฎ๐ฉ๐๐ซ๐ฏ๐ข๐ฌ๐๐ ๐๐ฉ๐๐๐๐ก ๐๐จ๐๐๐ฅ๐ฌ ๐๐ซ๐ ๐๐ก๐จ๐ง๐จ๐ฅ๐จ๐ ๐ข๐๐๐ฅ ๐๐๐๐ญ๐จ๐ซ ๐๐๐๐ก๐ข๐ง๐๐ฌ!
๐ฃ๏ธ Excited to be giving an invited talk this Thursday (March 19th, 3pm Amsterdam time)!
Huge thanks to @mdhk.net at University of Amsterdam for the invite ๐
โจNew paperโจ
We find script (e.g. Cyrillic, Latin) to be a linear direction in the activation space of Whisper, enabling transliteration at test-time by adding such script directions to the activations โ producing e.g. Cyrillic Japanese transcriptions.
๐ Apply to CMU LTIโs Summer 2026 โLanguage Technology for Allโ internship! ๐ Open to preโdoctoral students new to language tech (nonโCS backgrounds welcome). ๐ฌ 12โ14 weeks inโperson in Pittsburgh โ travel + stipend paid. ๐ธ Deadline: Feb 20, 11:59pm ET. Apply โ forms.gle/cUu8g6wb27Hs...
Had such a great time presenting our tutorial on Interpretability Techniques for Speech Models at #Interspeech2025! ๐
For anyone looking for an introduction to the topic, we've now uploaded all materials to the website: interpretingdl.github.io/speech-inter...
This wouldn't have been possible with my awesome co-first-author @mmiagshatoy.bsky.social and wonderful supervisors @shinjiw.bsky.social and @strubell.bsky.social!
I'll see you at Rotterdam, Wed 17:00-17:20 Area8-Oral4 (Streaming ASR)! (10/10)
There's also bunch of engineering tricks that can improve the performance. We provide a pareto-optimal baseline after applying all the available tricks, positioning our work as a foundation for future works in this direction. github.com/Masao-Someki... (9/n)
We also verified that DSUs are learnable with smaller weights (# of layers), i.e., more lightweight! This implies that we're using self-supervised models inefficiently when extracting DSUs. (8/n)
We verified that DSUs are learnable with limited attention size (window size), i.e., streamable! This implies that DSUs are temporally "local". (7/n)
After modifying the architecture, we fine-tune it with the DSUs extracted from the original full model. We're now understanding DSUs as "ground truth" for smaller models. (6/n)
However, the underlying Transformer model is heavy and non-streamable. We make the model more lightweight (via reducing # of layers) and streamable (via streaming window). (5/n)
Why DSUs?
(1) High transmission efficiency of ~0.6kbps (.wav files are around 512kbps, 3-4 orders of magnitude bigger!)
(2) Easy integration with LLMs (we can say DSUs are "tokenized speech")
(3) DSUs somewhat "acts" like phonemes (4/n)
A whirlwind overview of discrete speech units (DSUs): we first train a Transformer model with self-supervision (i.e., self-supervised speech models, S3Ms). Then, we simply apply k-means on top of it. Then, the k-means cluster indices becomes DSUs! (3/n)
In short, yes! Long story short:
(1) We are using self-supervised models inefficiently when extracting discrete speech units (DSUs), hence can be made more lightweight.
(2) DSUs do not require full temporal receptive field, hence streamable. (2/n)
Can we make discrete speech units lightweight๐ชถ and streamable๐? Excited to share our new #Interspeech2025 paper: On-device Streaming Discrete Speech Units arxiv.org/abs/2506.01845 (1/n)
www.nature.com/articles/350...
Ted Chiang. Catching crumbs from the table. Nature 405, 517 (2000). My favorite sci-fi short, which surprisingly well-summarizes what I actually do nowadays. I bet self-supervised speech models contain undiscovered theories on phonetics and phonology.
It's good to finally have a good reference for this stuff! Kudos to the authors.
arxiv.org/abs/2501.18374
Check out my presentation and poster for more details. I'll see you at NAACL, 4/30 14:00-15:30 Poster Session C! youtu.be/ZRF4u1eThJM (9/9)
We provide all the code and additional textgrids for everyone to use! github.com/juice500ml/a... (8/n)
We provide an extensive benchmark containing both pathological and non-native speech, with 8 different methods and 4 different speech features. It measures how well does the speech features model each phonemes accurately. (7/n)