Advertisement ยท 728 ร— 90

Posts by Kwanghee Choi

Post image

4 papers submitted & accepted at #ACL2026 ๐ŸŽ‰ So grateful to work alongside & learn from amazing minds, pushing the boundaries of speech technologies, machine learning, and computational linguistics. See you in San Diego!

1 week ago 3 2 0 0

Thanks a lot for the interest in our work! Here's the recording for people who missed the seminar: youtu.be/DtFYKvNo9IQ

1 month ago 1 0 0 0

Yeah, exactly! Actually, phonotactic restrictions was one of the things that we were also interested in, and currently brainstorming about future work.

1 month ago 1 0 0 0

Yeah, that also may happen because they might simply be too close to each other. In our case, we compared such arithmetic with upper (same phone)/lower (different phone) baselines, and also did some offset-based analogy tests (Sec A.1). Well, personally, speech editing feels most compelling tho.

1 month ago 1 0 0 0

Thanks for the interest! I'm not fully sure whether I fully understood your question, but our observations implied that it was likely to be position-independent (for such devoicing case, it's likely that the voicing vector is not "activated") and inventory-independent (crosslingual).

1 month ago 1 0 1 0

Yup, we managed to predict [p] via [b] + [t] - [d] in the self-supervised representation space! Actually, we tested couple hundred of those analogies, and >90% were successful. Further, we showed speech editing on the existing speech based on such phonological feature-driven vectors!

1 month ago 2 0 1 0

Huge thanks for my wonderful coauthors, Eunjung and Cheol-jun, and my two favorite Davids, Mortensen ๐Ÿ‘ and Harwath ๐Ÿค  โ€” best advisors I could ask for ๐Ÿ™ Can't wait to see what we cook up next! ๐Ÿš€

1 month ago 0 1 0 0
Preview
Self-supervised Speech Models are Phonological Vector Machines Self-supervised Speech Models are Phonological Vector Machines Kwanghee Choi kwanghee@utexas.edu

๐Ÿงต Together, both papers take a step beyond the usual "what info do S3Ms encode" probing paradigm. We aim to answer how is that info actually encoded geometrically? Come see for yourself Thursday! ๐Ÿ‘€
Slides: docs.google.com/presentation...

1 month ago 0 0 1 0
Advertisement
Post image

๐Ÿ“„ Paper 2 (submitted to IS): "Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces"
We further show how sequences of phone(me)s can be encoded, i.e., contextualize, in a single S3M frame.
arxiv.org/abs/2603.12642

1 month ago 1 0 1 0
Post image

๐Ÿ“„ Paper 1 (submitted to Jan ARR):
"[b] = [d] โˆ’ [t] + [p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic"
We show how phone(me)s are encoded in S3Ms: as a linear combination of phonological feature vectors.
arxiv.org/abs/2602.18899

1 month ago 0 0 1 0

This is my third time presenting this work โ€” previous stops were UTAustin (3/6) and CMU (3/13) โ€” but this is the first public one, so everyone can join! ๐ŸŽ‰
๐Ÿ“ฉ Email me (kwanghee@utexas.edu) or Marianne (m.l.s.deheerkloots@uva.nl) for the Zoom link.

1 month ago 0 0 1 0
Post image

๐’๐ž๐ฅ๐Ÿ-๐ฌ๐ฎ๐ฉ๐ž๐ซ๐ฏ๐ข๐ฌ๐ž๐ ๐’๐ฉ๐ž๐ž๐œ๐ก ๐Œ๐จ๐๐ž๐ฅ๐ฌ ๐š๐ซ๐ž ๐๐ก๐จ๐ง๐จ๐ฅ๐จ๐ ๐ข๐œ๐š๐ฅ ๐•๐ž๐œ๐ญ๐จ๐ซ ๐Œ๐š๐œ๐ก๐ข๐ง๐ž๐ฌ!
๐Ÿ—ฃ๏ธ Excited to be giving an invited talk this Thursday (March 19th, 3pm Amsterdam time)!
Huge thanks to @mdhk.net at University of Amsterdam for the invite ๐Ÿ™

1 month ago 6 2 2 2
Post image

โœจNew paperโœจ

We find script (e.g. Cyrillic, Latin) to be a linear direction in the activation space of Whisper, enabling transliteration at test-time by adding such script directions to the activations โ€” producing e.g. Cyrillic Japanese transcriptions.

3 months ago 10 5 1 0
Preview
CMU LTI Summer 2026 Internship Program Application We are looking for applicants for the Carnegie Mellon University Language Technology Institute's Summer 2026 "Language Technology for All" internship program. The main goal of this internship is to pr...

๐Ÿš€ Apply to CMU LTIโ€™s Summer 2026 โ€œLanguage Technology for Allโ€ internship! ๐ŸŽ“ Open to preโ€‘doctoral students new to language tech (nonโ€‘CS backgrounds welcome). ๐Ÿ”ฌ 12โ€“14 weeks inโ€‘person in Pittsburgh โ€” travel + stipend paid. ๐Ÿ’ธ Deadline: Feb 20, 11:59pm ET. Apply โ†’ forms.gle/cUu8g6wb27Hs...

2 months ago 14 12 2 0

Had such a great time presenting our tutorial on Interpretability Techniques for Speech Models at #Interspeech2025! ๐Ÿ”

For anyone looking for an introduction to the topic, we've now uploaded all materials to the website: interpretingdl.github.io/speech-inter...

8 months ago 41 15 2 1
Advertisement

This wouldn't have been possible with my awesome co-first-author @mmiagshatoy.bsky.social and wonderful supervisors @shinjiw.bsky.social and @strubell.bsky.social!
I'll see you at Rotterdam, Wed 17:00-17:20 Area8-Oral4 (Streaming ASR)! (10/10)

8 months ago 0 0 0 0
Post image

There's also bunch of engineering tricks that can improve the performance. We provide a pareto-optimal baseline after applying all the available tricks, positioning our work as a foundation for future works in this direction. github.com/Masao-Someki... (9/n)

8 months ago 0 0 1 0
Post image

We also verified that DSUs are learnable with smaller weights (# of layers), i.e., more lightweight! This implies that we're using self-supervised models inefficiently when extracting DSUs. (8/n)

8 months ago 0 0 1 0
Post image

We verified that DSUs are learnable with limited attention size (window size), i.e., streamable! This implies that DSUs are temporally "local". (7/n)

8 months ago 0 0 1 0
Post image

After modifying the architecture, we fine-tune it with the DSUs extracted from the original full model. We're now understanding DSUs as "ground truth" for smaller models. (6/n)

8 months ago 0 0 1 0
Post image

However, the underlying Transformer model is heavy and non-streamable. We make the model more lightweight (via reducing # of layers) and streamable (via streaming window). (5/n)

8 months ago 0 0 1 0

Why DSUs?
(1) High transmission efficiency of ~0.6kbps (.wav files are around 512kbps, 3-4 orders of magnitude bigger!)
(2) Easy integration with LLMs (we can say DSUs are "tokenized speech")
(3) DSUs somewhat "acts" like phonemes (4/n)

8 months ago 0 0 1 0
Advertisement
Post image

A whirlwind overview of discrete speech units (DSUs): we first train a Transformer model with self-supervision (i.e., self-supervised speech models, S3Ms). Then, we simply apply k-means on top of it. Then, the k-means cluster indices becomes DSUs! (3/n)

8 months ago 0 0 1 0

In short, yes! Long story short:
(1) We are using self-supervised models inefficiently when extracting discrete speech units (DSUs), hence can be made more lightweight.
(2) DSUs do not require full temporal receptive field, hence streamable. (2/n)

8 months ago 0 0 1 0
Post image

Can we make discrete speech units lightweight๐Ÿชถ and streamable๐ŸŽ? Excited to share our new #Interspeech2025 paper: On-device Streaming Discrete Speech Units arxiv.org/abs/2506.01845 (1/n)

8 months ago 1 1 2 0
Preview
Catching crumbs from the table - Nature In the face of metahuman science, humans have become metascientists.

www.nature.com/articles/350...
Ted Chiang. Catching crumbs from the table. Nature 405, 517 (2000). My favorite sci-fi short, which surprisingly well-summarizes what I actually do nowadays. I bet self-supervised speech models contain undiscovered theories on phonetics and phonology.

10 months ago 3 0 0 0
Preview
Proofs for Folklore Theorems on the Radon-Nikodym Derivative In this paper, rigorous statements and formal proofs are presented for both foundational and advanced folklore theorems on the Radon-Nikodym derivative. The cases of conditional and marginal probabili...

It's good to finally have a good reference for this stuff! Kudos to the authors.
arxiv.org/abs/2501.18374

11 months ago 4 2 0 1
Post image

Check out my presentation and poster for more details. I'll see you at NAACL, 4/30 14:00-15:30 Poster Session C! youtu.be/ZRF4u1eThJM (9/9)

11 months ago 1 0 0 0
Preview
GitHub - juice500ml/acoustic-units-for-ood: Official implementation for the paper "Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (NAACL 2025)" Official implementation for the paper "Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (NAACL 2025)" - juice500ml/acoustic-units-for-ood

We provide all the code and additional textgrids for everyone to use! github.com/juice500ml/a... (8/n)

11 months ago 0 0 1 0
Post image

We provide an extensive benchmark containing both pathological and non-native speech, with 8 different methods and 4 different speech features. It measures how well does the speech features model each phonemes accurately. (7/n)

11 months ago 0 0 1 0