Shikhar Bharadwaj, Chin-Jou Li, Yoonjae Kim, Kwanghee Choi, Eunjung Yeo, Ryan Soh-Eun Shim, Hanyu Zhou, Brendon Boldt, Karen Rosero Jacome, Kalvin Chang, Darsh Agrawal, Keer Xu, ...
PRiSM: Benchmarking Phone Realization in Speech Models
https://arxiv.org/abs/2601.14046
Posts by Shikhar
Bharadwaj, Li, Kim, Choi, Yeo, Shim, Zhou, Boldt, Jacome, Chang, Agrawal, Xu, Yang, Zhu, Watanabe, Mortensen: PRiSM: Benchmarking Phone Realization in Speech Models https://arxiv.org/abs/2601.14046 https://arxiv.org/pdf/2601.14046 https://arxiv.org/html/2601.14046
Can we make discrete speech units lightweightπͺΆ and streamableπ? Excited to share our new #Interspeech2025 paper: On-device Streaming Discrete Speech Units arxiv.org/abs/2506.01845 (1/n)
Meows, music, murmurs and more - we trained a general purpose audio encoder and open sourced the code, checkpoint and evaluation toolkit.
π’ We've open-sourced NatureLM-audio, the first audio-language foundation model for #bioacoustics.
Trained on large-scale animal vocalization, human speech & music datasets, the model enables zero-shot classification, detection & querying across diverse species & environments ππ½
π Resources for ESPnet-SDS:
π Codebase (part of ESPnet): github.com/espnet/espnet
π README & User Guide: github.com/espnet/espne...
π₯ Demo Video: www.youtube.com/watch?v=kI_D...
New #NAACL2025 demo, Excited to introduce ESPnet-SDS, a new open-source toolkit for building unified web interfaces for both cascaded & end-to-end spoken dialogue system, providing real-time evaluation, and more!
π: arxiv.org/abs/2503.08533
Live Demo: huggingface.co/spaces/Siddh...
π New #ICLR2025 Paper Alert! π
Can Audio Foundation Models like Moshi and GPT-4o truly engage in natural conversations? π£οΈπ
We benchmark their turn-taking abilities and uncover major gaps in conversational AI. π§΅π
π: arxiv.org/abs/2503.01174
Wait I thought the rock was named Dwayne Johnson
gpu poverty is real
Happy New Year
Philip Whittington, Gregor Bachmann, Tiago Pimentel
Tokenisation is NP-Complete
https://arxiv.org/abs/2412.15210
Today, weβre introducing NatureLM-audio: the first large audio-language model tailored for understanding animal sounds. arxiv.org/abs/2411.07186 π§΅π
Announcing π₯ FineWeb2: A sparkling update with 1000s of π£οΈlanguages.
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other datasets.
Language bind arxiv.org/abs/2310.01852
Language as the pivoting modality instead of images. Different training dataset.
WAVLab is up in bsky!
We are excited to announce the launch of ML SUPERB 2.0 (multilingual.superbbenchmark.org) as part of the Interspeech 2024 official challenge! We hope this upgraded version of ML SUPERB advances universal access to speech processing worldwide. Please join it!
#Interspeech2025
πββοΈ
I've started putting together a starter pack with people working on Speech Technology and Speech Science: go.bsky.app/BQ7mbkA
(Self-)nominations welcome!
Examples from dataset, a world map surrounded by spectrograms showing animal sounds from different regions of the world
Scatter plot where points are sound data sets, x axis is number of categories in dataset and y axis is duration of dataset in hours iNatSounds is shown as the largest dataset on both axes
iNatSounds: new dataset from folks @inaturalist.bsky.social & co-authors; looks to be one of the largest public datasets of animal sounds
openreview.net/forum?id=QCY...
github.com/visipedia/in...
#prattle π¬
#bioacoustics
πββοΈπ
πββοΈπ
πββοΈ
We're here too now! π₯³
Me (shikharb@bsky.social) and our lab bsky.app/profile/wavl...