I'll be at @iclr-conf.bsky.social this week, presenting TTSDS2 as an oral on Friday at 11:18 local time (iclr.cc/virtual/2026...) - looking forward to meeting people there!
Posts by Christoph Minixhofer
Currently on three different papers using BWS with three different methods to run the listening tests due to different Universities/first authors - it’s time we had an open-source framework for listening tests that is well maintained and easy to use. If you know any let me know!
Turns out it’s an oral. Looking forward to Rio 🇧🇷
A pre-release of *ttsdb*, my collection of SOTA TTS models, is out now - github.com/ttsds/ttsdb
The aim is to provide a simple cli and collection of python packages to make it easy to synthesise speech across a variety of models. Docs and website coming soon!
🧪 My paper on Text-to-Speech evaluation using distributional measures has been accepted to ICLR 2026! 🎉
openreview.net/forum?id=uGa...
In my opinion, we should focus much more on the distributions of synthetically generated speech, and we showed this correlates highly with human ratings.
Just came across this wonderful blogpost on pangrams in many languages. If only there was a similar collection full of phonetic pangrams!
clagnut.com/blog/2380/#P...
www.technomoralfutures.uk/news-databas...
Happy Monday! Here's me thinking about speech tech, voices, and death thanks to the lovely @technomoralfutures.bsky.social
content notes: discussion of death, grief, online abuse
Passed my viva yesterday 🥳
Here's the pre-viva talk if anyone's interested, my work was/is about quantifying the distributional distance between real and synthetic speech.
youtu.be/Ii-6buwAoCg
First time going to a big gym in the UK, and somehow the practice of saying a little “sorry” as you go past someone cracks me up in that setting.
Fill in the blank:
"My p-value is smaller than 0.05, so..."
Wrong answers only.
… so glad I ran 20 experiments this time!
If your p-value remains stubbornly above 0.05, there are some creative ways to describe that as well, see this blog post: mchankins.wordpress.com/2013/04/21/s...
I don't download new HF models often, but when I do, it's during the 0.008% of downtime :(
TTSDS2 is one of the papers accepted by the @neuripsconf.bsky.social area chairs but but rejected by the senior area chairs with no explanation as to why. A bit frustrating after the long review process.
100% agreed, also crisps are snack, not a side dish for lunch
Accents are also best seen as a distribution, not a group of labels imo. We tried to incorporate some proxy of accent in TTSDS2, but a simple phone distribution did not work all that well, probably because it’s hard to disentangle from lexical content…
It's been a great #interspeech2025!
I presented a TTS-for-ASR paper:
www.isca-archive.org/interspeech_...
And one on prosody reps: www.isca-archive.org/interspeech_...
There were many interesting questions & comments - if you have more and didn't get the chance feel free to send me a message.
I’ll will be presenting this tomorrow at 8.50 at #interspeech2025, come by if you’re interested in prosodic representations!
Thank you to everyone who stopped by, I’m grateful for all the feedback and interesting questions #interspeech2025
In other news — if you’re an early bird and at #interspeech, feel free to drop by my poster presentation on scaling synthetic data tomorrow - who doesn’t want to chat about neural scaling laws early in the morning!
App: interspeech.app.link?event=687602...
Paper: www.isca-archive.org/interspeech_...
I tried: “what sport should I pick up?” and for my original (male) voice it responded with “association football is the most popular sport in the UK”. For my female one… “oh, for a newbie? Something easy like […]” — Goes without saying that research into these biases is important. 2/2
A highlight at #interspeech so far: the “hear me out” show&tell in which you can check how the spoken language model Moshi responds based on if it’s your voice or a voice converted version to the opposite gender.
Check it out here shreeharsha-bs.github.io/Hear-Me-Out/
1/2
If you’re interested in ASR for low resource languages, come by at 14.30 in Poster Area 09 at #interspeech today! I’ll be presenting this paper by Ondrej Klejch et al. arxiv.org/abs/2506.04915
Looking forward to present a bunch of things at #INTERSPEECH and #SSW - will put the details here once my thesis final draft is done, which will probably be on the plane to Rotterdam.
One day until the Q2 ttsdsbenchmark.com update. We‘ll see which TTS system tops the leaderboard this time - some new ones have been added that could shake things up.
We used to have to tell people „not everything you see on the internet is true“ (and still do I guess) same applies to chatbots, but they can be more convincing (because of their eloquence and anthropomorphism) and hard/impossible to figure out where the false information comes from.
Followed your advice and can confirm “Ughaaaghaghaa” was my reaction as well.
Figure showing two overlapping bell curves representing data distributions. The green curve on the left is labeled ‘synthetic data distribution’, and the black curve on the right is labeled ‘true data distribution’. The horizontal axis is divided into four regions: ‘artifacts’ (only covered by the green curve), ‘over-sampled’ (where the synthetic curve is higher than true), ‘under-sampled’ (where the true curve is higher than synthetic), and ‘missing samples’ (only covered by the black curve). Caption: Fig. 1 describes the gap between synthetic and true data distributions partitioned into four regions.
This figure motivated a lot of my PhD (or at least nudged me into a direction) -- check out arxiv.org/abs/2110.11479 (Hu et al.) if you haven't come across it before, it really frames the problem of synthetic/real speech distributions well.
Norwegian flag in a sunny and green scene in Scotland with water and a bridge in the background.
Spotted a Norwegian flag across the Firth of Forth, didn’t know Norwegians had hytte on this side of the North Sea as well!
More details on this soon! Also this weekend is the last chance to submit your TTS system for the next round of evaluation (Q2 2025) by either messaging me at christoph.minixhofer@ed.ac.uk or requesting a model here: huggingface.co/spaces/ttsds...
It’s amazing how a days work can stretch out over a fortnight, and a week of work can be compressed into 24 hours sometimes…