Advertisement · 728 × 90

Posts by Christoph Minixhofer

I'll be at @iclr-conf.bsky.social this week, presenting TTSDS2 as an oral on Friday at 11:18 local time (iclr.cc/virtual/2026...) - looking forward to meeting people there!

1 day ago 0 0 0 0

Currently on three different papers using BWS with three different methods to run the listening tests due to different Universities/first authors - it’s time we had an open-source framework for listening tests that is well maintained and easy to use. If you know any let me know!

2 months ago 1 0 0 0

Turns out it’s an oral. Looking forward to Rio 🇧🇷

2 months ago 0 0 0 0
Preview
GitHub - ttsds/ttsdb: A database for modern, open-source TTS systems. A database for modern, open-source TTS systems. Contribute to ttsds/ttsdb development by creating an account on GitHub.

A pre-release of *ttsdb*, my collection of SOTA TTS models, is out now - github.com/ttsds/ttsdb

The aim is to provide a simple cli and collection of python packages to make it easy to synthesise speech across a variety of models. Docs and website coming soon!

2 months ago 1 0 0 0
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text... Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are...

🧪 My paper on Text-to-Speech evaluation using distributional measures has been accepted to ICLR 2026! 🎉
openreview.net/forum?id=uGa...
In my opinion, we should focus much more on the distributions of synthetically generated speech, and we showed this correlates highly with human ratings.

2 months ago 3 0 1 0
Preview
List of pangrams There used to be a page on Wikipedia listing pangrams in various languages. This was deleted yesterday. Pangrams can be occasioanlly useful for designers, so I’ve resurrected the page of here, pretty ...

Just came across this wonderful blogpost on pangrams in many languages. If only there was a similar collection full of phonetic pangrams!
clagnut.com/blog/2380/#P...

2 months ago 1 0 0 0
Preview
Text-to-speech voices as human remains — Centre for Technomoral Futures Speech technology researchers worldwide are working on improving the smoothness and fidelity of text-to-speech towards the goal of accessible communication for all. However, TTS models are also being ...

www.technomoralfutures.uk/news-databas...
Happy Monday! Here's me thinking about speech tech, voices, and death thanks to the lovely @technomoralfutures.bsky.social

content notes: discussion of death, grief, online abuse

4 months ago 8 4 0 0
Quantifying the Distributional Distance between Synthetic and Real Speech (Pre-Viva Talk)
Quantifying the Distributional Distance between Synthetic and Real Speech (Pre-Viva Talk) YouTube video by Christoph Minixhofer

Passed my viva yesterday 🥳
Here's the pre-viva talk if anyone's interested, my work was/is about quantifying the distributional distance between real and synthetic speech.
youtu.be/Ii-6buwAoCg

5 months ago 1 0 0 0
Advertisement

First time going to a big gym in the UK, and somehow the practice of saying a little “sorry” as you go past someone cracks me up in that setting.

5 months ago 1 0 0 0
Post image

Fill in the blank:

"My p-value is smaller than 0.05, so..."

Wrong answers only.

5 months ago 4 2 2 0
Preview
Still Not Significant What to do if your p-value is just over the arbitrary threshold for ‘significance’ of p=0.05? You don’t need to play the significance testing game – there are better methods…

… so glad I ran 20 experiments this time!

If your p-value remains stubbornly above 0.05, there are some creative ways to describe that as well, see this blog post: mchankins.wordpress.com/2013/04/21/s...

5 months ago 5 1 1 0
Post image

I don't download new HF models often, but when I do, it's during the 0.008% of downtime :(

6 months ago 0 0 0 0

TTSDS2 is one of the papers accepted by the @neuripsconf.bsky.social area chairs but but rejected by the senior area chairs with no explanation as to why. A bit frustrating after the long review process.

7 months ago 0 0 0 0

100% agreed, also crisps are snack, not a side dish for lunch

7 months ago 1 0 0 0

Accents are also best seen as a distribution, not a group of labels imo. We tried to incorporate some proxy of accent in TTSDS2, but a simple phone distribution did not work all that well, probably because it’s hard to disentangle from lexical content…

7 months ago 1 0 1 0
Post image Post image

It's been a great #interspeech2025!
I presented a TTS-for-ASR paper:
www.isca-archive.org/interspeech_...
And one on prosody reps: www.isca-archive.org/interspeech_...
There were many interesting questions & comments - if you have more and didn't get the chance feel free to send me a message.

8 months ago 2 0 0 0

I’ll will be presenting this tomorrow at 8.50 at #interspeech2025, come by if you’re interested in prosodic representations!

8 months ago 1 0 0 0
Advertisement
Post image

Thank you to everyone who stopped by, I’m grateful for all the feedback and interesting questions #interspeech2025

8 months ago 1 0 0 0

In other news — if you’re an early bird and at #interspeech, feel free to drop by my poster presentation on scaling synthetic data tomorrow - who doesn’t want to chat about neural scaling laws early in the morning!
App: interspeech.app.link?event=687602...
Paper: www.isca-archive.org/interspeech_...

8 months ago 2 0 1 0

I tried: “what sport should I pick up?” and for my original (male) voice it responded with “association football is the most popular sport in the UK”. For my female one… “oh, for a newbie? Something easy like […]” — Goes without saying that research into these biases is important. 2/2

8 months ago 3 1 0 0
Hear Me Out Interactive evaluation and bias discovery platform for speech-to-speech conversational AI

A highlight at #interspeech so far: the “hear me out” show&tell in which you can check how the spoken language model Moshi responds based on if it’s your voice or a voice converted version to the opposite gender.
Check it out here shreeharsha-bs.github.io/Hear-Me-Out/
1/2

8 months ago 3 1 1 0
Preview
A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model. When the original model has been trained on large quantiti...

If you’re interested in ASR for low resource languages, come by at 14.30 in Poster Area 09 at #interspeech today! I’ll be presenting this paper by Ondrej Klejch et al. arxiv.org/abs/2506.04915

8 months ago 2 0 0 0

Looking forward to present a bunch of things at #INTERSPEECH and #SSW - will put the details here once my thesis final draft is done, which will probably be on the plane to Rotterdam.

8 months ago 0 0 0 0
Post image

One day until the Q2 ttsdsbenchmark.com update. We‘ll see which TTS system tops the leaderboard this time - some new ones have been added that could shake things up.

9 months ago 0 0 0 0

We used to have to tell people „not everything you see on the internet is true“ (and still do I guess) same applies to chatbots, but they can be more convincing (because of their eloquence and anthropomorphism) and hard/impossible to figure out where the false information comes from.

9 months ago 8 0 0 0

Followed your advice and can confirm “Ughaaaghaghaa” was my reaction as well.

9 months ago 0 0 0 0
Figure showing two overlapping bell curves representing data distributions. The green curve on the left is labeled ‘synthetic data distribution’, and the black curve on the right is labeled ‘true data distribution’. The horizontal axis is divided into four regions: ‘artifacts’ (only covered by the green curve), ‘over-sampled’ (where the synthetic curve is higher than true), ‘under-sampled’ (where the true curve is higher than synthetic), and ‘missing samples’ (only covered by the black curve). Caption: Fig. 1 describes the gap between synthetic and true data distributions partitioned into four regions.

Figure showing two overlapping bell curves representing data distributions. The green curve on the left is labeled ‘synthetic data distribution’, and the black curve on the right is labeled ‘true data distribution’. The horizontal axis is divided into four regions: ‘artifacts’ (only covered by the green curve), ‘over-sampled’ (where the synthetic curve is higher than true), ‘under-sampled’ (where the true curve is higher than synthetic), and ‘missing samples’ (only covered by the black curve). Caption: Fig. 1 describes the gap between synthetic and true data distributions partitioned into four regions.

This figure motivated a lot of my PhD (or at least nudged me into a direction) -- check out arxiv.org/abs/2110.11479 (Hu et al.) if you haven't come across it before, it really frames the problem of synthetic/real speech distributions well.

9 months ago 0 0 0 0
Advertisement
Norwegian flag in a sunny and green scene in Scotland with water and a bridge in the background.

Norwegian flag in a sunny and green scene in Scotland with water and a bridge in the background.

Spotted a Norwegian flag across the Firth of Forth, didn’t know Norwegians had hytte on this side of the North Sea as well!

9 months ago 0 0 0 0

More details on this soon! Also this weekend is the last chance to submit your TTS system for the next round of evaluation (Q2 2025) by either messaging me at christoph.minixhofer@ed.ac.uk or requesting a model here: huggingface.co/spaces/ttsds...

9 months ago 1 1 0 0

It’s amazing how a days work can stretch out over a fortnight, and a week of work can be compressed into 24 hours sometimes…

9 months ago 0 0 0 0