Genuine question: when reading papers from another discipline (especially journal papers), how can one distinguish if they are in a good journal? 🫠
Posts by Zhaofeng Lin
Congrats!! 🤩
This paper investigates how AVSR systems exploit visual information.
Our findings reveal varying patterns across systems, with mismatches to human perception.
We recommend reporting *effective SNR gains* alongside WERs for a more comprehensive performance assessment 🧐
[8/8] 🧵
Results show Auto-AVSR may rely more on audio, with a weaker correlation between MaFI scores and IWERs in AV mode.
In contrast, AVEC shows a stronger use of visual information, with a significant negative correlation, especially in noisy conditions.
[7/8] 🧵
Finally, we explore the relationship between AVSR errors and Mouth & Facial Informativeness (MaFI) scores.
We calculated Individual WER (IWERs) for each word and performed a Pearson correlation between MaFI scores and IWERs for audio-only, video-only, and AV models. [6/8] 🧵
Occlusion tests reveal AVSR models rely differently on visual segments.
Auto-AVSR & AV-RelScore are equally affected by initial & middle occlusions, while AVEC is more impacted by middle occlusion.
Unlike humans, AVSR models do not depend on initial visual cues.
[5/8] 🧵
Speech perception research shows that visual cues at the start of a word have a stronger impact on humans.
We test AVSR by occluding the initial vs. middle third of frames for each word, comparing 3 conditions: no occlusion, initial occlusion, and middle occlusion.
[4/8] 🧵
First, we revisit *effective SNR gain* - measured by the difference in SNR at which the AVSR WER equals the reference WER for audio-only recognition at 0 dB.
This metric quantifies the benefit of the visual modality in reducing WER compared to the audio-only system. [3/n] 🧵
In this paper, we take a step back to assess SOTA systems (all from 2023) from a different perspective by considering *human speech perception*.
Through this, we hope to gain insights as to whether the visual component is being fully exploited in existing AVSR systems. [2/8] 🧵
Unfortunately, due to a visa issue, I won’t attend #ICASSP2025 in person - but I’ll share my work here! 😀
Excited to share my first ICASSP paper: ‘Uncovering the Visual Contribution in Audio-Visual Speech Recognition’ 📄
🔗https://ieeexplore.ieee.org/abstract/document/10888423
[1/8] 🧵
My first PhD paper, “Uncovering the Visual Contribution in Audio-Visual Speech Recognition”, was accepted to ICASSP2025 🇮🇳!
#ICASSP2025
icassp rebuttal submitted🤞🤞
🙋♂️
Hi speech people, super exciting news here!
We are running another "Multimodal information based speech (MISP)" Challenge at @interspeech.bsky.social
Participate!
Spread the word!
More info 👇
mispchallenge.github.io/mispchalleng...
Checking on this side to see if the sky is blue 🤔