Take Note: testing the intelligibility of synthesized speech
Take Note: testing the intelligibility of synthesized speech lead image
Speech synthesis technology has seen major advancements since 1980, when several Text-To-Speech (TTS) systems first became publicly available. Improvements in acoustics, linguistics, signal processing and the advent of artificial intelligence have enabled many now-common TTS systems that can synthesize human voices intelligibly, even in noisy conditions. Automatic Speech Recognition (ASR) systems have also seen huge improvements throughout the years. Some ASR systems actually differentiate speech from noise better than human listeners.
Yang et al. evaluated speech intelligibility of both human and synthesized voices using speech recordings of four human talkers (two female, two male) and twelve synthesized voices (six females, six males). Artificially synthesized speech was generated by three commercial TTS platforms: Amazon Polly, Microsoft Azure Text-To-Speech and Google Text-To-Speech.
The researchers further transcribed those speech recordings in noisy environments using five ASR platforms and found that two of the systems tested recognized 10% more words than did their human counterparts.
“With the help of modern ASR systems, we can start thinking about further improving speech synthesis technology,” said author Ye Yang. “For example, we may use ASR to screen highly intelligible speech materials and use these materials to develop a speech synthesis system that produces highly intelligible speech. Such a system is beneficial for hearing loss listeners, new language learners, or noisy scenarios.”
The results also demonstrated a high correlation between ASR recognition results and human recognition results and illuminated areas for advancement.
“We may improve speech enhancement, or noise reduction, systems by utilizing the intelligibility prediction power of an ASR system,” said Yang. “There are a lot of potential uses of ASR systems waiting to be explored.”
Source: “Evaluating synthesized speech intelligibility in noise,” by Ye Yang, Dathan Nguyen, Katherine Chen, and Fan-Gang Zeng, JASA Express Letters (2025). The article can be accessed at https://doi.org/10.1121/10.0036397