Speech processing


Involved faculty member:

Odette Scharenborg Zhengjun Yue


Since recently, the Multimedia Computing Group has built up a rich track record in the field of human and automatic speech processing, marked by numerous and impactful publications in the major conference and journal venues in this field. These venues include the Interspeech conference, the IEEE ICASSP conference, and journals such as Speech Communication, Computer, Speech, and Language, the Journal of the Acoustical Society of America, and IEEE Transactions on Audio, Speech, and Language Processing. The MMC group has pioneered the development of several new research areas, which remain at its focus, and which makes MMC a unique speech processing research group in the world:

  • Using knowledge about human speech processing to help improve deep learning-based automatic speech recognition algorithms
  • Systematic comparisons between human and automatic speech processing architectures and performance
  • Building computational models of human speech processing using techniques from automatic speech recognition
  • Using visualisations of the speech representations in deep neural networks to improve understanding of how speech is processed by deep learning-based automatic speech recognisers

The main focus of the speech processing research at MMC is building speech technology applications that are available for everyone irrespective of the language or type of (disordered) speech of that person, with a strong focus on building speech technology for under-resourced languages and types of speech (e.g., oral cancer speech and dysarthric speech) and languages without a common writing system. Research in these areas has gained momentum through long-running and fruitful collaborations with Johns Hopkins University, Baltimore, MD, USA, and the University of Illinois at Urbana-Champaign, IL, USA, which culminated in the running of a highly successful JSALT workshop in 2017 and several high-impact publications. In our research, we continuously draw parallels with human listening as humans are the ‘ideal’ speech recognisers and are quickly able to adapt to new types of speech and learn new languages.

Representative publications

  1. Scharenborg, O. (2007). Reaching over the gap: A review of efforts to link human and automatic speech recognition research. Speech Communication – Special Issue on Bridging the Gap between Human and Automatic Speech Processing, 49, 336-347.
  2. Scharenborg, O., Besacier, L., Black, A., Hasegawa-Johnson, M., Metze, F., Neubig, G., Stueker, S., Godard, P., Moeller, M., Ondel, L., Palaskar, S., Arthur, P., Ciannella, F., Du, M., Larsen, E., Merkx, D., Riad, R., Wang, L., Dupoux, E. (2018). Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the “Speaking Rosetta” JSALT 2017 workshop. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 4979-4983.
  3. Scharenborg, O., Tiesmeyer, S., Hasegawa-Johnson, M., Dehak, N. (2018). Visualizing phoneme category adaptation in deep neural networks. Proceedings of Interspeech, Hyderabad, India.
  4. Moro Velazquez, L., Cho, J., Watanabe, S., Hasegawa-Johnson, M.,  Scharenborg, O., Heejin, K., Dehak, N. (2019). Study of the performance of automatic speech recognition systems in speakers with Parkinson’s Disease. Proceedings of Interspeech, Graz, Austria.
  5. Halpern, B., van Son, R., van den Brekel, M., Scharenborg, O. (2020). Detecting and analysing spontaneous oral cancer speech in the wild. Interspeech 2020, Shanghai, China.
  6. Feng, S., Scharenborg, O. (2020). Unsupervised subword modeling using autoregressive pretraining and cross-lingual phone-aware modeling. Interspeech 2020, Shanghai, China.
  7. Żelasko, P., Moro Velazquez, L., Hasegawa-Johnson, M., Scharenborg, O., Dehak, N.(2020). That sounds familiar: an analysis of phonetic representations transfer across languages. Interspeech 2020, Shanghai, China.