Cross-modal retrieval, generation and captioning


Involved faculty members:

Alan Hanjalic Odette Scharenborg

Building on two decades of research on video content analysis and retrieval, including the seminal publications on affective video content representation and modelling, cross-modal retrieval, generation and captioning has emerged as a challenging new research direction within the MMC Group, which gained momentum through the collaboration with the University of Electronic Science and Technology, Chengdu, China, and with the University of Illinois at Urbana-Champaign, IL, USA. Our research targets innovative methodological and algorithmic concepts that automatically infer semantic links between pieces of information conveyed by different modalities. Such concepts could be used in a cross-modal retrieval scenario, in which e.g., an image can be found based on a textual or spoken description of its semantic content, but also in the scenarios involving “translating” the information from one modality into another. Examples of such scenarios are image and video captioning (in text and speech), visual question generation, and image generation from spoken descriptions.

The MMC Group has rapidly built up a strong track record in cross-modal retrieval, generation and captioning marked by the numerous and impactful publications in the major conference and journal venues in this field. These venues include ACM International Conference on Multimedia (ACM Multimedia), IEEE Transactions on Image Processing, IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Audio, Speech and Language Processing and IEEE Transactions on Neural Networks and Learning Systems. These publications have already brought two Best Paper Awards in 2017 and 2019.


Representative publications

  1. B Wang, Y Yang, X Xu, A Hanjalic, HT Shen: Adversarial cross-modal retrieval, ACM international conference on Multimedia, 2017 (Best Paper Award)
  2. J Song, Y Guo, L Gao, X Li, A Hanjalic, HT Shen: From deterministic to generative: Multimodal stochastic RNNs for video captioning, IEEE Transactions on Neural Networks and Learning Systems 2018
  3. Y Yang, J Zhou, J Ai, Y Bin, A Hanjalic, HT Shen, Y Ji: Video captioning by adversarial LSTM, IEEE Transactions on Image Processing, 2018
  4. M Hasegawa-Johnson, A Black, L Ondel, O Scharenborg, F Ciannella: Image2speech: Automatically generating audio descriptions of images. Journal of International Science and General Applications, 2018
  5. T Wang, X Xu, Y Yang, A Hanjalic, HT Shen, J Song: Matching images and text with multi-modal tensor fusion and re-ranking, ACM Multimedia, 2019
  6. J Choi, M Larson, G Friedland, A Hanjalic: From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval, IEEE International Conference on Multimedia Big Data (BigMM), 2019 (Best Paper Award)
  7. X Wang, T Qiao, J Zhu, A Hanjalic, O Scharenborg: S2IGAN: Speech-to-Image Generation via Adversarial Learning, Interspeech, Shanghai, China, 2020
  8. J van der Hout, Z D’Haese, M Hasegawa-Johnson, O Scharenborg: Evaluating automatically generated phoneme captions for images. Interspeech, Shanghai, China, 2020