Cross-modal embeddings for video and audio retrieval

D Surís, A Duarte, A Salvador… - Proceedings of the …, 2018 - openaccess.thecvf.com
Proceedings of the european conference on computer vision …, 2018openaccess.thecvf.com
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by
projecting the audio and visual features into a common feature space, to obtain joint audio-
visual embeddings. These links are used to retrieve audio samples that fit well to a given
silent video, and also to retrieve images that match a given query audio. The results in terms
of Recall@ K obtained over a subset of YouTube-8M videos show the potential of this
unsupervised approach for cross-modal feature learning.
Abstract
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@ K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.
openaccess.thecvf.com