Google 學術搜尋

Cross-modal embeddings for video and audio retrieval

D Surís, A Duarte, A Salvador… - Proceedings of the …, 2018 - openaccess.thecvf.com

D Surís, A Duarte, A Salvador, J Torres, X Giró-i-Nieto

Proceedings of the european conference on computer vision …, 2018•openaccess.thecvf.com

In this work, we explore the multi-modal information provided by the Youtube-8M dataset by
projecting the audio and visual features into a common feature space, to obtain joint audio-
visual embeddings. These links are used to retrieve audio samples that fit well to a given
silent video, and also to retrieve images that match a given query audio. The results in terms
of Recall@ K obtained over a subset of YouTube-8M videos show the potential of this
unsupervised approach for cross-modal feature learning.

Abstract

In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@ K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.

openaccess.thecvf.com

顯示更多顯示較少

儲存引用被引用 83 次相關文章全部共 10 個版本 HTML 版

引用

進階搜尋

已儲存至「我的圖書館」

Cross-modal embeddings for video and audio retrieval