Google 學術搜尋

[PDF][PDF] WAV2PIX: Speech-conditioned Face Generation using Generative Adversarial Networks.

AC Duarte, F Roldan, M Tubau, J Escur… - …, 2019 - openaccess.thecvf.com

AC Duarte, F Roldan, M Tubau, J Escur, S Pascual, A Salvador, E Mohedano…

ICASSP, 2019•openaccess.thecvf.com

Audio and visual signals are the most common modalities used by humans to identify other humans and sense their emotional state. Features extracted from these two signals are often highly correlated, allowing us to imagine the visual appearance of a person just by listening to their voice, or build some expectations about the tone or pitch of their voice just by looking at a picture of the speaker. When it comes to image generation, however, this multimodal correlation is still under-explored. In this paper, we focus on cross-modal visual generation, more specifically, the generation of facial images given a speech signal. Unlike recent works, we aim to generate the whole face image at pixel level, conditioning only on the raw speech signal (ie without the use of any handcrafted features) and without requiring any previous knowledge (eg speaker image or face model). To this end, we propose a conditional generative adversarial model (shown in Figure 1) that is trained using the aligned audio and video channels in a self-supervised way. For learning such a model, high quality, aligned samples are required. This makes the most commonly used datasets such as Lip Reading in the wild [6], or Vox-Celeb [17] unsuitable for our approach, as the position of the speaker, the background, and the quality of the videos and the acoustic signal can vary significantly across different samples. We therefore built a new video dataset from YouTube, composed of videos uploaded to the platform by well-established users (commonly known as youtubers), who recorded themselves speaking in front of the camera in their personal home studios. Hence, our main contributions

openaccess.thecvf.com

顯示更多顯示較少

儲存引用被引用 98 次相關文章全部共 14 個版本 HTML 版

引用

進階搜尋

已儲存至「我的圖書館」

[PDF][PDF] WAV2PIX: Speech-conditioned Face Generation using Generative Adversarial Networks.