[PDF][PDF] WAV2PIX: Speech-conditioned Face Generation using Generative Adversarial Networks.

AC Duarte, F Roldan, M Tubau, J Escur… - …, 2019 - openaccess.thecvf.com
ICASSP, 2019openaccess.thecvf.com
Audio and visual signals are the most common modalities used by humans to identify other
humans and sense their emotional state. Features extracted from these two signals are often
highly correlated, allowing us to imagine the visual appearance of a person just by listening
to their voice, or build some expectations about the tone or pitch of their voice just by looking
at a picture of the speaker. When it comes to image generation, however, this multimodal
correlation is still under-explored. In this paper, we focus on cross-modal visual generation …
Audio and visual signals are the most common modalities used by humans to identify other humans and sense their emotional state. Features extracted from these two signals are often highly correlated, allowing us to imagine the visual appearance of a person just by listening to their voice, or build some expectations about the tone or pitch of their voice just by looking at a picture of the speaker. When it comes to image generation, however, this multimodal correlation is still under-explored. In this paper, we focus on cross-modal visual generation, more specifically, the generation of facial images given a speech signal. Unlike recent works, we aim to generate the whole face image at pixel level, conditioning only on the raw speech signal (ie without the use of any handcrafted features) and without requiring any previous knowledge (eg speaker image or face model). To this end, we propose a conditional generative adversarial model (shown in Figure 1) that is trained using the aligned audio and video channels in a self-supervised way. For learning such a model, high quality, aligned samples are required. This makes the most commonly used datasets such as Lip Reading in the wild [6], or Vox-Celeb [17] unsuitable for our approach, as the position of the speaker, the background, and the quality of the videos and the acoustic signal can vary significantly across different samples. We therefore built a new video dataset from YouTube, composed of videos uploaded to the platform by well-established users (commonly known as youtubers), who recorded themselves speaking in front of the camera in their personal home studios. Hence, our main contributions
openaccess.thecvf.com