ISCA Archive SSW 2010
ISCA Archive SSW 2010

Speech synthesis without the right data

Simon King

Constructing a speech synthesiser using a large, carefully recorded, single-speaker database of read text is straightforward and good results can be obtained every time, using conventional concatenative or statistical parametric methods. But this scenario is very restrictive and those good results are only actually obtained if: the voice is built offline and carefully checked for errors; the speech is recorded in clean conditions; the word transcriptions are correct; accurate phonetic labels are available or can be obtained; the speech is in the required language and speaking style, from a suitable speaker; etc. A large number of applications become possible if we can escape these restrictions - applications where one or more of the above conditions is not satisfied. Examples include: prosthetic voices, where the speech available from the patient may already be disordered, is very limited in quantity and recorded under unfavourable conditions; cross-lingual voices, where the aim is to produce a synthetic voice in a target language that sounds like a particular speaker, yet we only have sample speech from that speaker in some other source language; voices for accents or languages where we do not have detailed knowledge of the phonology or where other resources, such as pronunciation dictionaries or prosodic models, are unavailable; and situations where we only have untranscribed speech from which to build a voice. In this tutorial, I will look at a few examples of such applications and describe some of the techniques that can be used to construct synthetic voices in scenarios where the conventional approach is not possible.

  翻译: