ISCA Archive AVSP 2015
ISCA Archive AVSP 2015

Comparison of single-model and multiple-model prediction-based audiovisual fusion

Stavros Petridis, Varun Rajgarhia, Maja Pantic

Prediction-based fusion is a recently proposed audiovisual fusion approach which outperforms feature-level fusion on laughter-vs-speech discrimination. One set of predictive models is trained per class which learns the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features. Classification of a new input is performed via prediction. All the class predictors produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input. In all the previous works, a single set of predictors was trained on the entire training set for each class. In this work, we investigate the use of multiple sets of predictors per class. The main idea is that since models are trained on clusters of data, they will be more specialised and they will produce lower prediction errors which can in turn enhance the classification performance. We experimented with subject-based clustering and clustering based on different types of laughter, voiced and unvoiced. Results are presented on laughter-vs-speech discrimination on a cross-database experiment using the AMI and MAHNOB databases. The use of multiple sets of models results in a significant performance increase with the latter clustering approach achieving the best performance. Overall, an increase of over 4% and 10% is observed for F1 speech and laughter, respectively, for both datasets.

Index Terms: Prediction-based fusion, Audiovisual fusion, Nonlinguistic Information Processing

  翻译: