Audio-visual Speaker Diarization: Improved Voice Activity Detection with CNN based Feature Extraction

Konstantinos Fanaras, Antonios Tragoudaras, Charalampos Antoniadis, Yehia Massoud

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Speaker diarization is a task to identify 'who spoke when'. Moreover, nowadays, speakers' audio clips usually are accompanied by visual information. Thus, in the latest works, speaker diarization systems performance has been improved substantially by taking advantage of the visual information synchronized with audio clips in Audio-Visual (AV) content. This paper presents a deep learning architecture to implement an AV speaker diarization system emphasizing Voice Activity Detection (VAD). Traditional AV speaker diarization systems use hand-crafted features, like Mel-frequency cepstral coefficients, to perform VAD. On the other hand, the VAD module in our proposed system employs Convolutional Neural Networks (CNN) to learn and extract features from the audio waveforms directly. Experimental results on the AMI Meeting Corpus indicated that the proposed multimodal speaker diarization system reaches a state-of-the-art VAD False Alarm rate due to the CNN-based VAD, which in turn boosts the whole system's performance.

Original languageEnglish (US)
Title of host publicationMWSCAS 2022 - 65th IEEE International Midwest Symposium on Circuits and Systems, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665402798
DOIs
StatePublished - 2022
Event65th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2022 - Fukuoka, Japan
Duration: Aug 7 2022Aug 10 2022

Publication series

NameMidwest Symposium on Circuits and Systems
Volume2022-August
ISSN (Print)1548-3746

Conference

Conference65th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2022
Country/TerritoryJapan
CityFukuoka
Period08/7/2208/10/22

Keywords

  • audio-visual
  • convolutional neural networks
  • deep learning
  • false alarm rate
  • speaker diarization
  • voice activity detection

ASJC Scopus subject areas

  • Electronic, Optical and Magnetic Materials
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Audio-visual Speaker Diarization: Improved Voice Activity Detection with CNN based Feature Extraction'. Together they form a unique fingerprint.

Cite this