TY - GEN
T1 - Audio-visual Speaker Diarization
T2 - 65th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2022
AU - Fanaras, Konstantinos
AU - Tragoudaras, Antonios
AU - Antoniadis, Charalampos
AU - Massoud, Yehia
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Speaker diarization is a task to identify 'who spoke when'. Moreover, nowadays, speakers' audio clips usually are accompanied by visual information. Thus, in the latest works, speaker diarization systems performance has been improved substantially by taking advantage of the visual information synchronized with audio clips in Audio-Visual (AV) content. This paper presents a deep learning architecture to implement an AV speaker diarization system emphasizing Voice Activity Detection (VAD). Traditional AV speaker diarization systems use hand-crafted features, like Mel-frequency cepstral coefficients, to perform VAD. On the other hand, the VAD module in our proposed system employs Convolutional Neural Networks (CNN) to learn and extract features from the audio waveforms directly. Experimental results on the AMI Meeting Corpus indicated that the proposed multimodal speaker diarization system reaches a state-of-the-art VAD False Alarm rate due to the CNN-based VAD, which in turn boosts the whole system's performance.
AB - Speaker diarization is a task to identify 'who spoke when'. Moreover, nowadays, speakers' audio clips usually are accompanied by visual information. Thus, in the latest works, speaker diarization systems performance has been improved substantially by taking advantage of the visual information synchronized with audio clips in Audio-Visual (AV) content. This paper presents a deep learning architecture to implement an AV speaker diarization system emphasizing Voice Activity Detection (VAD). Traditional AV speaker diarization systems use hand-crafted features, like Mel-frequency cepstral coefficients, to perform VAD. On the other hand, the VAD module in our proposed system employs Convolutional Neural Networks (CNN) to learn and extract features from the audio waveforms directly. Experimental results on the AMI Meeting Corpus indicated that the proposed multimodal speaker diarization system reaches a state-of-the-art VAD False Alarm rate due to the CNN-based VAD, which in turn boosts the whole system's performance.
KW - audio-visual
KW - convolutional neural networks
KW - deep learning
KW - false alarm rate
KW - speaker diarization
KW - voice activity detection
UR - http://www.scopus.com/inward/record.url?scp=85137424359&partnerID=8YFLogxK
U2 - 10.1109/MWSCAS54063.2022.9859533
DO - 10.1109/MWSCAS54063.2022.9859533
M3 - Conference contribution
AN - SCOPUS:85137424359
T3 - Midwest Symposium on Circuits and Systems
BT - MWSCAS 2022 - 65th IEEE International Midwest Symposium on Circuits and Systems, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 7 August 2022 through 10 August 2022
ER -