Abstract
Current methods for active speaker detection focus on modeling audiovisual information from a single speaker. This strategy can be adequate for addressing single-speaker scenarios, but it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our new model learns pairwise and temporal relations from a structured ensemble of audiovisual observations. Our experiments show that a structured feature ensemble already beneï¬ ts active speaker detection performance. We also ï¬ nd that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving an mAP of 87.1%. Moreover, ablation studies verify that this result is a direct consequence of our long-term multi-speaker analysis.
Original language | English (US) |
---|---|
Title of host publication | 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
Publisher | IEEE |
ISBN (Print) | 978-1-7281-7169-2 |
DOIs | |
State | Published - 2020 |