TY - GEN
T1 - OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos
AU - Ramazanova, Merey
AU - Escorcia, Victor
AU - Heilbron, Fabian Caba
AU - Zhao, Chen
AU - Ghanem, Bernard
N1 - KAUST Repository Item: Exported on 2023-08-21
Acknowledgements: This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding, and SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI).
PY - 2023/6
Y1 - 2023/6
N2 - Egocentric videos capture sequences of human activities from a first-person perspective and can provide rich multi-modal signals. However, most current localization methods use third-person videos and only incorporate visual information. In this work, we take a deep look into the effectiveness of audiovisual context in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL). OWL leverages audiovisual information and context for egocentric Temporal Action Localization (TAL). We validate our approach in two large-scale datasets, EPIC-KITCHENS and HOMAGE. Extensive experiments demonstrate the relevance of the audiovisual temporal context. Namely, we boost the localization performance (mAP) over visual-only models by +2.23% and +3.35% in the above datasets.
AB - Egocentric videos capture sequences of human activities from a first-person perspective and can provide rich multi-modal signals. However, most current localization methods use third-person videos and only incorporate visual information. In this work, we take a deep look into the effectiveness of audiovisual context in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL). OWL leverages audiovisual information and context for egocentric Temporal Action Localization (TAL). We validate our approach in two large-scale datasets, EPIC-KITCHENS and HOMAGE. Extensive experiments demonstrate the relevance of the audiovisual temporal context. Namely, we boost the localization performance (mAP) over visual-only models by +2.23% and +3.35% in the above datasets.
UR - http://hdl.handle.net/10754/675565
UR - https://ieeexplore.ieee.org/document/10208463/
U2 - 10.1109/cvprw59228.2023.00516
DO - 10.1109/cvprw59228.2023.00516
M3 - Conference contribution
BT - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
PB - IEEE
ER -