TY - JOUR
T1 - MAIN: Multi-Attention Instance Network for video segmentation
AU - León Alcázar, Juan
AU - Bravo, María A.
AU - Jeanneret, Guillaume
AU - Thabet, Ali Kassem
AU - Brox, Thomas
AU - Arbeláez, Pablo
AU - Ghanem, Bernard
N1 - KAUST Repository Item: Exported on 2021-06-28
Acknowledgements: This work was partially supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research, and by the German-Colombian Academic Cooperation between the German Research Foundation (DFG grant BR 3815/9-1) and Universidad de los Andes , Colombia.
PY - 2021/6/24
Y1 - 2021/6/24
N2 - Instance-level video segmentation requires a solid integration of spatial and temporal information. However, current methods rely mostly on domain-specific information (online learning) to produce accurate instance-level segmentations. We propose a novel approach that relies exclusively on the integration of generic spatio-temporal attention cues. Our strategy, named Multi-Attention Instance Network (MAIN), overcomes challenging segmentation scenarios over arbitrary videos without modelling sequence- or instance-specific knowledge. We design MAIN to segment multiple instances in a single forward pass, and optimize it with a novel loss function that favors class agnostic predictions and assigns instance-specific penalties. We achieve state-of-the-art performance on the challenging Youtube-VOS dataset and benchmark, improving the unseen Jaccard and F-Metric by 6.8% and 12.7% respectively, while operating at real-time (30.3 FPS).
AB - Instance-level video segmentation requires a solid integration of spatial and temporal information. However, current methods rely mostly on domain-specific information (online learning) to produce accurate instance-level segmentations. We propose a novel approach that relies exclusively on the integration of generic spatio-temporal attention cues. Our strategy, named Multi-Attention Instance Network (MAIN), overcomes challenging segmentation scenarios over arbitrary videos without modelling sequence- or instance-specific knowledge. We design MAIN to segment multiple instances in a single forward pass, and optimize it with a novel loss function that favors class agnostic predictions and assigns instance-specific penalties. We achieve state-of-the-art performance on the challenging Youtube-VOS dataset and benchmark, improving the unseen Jaccard and F-Metric by 6.8% and 12.7% respectively, while operating at real-time (30.3 FPS).
UR - http://hdl.handle.net/10754/660665
UR - https://linkinghub.elsevier.com/retrieve/pii/S1077314221000849
U2 - 10.1016/j.cviu.2021.103240
DO - 10.1016/j.cviu.2021.103240
M3 - Article
SN - 1077-3142
SP - 103240
JO - Computer Vision and Image Understanding
JF - Computer Vision and Image Understanding
ER -