TY - JOUR
T1 - Learning a strong detector for action localization in videos
AU - Zhang, Yongqiang
AU - Ding, Mingli
AU - Bai, Yancheng
AU - Liu, Dandan
AU - Ghanem, Bernard
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: This work was supported by Natural Science Foundation of China, Grant no. 61603372.
PY - 2019/10/9
Y1 - 2019/10/9
N2 - We address the problem of spatio-temporal action localization in videos in this paper. Current state-of-the-art methods for this challenging task rely on an object detector to localize actors at frame-level firstly, and then link or track the detections across time. Most of these methods commonly pay more attention to leveraging the temporal context of videos for action detection while ignoring the importance of the object detector itself. In this paper, we prove the importance of the object detector in the pipeline of action localization, and propose a strong object detector for better action localization in videos, which is based on the single shot multibox detector (SSD) framework. Different from SSD, we introduce an anchor refine branch at the end of the backbone network to refine the input anchors, and add a batch normalization layer before concatenating the intermediate feature maps at frame-level and after stacking feature maps at clip-level. The proposed strong detector have two contributions: (1) reducing the phenomenon of missing target objects at frame-level; (2) generating deformable anchor cuboids for modeling temporal dynamic actions. Extensive experiments on UCF-Sports, J-HMDB and UCF-101 validate our claims, and we outperform the previous state-of-the-art methods by a large margin in terms of frame-mAP and video-mAP, especially at a higher overlap threshold.
AB - We address the problem of spatio-temporal action localization in videos in this paper. Current state-of-the-art methods for this challenging task rely on an object detector to localize actors at frame-level firstly, and then link or track the detections across time. Most of these methods commonly pay more attention to leveraging the temporal context of videos for action detection while ignoring the importance of the object detector itself. In this paper, we prove the importance of the object detector in the pipeline of action localization, and propose a strong object detector for better action localization in videos, which is based on the single shot multibox detector (SSD) framework. Different from SSD, we introduce an anchor refine branch at the end of the backbone network to refine the input anchors, and add a batch normalization layer before concatenating the intermediate feature maps at frame-level and after stacking feature maps at clip-level. The proposed strong detector have two contributions: (1) reducing the phenomenon of missing target objects at frame-level; (2) generating deformable anchor cuboids for modeling temporal dynamic actions. Extensive experiments on UCF-Sports, J-HMDB and UCF-101 validate our claims, and we outperform the previous state-of-the-art methods by a large margin in terms of frame-mAP and video-mAP, especially at a higher overlap threshold.
UR - http://hdl.handle.net/10754/659224
UR - https://linkinghub.elsevier.com/retrieve/pii/S0167865519302776
UR - http://www.scopus.com/inward/record.url?scp=85073219145&partnerID=8YFLogxK
U2 - 10.1016/j.patrec.2019.10.005
DO - 10.1016/j.patrec.2019.10.005
M3 - Article
SN - 0167-8655
VL - 128
SP - 407
EP - 413
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
ER -