TY - GEN
T1 - SegTAD: Precise Temporal Action Detection via Semantic Segmentation
AU - Zhao, Chen
AU - Ramazanova, Merey
AU - Xu, Mengmeng
AU - Ghanem, Bernard
N1 - KAUST Repository Item: Exported on 2023-04-05
Acknowledgements: This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.
PY - 2023/2/14
Y1 - 2023/2/14
N2 - Temporal action detection (TAD) is an important yet challenging task in video analysis. Most existing works draw inspiration from image object detection and tend to reformulate it as a proposal generation - classification problem. However, there are two caveats with this paradigm. First, proposals are not equipped with annotated labels, which have to be empirically compiled, thus the information in the annotations is not necessarily precisely employed in the model training process. Second, there are large variations in the temporal scale of actions, and neglecting this fact may lead to deficient representation in the video features. To address these issues and precisely model TAD, we formulate the task in a novel perspective of semantic segmentation. Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free. We take advantage of them to provide precise supervision so as to mitigate the impact induced by the imprecise proposal labels. We propose a unified framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN). We evaluate SegTAD on two important large-scale datasets for action detection and it shows competitive performance on both datasets.
AB - Temporal action detection (TAD) is an important yet challenging task in video analysis. Most existing works draw inspiration from image object detection and tend to reformulate it as a proposal generation - classification problem. However, there are two caveats with this paradigm. First, proposals are not equipped with annotated labels, which have to be empirically compiled, thus the information in the annotations is not necessarily precisely employed in the model training process. Second, there are large variations in the temporal scale of actions, and neglecting this fact may lead to deficient representation in the video features. To address these issues and precisely model TAD, we formulate the task in a novel perspective of semantic segmentation. Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free. We take advantage of them to provide precise supervision so as to mitigate the impact induced by the imprecise proposal labels. We propose a unified framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN). We evaluate SegTAD on two important large-scale datasets for action detection and it shows competitive performance on both datasets.
UR - http://hdl.handle.net/10754/677950
UR - https://link.springer.com/10.1007/978-3-031-25069-9_37
UR - http://www.scopus.com/inward/record.url?scp=85151048555&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-25069-9_37
DO - 10.1007/978-3-031-25069-9_37
M3 - Conference contribution
SN - 9783031250682
SP - 576
EP - 593
BT - Lecture Notes in Computer Science
PB - Springer Nature Switzerland
ER -