With video data dominating the internet traffic, it is crucial to develop automated models that can analyze and understand what humans do in videos. Such models must solve tasks such as action classification, temporal activity localization, spatiotemporal action detection, and video captioning. This dissertation aims to identify the challenges hindering the progress in human action understanding and propose novel solutions to overcome these challenges. We identify three challenges: (i) the lack of tools to systematically profile algorithms' performance and understand their strengths and weaknesses, (ii) the expensive cost of large-scale video annotation, and (iii) the prohibitively large memory footprint of untrimmed videos, which forces localization algorithms to operate atop precomputed temporally-insensitive clip features.
To address the first challenge, we propose a novel diagnostic tool to analyze the performance of action detectors and compare different methods beyond a single scalar metric. We use our tool to analyze the top action localization algorithm and conclude that the most impactful aspects to work on are: devising strategies to handle temporal context around the instances better, improving the robustness with respect to the instance absolute and relative size, and proposing ways to reduce the localization errors. Moreover, our analysis finds that the lack of agreement among annotators is not a significant roadblock to attaining progress in the field.
We tackle the second challenge by proposing novel frameworks and algorithms that learn from videos with incomplete annotations (weak supervision) or no labels (self-supervision). In the weakly-supervised scenario, we study the temporal action localization task on untrimmed videos where only a weak video-level label is available. We propose a novel weakly-supervised method that uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. In the self-supervised setup, we study learning from unlabeled videos by exploiting the strong correlation between the visual frames and the audio signal. We propose a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps our model utilize the semantic correlation and the differences between the two modalities, resulting in the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
Finally, the third challenge stems from localization methods using precomputed clip features extracted from video encoders typically trained for trimmed action classification tasks. Such features tend to be temporally insensitive, i.e., background (no action) segments can have similar representations to foreground (action) segments from the same untrimmed video. These temporally-insensitive features make it harder for the localization algorithm to learn the target task and thus negatively impact the final performance. We propose to mitigate this temporal insensitivity through a novel supervised pretraining paradigm for clip features that not only trains to classify activities but also considers background clips and global video information.
|Date of Award||Jan 2023|
|Original language||English (US)|
- Computer, Electrical and Mathematical Sciences and Engineering
|Supervisor||Bernard Ghanem (Supervisor)|