With the growth of online media, surveillance and mobile cameras, the amount and size of video databases are increasing at an incredible pace. For example, YouTube reported that over 400 hours of video are uploaded every minute to their servers. Arguably, people are the most important and interesting subjects of such videos. The computer vision community has embraced this observation to validate the crucial role that human action recognition plays in building smarter surveillance systems, semantically aware video indexes and more natural human-computer interfaces. However, despite the explosion of video data, the ability to automatically recognize and understand human activities is still somewhat limited.
In this work, I address four different challenges at scaling up action understanding. First, I tackle existing dataset limitations by using a flexible framework that allows continuous acquisition, crowdsourced annotation, and segmentation of online videos, thus, culminating in a large-scale, rich, and easy-to-use activity dataset, known as ActivityNet. Second, I develop an action proposal model that takes a video and directly generates temporal segments that are likely to contain human actions. The model has two appealing properties: (a) it retrieves temporal locations of activities with high recall, and (b) it produces these proposals quickly. Thirdly, I introduce a model, which exploits action-object and action-scene relationships to improve the localization quality of a fast generic action proposal method and to prune out irrelevant activities in a cascade fashion quickly. These two features lead to an efficient and accurate cascade pipeline for temporal activity localization. Lastly, I introduce a novel active learning framework for temporal localization that aims to mitigate the data dependency issue of contemporary action detectors. By creating a large-scale video benchmark, designing efficient action scanning methods, enriching approaches with high-level semantics for activity localization, and an effective strategy to build action detectors with limited data, this thesis is making a step closer towards general video understanding.
|Date of Award
- Computer, Electrical and Mathematical Sciences and Engineering
|Bernard Ghanem (Supervisor)
- Computer Vision
- Machine Learning
- Video Understanding
- Activity Localization