The growth of digital cameras and data communication has led to an exponential increase in video production and dissemination. As a result, automatic video analysis and understanding has become a crucial research topic in the computer vision community. However, the localization problem, which involves identifying a specific event in a large volume of data, particularly in long-form videos, remains a significant challenge.
While recognizing video activity has been extensively researched, localizing a specific query in a long and untrimmed video requires the AI system to have a long-term understanding of the video, up to minutes or even hours. Therefore, we focus on studying the challenging problem of query localization in long-form videos from three perspectives: temporal modeling, localization feature, and data sampling.
Firstly, a graph-based method is proposed to model long-range dependencies, which are known as the semantic context of human activities. Secondly, the task gap between action recognition and action localization is identified, and two methods are proposed from pre-training and end-to-end learning perspectives to improve the video representation with a minimal gap. Lastly, a spatiotemporal localization problem is addressed in a more realistic dataset constructed from live human beings with various job positions and geometric locations. Our study finds that data sampling is more crucial than model architecture design, and proposes several ways to reduce sampling bias.
In summary, the dissertation aims to advance the machine intelligence of video understanding and we hope this work has practical implications for human beings in AI assistants, recommendations, health care, security, and other applications.
|Date of Award||May 2023|
|Original language||English (US)|
- Computer, Electrical and Mathematical Sciences and Engineering
|Supervisor||Bernard Ghanem (Supervisor)|