Final Examination of Rui Hou for the degree of Doctor of Philosophy in Computer Science
This dissertation addresses the problem of action understanding in videos, which includes action recognition in trimmed video, temporal action localization in untrimmed video, spatial-temporal action detection and video object/action segmentation. For action recognition, we propose a method for learning discriminative categorylevel features. The key observation is that one-vs-rest classifiers, which are ubiquitously employed for this task, face challenges in separating very similar categories (such as running vs. jogging). Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, and a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. We augment one-vs-rest classifiers with a judicious selection of “two-vs-rest” classifier outputs, formed from such discriminative and mutually nearest DaMN pairs. Above video action recognition is designed for clip level classification in manually trimmed datasets. However, in the real world, the majority of videos are untrimmed, where an action of interest may occur only in a small part of a long video. Next, for temporal action localization, we propose to exploit the temporal structure of actions by modeling an action as a sequence of sub-actions and present a computationally efficient approach. A novel sub-action discovery algorithm is proposed, where the number of sub-actions for each action as well as their types are automatically determined from the training videos. To localize an action, an objective function combining appearance, duration and temporal structure of sub-actions is optimized as a shortest path problem in a network flow formulation.
In above method, the actions are only localized temporally. However, the goal of action detection is to detect every occurrence of a given action within a long video and to localize each detection both in space and time. Towards that end we propose an end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for action detection in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D Convolutional Network (ConvNet) features. Finally, the tube proposals of different clips are linked together and spatio-temporal action detection is performed using these linked video proposals.
Localizing actions in videos at finer spatio-temporal level is continuously desired by computer vision researchers and developers. Finally, to pixel-wise localize actions, we propose an end-to-end encoder-decoder style 3D CNN for video object segmentation. The proposed approach leverages 3D separable convolutions and 3dpyramid pooling, which drastically reduces the number of trainable parameters. Additionally, the framework is extended to solve the video action segmentation problem by adding an extra classifier to predict the action label for actors in videos.