ECCV 2012 - LNCS 7572-7578 and 7583-7585

Visual Code-Sentences: A New Video Representation Based on Image Descriptor Sequences

Yusuke Mitarai and Masakazu Matsugu

Canon Inc. Digital System Technology Development Headquarters, Tokyo, Japan

Abstract. We present a new descriptor-sequence model for action recognition that enhances discriminative power in the spatio-temporal context, while maintaining robustness against background clutter as well as variability in inter-/intra-person behavior. We extend the framework of Dense Trajectories based activity recognition (Wang et al., 2011) and introduce a pool of dynamic Bayesian networks (e.g., multiple HMMs) with histogram descriptors as codebooks of composite action categories represented at respective key points. The entire codebooks bound with spatio-temporal interest points constitute intermediate feature representation as basis for generic action categories. This representation scheme is intended to serve as visual code-sentences which subsume a rich vocabulary of basis action categories. Through extensive experiments using KTH, UCF Sports, and Hollywood2 datasets, we demonstrate some improvements over the state-of-the-art methods.

LNCS 7583, p. 321 ff.

Full article in PDF | BibTeX