ECCV 2012 - LNCS 7572-7578 and 7583-7585

Weakly Supervised Learning of Object Segmentations from Web-Scale Video

Glenn Hartmann¹, Matthias Grundmann², Judy Hoffman³, David Tsai², Vivek Kwatra¹, Omid Madani¹, Sudheendra Vijayanarasimhan¹, Irfan Essa², James Rehg², and Rahul Sukthankar¹

¹Google Research, USA

²Georgia Institute of Technology, USA

³University of California, Berkeley, USA

Abstract. We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as “dog”, without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube.

LNCS 7583, p. 198 ff.

Full article in PDF | BibTeX