ICPR16


Th1PL	G.Cancun T1
Marc Pollefeys	Plenary Session

09:00-10:00, Paper Th1PL.1
Semantic 3D Reconstruction (I)
Pollefeys, Marc	Univ. of North Carolina at Chapel Hill
Keywords: Image and video analysis and understanding Abstract: This paper gives an overview of a recently proposed method of solving dense 3D reconstruction and semantic segmentation from multiple input images in a joint fashion, i.e. as semantic 3D reconstruction. The formulation is cast as a volumetric fusion of depth maps and pixel-wise semantic classication scores. By posing the two problems as a joint optimization problem, both of the tasks can benet from the other task's information. This leads to formulations which can reconstruct hidden unobserved surfaces. We give an overview of several papers which describe different ways of modeling the data term from the input data and also works which introduce object shape priors to the formulation. We present the basic convex multi-label formulation on which the method builds and also discuss the relation to other reconstruction algorithms which extract semantically annotated 3D models from images.


ThAT1	G.Cancun T1.A
ThAMO1	Oral Session

10:30-10:50, Paper ThAT1.1
Rough Neighborhood Covering Reduction for Robust Classification
Yue, Xiaodong	Shanghai Univ
Huang, Wei	Shanghai Univ
Zhong, Caimimg	Coll. of Science and Tech. Ningbo Univ
Zhang, Nan	Yantai Univ
Keywords: Classification and clustering, Machine learning and data mining, Symbolic learning Abstract: Neighborhood Covering Reduction (NCR) is an effective tool to learn rules from structural data for classification. However, the existing neighborhood covering model is not robust enough. A neighborhood is constructed according to the nearest heterogeneous samples. This strategy over focuses on the boundary samples and makes the model sensitive to noise. To tackle this problem, we proposed a Rough Neighborhood Covering Reduction method (RCR) for robust classification. In RCR method, we construct an approximation of neighborhood based on rough sets and further design a reduction algorithm to filter out neighborhoods to form flexible covering of data space for classification. Abundant experiments verify the robustness of the proposed method, which achieves precise and stable classification results on noisy data.

10:50-11:10, Paper ThAT1.2
Class-Dependent, Non-Convex Losses to Optimize Precision
Tax, David	Delft Univ. of Tech
Wang, Feng	Delft Univ. of Tech
Keywords: Classification and clustering, Model selection, Pattern Recognition for Search, Retrieval and Visualization Abstract: Retrieving a small set of relevant and interesting objects from a large background class is challenging because classifiers can easily be overwhelmed by the large class. Classifiers have been developed that are more sensitive to the small class, and typically they optimize a ranking, or precision at the top. These measures can be costly because they often look at pairwise rankings. The classical approach of just reweighing the relevant objects also has its limits because the influence of outliers and mislabeled objects also dramatically increases, deteriorating the performance. In this paper we propose an alternative solution that uses non-convex and class dependent loss functions. The non-convex loss makes the classifier less sensitive to outliers, while the class-dependent loss stresses the interesting class. It can not only be used to solve retrieval problems, but also classification problems in which not all objects are labeled reliably, like in Multiple Instance Learning or Positive and Unlabeled data learning. For Multiple Instance Learning and learning from Positive and Unlabeled data it is even shown that these non-convex, class-dependent losses are already implicitly used.

11:10-11:30, Paper ThAT1.3
ROC-Based Cost-Sensitive Classification with a Reject Option
Dubos, Clément	Univ. De Rouen
Bernard, Simon	Univ. De Rouen
Adam, Sebastien	Univ. of Rouen
Sabourin, Robert	École De Tech. Supérieure
Keywords: Classification and clustering, Model selection, Support vector machines and kernel methods Abstract: In many real-world classification tasks it is crucial to take into account misclassification costs for designing an accurate classification system. Nevertheless, begin able to reject a sample is also often needed in order to avoid a very risky prediction error. In that case, a cost-sensitive classifier must embed a rejection mechanism, that takes into account the rejection costs as well as the misclassification costs. In binary classification, the ROC space has shown to be very powerful for designing cost-sensitive classifiers, but it has been poorly exploited for designing classifiers able to reject. The purpose of this work is to extend a ROC-based ensemble method recently proposed, called the ROC Front method, with a cost-sensitive rejection mechanism. This approach compares favorably to the state-of-the-art ROC-based rejection rule recently proposed for binary cost-sensitive classification. It is also more robust as it allows to design an accurate classifier for all cost-sensitive situations contrary to the state-of-the-art method that fails in many cases, as for example with small datasets.

11:30-11:50, Paper ThAT1.4
Misalignment Resilient CCA for Interactive Satellite Image Change Detection
Sahbi, Hichem	CNRS, TELECOM ParisTech
Keywords: Classification and clustering, Multimedia analysis, indexing and retrieval Abstract: Change detection, in multi-temporal satellite imagery, seeks to discover relevant changes and to discard irrelevant ones. This task is usually achieved by modeling accurate decision criteria that capture the user's intention while being resilient to many irrelevant changes including acquisition conditions. Among existing change detection solutions, correlation-based models -- such as canonical correlation analysis (CCA) -- are particularly successful, but their success is very dependent on the quality of alignments used to train these models. In this paper, we introduce a novel interactive change detection algorithm based on a new variant of CCA, referred to as misalignment resilient CCA. Given a small sample of ``changes'' and ``no-changes'' labeled by an oracle (user), our method learns transformation matrices that map these data from different input spaces, related to multi-temporal images, into a common latent space which is sensitive to relevant changes while being resilient to irrelevant ones including misalignments. These CCA transformations correspond to the optimum of a particular constrained maximization problem that mixes a new soft-alignment term and a context-based regularization criterion. Extensive experiments conducted in interactive satellite image change detection, show that our misalignment resilient CCA approach is highly effective.

11:50-12:10, Paper ThAT1.5
Stable Clinical Prediction Using Graph Support Vector Machines
Kamkar, Iman	Deakin Univ
Gupta, Sunil Kumar	Deakin Univ
Li, Cheng	Deakin Univ
Phung, Dinh	Deakin Univ
Venkatesh, Svetha	Deakin Univ
Keywords: Classification and clustering, Pattern Recognition for Bioinformatics, Support vector machines and kernel methods Abstract: The stability matters in clinical prediction models due to the generalization and interpretability of the model. The problem is paramount in presence of high dimensional data, which employ sparse models with feature selection ability. We propose a new classification method to stabilize sparse support vector machine (SVM) using intrinsic graph structure of the electronic medical records (EMRs). The graph structure is exploited using the Jaccard similarity among features. Our method employs a convex function to penalize the pairwise l ∞ - norm of connected feature coefficients in the graph. We apply the alternating direction method of multipliers (ADMM) to solve the proposed formulation. Our experiments are conducted on a synthetic and three real-world hospital data. We show that our proposed method is more stable than the state-of-the-art feature selection and classification techniques in terms of three stability measures namely, Jaccard similarity measure, Spearman's rank correlation coefficient and Kuncheva index. We further show that our method has resulted in better classification performance compared to the baselines.

12:10-12:30, Paper ThAT1.7
Analyzing Graph Time Series Using a Generative Model
Ye, Cheng	Univ. of York
Wilson, Richard	Univ. of York
Hancock, Edwin	Univ. of York
Keywords: Machine learning and data mining, Deep learning, Other applications Abstract: In this paper, we present a novel method for constructing a generative model to analyze the structure of labeled data. Given a time-series of sample graphs, we aim to learn a so-called ``supergraph'' that best describes the underlying average connectivity structure presenting in the data. In this time-series the vertex set is fixed and labeled and the set of possible connections between vertices change with time. The supergraph represents these changes with a Gaussian probability distribution for the connection weights on each individual edge. This structure is fitted to the time-series data by minimizing a description length criterion, with the von Neumann entropy controlling the complexity of the fitted model structure and the Gaussian log-likelihood controlling the mean edge weights and variances. We further show this fitting process can be optimized by using a new fixed-point iteration scheme which locates the elements of the optimal weighted adjacency matrix of the supergraph. We show the iteration process is in fact governed by the partial derivative of the von Neumann entropy. In the experiments, the resulting generative model is shown to be an effective tool for analyzing the underlying connectivity structure of time-evolving networks in the financial domain, and in particular locating critical events and distinct time epochs in their evolution.


ThAT2	G.Cancun T1.B
ThAMO2	Oral Session

10:30-10:50, Paper ThAT2.1
Detection by Classification of Buildings in Multispectral Satellite Imagery
Ishii, Tomohiro	Waseda Univ
Simo-Serra, Edgar	Waseda Univ
Iizuka, Satoshi	Waseda Univ
Mochizuki, Yoshihiko	Waseda Univ
Sugimoto, Akihiro	National Inst. of Informatics
Ishikawa, Hiroshi	Waseda Univ
Nakamura, Ryosuke	National Inst. of Advanced Industrial Science and Tech
Keywords: 2D/3D object detection and recognition, Deep learning Abstract: We present an approach for the detection of buildings in multispectral satellite images. Unlike 3-channel RGB images, satellite imagery contains additional channels corresponding to different wavelengths. Approaches that do not use all channels are unable to fully exploit these images for optimal performance. Furthermore, care must be taken due to the large bias in classes, e.g., most of the Earth is covered in water and thus it will be dominant in the images. Our approach consists of training a Convolutional Neural Network (CNN) from scratch to classify multispectral image patches taken by satellites as whether or not they belong to a class of buildings. We then adapt the classification network to detection by converting the fully-connected layers of the network to convolutional layers, which allows the network to process images of any resolution. The dataset bias is compensated by subsampling the negatives and tuning the detection threshold for optimal performance. We have constructed a new dataset using images from the Landsat 8 satellite for detecting solar power plants and show our approach is able to significantly outperform the state-of-the-art. Furthermore, we provide an in-depth evaluation of the seven different spectral bands provided by the satellite images and show it is critical to combine them to obtain good results.

10:50-11:10, Paper ThAT2.2
To Boost or Not to Boost? on the Limits of Boosted Trees for Object Detection
Ohn-Bar, Eshed	Univ. of California, San Diego
Trivedi, Mohan	Univ. of California, San Diego
Keywords: 2D/3D object detection and recognition, Signal, image and video processing, Image and video analysis and understanding Abstract: We aim to study the modeling limitations of the commonly employed boosted decision trees classifier. Inspired by the success of large, data-hungry visual recognition models (e.g. deep convolutional neural networks), this paper focuses on the relationship between modeling capacity of the weak learners, dataset size, and dataset properties. A set of novel experiments on the Caltech Pedestrian Detection benchmark results in the best known performance among non-CNN techniques while operating at fast run-time speed. Furthermore, the performance is on par with deep architectures (9.71% log-average miss rate), while using only HOG+LUV channels as features. The conclusions from this study are shown to generalize over different object detection domains as demonstrated on the FDDB face detection benchmark (93.37% accuracy). Despite the impressive performance, this study reveals the limited modeling capacity of the common boosted trees model, motivating a need for architectural changes in order to compete with multi-level and very deep architectures.

11:10-11:30, Paper ThAT2.3
Increased Generalization Capability of Trainable COSFIRE Filters with Application to Machine Vision
Azzopardi, George	Univ. of Malta
Fernández-Robles, Laura	Univ. of León
Alegre, Enrique	Univ. De Leon
Petkov, N	Univ. of Groningen
Keywords: 2D/3D object detection and recognition, Image based modeling, Industrial image analysis Abstract: The recently proposed trainable COSFIRE filters are highly effective in a wide range of computer vision applications, including object recognition, image classification, contour detection and retinal vessel segmentation. A COSFIRE filter is selective for a collection of contour parts in a certain spatial arrangement. These contour parts and their spatial arrangement are determined in an automatic configuration procedure from a single user-specified pattern of interest. The traditional configuration, however, does not guarantee the selection of the most distinctive contour parts. We propose a genetic algorithm-based optimization step in the configuration of COSFIRE filters that determines the minimum subset of contour parts that best characterize the pattern of interest. We use a public dataset of images of an edge milling head machine equipped with multiple cutting tools to demonstrate the effectiveness of the proposed optimization step for the detection and localization of such tools. The optimization process that we propose yields COSFIRE filters with substantially higher generalization capability. With an average of only six COSFIRE filters we achieve high precision P and recall R rates (P=91.99%, R=96.22%). This outperforms the original COSFIRE filter approach (without optimization) mostly in terms of recall. The proposed optimization procedure increases the efficiency of COSFIRE filters with little effect on the selectivity.

11:30-11:50, Paper ThAT2.4
Learning-By-Synthesis for Accurate Eye Detection
Gou, Chao	Chinese Acad. of Sciences
Wu, Yue	RPI
Wang, Kang	Rensselaer Pol. Inst
Wang, Fei-Yue	Chinese Acad. of Sciences
Ji, Qiang	RPI
Keywords: 2D/3D object detection and recognition, Iris recognition, Human Computer Interaction Abstract: Cascade regression framework has been successfully applied to facial landmark detection and achieves state-of-the-art performance recently. It requires large number of facial images with labeled landmarks for training regression models. We propose to use cascade regression framework to detect eye center by capturing its contextual and shape information of other related eye landmarks. While for eye detection, it is time-consuming to collect large scale training data and it also can be unreliable for accurate manual annotation of eye related landmarks. In addition, it is difficult to collect enough training data to cover various illuminations, subjects with different head poses and gaze directions. To tackle this problem, we propose to learn cascade regression models from synthetic photorealistic data. In our proposed approach, eye region is coarsely localized by a facial landmark detection method first. Then we learn the cascade regression models iteratively to predict the eye shape updates based on local appearance and shape features. Experimental results on benchmark databases such as BioID and GI4E show that our proposed cascade regression models learned from synthetic data can accurately localize the eye center. Comparisons with existing methods also demonstrates our proposed framework can achieve preferable performance against state-of-the-art methods.

11:50-12:10, Paper ThAT2.5
Building Facade Recognition from Aerial Images Using Delaunay Triangulation Induced Feature Perceptual Grouping
Qin, Xuebin	Univ. of Alberta
Jagersand, Martin	Univ. of Alberta
Yang, Xiucheng	Univ. of Strasbourg
Wang, Jun	Peking Univ
Keywords: 2D/3D object detection and recognition, Segmentation, features and descriptors, Statistical, syntactic and structural pattern recognition Abstract: This paper presents a novel feature grouping based framework for building facade recognition from aerial images. A combination of Maximally Stable Extremal Regions (MSERs and steered Determinant-of-Hessian (steered-DoH) are proposed to detect different shapes of blobs from images. Then we employ local parallelogram grouped by these repetitive and evenly distributed blobs to form an point-based regularity measurement. Building facade regions are indicated by these local parallelograms. In our work, we use Delaunay Triangulation (DT) to guide the search of local parallelograms. Our approach can handle images with large range of resolution. Vertical and horizontal assumptions of facades are not required. The experimental results conducted on images with different resolutions and different types of facades demonstrate superior performance on facade recognition both in terms of speed and accuracy (F_1-score over 80%) over state-of-the-art methods.

12:10-12:30, Paper ThAT2.6
Crossmodal Point Cloud Registration in the Hough Space for Mobile Laser Scanning Data
Gálai, Bence	Mta Sztaki
Nagy, Balázs	Distributed Events Analysis Res. Lab. Inst. for C
Benedek, Csaba	Inst. for Computer Science and Control, Hungarian Acad. Of
Keywords: Vision for robotics, Vision sensors, Reconstruction and camera motion estimation Abstract: In this paper we propose a general approach for registration of point clouds obtained by various mobile laser scanning technologies. Our method is able to robustly match measurements with significantly different density characteristic including the sparse and inhomogeneous instant 3D (I3D) data taken be self-driving cars, and the dense and regular point clouds captured by mobile mapping systems (MMS) for virtual city generation. The core steps of the algorithm are robust scan segmentation, abstract street object extraction, object based coarse transformation estimation in the Hough accumulator space, and point-level registration refinement. Experimental results are provided using three different sensors: Velodyne HDL64 and VLP16 I3D scanners, and a Riegl VMX450 MMS. Application examples are shown regarding self localization of autonomous cars through crossmodal I3D and MMS frame registration, IMU-less SLAM and change detection based on I3D data.


ThAT3	Maya T2.A
ThAMO3	Oral Session

10:30-10:50, Paper ThAT3.1
A Novel Text Structure Feature Extractor for Chinese Scene Text Detection and Recognition
Ren, Xiaohang	Shanghai Jiao Tong Univ
Chen, Kai	ShangHai JiaoTong Univ
Yang, Xiaokang	Shanghai Jiao Tong Univ
Zhou, Yi	Shanghai Jiaotong Univ
He, Jianhua	Aston Univ
Sun, Jun	Shanghai Jiao Tong Univ
Attachments: Supplementary material Keywords: Image and video analysis and understanding, Signal, image and video processing, Deep learning Abstract: Scene text information extraction plays an important role in many computer vision applications. Unlike most existing text extraction algorithms for English texts, in this paper, we focus on Chinese texts, which are more complex in stroke and structure. To tackle this challenging problem, we propose a novel convolutional neural network (CNN) based text structure feature extractor for Chinese texts. Each Chinese characters contains its specific types and combination of text structure component, which is rarely seen in backgrounds. Thus, different from the features only applicable to one text extraction stage (text detection or text recognition), the text structure component feature is suitable for both Chinese text detection and recognition. A text structure component detector (TSCD) layer is designed to detect the large amount of components, which is the most challengeable part in extracting text structure component features. Through statistical classification, various types of text structure component are detected by their specially designed convolutional units in the TSCD layer. With the TSCD layer, the CNN has improvements in the accuracy and uniqueness of text feature description. In the evaluation, both text detection and recognition algorithm based on the proposed text structure feature extractor achieve state-of-the-art results in two datasets.

10:50-11:10, Paper ThAT3.2
Inter-Dependent CNNs for Joint Scene and Object Recognition
Bappy, Md Jawadul Hasan	Univ. of California, Riverside
Roy-chowdhury, Amit	Univ. of California, Riverside
Attachments: Supplementary material Keywords: Image and video analysis and understanding Abstract: In this paper, we consider two inter-dependent deep networks, where one network taps into the other, to perform two challenging cognitive vision tasks - scene classification and object recognition jointly. Recently, convolutional neural networks have shown promising results in each of these tasks. However, as scene and objects are interrelated, the performance of both of these recognition tasks can be further improved by exploiting dependencies between scene and object deep networks. The advantages of considering the inter-dependency between these networks are the following: 1. improvement of accuracy in both scene and object classification, and 2. significant reduction of computational cost in object detection. In order to formulate our framework, we employ two convolutional neural networks (CNNs), scene-CNN and object-CNN. We utilize scene-CNN to generate object proposals which indicate the probable object locations in an image. Object proposals found in the process are semantically relevant to the object. More importantly, the number of object proposals is fewer in amount when compared to other existing methods which reduces the computational cost significantly. Thereafter, in scene classification, we train three hidden layers in order to combine the global (image as a whole) and local features (object information in an image). Features extracted from CNN architecture along with the features processed from object-CNN are combined to perform efficient classification. We perform rigorous experiments on five datasets to demonstrate that our proposed framework outperforms other state-of-the-art methods in classifying scenes as well as recognizing objects.

11:10-11:30, Paper ThAT3.3
What Makes an On-Road Object Important?
Ohn-Bar, Eshed	Univ. of California, San Diego
Trivedi, Mohan	Univ. of California, San Diego
Keywords: Image and video analysis and understanding, 2D/3D object detection and recognition, Vision for robotics Abstract: Human drivers continuously attend to important scene elements in order to safely and smoothly navigate in intricate environments and under uncertainty. This paper develops a human-centric framework for object recognition by analyzing a notion of object importance, as measured in a spatio-temporal context of driving a vehicle. Given a video, a main research question in this paper is - which of the surrounding agents are most important? The answer inherently requires complex reasoning over the current driving task, object properties, scene context, intent, and possible future actions. Therefore, we find that various spatio-temporal cues are relevant for the importance classification task. Furthermore, we demonstrate the usefulness of the importance annotations in evaluating vision algorithms (specifically, for the task of object detection) in an application where trust in automation is imperative and errors are costly. Finally, we show that importance-guided training of object detection models results in improved detection performance of surrounding objects of higher importance. Hence, such models may be better suited for use in representing safety-critical situations, predicting surrounding agents' intentions, and in human-robot interactivity. The dataset and code will be made publicly available.

11:30-11:50, Paper ThAT3.4
Anomaly Detection in Crowded Scenes by SL-HOF Descriptor and Foreground Classification
Wang, Siqi	National Univ. of Defense Tech
Zhu, En	National Univ. of Defense Tech
Yin, Jianping	National Univ. of Defense Tech
Porikli, Fatih	Anu / Nicta
Keywords: Image and video analysis and understanding, Segmentation, features and descriptors, Pattern Recognition for Surveillance and Security Abstract: With the widespread use of surveillance cameras, massive video data analysis has become an extremely labor-intensive work. In this paper, we propose an efficient approach to detect video anomaly in crowded scenes based on Spatially Localized Histogram of Optical Flow (SL-HOF) descriptor and foreground classification. For motion description, the new SL-HOF descriptor can not only preserve classic HOF descriptor's favorable capability of characterizing the motion velocity and direction of foreground in crowded scene, but also depicts the spatial distribution of optical flow, which implicitly encodes the structure and local motion information of foreground objects in videos. SL-HOF is shown to significantly outperform other classic video descriptors. To further boost the performance of anomaly localization, we then introduce Robust PCA based foreground classification to discriminate anomalous foreground texture. Instead of computationally expensive approaches like l1-norm Sparse Coding, we adopt classic one-class SVM (OCSVM) to model normal video events and detect outliers (anomaly). Our experiments on the challenging UCSD datasets show our approach can achieve state-of-the-art results when compared to existing video anomaly detection methods.

11:50-12:10, Paper ThAT3.5
Content Selection Using Frontalness Evaluation of Multiple Frames
Eum, Sungmin	Univ. of Maryland Coll. Park
Doermann, David	Univ. of Maryland
Keywords: Image and video analysis and understanding, Multimedia analysis, indexing and retrieval, Document Understanding Abstract: This paper addresses the problem of selecting instances of a planar object in a video or from a set of images based on an evaluation of its "frontalness". We introduce the idea of "evaluating the frontalness" by computing how close the object's surface normal aligns with the optical axis of a camera. The unique and novel aspect of our method is that unlike previous planar object pose estimation methods, our method does not require the true frontal image as a reference. The intuition is that a true frontal image can be used to produce other non-frontal images by perspective projection, while the non-frontal images have limited ability to produce other non-frontal images. We show that this intuition of comparing 'frontal' and 'non-frontal' can be extended to comparing 'more frontal' and 'less frontal' images. Based on this observation, our method estimates the relative frontalness of an image by exploiting the objective space error. We also propose the usage of K-invariant space to evaluate the frontalness even when the camera intrinsic parameters are unknown (e.g., images/videos from the web). We show that our method outperforms the homography decomposition-based method which also does not require reference images. In addition, a qualitative evaluation is carried out to show that our method can be applied in selecting the most frontal characters from a set of images captured in various viewpoints.

12:10-12:30, Paper ThAT3.6
Cross-View Transformation Based Sparse Reconstruction for Person Re-Identification
He, Wei-Xiong	Sun Yat-Sen Univ
Chen, Ying-Cong	Sun Yat-Sen Univ
Lai, Jian-huang	Sun Yat-Sen Univ
Keywords: Image and video analysis and understanding, Machine learning and data mining Abstract: Based on minimum reconstruction error criterion and the intrinsic sparse property of natural data, sparse representation (SR) has shown promising performance on various image recognition tasks. However, in the field of person re-identification (re-id), the state-of-the-art is still dominated by other methods such as metric learning or CNN. It is because samples in one view may not be representative enough to represent samples from another view. As such, the reconstruction error could be excessive, and different pedestrians are indistinguishable with the coefficient produced by sparse representation. In this paper, we proposed an asymmetric sparse representation to address this problem. Samples of different camera views (gallery and probe samples) are mapped to a common latent space and the sparse coefficient is generated in this space. In this way, the representation power is enhanced and the sparse coefficient becomes more reliable. The similarities of different people are determined by the enhanced sparse coefficient, which allows more discriminative matching across different camera views. Extensive experiments on CAVIAR4REID, ILIDs-VID and PRID2011 datasets have demonstrated the merits of our approach.


ThAT4	Maya T2.B
ThAMO4	Oral Session

10:30-10:50, Paper ThAT4.1
Cascading BLSTM Networks for Handwritten Word Recognition
Stuner, Bruno	Univ. of Rouen, Litis
Chatelain, Clement	LITIS Lab., INSA De Rouen
Paquet, Thierry	Univ. of Rouen
Keywords: Handwriting Recognition, Artificial neural networks, Document Understanding Abstract: Handwritten word recognition is a tough task, mixing image and natural language processing. Recently new recurrent neural networks with LSTM cells allowed significant improvements in this field. These networks are generally coupled with lexical and linguistic knowledge in order to correct character misrecognitions, namely using a lexicon driven decoding. Yet the high performances of LSTM networks let us think that there is a room to use them without lexical decoding. In this article we propose a lexicon-free decoding, combined with a lexicon verification method. This lexicon control method presents some interesting properties and enables us to efficiently combine LSTM networks in a cascade framework. This cascade process is not driven by the lexicon but simply controlled, allowing it to speed up the decoding while being nearly insensitive to the lexicon size. Our approach presents promising results with à low error rate by conceding rejects. Those rejects can finally be processed by a standard lexical decoding, enabling us to reach state of the art performance, while being faster than existing methods for decoding.

10:50-11:10, Paper ThAT4.2
Compact Correlated Features for Writer Independent Signature Verification
Dutta, Anjan	Computer Vision Centre, Univ. Autonoma De Barcelona
Pal, Umapada	Indian Statistical Inst
Llados, Josep	Computer Vision Center
Keywords: Biometric systems and applications, Forensic biometrics and its applications, Pattern Recognition for Search, Retrieval and Visualization Abstract: This paper considers the offline signature verification problem which is considered to be an important research line in the field of pattern recognition. In this work we propose hybrid features that consider the local features and their global statistics in the signature image. This has been done by creating a vocabulary of histogram of oriented gradients (HOGs). We impose weights on these local features based on the height information of water reservoirs obtained from the signature. Spatial information between local features are thought to play a vital role in considering the geometry of the signatures which distinguishes the originals from the forged ones. Nevertheless, learning a condensed set of higher order neighbouring features based on visual words, eg, doublets and triplets, continues to be a challenging problem as possible combinations of visual words grow exponentially. To avoid this explosion of size, we create a code of local pairwise features which are represented as joint descriptors. Local features are paired based on the edges of a graph representation built upon the Delaunay triangulation. We reveal the advantage of combining both type of visual codebooks (order one and pairwise) for signature verification task. This is validated through an encouraging result on two benchmark datasets viz~CEDAR and GPDS300.

11:10-11:30, Paper ThAT4.3
Deep Neural Network Based Hidden Markov Model for Offline Handwritten Chinese Text Recognition
Du, Jun	Univ. of Science and Tech. of China
Wang, Zi-Rui	Univ. of Science and Tech. of China
Zhai, Jian-Fang	Univ. of Science and Tech. of China
Hu, Jin-Shui	Iflytek Res
Keywords: Handwriting Recognition, Character and Text Recognition, Deep learning Abstract: This paper proposes a novel segmentation-free approach using deep neural network based hidden Markov model (DNN-HMM) for offline handwritten Chinese text recognition. In the general Bayesian framework, three key issues are comprehensively investigated, namely feature extraction, character modeling, and language modeling. First, as for the feature extraction on the basis of each frame or sliding window, the gradient-based features are extracted for the DNN-based classifier. Second, the text line is sequentially modeled by HMMs with each representing one character class. Meanwhile the DNN-based classifier is adopted to calculate the posterior probability of all HMM states. Finally, the character n-gram language model is integrated with the DNN-HMM character model for the Bayesian decision. The experiments on the ICDAR 2013 competition task of CASIA-HWDB database show that the proposed approach can achieve the best published recognition results to our knowledge, yielding a character error rate (CER) of 6.50%, which significantly outperforms the previously best reported oversegmentation approach (with a CER of 9.25%) and the segmentation-free approach using multidimensional long-short term memory recurrent neural network (MDLSTM-RNN) approach (with a CER of 10.6%).

11:30-11:50, Paper ThAT4.4
Deep Learning Features for Handwritten Keyword Spotting
Wicht, Baptiste	Univ. of Fribourg
Fischer, Andreas	Pol. Montreal
Hennebert, Jean	Univ. of Applied Sciences Western Switzerland
Keywords: Handwriting Recognition, Deep learning, Artificial neural networks Abstract: Deep learning had a significant impact on diverse pattern recognition tasks in the recent past. In this paper, we investigate its potential for keyword spotting in handwritten documents by designing a novel feature extraction system based on Convolutional Deep Belief Networks. Sliding window features are learned from word images in an unsupervised manner. The proposed features are evaluated both for template-based word spotting with Dynamic Time Warping and for learning-based word spotting with Hidden Markov Models. In an experimental evaluation on three benchmark data sets with historical and modern handwriting, it is shown that the proposed learned features outperform three standard sets of handcrafted features.

11:50-12:10, Paper ThAT4.5
Handwritten Chinese Character Recognition with Spatial Transformer and Deep Residual Networks
Zhong, Zhao	Inst. of Automation, Chinese Acad. of Sciences
Zhang, Xu-Yao	Inst. of Automation, Chinese Acad. of Sciences
Yin, Fei	Inst. of Automation of CAS
Liu, Cheng-Lin	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Handwriting Recognition, Deep learning, Character and Text Recognition Abstract: This paper considers using deep neural networks for handwritten Chinese character recognition (HCCR) with arbitrary position, scale, and orientations. To solve this problem, we combine the recently proposed spatial transformer network (STN) with the deep residual network (DRN). The STN acts like a character shape normalization procedure. Different from the traditional heuristic shape normalization methods, STN is learned directly from the data. Furthermore, the DRN makes the training of very deep network to be both efficient and effective. With the combination of STN and DRN, the whole model can be trained jointly in an end-to-end manner. In this paper, new state-of-the-art performance has been achieved by our proposed model on the offline ICDAR-2013 Chinese handwriting competition database. Moreover, the experiment on randomly distorted samples shows that the STN is very effective for robust HCCR in rectifying the shape of distorted characters.

12:10-12:30, Paper ThAT4.6
Subexpression and Dominant Symbol Histograms for Spatial Relation Classification in Mathematical Expressions
Julca Aguilar, Frank Dennis	Univ. of São Paulo and Univ. of Nantes
Hirata, Nina S. T.	Univ. of São Paulo
Mouchère, Harold	Univ. or Nantes, IRCCyN
Viard-Gaudin, Christian	Ec. Pol. De Univ. De Nantes
Keywords: Handwriting Recognition, Graphics Recognition, Segmentation, features and descriptors Abstract: Recognition of spatial relations between pairs of subexpressions is a key problem of recognition of handwritten mathematical expressions. Most methods for spatial relation classification are based on handcrafted rules and geometric indices extracted from the subexpression bounding boxes. In this work, we propose new spatial relation features that combine subexpression bounding box and intra-subexpression information, along with prior knowledge about the general position and size of symbols. Instead of handcrafting features, we train artificial neural networks to learn the useful features from two kinds of histograms. The first type captures the relative positions and sizes of the subexpression bounding boxes. The second captures the relative positions and shape of a pair of symbols, called dominant symbols, extracted from the main baselines of the evaluated subexpressions. We evaluate and compare our features with two state-of-the-art features on a benchmark dataset. Experimental results show that our features obtain better accuracy than these two features.


ThAT5	Maya T2.C
THAMO5	Oral Session

10:30-10:50, Paper ThAT5.1
AutoMarkov DNNs for Object Classification
Toca, Cosmin	Univ. Pol. of Bucharest
Patrascu, Carmen	Univ. Pol. of Bucharest
Ciuc, Mihai	Univ. "Pol. of Bucharest
Keywords: Deep learning, Artificial neural networks, Statistical, syntactic and structural pattern recognition Abstract: Recent advances in the area of Deep Convolutional Neural Networks have led to steady progress, mainly observed in the field of object classification and localization. Extensive testing helped generate frameworks guaranteeing the initiation of successful network architectures. For this reason, the authors focus on bringing added value on specific nodes of a generic network configuration. We propose a novel type of convolutional layer based on Autobinomial Markov-Gibbs Random Fields (AutoMarkov Layer). Our choice is motivated by the fact that each neuron in a layer is only connected to a local region in the following layer. This property allows us to integrate Markov Random Fields into the structure of a neuron, to account for the probability of each particular pathway. Functional testing is performed on the MNIST, CIFAR-10 and CIFAR-100 datasets, showing clear improvements for correct classification scores on all the datasets mentioned regardless of the network architecture.

10:50-11:10, Paper ThAT5.2
Detecting Road Surface Wetness from Audio: A Deep Learning Approach
Abdić, Irman	MIT
Fridman, Lex	MIT
Brown, Daniel E.	MIT
Angell, William	MIT
Reimer, Bryan	MIT
Marchi, Erik	Tech. Univ. München
Schuller, Bjoern	Tech. Univ. of Munich
Keywords: Deep learning, Audio and acoustic processing and analysis, Machine learning and data mining Abstract: We introduce a recurrent neural network architecture for automated road surface wetness detection from audio of tire-surface interaction. The robustness of our approach is evaluated on 785,826 bins of audio that span an extensive range of vehicle speeds, noises from the environment, road surface types, and pavement conditions including international roughness index (IRI) values from 25 in/mi to 1400 in/mi. The training and evaluation of the model are performed on different roads to minimize the impact of environmental and other external factors on the accuracy of the classification. We achieve an unweighted average recall (UAR) of 93.2% across all vehicle speeds including 0 mph. The classifier still works at 0 mph because the discriminating signal is present in the sound of other vehicles driving by.

11:10-11:30, Paper ThAT5.3
Deep Networks Are Efficient for Circular Manifolds
McCane, Brendan	Univ. of Otago
Szymanski, Lech	Univ. of Otago
Keywords: Deep learning, Dimensionality reduction and manifold learning, Artificial neural networks Abstract: We present theoretical results showing that deep neural networks require fewer parameters than a shallow network to achieve similar accuracy results on a simple classification problem where the decision boundary is a circle in two dimensions. In particular, shallow networks require O(frac{1}{sqrt{epsilon}}) parameters compared to O(log_2left[frac{1}{epsilon}right]) for a deep network to achieve an error rate of epsilon.

11:30-11:50, Paper ThAT5.4
Towards Deep Compositional Networks
Tabernik, Domen	Univ. of Ljubljana, Faculty of Computer Andinformationscien
Kristan, Matej	Univ. of Ljubljana
Wyatt, Jeremy L	Univ. of Birmingham
Leonardis, Ales	Univ. of Birmingham
Keywords: Deep learning, Representation and analysis in pixel/voxel images Abstract: Hierarchical feature learning based on convolutional neural networks (CNN) has recently shown significant potential in various computer vision tasks. While allowing high-quality discriminative feature learning, the downside of CNNs is the lack of explicit structure in features, which often leads to overfitting, absence of reconstruction from partial observations and limited generative abilities. Explicit structure is inherent in hierarchical compositional models, however, these lack the ability to optimize a well-defined cost function. We propose a novel analytic model of a basic unit in a layered hierarchical model with both explicit compositional structure and a well-defined discriminative cost function. Our experiments on two datasets show that the proposed compositional model performs on a par with standard CNNs on discriminative tasks, while, due to explicit modeling of the structure in the feature units, affording a straight-forward visualization of parts and faster inference due to separability of the units.

11:50-12:10, Paper ThAT5.5
Leveraging Multiple Tasks to Regularize Fine-Grained Classification
Dasgupta, Riddhiman	IIIT Hyderabad
Namboodiri, Anoop	IIIT, Hyderabad
Keywords: Deep learning, Transfer learning, 2D/3D object detection and recognition Abstract: Fine-grained classification is an extremely challenging problem in computer vision, compounded by subtle differences in shape, pose, illumination and appearance. While convolutional neural networks have become the versatile jack-of-all-trades tool in modern computer vision, approaches for fine-grained recognition still rely on localization of keypoints and parts to learn discriminative features for recognition. In order to achieve this, most approaches use a localization module and subsequently learn classifiers for the inferred locations, thus necessitating large amounts of manual annotations for bounding boxes and keypoints. In order to tackle this problem, we aim to leverage the (taxonomic and/or semantic) relationships present among fine-grained classes. The ontology tree is a free source of labels that can be used as auxiliary tasks to train a multi-task loss. Additional tasks can act as regularizers, and increase the generalization capabilities of the network. Multiple tasks try to take the network in diverging directions, and the network has to reach a common minimum by adapting and learning features common to all tasks in its shared layers. We train a multi-task network using auxiliary tasks extracted from taxonomical or semantic hierarchies, using a novel method to update task-wise learning rates to ensure that the related tasks aid and unrelated tasks does not hamper performance on the primary task. Experiments on the popular CUB-200-2011 dataset show that employing super-classes in an end-to-end model improves performance, compared to methods employing additional expensive annotations such as keypoints and bounding boxes and/or using multi-stage pipelines. Additional details can be found at http://cvit.iiit.ac.in/research/projects/cvit-projects/multitaskhierarchy

12:10-12:30, Paper ThAT5.6
One-Pass Online SVM with Extremely Small Space Complexity
Liu, Yangwei	State Univ. of New York at Buffalo
Xu, Jinhui	State Univ. of New York at Buffalo
Attachments: Supplementary material Keywords: Support vector machines and kernel methods, Machine learning and data mining, Classification and clustering Abstract: In this paper we consider the problem of training a Support Vector Machine (SVM) online using a stream of data in random order. We provide a fast online training algorithm for general SVM on very large datasets. Based on the geometric interpretation of SVM known as the polytope distance, our algorithm uses a gradient descent procedure to solve the problem. With high probability our algorithm outputs an (epsilon,delta)-approximation result in constant time and space, which is independent of the size of the dataset, where (epsilon,delta)-approximation means that the separating margin of the classifier is almost optimal (with error le epsilon), and the number of misclassified training points is very small (with error le delta). Experimental results show that our algorithm outperforms most of existing online algorithms, especially in the space requirement aspect, while maintaining high accuracy.


ThPT1	Poster Session Hall
ThP1	Poster Session

14:00-16:10, Paper ThPT1.1
A Deep Multi-Level Network for Saliency Prediction
Cornia, Marcella	Univ. of Modena and Reggio Emilia
Baraldi, Lorenzo	Univ. of Modena and Reggio Emilia
Serra, Giuseppe	Univ. DEGLI STUDI DI MODENA E REGGIO EMILIA
Cucchiara, Rita	Univ. Degli Studi Di Modena E Reggio Emilia
Keywords: Deep learning Abstract: This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural Network (CNN). Our model is composed of three main blocks: a feature extraction CNN, a feature encoding network, that weights low and high level feature maps, and a prior learning network. We compare our solution with state of the art saliency models on two public benchmarks datasets. Results show that our model outperforms under all evaluation metrics on the SALICON dataset, which is currently the largest public dataset for saliency prediction, and achieves competitive results on the MIT300 benchmark.

14:00-16:10, Paper ThPT1.2
Latent Regression Bayesian Network for Data Representation
Nie, Siqi	RPI
Zhao, Yue	Minzu Univ. of China
Ji, Qiang	RPI
Keywords: Deep learning Abstract: Restricted Boltzmann machines (RBMs) are widely used for data representation and feature learning in various machine learning tasks. The undirected structure of an RBM allows inference to be performed efficiently, because the latent variables are dependent on each other given the visible variables. However, we believe the correlations among latent variables are crucial for faithful data representation. Driven by this idea, we propose a counterpart of RBMs, namely latent regression Bayesian networks (LRBNs), which has a directed structure. One major difficulty of learning LRBNs is the intractable inference. To address this problem, we propose an inference method based on the conditional pseudo-likelihood that preserves the dependencies among the latent variables. For learning, we propose to employ the hard Expectation Maximization (EM) algorithm, which avoids the intractability of the traditional EM by max-out instead of sum-out to compute the data likelihood. Qualitative and quantitative evaluations of our model against state-of-the-art models and algorithms on benchmark data sets demonstrate the effectiveness of the proposed algorithm in data representation and reconstruction.

14:00-16:10, Paper ThPT1.3
Pedestrian and Part Position Detection Using a Regression-Based Multiple Task Deep Convolutional Neural Network
Yamashita, Takayoshi	Chubu Univ
Fukui, Hiroshi	Chubu Univ
Yamauchi, Yuji	Chubu Univ
Fujiyoshi, Hironobu	Chubu Univ
Keywords: Deep learning, 2D/3D object detection and recognition Abstract: In driving support systems, it is not only necessary to detect the position of pedestrians, but also to estimate the distance between a pedestrian and the vehicle. In general approaches using monocular cameras, the upper and lower positions of each pedestrian are detected using a bounding box obtained from a pedestrian detection technique. The distance between the pedestrian and the vehicle is then estimated using these positions and the camera parameters. This conventional framework uses independent pedestrian detection and position detection processes to estimate the distance. In this paper, we propose a method to detect both the pedestrian and their position simultaneously using a regression-based deep convolutional neural network (DCNN). This simultaneous detection enables the DCNN to train efficient parameters for the extraction of proper features, because the position information is expressly controlled by the pedestrian region. In a series of experiments, our method is shown to improve the pedestrian detection performance compared with methods based solely on pedestrian detection. The proposed approach also improves the detection accuracy of the head and leg positions compared with methods that consider only position detection. Using the results of position detection and the obtained camera parameters, our method achieves distance estimation to within 5% error.

14:00-16:10, Paper ThPT1.4
Beam Search for Learning a Deep Convolutional Neural Network of 3D Shapes
Xu, Xu	Oreong State Univ
Todorovic, Sinisa	Oregon State Univ
Keywords: Deep learning, 2D/3D object detection and recognition Abstract: This paper addresses one of the basic problems in computer vision, that of recognizing 3D shapes of objects. Recent work typically represents a 3D shape as a set of binary variables corresponding to 3D voxels of a uniform 3D grid centered on the shape, and resorts to deep convolutional neural networks (CNNs) for modeling these binary variables. However, robust learning of CNNs is currently limited by the small datasets of 3D shapes available – an order of magnitude smaller than other common datasets in computer vision. Related work typically deals with the small training datasets using a number of ad hoc, hand-tuning strategies. To address this issue, we formulate CNN learning as a beam search aimed at identifying an optimal CNN architecture – namely, the number of layers, nodes, and their connectivity in the network – as well as estimating parameters of such an optimal CNN. Each state of the beam search corresponds to a candidate CNN. Two types of actions are defined to add new convolutional filters or new convolutional layers to a parent CNN, and thus transition to children states. The utility function of each action is efficiently computed by transferring parameter values of the parent CNN to its children, thereby enabling an efficient beam search. Our experimental evaluation on the 3D ModelNet dataset demonstrates that our model pursuit using the beam search yields a CNN with superior performance on 3D shape classification than the state of the art.

14:00-16:10, Paper ThPT1.5
Discriminant Auto Encoders for Face Recognition with Expression and Pose Variations
Pathirage, Chathurdara Sri Nadith	Curtin Univ
Li, Ling	Curtin Univ. of Tech
Liu, Wanquan	Curtin Univ. of Tech
Keywords: Deep learning, 2D/3D object detection and recognition, Biologically motivated vision Abstract: The key challenge of face recognition is to develop effective feature representations for reducing intra-personal variations while enlarging inter-personal differences. This paper presents a novel non-linear discriminant error criterion which can be used in effective feature learning from raw pixels. Unlike many existing methods which assume the problem to be linear in nature, the proposed method utilizes a novel deep learning (DL) framework which makes no prior assumptions thus exploiting the full potential of learning a highly non-linear transformation. High level representations learnt via the proposed model are highly supervised and can help to boost the performance of subsequent classifiers such as LDA. This study clearly shows the value of using non-linear discriminant error criterion as a tractable objective to guide the learning of useful high level features in various face related problems. The extracted features are learnt from local face regions and the results of the experiments performed on 3 different face image databases demonstrate the superiority and the generalizability of our method compared to existing work, as well as the applicability of the concept onto many different deep learning models of the same nature.

14:00-16:10, Paper ThPT1.6
MRCNN: A Stateful Fast R-CNN
Burlina, Philippe	Johns Hopkins Univ. Applied Physics Lab
Keywords: Deep learning, 2D/3D object detection and recognition, Machine learning and data mining Abstract: Deep convolutional neural networks (DCNNs) perform on par or better than humans for image classification. Hence efforts have now shifted to more challenging tasks such as object detection and classification in images, video or RGBD. Recently developed region CNNs (R-CNN) such as Fast R-CNN [7] address this detection task for images. Instead, this paper is concerned with video and also focuses on resource-limited systems. Newly proposed methods accelerate R-CNN by sharing convolutional layers for proposal generation, location regression and labeling [12][13][19][25]. These approaches when applied to video are stateless: they process each image individually. This suggests an alternate route: to make R-CNN stateful and exploit temporal consistency. We extend Fast R-CNNs by making it employ recursive Bayesian filtering and perform proposal propagation and reuse. We couple multi-target proposal/detection tracking (MTT) with R-CNN and do detection-to-track association. We call this approach MRCNN as short for MTT + R-CNN. In MRCNN, region proposals -- that are vetted via classification and regression in R-CNNs -- are treated as observations in MTT and propagated using assumed kinematics. Actual proposal generation (e.g. via Selective Search) need only be performed sporadically and/or periodically and is replaced at all other times by MTT proposal predictions. Preliminary results show that MRCNNs can economize on both proposal and classification computations, and can yield up to a 10 to 30 factor decrease in number of proposals generated, about one order of magnitude proposal computation time savings and nearly one order magnitude improvement in overall computational time savings, for comparable localization and classification performance. This method can additionally be beneficial for false alarm abatement.

14:00-16:10, Paper ThPT1.7
MSR-CNN: Applying Motion Salient Region Based Descriptors for Action Recognition
Zhigang, Tu	Arizona State Univ
Cao, Jun	Intel Corp
Li, Yikang	Arizona State Univ
Li, Baoxin	Arizona State Univ
Keywords: Deep learning, 2D/3D object detection and recognition, Motion, tracking and video analysis Abstract: In recent years the most popular video-based human action recognition methods rely on extracting feature representations using Convolutional Neural Networks (CNN) and then using these representations to classify actions. In this work, we propose a fast and accurate video representation that is derived from the motion-salient region (MSR), which represents features most useful for action labeling. By improving a well-performed foreground detection technique, the region of interest (ROI) corresponding to actors in the foreground in both the appearance and the motion field can be detected under various realistic challenges. Furthermore, we propose a complementary motion salient measure to select a secondary ROI -- the major moving part of the human. Accordingly, a MSR-based CNN descriptor (MSR-CNN) is formulated to recognize human action, where the descriptor incorporates appearance and motion features along with tracks of MSR. The computation can be efficiently implemented due to two characteristics: 1) only part of the RGB image and the motion field need to be processed; 2) less data is used as input for the CNN feature extraction. Comparative evaluation on JHMDB and UCF Sports datasets shows that our method outperforms the state-of-the-art in both accuracy and efficiency.

14:00-16:10, Paper ThPT1.8
Convolutional Neural Networks for Object Recognition on Mobile Devices: A Case Study
Tobias Quiroz, Jose Luis	Telecom-Bretagne
Ducournau, Aurélien	Telecom Bretagne
Rousseau, François	Inst. Mines Telecom
Mercier, Grégoire	Telecom Bretagne
Fablet, Ronan	Telecom Bretagne/LabSTICC
Keywords: Deep learning, 2D/3D object detection and recognition, Pattern Recognition for Art, Cultural Heritage and Entertainment Abstract: Deep Learning (DL), especially in Convolutional Neural Networks (CNN), has become the state-of-the-art for a variety of pattern recognition issues. The advances in technology allowed the use of high-end General Purpose Graphic Processor Units (GPGPU) for accelerating numerical problem solving. These advances are not only in terms of speed, but also in terms of network size, nowadays computers are able to drive deeper, wider and more powerful models. State of the art CNN’s have achieved human-like performance in several recognition tasks such as: handwritten character recognition, face recognition, scene labelling, object detection and image classification among others. Meanwhile, mobile devices have become powerful enough to handle the computations required for deploying CNNs models in near real-time. Here, we investigate the implementation of light-weight CNN schemes on mobile devices for domain-specific objection recognition tasks.

14:00-16:10, Paper ThPT1.9
Deep Feature Extraction in the DCT Domain
Ghosh, Arthita	Univ. of Maryland Coll. Park
Chellappa, Rama	Univ. of Maryland
Keywords: Deep learning, Artificial neural networks, 2D/3D object detection and recognition Abstract: We explore the effectiveness of deep features extracted by Convolutional Neural Networks(CNNs) in the Discrete Cosine Transform(DCT) domain on various image classification tasks such as pedestrian and face detection, material identification and object recognition. We perform the DCT operation on the feature maps generated by convolutional layers in CNNs. We compare the performance of the same network on the same datasets, with the same hyper-parameters with or without the DCT step. Our results indicate that a DCT operation incorporated into the network after convolution+thresholding layer and before pooling can have certain advantages such as convergence over fewer training epochs and sparser weight matrices that are more conducive to pruning and hashing techniques.

14:00-16:10, Paper ThPT1.10
Faster Training of Very Deep Networks Via P -Norm Gates
Pham, Trang	Deakin Univ
Tran, Truyen	Deakin Univ
Phung, Dinh	Deakin Univ
Venkatesh, Svetha	Deakin Univ
Keywords: Deep learning, Artificial neural networks, Classification and clustering Abstract: A major contributing factor to the recent advances in deep neural networks is structural units that let sensory information and gradient propagate easily. Gating is such an important structure that acts as a flow control. Gates are pervasive among state-of-the-art recurrent models such as LSTM and GRU, and feedforward models such as Residual Nets and Highway Networks. This enables learning very deep networks with hundred layers and helps achieve record-breaking results in vision (e.g., ImageNet with Residual Nets) and NLP (e.g., machine translation with GRU). However, there is a little work in analysing the role of gating in the learning process. In this paper, we propose a flexible p -norm gating scheme, which allows user-controllable flow and as a consequence, can improve the learning speed. This scheme subsumes other existing gating schemes, including those in GRU, Highway Networks and Residual Nets as special cases. Experiments on large sequence and vector datasets demonstrate that the proposed gating scheme help improve the learning speed significantly without extra overhead.

14:00-16:10, Paper ThPT1.11
Coupled Convolution Layer for Convolutional Neural Network
Uchida, Kazutaka	Tokyo Inst. of Tech
Tanaka, Masayuki	Tokyo Inst. of Tech
Okutomi, Masatoshi	Tokyo Inst. of Tech
Keywords: Deep learning, Artificial neural networks, Classification and clustering Abstract: We introduce a coupled convolution layer comprising two parallel convolutions with mutually constrained weights. Inspired by the human retina mechanism, we constrain the convolution weights such that one set of weights should be the negative of the other to mimic responses of on-center and off-center retinal ganglion cells. Our analysis shows that the retina-like convolution layer, a special case of the coupled convolution layer, can be realized by a normal convolutional layer with a pair of activation functions designated as Biased ON/OFF ReLU. Experimental comparisons demonstrate that the proposed coupled convolution layer performs better without increasing the number of parameters, which reveals two important facts. First, the separation of the positive and negative part into different channels plays an important role. Secondly, constraining weights across convolutions can produce better performance than training weights freely. We evaluate its effect by comparison with ReLU, LReLU, and PReLU using the CIFAR-10, CIFAR-100, and PlanktonSet 1.0 datasets.

14:00-16:10, Paper ThPT1.12
Finetuning Convolutional Neural Networks for Visual Aesthetics
Wang, Yeqing	Changzhou Coll. of Information Tech
Li, Yi	Toyota Res. Inst. Australian National Univ. NICTA
Porikli, Fatih	Anu / Nicta
Keywords: Deep learning, Artificial neural networks, Image and video analysis and understanding Abstract: Inferring the aesthetic quality of images is a challenging computer vision task due to its subjective and conceptual nature. Most image aesthetics evaluation approaches focused on designing handcrafted features, and only a few adopted learning of relevant and imperative characteristics in a data-driven manner. In this paper, we propose to attune Convolutional Neural Networks (CNNs) for image aesthetics. Unlike previous deep learning based techniques, we employ pretrained models, namely AlexNet and the 16-layer VGGNet, and calibrate them to estimate visual aesthetic quality. This enables exploiting automatically the inherent information from much larger scale and more diversified image datasets. We tested our methods on AVA and CUHKPQ image aesthetics datasets on two different training-testing partitions, and compared the performance using both local and contextual information. Experimental results suggest that our strategy is robust, effective and superior to the state-of-the-art approaches.

14:00-16:10, Paper ThPT1.13
Face Detection Based on Deep Convolutional Neural Networks Exploiting Incremental Facial Part Learning
Triantafyllidou, Danai	Aristotle Univ. of Thessaloniki
Tefas, Anastasios	Aristotle Univ. of Thessaloniki
Keywords: Deep learning, Artificial neural networks, Machine learning and data mining Abstract: Deep learning methods are powerful approaches but often require expensive computations and lead to models of high complexity which need to be trained with large amounts of data. In this paper, we consider the problem of face detection and we propose a light-weight deep convolutional neural network that achieves a state-of-the-art recall rate at the challenging FDDB dataset. Our model is designed with a view to minimize both training and run time and outperforms the convolutional network used in cite{ddfd} for the same task. Our model consists only of 113.864 free parameters whereas the previously proposed CNN for face detection had 60 million parameters. We propose a new training method that gradually increases the difficulty of both negative and positive examples and has proved to drastically improve training speed and accuracy. Our second approach, involves training a separate deep network to detect individual facial features whilst creating a model that combines the outputs of two different networks. Both methods are able to detect faces under severe occlusion and unconstrained pose variation and meet the difficulties and the large variations of real-world face detection.

14:00-16:10, Paper ThPT1.14
Learning to Semantically Segment High-Resolution Remote Sensing Images
Nogueira, Keiller	Univ. Federal De Minas Gerais
Dalla Mura, Mauro	Fondazione Bruno Kessler
Chanussot, Jocelyn	Grenoble Inst. of Tech
Schwartz, William	Federal Univ. of Minas Gerais
dos Santos, Jefersson Alex	Univ. Federal De Minas Gerais
Keywords: Deep learning, Artificial neural networks, Other applications Abstract: Land cover classification is a task that requires methods capable of learning high-level features while dealing with high volume of data. Overcoming these challenges, Convolutional Networks (ConvNets) can learn specific and adaptable features depending on the data while, at the same time, learn classifiers. In this work, we propose a novel technique to automatically perform pixel-wise land cover classification. To the best of our knowledge, there is no other work in the literature that perform pixel-wise semantic segmentation based on data-driven feature descriptors for high-resolution remote sensing images. The main idea is to exploit the power of ConvNet feature representations to learn how to semantically segment remote sensing images. First, our method learns each label in a pixel-wise manner by taking into account the spatial context of each pixel. In a predicting phase, the probability of a pixel belonging to a class is also estimated according to its spatial context and the learned patterns. We conducted a systematic evaluation of the proposed algorithm using two remote sensing datasets with very distinct properties. Our results show that the proposed algorithm provides improvements when compared to traditional and state-of-the-art methods that ranges from 5 to 15% in terms of accuracy.

14:00-16:10, Paper ThPT1.15
On the Size of Convolutional Neural Networks and Generalization Performance
Kabkab, Maya	Univ. of Maryland
Hand, Emily	Univ. of Maryland
Chellappa, Rama	Univ. of Maryland
Keywords: Deep learning, Classification and clustering, Artificial neural networks Abstract: While Convolutional Neural Networks (CNNs) have recently achieved impressive results on many classification tasks, it is still unclear why they perform so well and how to properly design them. In this work, we investigate the effect of the convolutional depth of a CNN on its generalization performance for binary classification problems. We prove a sufficient condition -polynomial in the depth of the CNN- on the training database size to guarantee such performance. We empirically test our theory on the problem of gender classification and explore the effect of varying the CNN depth, as well as the training distribution and set size.

14:00-16:10, Paper ThPT1.16
Adaptive Hierarchical Classification Networks
Nooka, Sai	RIT
Chennupati, Vijaya Naga Jyoth Sumanth	Rochester Inst. of Tech
Veerabhadra, Naga Karthik Reddy	Rochester Inst. of Tech
Sah, Shagan	Rochester Inst. of Tech
Ptucha, Raymond	Rochester Inst. of Tech
Keywords: Deep learning, Classification and clustering, Artificial neural networks Abstract: Hierarchical decomposition enables increased number of classes in a classification problem. Class similarities guide the creation of a family of course to fine classifiers which solve categorical problems more effectively than a single flat classifier. High accuracies require precise configurations for each of the family of classifiers. This paper proposes a method to adaptively select the configuration of the hierarchical family of classifiers. Linkage statistics from overall and sub-classification confusion matrices define categorical groupings for an efficient and accurate classification framework. Depending on the number of classes and the complexity of the problem, an adaptive configuration manager chooses between a multi-layer perceptron or a deep convolutional neural network, then selects the complexity of each.

14:00-16:10, Paper ThPT1.17
An Information Theoretic Feature Selection Framework Based on Integer Programming
Nie, Siqi	RPI
Gao, Tian	Rensselaer Pol. Inst
Ji, Qiang	RPI
Keywords: Dimensionality reduction and manifold learning Abstract: We propose a general framework for information theoretic feature selection based on the integer programming. Filter feature selection methods usually rely on a greedy forward or backward selection heuristic to find a satisfactory set of features, as the exact search is a combinatorial problem. We formulate the existing filter information theoretic criteria into an integer programming problem, and by using objective functions, we can represent many different existing scoring criteria. The integer programming framework can be solved efficiently by the existing solvers. We demonstrate the superior performance of the integer programming formulation over its corresponding criterion empirically.

14:00-16:10, Paper ThPT1.18
Nonlinear Dimensionality Reduction by Curvature Minimization
Yoshiyasu, Yusuke	AIST
Yoshida, Eiichi	AIST
Attachments: Supplementary material Keywords: Dimensionality reduction and manifold learning, 2D/3D object detection and recognition, Shape modeling and encoding Abstract: In this paper, we introduce a nonlinear dimensionality reduction (NLDR) technique that can construct a low-dimensional embedding efficiently and accurately with low embedding distortions. The key idea is to divide NLDR into nonlinearity reduction and linear dimensionality reduction, which simplifies the overall NLDR process. Nonlinearity reduction is based on the elastic shell model that measures the in-plane stretching and bending energy. With this model, we minimize the curvature of the data, which is the source of nonlinearity, while preserving the original intrinsic property (i.e., local lengths) as-much-as possible. We discretize and linearize our nonlinearity reduction model such that it leads to an iterative deformation technique that alternates between two steps in order to flatten a manifold: the curvature minimization step that solves a bi-Laplace system and the local length restoration step that solves a Poisson system. We propose an efficient optimization technique for the both steps using a direct solver based on Cholesky decomposition, which exploits the fact that the system matrices stay constant; during iterations, we reuse the factorizations that are obtained once at the beginning and perform back substitutions only. Since our algorithm relies only on local geometric properties, it can accurately embed the data with complicated topology. Experimental results show that our algorithm is faster than the most of other state-of-the-art algorithms and preserves local areas and angles better than previous approaches.

14:00-16:10, Paper ThPT1.19
Unsupervised Feature Extraction Using a Learned Graph with Clustering Structure
Zhuge, Wenzhang	National Univ. of Defense Tech
Hou, Chenping	National Univ. of Defense Tech
Nie, Feiping	NWPU
Yi, Dongyun	National Univ. of Defense Tech
Keywords: Dimensionality reduction and manifold learning, Classification and clustering, Machine learning and data mining Abstract: Feature extraction, one kind of dimensionality reduction methodology, has aroused considerable research interests during the last few decades. Traditional graph embedding methods construct a fixed graph with original data to fulfill the aim of feature extraction. The lack of the graph learning mechanism leaves room for the improvement of their performances. In this paper, we propose a novel framework, termed as unsupervised feature extraction using a learned graph with clustering structure(LGCS), in which a graph learning mechanism has been presented. To be specific, the proposed LGCS learn both a transformation matrix and a structured graph which has k connected components(where k is the number of clusters). To show the effectiveness of the framework, we present a method within our framework combining the locality preserving projection(LPP) with the graph learning mechanism, and an iteration algorithm has been designed to solve the corresponding optimizing problem. Promising experimental results on real-world datasets have validated the effectiveness of our proposed algorithm.

14:00-16:10, Paper ThPT1.20
Simultaneous Visualization of Samples, Features and Multi-Labels
Kudo, Mineichi	Hokkaido Univ
Kimura, Keigo	Hokkaido Univ
Haindl, Michael	Inst. of Information Theory and Automation
Tenmoto, Hiroshi	Kushiro National Coll. of Tech
Keywords: Dimensionality reduction and manifold learning, Classification and clustering, Statistical, syntactic and structural pattern recognition Abstract: Visualization helps us to understand single-label and multi-label classification problems. In this paper, we show several standard techniques for simultaneous visualization of samples, features and multi-classes on the basis of linear regression and matrix factorization. The experiment with two real-life multi-label datasets showed that such techniques are effective to know how labels are correlated to each other and how features are related to labels in a given multi-label classification problem.

14:00-16:10, Paper ThPT1.21
Simplex-Based Dimension Estimation of Topological Manifolds
Tasaki, Hajime	Chuo Univ
Lenz, Reiner	Linköping Univ
Chao, Jinhui	Department of Information and System Engineering, Chuo Univ
Keywords: Dimensionality reduction and manifold learning, Machine learning and data mining Abstract: Dimension reduction is one of the most important issues in machine learning and computational intelligence. Typical data sets are point clouds in a high dimensional space with a hidden structure to be found in low dimensional submanifolds.Finding this intrinsic manifold structure is very important in the understanding of the data and for reducing computational complexity. In this paper, we propose a novel approach for dimension estimation of topological manifolds based on measures of simplices. We also investigate the effects of resolution changes for dimension estimation in the framework of Morse theory. The result is a method that can be used for data located in simplical complexes of varying dimensions and with no continuous or differentiable structure. The proposed method is applied to images of handwritten digits with known deforming dimensions, data with a nontrivial topology and noisy data. We compare the estimates with results obtain by local PCA.

14:00-16:10, Paper ThPT1.22
Robust Unsupervised Feature Selection by Nonnegative Sparse Subspace Learning
Wei, Zheng	Nanjing Univ. of Science and Tech
Yan, Hui	Nanjing Univ. of Science and Tech
Yang, Jian	Nanjing Univ. of Science and Tech
Yang, Jingyu	Nanjing Univ. of Science and Tech
Attachments: Supplementary material Keywords: Dimensionality reduction and manifold learning, Machine learning and data mining Abstract: Sparse subspace learning has been demonstrated to be effective in data mining and machine learning. In this paper, we cast the unsupervised feature selection scenario as a matrix factorization problem from the view of sparse subspace learning. By minimizing the reconstruction residual, the learned feature weight matrix with the l2,1-norm and the non-negative constraints not only removes the irrelevant features, but also captures the underlying low dimensional structure of the data points. Meanwhile in order to enhance the model’s robustness, we attempt to solve our problem by l1-norm error function which is resistant to outliers and sparse noise. An efficient iterative algorithm is introduced to optimize this non-convex and non-smooth objective function and the proof of its convergence is given. Particularly, differ from conventional non-negative updating rules, we design a novel multiplicative update rule to iteratively solve the feature weight matrix, and we validate its non-negativity. Comparative experiments on various original datasets with and without malicious pollution demonstrate performance superiority of our model.

14:00-16:10, Paper ThPT1.23
Moment-Based Symmetry Detection for Scene Modeling and Recognition Using RGB-D Images
Su, Jui-Yuan	Ming Chuan Univ
Cheng, Shyi-Chyi	National Taiwan Ocean Univ. Taiwan
Hsieh, Jun-Wei	-National Taiwan Ocean Univ
Hsu, Tzu-Hao	National Taiwan Ocean Univ
Keywords: Dimensionality reduction and manifold learning, Representation and analysis in pixel/voxel images, Classification and clustering Abstract: In this paper we present a novel unsupervised feature representation by extracting salient symmetries in RGB-D images using the proposed moment-based symmetric patch detector. A fast indexing structure is also derived to group local symmetric patches into semantically meaningful symmetric parts. Given an RGB-D image, the hash-based symmetric patch indexing speeds up the searches of symmetric patch pairs, which are further grouped into symmetric parts with nearly linear time complexity. In the context of symmetry matching and scene classification, the second part of this work presents a symmetry-based scene modeling, aiming at computing a robust part-based feature set for each image category. To verify the effectiveness of the symmetry detector, based on the pre-learned part-based scene model, a part-based voting scheme is constructed to annotate the scene type of the input RGB-D image. Experimental results show that the proposed approach outperforms the compared methods in terms of detection and recognition accuracy using publicly available datasets.

14:00-16:10, Paper ThPT1.24
Unsupervised Object Counting without Object Recognition
Katsuki, Takayuki	IBM Res. - Tokyo
Morimura, Tetsuro	IBM Res. - Tokyo
Ide, Tsuyoshi	T. J. Watson Res. Center
Attachments: Supplementary material Keywords: Machine learning and data mining, Classification and clustering, Pattern Recognition for Surveillance and Security Abstract: This paper addresses the problem of object counting, which is to estimate the number of objects of interest from an input observation. We formalize the problem as a posterior inference of the count by introducing a particular type of Gaussian mixture for the input observation, whose mixture indexes correspond to the count. Unlike existing approaches in image analysis, which typically perform explicit object detection using labeled training images, our approach does not need any labeled training data. Our idea is to use the stick-breaking process as a constraint to make it possible to interpret the mixture indexes as the count. We apply our method to the problem of counting vehicles in real-world web camera images and demonstrate that the accuracy and robustness of the proposed approach without any labeled training data are comparable to those of supervised alternatives.

14:00-16:10, Paper ThPT1.25
MCNC: Multi-Channel Nonparametric Clustering from Heterogeneous Data
Nguyen, Thanh-Binh	Deakin Univ
Nguyen, Vu	Deakin Univ
Venkatesh, Svetha	Deakin Univ
Phung, Dinh	Deakin Univ
Keywords: Machine learning and data mining, Classification and clustering, Statistical, syntactic and structural pattern recognition Abstract: Bayesian nonparametric (BNP) models have recently become popular due to their flexibility in identifying the unknown number of clusters. However, they have difficulties handling heterogeneous data from multiple sources. Existing BNP methods either treat each of these sources independently – hence do not get benefits from the correlating information between them, or require to explicitly specify data sources as primary and context channels. In this paper, we present a BNP framework, termed MCNC, which has the ability to (1) discover co-patterns from multiple sources; (2) explore multi-channel data simultaneously and treat them equally; (3) automatically identify a suitable number of patterns from data; and (4) handle missing data. The key idea is to utilize a richer base measure of a BNP model being a product-space. We demonstrate our framework on synthetic and real-world datasets to discover the identity–location–time (a.k.a who–where–when) patterns. The experimental results highlight the effectiveness of our MCNC framework in both cases of complete and missing data.

14:00-16:10, Paper ThPT1.26
Witness Identification in Multiple Instance Learning Using Random Subspaces
Carbonneau, Marc-André	Ec. De Tech. Supérieure
Granger, Eric	École De Tech. Supérieure
Gagnon, Ghyslain	École De Tech. Supérieure
Keywords: Classification and clustering, Semi-supervised learning and spectral methods Abstract: Multiple instance learning (MIL) is a form of weakly-supervised learning where instances are organized in bags. A label is provided for bags, but not for instances. MIL literature typically focuses on the classification of bags seen as one object, or as a combination of their instances. In both cases, performance is generally measured using labels assigned to entire bags. In this paper, the MIL problem is formulated as a knowledge discovery task for which algorithms seek to discover the witnesses (i.e. identifying positive instances), using the weak supervision provided by bag labels. Some MIL methods are suitable for instance classification, but perform poorly in application where the witness rate is low, or when the positive class distribution is multimodal. A new method that clusters data projected in random subspaces is proposed to perform witness identification in these adverse settings. The proposed method is assessed on MIL data sets from three application domains, and compared to 7 reference MIL algorithms for the witness identification task. The proposed algorithm constantly ranks among the best methods in all experiments, while all other methods perform unevenly across data sets.

14:00-16:10, Paper ThPT1.28
Joint K-Means Quantization for Approximate Nearest Neighbor Search
Ozan, Ezgi Can	Tampere Univ. of Tech
Kiranyaz, Serkan	Tampere Univ. of Tech
Moncef, Gabbouj	Tampere Univ. of Tech
Keywords: Machine learning and data mining, Multimedia analysis, indexing and retrieval, Segmentation, features and descriptors Abstract: Recently, Approximate Nearest Neighbor (ANN) Search has become a very popular approach for similarity search on large-scale datasets. In this paper, we propose a novel vector quantization method for ANN, which introduces a joint multi-layer K-Means clustering solution for determination of the codebooks. The performance of the proposed method is improved further by a joint encoding scheme. Experimental results verify the success of the proposed algorithm as it outperforms the state-of-the-art methods.

14:00-16:10, Paper ThPT1.29
Semi-Supervised Learning Competence of Classifiers Based on Graph for Dynamic Classifier Selection
Hou, Cui qin	Fujitsu R&D Center Co. Ltd
Xia, Yingju	Information Tech. Lab. Fujitsu Res. & Developmen
Xu, Zhuo ran	Fujitsu R&D Center Co. Ltd
Sun, Jun	Fujitsu R&D Center Co., LTD
Keywords: Machine learning and data mining, Statistical, syntactic and structural pattern recognition Abstract: Classifier competence is critical important for dynamic classifier selection. This study proposes a semi-supervised learning algorithm to learn the competence of classifiers under the proposed optimization framework based on graph. First it constructs a graph based on the training data and some unlabeled data. Then it iteratively learns the competence of classifiers. The learned competence not just reflects the competitiveness of classifiers, but also varies smooth on the neighboring data. Experimental results on five different datasets show the dynamic classifier selection classification systems with the learned classifier competence perform better than the classification systems with local accuracy as the classifier competence.

14:00-16:10, Paper ThPT1.30
Learning Tubes
Ulm, Michael	Austrian Inst. of Tech
Braendle, Norbert	Austrian Inst. of Tech
Keywords: Machine learning and data mining, Statistical, syntactic and structural pattern recognition Abstract: We present a new method for analyzing data manifolds based on Weyl’s tube theorem. The coefficients of the tube polynomial for a manifold provide geometric information such as the volume of the manifold or its Euler characteristic, thus providing bounds on the geometric nature of the manifold. We present an algorithm estimating the coefficients of the tube polynomial for a given manifold and demonstrate the features of our algorithm on artificial datasets. We apply the algorithm on a real-world traffic dataset to determine the number and properties of clusters. We furthermore demonstrate that our algorithm can be used to determine image coverage of an object, giving hints on where a manifold is not sufficiently sampled.

14:00-16:10, Paper ThPT1.31
Bayesian Nonparametric Multiple Instance Regression
Subramanian, Saravanan	Deakin Univ
Rana, Santu	Deakin Univ
Gupta, Sunil Kumar	Deakin Univ
Bagavathi Sivakumar, P	Dept of Computer Science and Engineering Amrita School of Engin
Velayutham, Shunmuga	Dept. of Computer Science and Engineering, Amrita School of Engi
Venkatesh, Svetha	Deakin Univ
Keywords: Machine learning and data mining, Statistical, syntactic and structural pattern recognition Abstract: Multiple Instance Regression jointly models a set of instances and its corresponding real-valued output. We present a novel multiple instance regression model that infers a subset of instances in each bag that best describes the bag label and uses them to learn a predictive model in a unified framework. We assume that instances in each bag are drawn from a mixture distribution and thus naturally form groups, and instances from one of this group explain the bag label. The largest cluster is assumed to be correlated with the label. We evaluate this model on the crop yield prediction and aerosol depth prediction problems. The predictive accuracy of our model is better than the state of the art MIR methods.

14:00-16:10, Paper ThPT1.33
Bayesian Approach to Learn Bayesian Networks Using Data and Constraints
Gao, Xiao-guang	Northwestern Pol. Univ
Yang, Yu	Northwestern Pol. Univ
Guo, Zhigao	Northwestern Pol. Univ
Chen, Da-qing	London South Bank Univ
Keywords: Model selection, Machine learning and data mining Abstract: One of the essential problems on Bayesian networks (BNs) is parameter learning. When purely data-driven methods fail to work, incorporating supplemental information, like expert judgments, can improve the learning of BN parameters. In practice, expert judgments are provided and transformed into qualitative parameter constraints. Moreover, prior distributions of BN parameters are also useful information. In this paper we propose a Bayesian approach to learn parameters from small datasets by integrating both parameter constraints and prior distributions. First, the feasible parameter region is derived from constraints. Then, using the prior distribution, a posterior distribution over the feasible region is developed based on the Bayes theorem. Finally, the parameter estimations are taken as the mean values of the posterior distribution. Learning experiments on standard BNs reveal that the proposed method outperforms most of the existing methods.

14:00-16:10, Paper ThPT1.34
True-Negative Label Selection for Large-Scale Multi-Label Learning
Kanehira, Atsushi	Univ. of Tokyo
Shin, Andrew	The Univ. of Tokyo
Harada, Tatsuya	The Univ. of Tokyo
Keywords: Multimedia analysis, indexing and retrieval, Image and video analysis and understanding, Machine learning and data mining Abstract: In this paper, we focus on training a classifier from large-scale data with incompletely assigned labels. In other words, we treat samples with following properties: 1. assigned labels are definitely positive, 2. absent labels are not necessarily negative, and 3. samples are allowed to take more than one label. These properties are frequently found in various kinds of computer vision tasks, including image and video classification and retrieval. Many online algorithms for multi-label task employ label sampling, which selects a label pair that reduces the largest penalty to update the model, thereby avoiding waste of computation. In the setting above, however, there are “false-negative” labels, which are originally positive labels but regarded as negative. Since it is high likely for label sampling to select these labels as negative labels in the sampled pair, it may severely degrade classification performance. In order to solve this problem while preserving convergence property of the online algorithms, we propose a novel label sampling approach, which aims to fetch “true-negative” labels via false-negativeness measure based on independently trained uni-class classifiers. Experimental results show the effectiveness of our approach.

14:00-16:10, Paper ThPT1.35
Learning Data-Driven Image Similarity Measure
Kobayashi, Takumi	National Inst. of Advanced Industrial Science And
Keywords: Representation and analysis in pixel/voxel images, Statistical, syntactic and structural pattern recognition, Segmentation, features and descriptors Abstract: Image quality assessment gains a greater interest due to development of digital imaging and storage. In that field, structural similarity (SSIM) index has been shown to favorably agree with human perceptual assessment, significantly outperforming the method of mean squared error, i.e., L2 distance. The similarity measure function in SSIM which compares a target (distorted) image with its reference (original) image is hand-crafted in a simple form via a top-down approach based on the human visual system. It, however, might lack optimality without directly considering the relationships between image data and the perceptual assessment (scores). In this paper, we propose a method to construct an image similarity measure based on actual data. The proposed method optimizes a similarity measure function by exploiting annotated data in a bottom-up and data-driven manner, while retaining the favorable property of structural similarity in SSIM. The non-linear similarity function is optimized as the global optimum of high generalization power. In addition, the proposed method is simply formulated and thus applicable to the family of SSIM, especially to FSIM which has been recently proposed exhibiting superior performance to SSIM. The experimental results on image quality assessment demonstrate the effectiveness of the proposed method compared to the other methods.

14:00-16:10, Paper ThPT1.36
Information-Theoretic Atomic Representation for Robust Pattern Classification
Wang, Yulong	Univ. of Macau
Tang, YuanYan	Univ. of Macao
Li, Luoqing	Hubei Univ
Wang, Patrick	Northeastern Univ
Keywords: Classification and clustering, Face recognition, Handwriting Recognition Abstract: Representation-based classifiers (RCs) including sparse RC (SRC) have attracted intensive interest in pattern recognition in recent years. In our previous work, we have proposed a general framework called atomic representationbased classifier (ARC) including many popular RCs as special cases. Despite the empirical success, ARC and conventional RCs utilize the mean square error (MSE) criterion and assign the same weights to all entries of the test data, including both severely corrupted and clean ones. This makes ARC sensitive to the entries with large noise and outliers. In this work, we propose an information-theoretic ARC (ITARC) framework to alleviate such limitation of ARC. Using ITARC as a general platform, we develop three novel representation-based classifiers. The experiments on public real-world datasets demonstrate the efficacy of ITARC for robust pattern recognition.

14:00-16:10, Paper ThPT1.37
Fully Automatic Image Colorization Based on Convolutional Neural Network
Varga, Domonkos	Inst. for Computer Science and Control, Hungarian Acad. Of
Sziranyi, Tamas	Mta Sztaki
Keywords: Artificial neural networks, Deep learning, Texture and color analysis Abstract: This paper deals with automatic image colorization. This is a very difficult task, since it is an ill-posed problem that usually requires user intervention to achieve high quality. A fully automatic approach is proposed that is able to produce realistic colorization of an input grayscale image. Motivated by the recent success of deep learning techniques in image processing, we propose a feed-forward, two-stage architecture based on Convolutional Neural Network that predicts the U and V color channels. Unlike most of the previous works, this paper presents a fully automatic colorization which is able to produce high-quality and realistic colorization even of complex scenes. Comprehensive experiments and qualitative and quantitative evaluations were conducted on the images of SUN database and on other images. We have found that Quaternion Structural Similarity (QSSIM) gives in some degree a good base for quantitative evaluation, that is why we chose QSSIM as an index-number for the quality of colorization.

14:00-16:10, Paper ThPT1.38
Integrating Deep Features for Material Recognition
Zhang, Yan	Tohoku Univ
Ozay, Mete	Tohoku Univ
Liu, Xing	TOHOKU Univ
Okatani, Takayuki	Tohoku Univ
Keywords: Deep learning Abstract: This paper considers the problem of material recognition. Motivated by observation of close interconnections between material and object recognition, we study how to select and integrate multiple features obtained by different models of Convolutional Neural Networks (CNNs) trained in a transfer learning setting. To be specific, we first compute activations of features using representations on images to select a set of samples which are best represented by the features. Then, we measure uncertainty of the features by computing entropy of class distributions for each sample set. Finally, we compute contribution of each feature to representation of classes for feature selection and integration. Experimental results show that the proposed method achieves state-of-the-art performance on two benchmark datasets for material recognition. Additionally, we introduce a new material dataset, named EFMD, which extends Flickr Material Database (FMD). By the employment of the EFMD for transfer learning, we achieve 84.0+/-1.8 accuracy on the FMD dataset, which is close to the reported human performance 84.9.


ThPT2	Poster Session Hall
ThP2	Poster Session

14:00-16:10, Paper ThPT2.1
Robust Joint Selection of Camera Orientations and Feature Projections Over Multiple Views
Pistellato, Mara	Univ. Ca' Foscari Venezia
Albarelli, Andrea	Univ. Ca' Foscari Di Venezia
Bergamasco, Filippo	Univ. Ca' Foscari Di Venezia
Torsello, Andrea	Univ. Ca' Foscari Venezia
Keywords: Motion, tracking and video analysis, Reconstruction and camera motion estimation, 3D shape recovery Abstract: A number of critical factors arises when a complex 3D scene is to be reconstructed by means of a large sequence of different views. Some of them are related to the ability to recover the correct identity and the accurate projection of each observed feature. Other sources of error are tied to the reliability of the orientation estimate for each view. With this paper we propose a method that tries to solve both problems at the same time, while being also inherently resilient to outliers. At the core of the approach stands a widely adopted game-theoretical selection technique, which has already been successfully embraced to address similar tasks. The original inception, however, has been further refined to address a wider range of scenarios, as well as to offer a reduced memory consumption and computation complexity. By exploiting these enhancements, we were able to apply our technique to a large-scale setup involving several hundreds of view points and tens of thousands of independent observations.

14:00-16:10, Paper ThPT2.2
Robust Online 3D Reconstruction Combining a Depth Sensor and Sparse Feature Points
Bylow, Erik	Lund Univ
Olsson, Carl	Lund Univ
Kahl, Fredrik	Lund Univ
Keywords: Reconstruction and camera motion estimation Abstract: Online 3D reconstruction has been an active research area for a long time. Since the release of the Microsoft Kinect Camera and publication of KinectFusion attention has been drawn how to acquire dense models in real-time. In this paper we present a method to make online 3D reconstruction which increases robustness for scenes with little structure information and little texture information. It is shown empirically that our proposed method also increases robustness when the distance between the camera positions becomes larger than what is commonly assumed. Quantitative and qualitative results suggest that this approach can handle situations where other well-known methods fail. This is important in, for example, robotics applications like when the camera position and the 3D model must be created online in real-time.

14:00-16:10, Paper ThPT2.3
A Game-Theoretical Approach for Joint Matching of Multiple Feature Throughout Unordered Images
Cosmo, Luca	Univ. Ca' Foscari Venezia
Albarelli, Andrea	Univ. Ca' Foscari Di Venezia
Bergamasco, Filippo	Univ. Ca' Foscari Di Venezia
Torsello, Andrea	Univ. Ca' Foscari Venezia
Rodolà, Emanuele	Univ. Ca' Foscari Venezia
Cremers, Daniel	Bonn Univ
Keywords: Reconstruction and camera motion estimation, 3D shape recovery, Segmentation, features and descriptors Abstract: Feature matching is a key step in most Computer Vision tasks involving several views of the same subject. In fact, it plays a crucial role for a successful reconstruction of 3D information of the corresponding material points. Typical approaches to construct stable feature tracks throughout a sequence of images operate via a two-step process: First, feature matches are extracted among all pairs of points of view; these matches are then given in input to a regularizer that provides a final, globally consistent solution. In this paper, we formulate this matching problem as a simultaneous optimization over the entire image collection, without requiring previously computed pairwise matches to be given as input. As our formulation operates directly in the space of feature across multiple images, the final matches are consistent by construction. Our matching problem has a natural interpretation as a non-cooperative game, which allows us to leverage tools and results from Game Theory. We performed a specially crafted set of experiments demonstrating that our approach compares favorably with the state of the art, while retaining a high computational efficiency.

14:00-16:10, Paper ThPT2.4
Optimization of Radial Distortion Self-Calibration for Structure from Motion from Uncalibrated UAV Images
Li, Yonglu	Inst. of Automation, Chinese Acad. of Sciences
Cai, Yinghao	Chinese Acad. of Sciences
Wen, Dayong	Inst. of Automation, Chinese Acad. of Sciences
Yang, Yiping	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Reconstruction and camera motion estimation, Stereo and multiple view geometry, 3D shape recovery Abstract: Structure from motion (SfM) and self-calibration from images of unknown radial distortions could fail under some critical configurations and produce distorted reconstruction results. In this paper, we propose an effective approach to optimize the estimation of radial distortion coefficient by taking full advantage of GPS information, which allows for more accurate SfM results. A feedback function is designed as the metric to indicate the magnitude of the distortion error. Heuristic search strategies are applied to search for the optimal distortion coefficient. Extensive experimental results show that our approach can effectively reduce the distorted deformation error and improve the estimation accuracy of the distortion coefficient.

14:00-16:10, Paper ThPT2.5
Robust Global Translation Averaging with Feature Tracks
Cui, Hainan	Inst. of Automation, Chinese Acad. of Sciences
Shen, Shuhan	Inst. of Automation, Chinese Acad. of Sciences
Hu, Zhanyi	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Reconstruction and camera motion estimation, Stereo and multiple view geometry, Image based modeling Abstract: How to average translations is the single most difficult task in global structure-from-motion (SfM) to fully tap its potentials in terms of reconstruction efficiency and accuracy since usually only noisy translation directions can be factored out from essential matrices due to the inevitable matching outliers. To tackle this problem, this work proposes a two-step strategy. Firstly, a "2-point method" is introduced to refine the epipolar geometry by which a more accurate track set is generated. Then, translation lengths are computed by solving a convex L1 optimization according to the adjacent triangles induced by the selected tracks and translations. Extensive experiments show that our method performs similarly or better than the state-of-art SfM approaches in terms of the reconstruction accuracy, completeness and efficiency.

14:00-16:10, Paper ThPT2.6
Accurate Localization for Mobile Device Using a Multi-Planar City Model
Luo, Yawei	Huazhong Univ. of Science and Tech
Guan, Tao	Huazhong Univ. of Science & Tech
Pan, Hailong	Huazhong Univ. of Science & Tech
Wang, Yuesong	HUAZHONG Univ. of Science and Tech
Yu, Junqing	HUAZHONG Univ. of Science and Tech
Attachments: Supplementary material Keywords: Reconstruction and camera motion estimation, Stereo and multiple view geometry, Segmentation, features and descriptors Abstract: This paper presents a novel method for estimating the unknown 6DOF pose of a mobile device. The method is based on matching between the mobile image and the virtual city model which is merely composed of 3D points on planar building facade. The main contributions of this paper are as follows: firstly, we design a new plane generation strategy which fuses the 3D model points, photo homography and the orientation of the buildings together within RANSAC framework. Secondly, we propose a novel energy-based method which can parallel solve the mobile poses as well as the best 2D-3D matches. Thirdly, a client/server mode is established to support a speedy localization experience on the mobile devices. To the best of our knowledge, this is the first implementation that uses such a multi-planar model to accurately locate the mobile device in a scene of city scale. Experiment shows that the localization performance becomes faster and more robust comparing with other methods when adding the multi-planar information of the buildings to localization algorithm.

14:00-16:10, Paper ThPT2.7
Dense Multi-View Homography Estimation and Plane Segmentation
Bergamasco, Filippo	Univ. Ca' Foscari Di Venezia
Cosmo, Luca	Univ. Ca' Foscari Venezia
Schiavinato, Michele	Univ. Ca'Foscari Venezia
Albarelli, Andrea	Univ. Ca' Foscari Di Venezia
Torsello, Andrea	Univ. Ca' Foscari Venezia
Keywords: Reconstruction and camera motion estimation, Stereo and multiple view geometry, Vision for robotics Abstract: When a planar structure is observed from multiple views, the projections of its corresponding 3D points on each image are related by a homography. Its estimation is often a key step in many computer vision tasks where either the rigid motion between views or a per-pixel image correspondence is sought. The vast majority of multi-view homography estimation techniques relies on matching a sparse set of point-to-point correspondences to establish a connected graph in the camera network. This track creation step is critical to ensure that the following bundle adjustment can estimate globally optimal alignment in which the error is diffused coherently on each pairwise homography. On the other hand, erroneous or short tracks often cause misalignments among the views. We propose an optimization technique to simultaneously recover a transitively consistent network of planar homographies between multiple views together with a segmentation of the pixels comprising the observed plane. Our method acts on a per-pixel basis to avoid a preliminary multi-view sparse feature matching step. Similarly to bundle adjustment, the error is diffused so that each homography in the view graph is transitively consistent with the others. The effectiveness of the proposed approach is evaluated in real-world scenarios and synthetically generated scenes.

14:00-16:10, Paper ThPT2.8
Improvement of Camera Calibration Using Surface Normals
Eichhardt, Iván	Mta Sztaki
Hajder, Levente	Mta Sztaki
Keywords: Reconstruction and camera motion estimation, Stereo and multiple view geometry, Vision sensors Abstract: A new camera calibration approach is proposed that can utilize the affine transformations and surface normals of small spatial patches. Even though classical calibration algorithms use only point locations, images contain more information than simple 2D point coordinates. New methods are presented in this paper for the calibration problem with their closed-form solutions, then the estimated parameters are numerically refined. The accuracy of our novel methods is validated on synthesized testing data, and the real-world applicability is presented on the calibration of a 3D structured-light scanner.

14:00-16:10, Paper ThPT2.9
Context-Regularized Learning of Fully Convolutional Networks for Scene Labeling
Roy, Anirban	Oregon State Univ
Todorovic, Sinisa	Oregon State Univ
Latecki, Longin Jan	Temple Univ
Keywords: Scene understanding, 2D/3D object detection and recognition Abstract: This paper addresses the problem of pixel-wise semantic labeling of images. To this end, we use a fully convolutional network (FCN) whose input are raw pixels, and output are pixel labels. Our key novelty is that we regularize a supervised learning of FCN, such that FCN correctly predicts pixel labels and additionally does not violate a given set of spatial object relationships of interest. The frequency of occurrence of these object relationships in training images is used to estimate a new loss function for the regularized learning of FCN. The results on the benchmark PASCAL 2011, 2012 and NYU~v2 datasets demonstrate that our regularized FCN outperforms a non-regularized FCN and other related state-of-the-art approaches. Importantly, in cases of error in semantic labeling, the regularized FCN does not violate the object relationships of interest, unlike the non-regularized counterparts.

14:00-16:10, Paper ThPT2.10
An Attention Model Based on Spatial Transformers for Scene Recognition
Guo, Shuxuan	National Univ. of Defense Tech
Liu, Li	National Univ. of Defense Tech
Wang, Wei	National Lab. of Pattern Recognition
Lao, Songyang	National Univ. of Defense Tech
Wang, Liang	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Scene understanding, Classification and clustering, Deep learning Abstract: Scene recognition is an important and challenging task in computer vision. We propose an end-to-end pipeline by combing convolutional neural networks (CNNs) with explicit attention model to determine several meaningful regions of original images for scene recognition. In the proposed pipeline, the spatial transformer network is leveraged as the attention module, which can automatically learn the scales and movements of centers of attention windows. As for feature extraction, the basic CNN architecture is utilized. Furthermore, the stronger descriptors of scenes are constructed by feature fusion. The highlight of our proposed network is that it is capable to localize discriminative regions from an image in a data-driven manner without any additional supervision. We conduct experiments on a subset of the Places205 database to evaluate the performance of the proposed basic network and the involved parameters. Our model achieves state-of-the-art top-1 accuracy 82.10% on the evaluation dataset comparing with fine-tuned PlacesCNN (80.98%). We find that our model is able to learn informative attention regions for discriminating scene categories.

14:00-16:10, Paper ThPT2.11
Fast Road Scene Segmentation Using Deep Learning and Scene-Based Models
John, Vijay	Toyota Tech. Inst
Guo, Chunzhao	Toyota Central R&D Labs., Inc
Mita, Seiichi	Toyota Tech. Inst
Kidono, Kiyosumi	TOYOTA Central R&D Labs., Inc
Tehrani, Hossein	Denso Res. and Development
Ishimaru, Kazuhisa	NIPPON SOKEN Inc
Keywords: Scene understanding, Deep learning, Classification and clustering Abstract: Pixel-labeling approaches using semantic segmentation play an important role in road scene understanding. In recent years, deep learning approaches such as the deconvolutional neural network have been used for semantic segmentation, obtaining state-of-the-art results. However, the segmentation results have limited object delineation. In this paper, we adopt the de-convolutional neural network to perform the semantic segmentation of the road scene using colour and depth information. Moreover, we improve the network’s limited object delineation within a computationally efficient framework using novel features that are learnt at the pixel-level and patch level for different road scenes. The patch-level features represent the road scene geometry. On the other hand, the learnt pixel level features represent the appearance and depth information. The features learnt for the different road scenes are indexed with the scene’s pre-defined label. Following the indexing, the random forest classifier is trained to retrieve the relevant geometric and appearance-depth features for a given road scene. The retrieved features are then used to refine identified error regions in the initial semantic segmentation estimate. Our proposed algorithm is evaluated on an acquired dataset and compared with state-of-the art baseline algorithms. We also perform a detailed parametric evaluation of our proposed framework. The experimental results show that our proposed algorithm reports better accuracy.

14:00-16:10, Paper ThPT2.12
Salient Object Segmentation Based on Linearly Combined Affinity Graphs
Aytekin, Caglar	Tampere Univ. of Tech
Iosifidis, Alexandros	Tampere Univ. of Tech
Kiranyaz, Serkan	Tampere Univ. of Tech
Moncef, Gabbouj	Tampere Univ. of Tech
Keywords: Scene understanding, Image and video analysis and understanding Abstract: In this paper, we propose a graph affinity learning method for a recently proposed graph-based salient object detection method, namely Extended Quantum Cuts (EQCut). We exploit the fact that the output of EQCut is differentiable with respect to graph affinities, in order to optimize linear combination coefficients and parameters of several differentiable affinity functions by applying error backpropagation. We show that the learnt linear combination of affinities improves the performance over the baseline method and achieves comparable (or even better) performance when compared to the state-of-the-art salient object segmentation methods.

14:00-16:10, Paper ThPT2.13
2D-3D Semantic Segmentation Using Cardinality As Higher-Order Loss
Rahmatollahi Namin, Shahin	Australian National Univ. (ANU), National ICT Australia (NI
Petersson, Lars	NICTA/Data61
Alvarez, Jose M	NICTA
Keywords: Scene understanding, Machine learning and data mining Abstract: Multi-modal scene analysis is a growing field of importance as additional sensors, such as 3D LIDAR, is becoming a common complement to image capturing systems. However, while additional sensory data potentially can make the analysis more accurate, it also comes with a host of associated issues. For example, inconsistencies in the data between sensors resulting from, e.g., misalignment, moving objects, or parallax effects, can severely affect the performance. Additionally, real-world scenes tend to have an inherent imbalance in the number of items of each class which typically suppresses the performance of infrequent classes. In this paper, we address those two issues specifically by a) using a cardinality loss function designed to target inconsistencies at training time, and b) devising an average per class loss function addressing the imbalance issue.

14:00-16:10, Paper ThPT2.14
Non-Rigid Dense Bijective Maps
Gasparetto, Andrea	Ca' Foscari
Cosmo, Luca	Univ. Ca' Foscari Venezia
Torsello, Andrea	Univ. Ca' Foscari Venezia
Wilson, Richard	Univ. of York
Keywords: Shape modeling and encoding, 2D/3D object detection and recognition Abstract: We present a novel approach to the computation of dense correspondence maps between shapes in a non-rigid setting. The problem is defined in terms of functional correspondences. We deal with the non-injectivity of the solution of the functional map framework due to the under-determinedness of the original problem. Key to our approach is the injectivity constraint plugged directly into the problem to optimize, achieved casting the it as an assignment problem. This leads to an iterative process which yields a high quality bijective map between the shapes. In the experimental section we present both quantitative and qualitative results, showing that the proposed approach is competitive with the current state-of-the-art on a quasi-isometric shape matching benchmark.

14:00-16:10, Paper ThPT2.16
Fully Automated Shape Analysis for Detection of Osteoarthritis from Lateral Knee Radiographs
Minciullo, Luca	The Univ. of Manchester
Cootes, Tim	The Univ. of Manchester
Keywords: Shape modeling and encoding, Machine learning and data mining, Computer-aided detection and diagnosis Abstract: Osteoarthritis (OA) is the most common form of arthritis, affecting millions of people around the world. Since no cure has been discovered and considering the financial impact on health systems, any attempt to understand more of this disease could reveal new insights that would help develop new therapies. Lateral knee radiographs are often ignored both by clinicians and the research community when trying to diagnose OA or other diseases that affect the knee joint. Our goal is to show that this view has a considerable potential. We present a fully automated method based on a Random Forest Regression Voting Constrained Local Model (RFCLM) to discriminate radiographs of people that have developed OA from people who have not. The experiments involved models built on different combinations of the four shapes (patella, tibia, medial and lateral femoral condyles) of the knee joint. We show that automated analysis of the the lateral view achieves classification performance comparable if not better than similar techniques applied to the frontal view.

14:00-16:10, Paper ThPT2.17
An Algebraic Framework for Deformable Image Registration
Sánta, Zsolt	Univ. of Szeged
Kato, Zoltan	Univ. of Szeged
Keywords: Shape modeling and encoding, Signal, image and video processing Abstract: This paper presents a novel approach to deformable image registration using a parametric transformation model. The aligning transformation is simply found by solving a system of non-linear equations. Each equation is generated by integrating the product of a set of polynomial functions and the image intensity function over corresponding image domains. The capabilities of the algorithm have been studied on synthetic and real datasets on which it compares favorably to state of the art methods in terms of alignment accuracy.

14:00-16:10, Paper ThPT2.18
Fast Projector-Camera Calibration for Interactive Projection Mapping
Fleischmann, Oliver	Christian-Albrechts-Univ. Kiel
Koch, Reinhard	Kiel Univ
Keywords: Stereo and multiple view geometry, 3D shape recovery, Vision for graphics Abstract: We propose a fast calibration method for projector-camera pairs which does not require any special calibration objects or initial estimates of the calibration parameters. Our method is based on a structured light approach to establish correspondences between the camera and the projector view. Using the vanishing points in the camera and the projector view the internal as well as the external calibration parameters are estimated. In addition, we propose an interactive projection mapping scheme which allows the user to directly place two-dimensional media elements in the tangent planes of the target surface without any manual perspective corrections.

14:00-16:10, Paper ThPT2.19
Dense Disparity Estimation Based on Feature Matching and IGMRF Regularization
Nahar, Sonam	The LNM Inst. of Information Tech. Jaipur
Joshi, Manjunath	Dhirubhai Ambani Inst. of Information Andcommunicationtechno
Keywords: Stereo and multiple view geometry, Deep learning, Signal, image and video processing Abstract: In this paper, we propose a new approach for dense disparity estimation in a global energy minimization framework. We combine the feature matching cost defined using the learned hierarchical features of given left and right stereo images, with the pixel-based intensity matching cost to form the data term. The features are learned in an unsupervised way using the textit{deep deconvolutional network}. Our regularization term consists of an inhomogeneous Gaussian markov random field (IGMRF) prior that captures the smoothness as well as preserves sharp discontinuities in the disparity map. An iterative two phase algorithm is proposed to minimize the energy function in order to estimate the dense disparity map. In phase one, IGMRF parameters are computed, keeping the disparity map fixed, and in phase two, the disparity map is refined by minimizing the energy function using graph cuts, with other parameters fixed. Experimental results on the Middlebury stereo benchmarks demonstrate the effectiveness of the proposed approach.

14:00-16:10, Paper ThPT2.20
Calibration, Positioning and Tracking in a Refractive and Reflective Scene
Palmer, Tobias	Lund Univ
Bianco, Giuseppe	Lund Univ
Ekvall, Mikael Tobias	Lund Univ
Hansson, Lars-Anders	Dep of Biology, Lund Univ
Åström, Kalle	Lund Univ
Keywords: Stereo and multiple view geometry, Physics-based vision Abstract: We propose a framework for calibration, positioning and tracking in a scene viewed by multiple cameras, through a flat refractive surface and one or several flat reflective walls. Refractions are explicitly modeled by Snell's law and reflections are handled using virtual points. A novel bundle adjustment framework is introduced for solving the nonlinear equations of refractions and the linear equations of reflections, which in addition enables optimization for calibration and positioning. The numerical accuracy of the solutions is investigated on synthetic data, and the influence of noise in image points for several settings of refractive and reflective planes is presented. The performance of the framework is evaluated on real data and confirms the validity of the physical model. Examples of how to use the framework to back-project image coordinates, forward-project scene points and estimate the refractive and reflective planes are presented. Lastly, an application of the system on real data from a biological experiment on small aquatic organisms is presented.

14:00-16:10, Paper ThPT2.21
An Efficient Solution to Absolute Orientation
Lourakis, Manolis	Foundation for Res. and Tech. - Hellas
Keywords: Stereo and multiple view geometry, Reconstruction and camera motion estimation, Vision for robotics Abstract: The absolute orientation problem arises often in vision and robotics. Despite that robust algorithmic solutions exist for quite some time, they all rely on matrix factorizations such as eigen or singular value decomposition. These factorizations are relatively expensive to compute, therefore might become a performance bottleneck when absolute orientation needs to be repeatedly computed on low-end hardware. The issue is exacerbated by implementations relying on standard numerical software libraries like LAPACK, since the linear algebra factorization routines they include are optimized for large matrices and thus are not the most efficient choice for small ones. Based on an attitude estimation algorithm originating from astronautics, this paper proposes a direct, factorization-free solution to the absolute orientation problem that is both computationally efficient and numerically accurate. Results from an experimental comparison with established methods demonstrate its superior performance.

14:00-16:10, Paper ThPT2.22
Robust Wide Baseline Pose Estimation from Video
Pellicanò, Nicola	Paris-Sud Univ. Paris-Saclay Univ
Aldea, Emanuel	Paris Sud Univ. Paris Saclay Univ
Le Hégarat-Mascle, Sylvie	Paris Sud Univ
Attachments: Supplementary material Keywords: Stereo and multiple view geometry, Scene understanding, Image and video analysis and understanding Abstract: Robust wide baseline pose estimation is an essential step in the deployment of smart camera networks. In this work, we highlight some current limitations of conventional strategies for relative pose estimation in difficult urban scenes. Then we propose a solution which relies on an adaptive search of corresponding interest points in synchronized video streams which allows us to converge robustly towards a high-quality solution. The experiments are performed using a manually annotated ground truth of a large scale scene exhibiting significant depth and perspective variation, uniform areas, repetitive patterns and homogeneous dynamic elements. The results show a fast and robust convergence of the solution, and a significant improvement, compared to single image based alternatives, of the RMSE of ground truth matches, and of the maximum absolute error.

14:00-16:10, Paper ThPT2.23
Automatic Generation of a Realistic Looking Single Image Stereogram Using Stereo Vision
Nguyen, Minh	Auckland Univ. of Tech. New Zealand
Keywords: Stereo and multiple view geometry, Signal, image and video processing, Content based image retrieval and data mining Abstract: Stereogram or autostereogram, frequently known as Magic Eye picture, is a two-dimensional (2D) image. Hidden inside each stereogram image is a floating object which appears in three-dimensional. We present a fully featured, web-based, online system that produces on-the-fly stereograms (or autostereograms) from a pair of side-by-side images. The system consists of three steps: (i) two side-by-side near stereo images are processed to reconstruct a disparity map, (ii) multiple regions of left or right images are automatically selected for building stereogram, (iii) base distance is estimated for the best human perception, and (iv) reconstruction of stereogram. Stereogram is built with a variety of transparency options, thus, multiple regions of images can be presented in the final output and display of transparent 3D layers in stereogram is possible. Moreover, the system is portable and can be used in smartphones with an Internet connection. Image pairs can be acquired directly on the phone and stereogram is reconstructed and displayed on the phone screen. These novel features place the system ahead of current alternatives and allows a wide variety of users to experience stereo reconstruction and stereogram generation in a quick and easy manner. Additionally, the system could serve as a platform for online based visual perception studies.

14:00-16:10, Paper ThPT2.24
A Constrained Clustering Based Approach for Matching a Collection of Feature Sets
Yan, Junchi	IBM Res. -- China
Ren, Zhe	Shanghai Jiao Tong Univ
Zha, Hongyuan	East China Normal Univ
Chu, Stephen	IBM Res. Div. Thomas J. Watson Res. Center
Keywords: Stereo and multiple view geometry, Vision for robotics, Low-level vision Abstract: We consider the problem of finding feature correspondences among a collection of feature sets, by using point-wise unary features. This is fundamental in computer vision and pattern recognition, which also relates to areas e.g. operational research. Different from two-set matching which can be transformed to a quadratic assignment programming task that is known NP-hard, inclusion of merely unary attributes leads to a linear assignment problem. This problem has been well studied and there are effective polynomial global optimum solvers such as the Hungarian method. However, it becomes ill-posed when the unary attributes are (heavily) corrupted. The global optimal correspondence concerning the best score defined by the attribute affinity/cost between the two sets can be distinct to the ground truth correspondence since the score function is biased by noises. To combat this issue, we devise a method for matching a collection of feature sets by synergetically exploring the information across the sets. In general, our method can be perceived from a (constrained) clustering perspective: in each iteration, it assigns the features of one set to the clusters formed by the rest of feature sets, and updates the cluster centers in turn. Results on both synthetic data and real images suggest the efficacy of our method against state-of-the-arts.


ThPT3	Poster Session Hall
ThP3	Poster Session

14:00-16:10, Paper ThPT3.1
Coupled Multiple Dictionary Learning Based on Edge Sharpness for Single Image Super-Resolution
Ahmed, Junaid	Sukkur Inst. of Business Administration
Klette, Reinhard	Auckland Univ. of Tech
Keywords: Coding, compression and super-resolution Abstract: In this article a new strategy for single-image super-resolution is proposed. A selective sparse coding strategy based on patch sharpness is assumed to be invariant for patch resolution. This sharpness criterion is used at training stage to classify image patches into different clusters. It is suggested that the use of coupled dictionary learning, with a mapping function can improve the representation quality. By this strategy clustered dictionaries are designed along with a mapping function for each cluster which can provide the coupling link between low-resolution and high-resolution image patches. During the reconstruction, image patch sharpness is used as a criterion for the selection of a clustered dictionary and the mapping function. The high-resolution patches are recovered by high-resolution cluster dictionary atoms and the mapping function with sparse representation coefficients from low resolution patches. The algorithm is tested over a set of benchmark images from different data sets. Peak-signal-to-noise ratio and structural-similarity-index measures indicate that the given algorithm is competitive in general with existing baseline algorithms. The proposed algorithm performs better for images with high-frequency components.

14:00-16:10, Paper ThPT3.2
Adaptive Hashing with Sparse Modification
Zhang, Lifang	Sun Yatsen Univ
Shen, Qi	School of Mathematics and Computational Science, Sun Yat-Sen Univ
Li, Defang	Sun Yat-Sen Univ
Tang, Xin	Sun Yat-Sen Univ
Wang, Patrick	Northeastern Univ
Feng, Guocan	Sun Yat-Sen Univ
Keywords: Coding, compression and super-resolution Abstract: By representing data through compact binary encodings, hashing techniques have been widely used for data retrieval because of their large data storage and efficient computational time. Most hashing algorithms typically learn a finite number of data projections, but learning optimal projections remains unaddressed. To deal with this limitation, a novel approach, dubbed Adaptive Hashing with Sparse Modification(AHSM), is proposed that learns binary indices composed of the vertices of hypercube and the projection matrix are comprised of two matrices in this paper. The first matrix is orthogonal for rotating data and the second one is sparse for modifying data. The essence of our scheme is that an optimal transformation of data is learned to minimize quantization distortion and improve model fidelity. AHSM has two contributions: the first one is accuracy improvement, and the second one is no increase on computational complexity. Through experiments, we find that AHSM substantially surpasses several state-of-the-art hashing schemes on three data sets.

14:00-16:10, Paper ThPT3.3
Regressor Basis Learning for Anchored Super-Resolution
Agustsson, Eirikur	ETH Zurich
Timofte, Radu	ETH Zurich
Van Gool, Luc	ETH Zurich and Univ. of Leuven
Keywords: Coding, compression and super-resolution Abstract: A+ aka Adjusted Anchored Neighborhood Regression - is a state-of-the-art method for exemplar-based single image super-resolution with low time complexity at both train and test time. By robustly training a clustered regression model over a low-resolution dictionary, its performance keeps improving with the dictionary size - even when using tens of thousands of regressors. However, this can pose a memory issue where the model size can grow to more than a gigabyte, limiting applicability in memory constrained scenarios. To address this, we propose Regressor Basis Learning (RB), a novel variant of A+ where we restrict the regressor set to a learned low-dimensional subspace, such that each regressor is coded as a linear combination of few basis regressors. We learn the regressor basis by alternating between closed form solutions of the optimal coding of the regressor set (given the basis) and the optimal regressor basis (given the coding). We validate RB on several standard benchmarks and achieve comparable performance to A+ but by using orders of magnitude fewer basis regressors, i.e. 32 basis regressors instead of 1024 regressors. This makes our RB method ideal for memory constrained applications.

14:00-16:10, Paper ThPT3.5
Person Re-Identification Via Person DPM Based Partition
Li, Shao Mei	National Digital Switching System Engineering and Tech
Gao, Chao	National Digital Switching System Engineering and Tech
Keywords: Image and video analysis and understanding, Biometric systems and applications, Human Computer Interaction Abstract: In surveillance videos, the pictures of a same person often present significant variation which makes person re-identification difficult. Though the globe appearances may present great difference, some local patches still have great similarities, and human eyes can be used to distinguish the identity of each person via these local patches. Inspired from it, patch matching is introduced in person re-identification and has been shown to be an efficient method to solve these problems caused by different viewpoints, poses, camera settings, illumination and occlusion. But until now there is no guide for how to decide the size of patch, and most researches got these patches either by dense sampling or coarse partition. In both case, the person structure information is missing. To improve re-identification accuracy, we propose a method via person DPM (Deformable Parts Model) partition. First, both compared appearances are partitioned into several body parts by pre-trained person DPM and these parts are grouped according to their positions in the body; Second, part matching is conducted between two appearances’ parts in each group based on deep learning features; Finally, fusing the similarities of each group are to decide whether these two appearances are from the same person or not. Experiments on VIPeR dataset illustrate that without supervised training, the proposed method can obtain good re-identification performances compared with state-of-the-art methods.

14:00-16:10, Paper ThPT3.6
Higher-Level Representation of Local Spatio-Temporal Features for Human Action Recognition Using Subspace Matching Kernels
Raytchev, Bisser	Hiroshima Univ
Kawamoto, Hideaki	Hiroshima Univ
Tamaki, Toru	Hiroshima Univ
Kaneda, Kazufumi	Hiroshima Univ
Keywords: Image and video analysis and understanding, Gesture and Behavior Analysis, Support vector machines and kernel methods Abstract: Although the design of low-level local spatio-temporal features has recently led to significant improvement of performance in many action recognition applications, much less attention has been given to the equally important problem how to organize such low-level features extracted from the videos into a higher-level representation suitable to represent and discriminate between many different action classes. In this paper we propose an alternative approach to the widely-used Bag-of-Features (BoF), where instead of histograms of visual words, action sequences are represented as sets of low-dimensional linear subspaces. This results in richer and more discriminative models. In order to be able to calculate similarities between sets of subspaces we define a novel Subspace Matching Kernel. Experimental results are shown on the widely used KTH action dataset, which demonstrate the effectiveness of the proposed framework.

14:00-16:10, Paper ThPT3.7
Segment-Based Models for Event Detection and Recounting
Kovvuri, Naga Vijaya Rama Reddy	Univ. of Southern California
Nevatia, Ram	USC
Snoek, Cees	Univ. of Amsterdam
Keywords: Image and video analysis and understanding, Multimedia analysis, indexing and retrieval, Motion, tracking and video analysis Abstract: We present a novel approach towards web video classification and recounting that uses video segments to model an event. This approach overcomes the limitations faced by the classical video-level models such as modeling semantics, identifying informative segments in a video and background segment suppression. We posit that segment-based models are able to identify both the frequently-occurring and rarer patterns in an event effectively, despite being trained on only a fraction of the training data. Our framework employs a discriminative approach to optimize our models in distributed and data-driven fashion while maintaining semantic interpretability. We evaluate the effectiveness of our approach on the challenging TRECVID MEDTest 2014 dataset. We demonstrate improvements in recounting and classification, particularly in events characterized by inherent intra-class variations.

14:00-16:10, Paper ThPT3.8
Robustness in Blind Camera Identification
Samaras, Stamatis	Aristotle Univ. of Thessaloniki
Mygdalis, Vasileios	Aristotle Univ. of Thessaloniki
Pitas, Ioannis	-
Keywords: Image and video analysis and understanding, Signal, image and video processing Abstract: In this paper, we focus on studying the effects of various image operations on sensor fingerprint camera identification. It is known that artifacts in the image processing pipeline, such as pixel defects or unevenness of the responses in the CCD array as well black current noise leave telltale footprints. Nowadays, camera identification based on the analysis of these artifacts is a well established technology for linking an image to a specific camera. The sensor fingerprint is estimated from images taken from a device. A similarity measure is deployed in order to associate an image with the camera. However, when the images used in the sensor fingerprint estimation have been processed using e.g. gamma correction, contrast enhancement, histogram equalization or white balance, the properties of the detection statistic change, hence affecting fingerprint detection. In this paper we study this effect experimentally, towards quantifying the robustness of fingerprint detection in the presence of image processing operations.

14:00-16:10, Paper ThPT3.9
A Semantic Tree-Based Approach for Sketch-Based 3D Model Retrieval
Li, Bo	Texas State Univ
Lu, Yijuan	Texas State Univ
Shen, Jian	Texas State Univ
Keywords: Multimedia analysis, indexing and retrieval, 2D/3D object detection and recognition, Pattern Recognition for Search, Retrieval and Visualization Abstract: Sketch-based 3D model retrieval is to retrieve 3D models given a user's hand-drawn sketch. Due to the big semantic gap between rough sketch representation and accurate 3D model coordinates, sketch-based 3D model retrieval (SBR) is one of the most challenging research topics in the field of 3D model retrieval. To bridge the semantic gap, a novel semantic tree-based SBR algorithm is proposed in this paper. Given a 2D sketch query and a collection of 3D models, a 3D semantic tree is built up first based on the ontology structure of WordNet. Every leaf node in the tree contains a set of 3D models assigned to this class according to their semantic classification/label information. Then, sketch components of the 2D query sketch are identified by sketch segmentation and annotation. Finally, by measuring the semantic relatedness between the sketch components' annotations and tree nodes in the 3D semantic tree, the similarities between the 2D sketch and 3D models are computed to find out the most relevant 3D models. Experimental results demonstrate the effectiveness and promising potentials of our approach on sketch-based 3D model retrieval.

14:00-16:10, Paper ThPT3.10
Video Scene Text Frames Categorization for Text Detection and Recognition
Qin, Longfei	National Key Lab for Novel Software Tech. Nanjing Univ
Palaiahnakote, Shivakumara	National Univ. of Singapore
Lu, Tong	State Key Lab. for Software Tech. Nanjing Univ
Pal, Umapada	Indian Statistical Inst
Tan, Chew-Lim	National Univ. of Singapore
Keywords: Multimedia analysis, indexing and retrieval, Image and video analysis and understanding, Character and Text Recognition Abstract: Developing a unified text detection and recognition method is hard for different video types due to varying characteristics in video. This paper proposes a new method for categorizing different types of video text frames, namely, videos containing advertisement, signboard, license plate, front page of book or magazine, street view, and video of general items, for better text detection and recognition rate. We propose symmetry features using gradient vector flow for Canny and Sobel edge images of each input frame to identify candidate edge components. Then for a candidate edge component image, we extract both global and local features using colors from different channels in a new way. Besides, the proposed method extracts statistical and structural features from the spatial distribution of candidate pixels in a multi-scale environment. Lastly, the extracted features are fed to a logistic classifier for categorization. The features extracted locally and globally are tested both separately and altogether in terms of confusion matrix. The performance of the proposed categorization method is evaluated through several text detection and recognition experiments before and after categorization. We noted that the proposed categorization method is very useful in improving text detection and recognition performance.

14:00-16:10, Paper ThPT3.11
Face Image Super-Resolution Via Weighted Patches Regression
Zhang, Zhihong	Xiamen Univ
Zhang, Yiping	Xiamen Univ
Hu, Guosheng	Inria Grenoble Rhone-Alpes
Hancock, Edwin	Univ. of York
Keywords: Representation and analysis in pixel/voxel images, Statistical, syntactic and structural pattern recognition, Machine learning and data mining Abstract: Recently sparse representation has gained great success in face image super-resolution. The conventional sparsity-based methods enforce sparse coding on face image patches and the representation fidelity is measured by ell_{2}-norm. Such a sparse coding model regularizes all facial patches equally, which however ignores the natures of facial patches, where the facial patches in the different regions (patch positions) of human face may have distinct contributions to face image reconstruction. In this paper, we propose to weight facial patches based on their discriminative abilities in regression for robust face hallucination reconstruction. Specifically, we learn the weights for facial patches according to the information entropy in each face region, so as to highlight higher frequency details in face images and the facial discriminability can be well retrieved. Various experimental results on standard face databased show that our proposed method outperforms state-of-the-art methods in terms of both objective metrics and visual quality.

14:00-16:10, Paper ThPT3.12
A Novel Method for Segmenting Moving Objects in Aerial Imagery Using Matrix Recovery and Physical Spring Model
ElTantawy, Agwad	Memorial Univ. of Newfoundland
Shehata, Mohamed	Memorial Univ. of Newfoundland
Keywords: Segmentation, features and descriptors, Motion, tracking and video analysis, Signal, image and video processing Abstract: Aerial imagery applications have gained a great interest especially in the area of comprehensive ground activities analysis. One of the key tasks in such applications is moving objects segmentation. Although many efforts have been presented in the literature that claim high true object detection rates, they still suffer from high false positive rates. This paper focuses on maintaining a high true positive detection rates while significantly reducing the false positive detection rates. To achieve this goal, this paper proposes a novel method that integrates matrix recovery concept with physical spring model to drastically reduce false detections. The proposed method segment all candidate moving objects by recovering the low rank matrix, which normally results high false positive detection. To reject false detections, each candidate moving object is modelled as a mass suspended by system of springs, such that the forces of springs attached to false detections is negligible whereas the forces of springs attached to a true moving object will be significant in response to the object motion. The results show that the proposed method, compared to other current state-of-theart methods, achieved better true positive rates while drastically lowering the false positive rates.

14:00-16:10, Paper ThPT3.14
A New Approach to Mathematical Morphology on One Dimensional Sampled Signals
Asplund, Teo	Uppsala Univ
Luengo Hendriks, Cris L.	Flagship Biosciences
Thurley, Matthew John	Luleå Univ. of Tech
Strand, Robin	Uppsala Univ
Keywords: Signal, image and video processing Abstract: We present a new approach to approximate continuous-domain mathematical morphology operators. The approach is applicable to irregularly sampled signals. We define a dilation under this new approach, where samples are duplicated and shifted according to the flat, continuous structuring element. We define the erosion by adjunction, and the opening and closing by composition. These new operators will significantly increase precision in image measurements. Experiments show that these operators indeed approximate continuous-domain operators better than the standard operators on sampled one-dimensional signals, and that they may be applied to signals using structuring elements smaller than the distance between samples. We also show that we can apply the operators to scan lines of a two-dimensional image to filter horizontal and vertical linear structures.

14:00-16:10, Paper ThPT3.15
Robust Signal Identification for Dynamic Pattern Classification
Zhao, Rui	RPI
Schalk, Gerwin	BCI R&D Progr, Wadsworth Ctr, NYS Dept of Health
Ji, Qiang	RPI
Keywords: Signal, image and video processing, Biological image and signal analysis, Classification and clustering Abstract: This paper addresses the problem of identifying signals of interest from discrete-time sequences contaminated by erroneous segments, which we define as the part of time series whose dynamic patterns are inconsistent with that of the signals. Assuming the signals of interest consist of consecutive samples with arbitrary starting point, duration and following a stationary dynamic pattern, we propose a robust algorithm combining Random Sample Consensus (RANSAC) and Hidden Markov Model (HMM) to automatically identify the start and end of signals of interest from time series. To evaluate the identification quality, we perform a classification task, where the identified signals are used to train a classifier. A majority vote strategy is adopted to handle error contaminated testing sequences. Compared with manual selection approach and other unsupervised learning methods, the proposed method shows improvement in classification accuracy on both synthetic and real Electrocorticographic (ECoG) data.

14:00-16:10, Paper ThPT3.16
Real-Time Removal of Random Value Impulse Noise in Medical Images
HosseinKhani, Zohreh	Isfahan Univ. of Tech
Karimi, Nader	Isfahan Univ. of Tech
Soroushmehr, S.M. Reza	Univ. of Michigan
Hajabdollahi, Mohsesn	Isfahan Univ. of Tech
Samavi, Shadrokh	McMaster Univ
Ward, Kevin	Univ. of Michigan
Najarian, Kayvan	Univ. of Michigan
Keywords: Signal, image and video processing, Biological image and signal analysis, Medical image and signal analysis Abstract: With the increasing use of telemedicine there is a great demand in real-time processing and transmission of medical images. Noise is one of the important factors that degrade the quality of medical images. Impulse noise is a common noise that could be caused by malfunctioning of sensors or by data transmission errors. It is one the most common noises that have extensively been studied in recent years. For real-time noise removal hardware techniques are more suited, since software methods are complex and slow. Usually hardware techniques have low complexity and low accuracy. In this paper a low complexity, high accuracy, de-noising method is proposed. It first categorizes image pixels into a number of groups. Then noisy pixels are restored in different ways in each category. Local analysis of image blocks allows us to restore a noisy pixel by using its neighboring non-noisy pixels. All steps are designed to have low hardware complexity. Simulation results show that in the case of MR images, the proposed method removes impulse noise with acceptable accuracy.

14:00-16:10, Paper ThPT3.17
Maximal Level Estimation and Unbalance Reduction for Graph Signal Downsampling
Zheng, Xianwei	Univ. of Macau
Tang, YuanYan	Univ. of Macao
Zhou, Jiantao	Univ. of Macau
Wang, Patrick	Northeastern Univ
Keywords: Signal, image and video processing, Dimensionality reduction and manifold learning Abstract: The emerging field of graph signal processing requires a solid design of downsampling operation for graph signals to extend pattern recognition, machine learning and signal processing techniques into the graph setting. The state-of-the-art downsampling method is constructed upon the maximum spanning trees of the graphs. However, under the framework of this method, unbalanced downsampling often occurs for signals defined on densely connected unweighted graphs, such as social network data. The unbalance also significantly reduces the maximal downsampling level, making it smaller than the level we expect. In applications, the maximal level must be estimated to ensure that it is larger than the expected level; meanwhile, the unbalance has to be reduced, if it occurs. In this paper, we propose a novel method to jointly estimate the maximal level and reduce the downsampling unbalance. This method also offers an estimation of the possibility of unbalanced downsampling. If a graph signal is classified to be with high unbalance possibility, the maximum spanning tree will be updated to generate a balanced downsampling. The simulation results on synthesis and real world data support the theoretical analysis.

14:00-16:10, Paper ThPT3.18
Patch-Based Visual Microphone for Improving Quality of Sound
Ahn, Juhyun	POSTECH
Kim, Yong-Joong	POSTECH
Kim, Daijin	POSTECH
Keywords: Signal, image and video processing, Image and video analysis and understanding Abstract: Visual microphone is a technique introduced to recover the sound from a silence video. And traditional method of sound recovery involves extracting and combining subtle motion signals from the entire image. However, there are two possible drawbacks of recovering the sound using the entire image. First, motion signals extracted from plain and edge regions may contain noise due to an aperture problem. Although plain regions are penalised by their squared amplitude values, summation of all the pixels present in the image could introduce notable amount of noise. Second, it is unclear which part of the surface of an object is hit by the sound wave. Utilizing only the region hit by the sound wave is expected to lead to a better sound recovery. The proposed patch-based visual microphone framework addresses these two problems by recovering the sound from a sub-region (patch) in the image centered at a key point (corner). Since we are unable to know which sub-region in the image is good for sound recovery, speeches are recovered from patches centered at each key point (corner), and then the best speech with the least noise is selected as a recovered speech. Extensive experiment results show that utilizing motion signals from a small region in the image near a key point can improve quality of the recovered speech.

14:00-16:10, Paper ThPT3.19
Crack Detection Based on a Marked Point Process Model
Vandoni, Jennifer	Paris Sud Univ
Le Hégarat-Mascle, Sylvie	Paris Sud Univ
Aldea, Emanuel	Paris Sud Univ. Paris Saclay Univ
Attachments: Supplementary material Keywords: Signal, image and video processing, Image and video analysis and understanding, 2D/3D object detection and recognition Abstract: This paper studies the problem of crack detection in images characterized by high gradient backgrounds. We propose an extension of a Marked Point Process model which has been successfully used for wrinkle detection. We show that our method exhibits state of the art results on a difficult image dataset, by proposing a robust trade-off between local analysis approaches, which exploit a limited amount of information around the area of interest, and global reconnection strategies, which aim to detect the crack at image level. Additional tests on a standard dataset show that the proposed method exhibits excellent performance on images with a more uniform background as well, underlining its usefulness in varying contexts.

14:00-16:10, Paper ThPT3.20
MLPF Algorithm for Tracking Fast Moving Target against Light Interference
Zhang, Libo	Univ. of Chinese Acad. of Sciences
Cai, Yuanqiang	Univ. of Chinese Acad. of Sciences
Ullah, Zakir	Inst. of Computing Tech. CAS
Luo, Tiejian	Univ. of Chinese Acad. of Sciences
Keywords: Signal, image and video processing, Image and video analysis and understanding, Vision for graphics Abstract: In order to deal with the difficulty of tracking the fast moving aerial targets with light interference, we propose an improved particle tracking algorithm named multi-layers particle filter (MLPF). In MLPF, the particles are divided into three categories: the main particles (M-particles), the subordinate particles (S-particles) and the regenerate particles (R-particles). In the phase of resampling and state estimating, only M-particles are involved, then the R-particles are generated and considered as new S-particles in the next cycle. To a certain extent, our algorithm maintains the diversity of particles and reduces the computation time. Besides, MLPF has significant improvements on overcoming the tracing error after the sudden disappearance of the target and solving the degradation of particles. We demonstrate effectiveness of our proposed algorithm through systematic experiments. Experimental results show MLPF has better tracking effect compared to the traditional particle filter (PF) when the target is moving fast and affected by light interference. In the first experiment, the running time has been reduced from 47s to 21s while the precision increased from 64% to 96%. And for the second experiment, the running time has been reduced from 237s to 121s while precision increased from 46% to 89%.

14:00-16:10, Paper ThPT3.21
Outdoor Omnidirectional Video Completion Via Depth Estimation by Motion Analysis
Morales, Carlos	The Univ. of Tokyo
Roxas, Menandro	Univ. of Tokyo
Okamoto, Yasuhide	The Univ. of Tokyo
Ono, Shintaro	The Univ. of Tokyo
Oishi, Takeshi	To Be Set
Ikeuchi, Katsushi	The Univ. of Tokyo
Attachments: Supplementary material Keywords: Signal, image and video processing, Motion, tracking and video analysis, Stereo and multiple view geometry Abstract: Video completion aims to track, remove, and fill in unwanted regions (holes) of a video sequence. Holes have to be filled-in consistently to create a visually pleasant video output. Challenges arise when big holes propagate along several frames (large space-time holes) in outdoor videos with variant illumination and structured background. In those cases even forefront video completion approaches based on optical flow fail to complete the holes correctly as 3D information is required to keep the structure of the scene and a wider field of view is needed to handle the large space-time holes. To overcome these limitations, we propose a novel omnidirectional video completion framework based on depth estimation. First, we recover the depth of the scene from a pixel motion model constrained by known camera pose. The depth map is further improved by a structure-aware refinement. The refined depth map is then employed for color propagation into the holes. We perform a set of experiments to evaluate our approaches for preliminary depth recovery, depth refinement, and color propagation. Our results confirm that the proposed framework generates accurate preliminary depth maps, improves the depth quality maintaining the structure of the scene, and outperforms state-of-the-art optical-flow-based video completion approach in terms of accuracy and visual appeal.

14:00-16:10, Paper ThPT3.22
Effective Real-Scenario Video Copy Detection
Zhang, Yue	Paytronix Systems, Inc
Zhang, Xinxiang	Southern Methodist Univ
Keywords: Signal, image and video processing, Multimedia analysis, indexing and retrieval, Image and video analysis and understanding Abstract: Our task of video copy detection system aims to locate video segments that are partially copied or near-duplicated versions from an archive of reference videos. In 2010, video copy detection problem was sometimes considered as a solved problem, since previous research within this area used either small-scale or large-scale datasets (e.g. TRECVID 2009, Muscle-VCD) with pre-defined simulated videos. Therefore, the near-perfect results obtained on these datasets were somehow not convincing. As a result, in this paper, we introduce an effective real-scenario video copy detection system which aims to effectively and efficiently detect complex real video copies. Our system obtains decent results on a real-scenario large-scale video copy database (VCDB) generated in 2014, and measures the trade-off between effectiveness and efficiency. We believe our work can be regarded as the beginning for this challenging problem.


ThPT4	Poster Session Hall
ThP4	Poster Session

14:00-16:10, Paper ThPT4.1
Automatic Script and Language Identification in Noisy and Skewed Document Images
Wang, Chenyang	Coll. of Computer and Control Engineering Nankai Univ. Ti
Xie, Yanhong	Shenzhouhaotian Tech. Co., Ltd. Tianjin
Wang, Shuai	Coll. of Computer and Control Engineering Nankai Univ. Ti
Yang, Jufeng	Coll. of Computer and Control Engineering Nankai Univ. Ti
Li, Tao	Coll. of Computer and Control Engineering Nankai Univ. Ti
Wang, Kai	Coll. of Computer and Control Engineering Nankai Univ. Ti
Keywords: Character and Text Recognition Abstract: Optical Character Recognition (OCR) has been widely applied for the digitalization of document images. With the automatic script and language identification, document images can be digitalized by script specific OCR engines and language specific OCR modules without any human involvement. A hierarchical method is proposed in this paper to automatically achieve the knowledge of the underlying script and language for document images by combining Gabor filter and confidence dependent character/word match. To identify the script, some patches are randomly generated, followed by a rule based algorithm that filters non-text patches and scores each text patch. The patches with high scores are fed into a well-trained classifier to label their scripts by Gabor features. The script with the most votes determines the OCR engine to be used. To identify the language, character or word candidates are obtained by script specific algorithms, followed by a language specific charset or dictionary match for each candidate. The language with the most matches determines the OCR module to be used. Experimental results show that the proposed method is accurate and tolerant to skew and noise.

14:00-16:10, Paper ThPT4.2
OTSU Guided Adaptive Binarization of CAPTCHA Image Using Gamma Correction
Shi, Cunzhao	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Character and Text Recognition Abstract: Gamma correction, a nonlinear operation, has long been used to code and decode luminance or tristimulus values in video or still image systems [1]. In this paper, we make the following observations: for CAPTCHA images which could not be well binarized using the threshold of OTSU, there exists a gamma corrected image which could be well segmented by the OTSU threshold and the value of the best gamma could be revealed by observing the maximal inter-class variance (MICV) values of different images transformed by different values of gamma. Concretely, we convert the R, G, B channels of the original CAPTCHA image with different gamma values and transform the color images to gray-level images. Each gray-level image could be then segmented by the threshold acquired by OTSU. By linking each gamma value with the corresponding maximal inter-class variance value, we could draw a changing curve of variance values versus gamma. The best gamma could be acquired by finding the point whose related MICV starts to change slowly. Moreover, the polarity of the image could also be revealed by the changing trend of the curve. Experimental results on different categories of CAPTCHA images demonstrate the effectiveness of the observations for binarizing the CAPTCHA images and telling the polarity as well.

14:00-16:10, Paper ThPT4.3
Camera-Captured Document Image Perspective Distortion Correction Using Vanishing Point Detection Based on Radon Transform
Takezawa, Yusuke	Tokyo Denki Univ
Hasegawa, Makoto	Tokyo Denki Univ
Tabbone, Salvatore	Univ. De Lorraine-LORIA UMR 7503
Keywords: Character and Text Recognition Abstract: A correction method for perspective distortions on document images is discussed. In documents, lines and line feeds give rise to many horizontal and vertical lines, then two vanishing points generated by these lines can be computed. High-energy regions are identified in the Radon transform thanks to a binarization step. Then, the image is zero-padded and the inverse Radon transform is applied to underline the main lines direction in the original image. The distortion is corrected by the perspective mapping determined with the two vanishing points and we propose to compute the homography matrix for the perspective mapping. Experimental results show that our method can correct the perspective distortions effectively and outperforms the state-of-the-art for vanishing points detection accuracy.

14:00-16:10, Paper ThPT4.4
Character Region Segmentation Based on Stroke Stable Regions
Shang, Hong	FRDC
Wang, Liuan	Fujitsu Res. & Development Center CO., LTD
Tanaka, Hiroshi	Fujitsu Lab. Ltd
Fan, Wei	Fujitsu R&D Center Co., LTD
Sun, Jun	Fujitsu R&D Center Co., LTD
Naoi, Satoshi	Fujitsu R&D Center Co., LTD
Keywords: Character and Text Recognition Abstract: Region segmentation is the key procedure in various text related image processing tasks. A good region extractor, which separates text area from complex background clutter, will reduce the burden of subsequent text grouping and post-processing functions. This paper proposes a character region segmentation method based on a new concept named Stroke Stable Region (SSR) to achieve a better precision than many off-the-shelf region extractors such as MSER in the text segmentation task. Our work presented in this paper is inspired by the structure of MSER. However, instead of evaluating the area variation of each connected component, we proposed a novel parameter, stroke time, to measure the possibility that a pixel belongs to a character or a stroke by analyzing its character affinity. The experiments show that SSR tends to extract the visual objects with prominent text characteristics and is capable of suppressing various background noise. In a text extraction task on the ICDAR 2003 dataset, the SSR based method reduces the extracted noise components to about 1/3 of those obtained by the MSER based method, maintaining the same level of recall rate. The proposed algorithm was successfully applied to a wide range of text segmentation tasks

14:00-16:10, Paper ThPT4.5
Distinguishing Text/Non-Text Natural Images with Multi-Dimensional Recurrent Neural Networks
Lyu, Pengyuan	Huazhong Univ. of Science and Tech
Shi, Baoguang	Huazhong Univ. of Science and Tech
Zhang, Chengquan	Huazhong Univ. of Science and Tech
Bai, Xiang	Huazhong Univ. of Sci. and Tech
Keywords: Character and Text Recognition Abstract: In this paper, we focus on the text/non-text classification problem: distinguishing images that contain text from a lot of natural images. To this end, we propose a novel neural network architecture, termed Convolutional Multi-Dimensional Recurrent Neural Network (CMDRNN), which distinguishes text/non-text images by classifying local image blocks, taking both region pixels and dependencies among blocks into account. The network is composed of a Convolutional Neural Network (CNN) and a Multi-Dimensional Recurrent Neural Network (MDRNN). The CNN extracts rich and high-level image representation, while the MDRNN analyzes dependencies along multiple directions and produces block-level predictions. By evaluating CMDRNN on a public dataset, we observe improvements over prior arts in terms of both speed and accuracy.

14:00-16:10, Paper ThPT4.6
Supervised Dictionary Learning in BoF Framework for Scene Character Recognition
Tounsi, Maroua	Res. Groups in Intelligent Machines (REGIM-Lab), ENIS-Sfax,
Moalla, Ikram	Res. Groups in Intelligent Machines (REGIM-Lab), ENIS-Sfax,
Alimi, Adel M.	REGIM, ENIS, Univ. of Sfax, ENIS, Sfax, Tunisia
Keywords: Character and Text Recognition, Classification and clustering, Artificial neural networks Abstract: In recent years, growing attention has been paid to recognizing text in natural scenes images. Scene Character recognition (SCR) is an important step in automatizing the process of reading text in natural scenes. In this paper, we propose a system which deals with SCR problem. This system is based on a novel Bag Of Features (BOF)-based model which use supervised dictionary learning in BoF framework using sparse neural networks models. Thus, in the learning dictionary step, we use a strategy based on neural network model combined with supervised fine-tuning. This technique provide more accuracy and more concise visual dictionary, if we compare it to the most used unsupervised dictionary learning technique like sparse coding. To evaluate our system, we test our proposed system on two English scene character benchmark datasets, i.e, Chars74K and ICDAR 2003, and we propose a database of Arabic characters, called ARASTI. Experimental results show the efficiency of this framework for English and Arabic STC recognition.

14:00-16:10, Paper ThPT4.7
Recognizing Text on Historical Maps Using Maps from Multiple Time Periods
Yu, Ronald	Univ. of Southern California
Luo, Zexuan	Univ. of Southern California
Chiang, Yao-Yi	Usc Isi
Keywords: Character and Text Recognition, Document Understanding Abstract: Recognizing text on historical maps is a challenging problem because of inherent difficulties with the input such as artifacts interfering with the text or an unpredictable rotation and orientation of the text. This paper discusses our algorithm that overcomes the limitations of the input by adding extra input consisting of multiple layers of images of the same map area but across different time periods and names of geographic entities in the United Kingdom collected from OpenStreetMap. Using our algorithm, compared to Strabo, a state-of-the-art text recognitionsoftware on maps, we obtain a 153 percent increase in precision, a 31 percent increase in recall, and a 75 percent increase in F-score for word recognition on maps.

14:00-16:10, Paper ThPT4.8
Scene Text Recognition with CNN Classifier and WFST-Based Word Labeling
Liu, Xinhao	NTT Communication Science Lab
Kawanishi, Takahito	NTT Corp
Wu, Xiaomeng	NTT Communication Science Lab
Kashino, Kunio	NTT Corp
Keywords: Character and Text Recognition, Document Understanding, Deep learning Abstract: Abstract—Natural scene text recognition has proved to be challenging due to the unconstrained wild conditions. In this paper, to solve this problem we propose a method which first detects and recognizes characters by utilizing the high performance Convolutional Neural Network (CNN). Then for post-processing, inspired by its success in speech recognition, we employ the efficient and flexible Weight Finite State Transducer (WFST) based word labeling model for incorporation with a lexicon or high order language model. In the experiments we show that the proposed approach can correctly and robustly recognize the text in the scene images and the results for serveral public datasets (ICDAR 2003, SVT and IIIT5K) show comparable or superior performance to the state-of-the-art algorithms.

14:00-16:10, Paper ThPT4.9
Recognition and Transition Frame Detection of Arabic News Captions for Video Retrieval
Iwata, Seiya	Mie Univ
Ohyama, Wataru	Mie Univ
Wakabayashi, Tetsushi	Mie Univ
Kimura, Fumitaka	Mie Univ
Keywords: Character and Text Recognition, Document Understanding, Pattern Recognition for Search, Retrieval and Visualization Abstract: The authors have conducted studies on recognizing Arabic news captions to develop a system for video retrieval to index and edit Arabic broadcast programs daily received and stored in big database. This paper describes a dedicated OCR for recognizing low resolution news captions in video images. News caption recognition system consisting of text line extraction, word segmentation and segmentation-recognition of words is developed and the performance was experimentally evaluated using datasets of frame images extracted from AlJazeera broadcasting programs. Character recognition of moving news caption is difficult due to combing noise yielded by the interlacing of scan lines. A technique to detect and eliminate the combing noise to correctly recognize the moving news caption is proposed. This paper also proposes a technique based on inter-frame text difference to detect transition frame of still news captions. The technique to detect transition frames is necessary for efficient video retrieve and play. The proposed technique is experimentally tested and shown to be robust to quick motion of the background and is able to detect the transition frame correctly with the F-measure higher than 90%. When compared with the ABBY FineReader 11 commercial OCR the dedicated OCR improves the recall of the Arabic characters in AlJazeera broadcasting news from 70.74% to 95.85% for non-interlaced moving news captions and from 23.82% to 96.29¥% for interlaced moving news captions.

14:00-16:10, Paper ThPT4.10
Fully Convolutional Recurrent Network for Handwritten Chinese Text Recognition
Xie, Zecheng	South China Univ. of Tech
Sun, Zenghui	South China Univ. of Tech
Jin, Lianwen	South China Univ. of Tech
Feng, Ziyong	South China Univ. of Tech
Zhang, Shuye	South China Univ. of Tech
Keywords: Character and Text Recognition, Handwriting Recognition Abstract: This paper proposes an end-to-end framework, namely fully convolutional recurrent network (FCRN) for handwritten Chinese text recognition (HCTR). Unlike traditional methods that rely heavily on segmentation, our FCRN is trained with online text data directly and learns to associate the pen-tip trajectory with a sequence of characters. FCRN consists of four parts: a path-signature layer to extract signature features from the input pen-tip trajectory, a fully convolutional network to learn informative representation, a sequence modeling layer to make per-frame predictions on the input sequence and a transcription layer to translate the predictions into a label sequence. We also present a refined beam search method that efficiently integrates the language model to decode the FCRN and significantly improve the recognition results. We evaluate the performance of the proposed method on the test sets from the databases CASIA-OLHWDB and ICDAR 2013 Chinese handwriting recognition competition, and both achieve state-of-the-art performance with correct rates of 96.40% and 95.00%, respectively.

14:00-16:10, Paper ThPT4.11
Study on Feature Extraction Methods for Character Recognition of Balinese Script on Palm Leaf Manuscript Images
Kesiman, Made Windu Antara	Univ. of La Rochelle
Prum, Sophea	Univ. of La Rochelle
Burie, Jean-Christophe	Univ. of La Rochelle
Ogier, Jean-Marc	Univ. De La Rochelle
Keywords: Character and Text Recognition, Handwriting Recognition, Pattern Recognition for Art, Cultural Heritage and Entertainment Abstract: The complexity of Balinese script and the poor quality of palm leaf manuscripts provide a new challenge for testing and evaluation of robustness of feature extraction methods for character recognition. With the aim of finding the combination of feature extraction methods for character recognition of Balinese script, we present, in this paper, our experimental study on feature extraction methods for character recognition on palm leaf manuscripts. We investigated and evaluated the performance of 10 feature extraction methods and we proposed the proper and robust combination of feature extraction methods to increase the recognition rate.

14:00-16:10, Paper ThPT4.12
A Quad Tree Based Method for Blurred and Non-Blurred Video Text Frames Classification through Quality Metrics
Khare, Vijeta	Univ. Malaya
Palaiahnakote, Shivakumara	National Univ. of Singapore
Ahlad, Kumar	Faculty of Engineering, Univ. of Malaya, Kuala Lumpur, Mala
Chan, Chee Seng	Univ. of Malaya
Lu, Tong	State Key Lab. for Software Tech. Nanjing Univ
Blumenstein, Michael	Univ. of Tech. Sydney
Keywords: Character and Text Recognition, Image and video analysis and understanding Abstract: Blur is a common artifact in video, which adds more complexity to text detection and recognition. To achieve good accuracies for text detection and recognition, this paper suggests a new method for classifying blurred and non-blurred frames in video. We explore quality metrics, namely, BRISQUE, NRIQA, GPC and SI, in a new way for classification. We estimate the values of these metrics with the help of predefined samples called reference values. To widen the difference between metric values for better classification, we introduce scaling factors as a non-linear sigmoidal function, which considers the metric of each current frame and its reference and results in templates. Based on the characteristics of metrics, the proposed method finds a relationship between the metrics to derive rules for classification. To classify the frame containing local blur, we explore quad tree division with classification rules which divide non-blurred blocks to identify local blur. We use standard databases, namely, ICDAR 2013, ICDAR 2015 and YVT videos for experimentation, and evaluate the proposed method in terms of text detection and recognition rates given by text detection and binarization methods before and after classification.

14:00-16:10, Paper ThPT4.13
Morphology-Based Hierarchical Representation with Application to Text Segmentation in Natural Images
Huynh, Le Duy	EPITA Res. and Development Lab. (LRDE)
Xu, Yongchao	Univ. Paris-Est, LIGM-A3SI, ESIEE / EPITA Res
Geraud, Thierry	EPITA Res. and Development Lab. (LRDE), France
Keywords: Character and Text Recognition, Image and video analysis and understanding, Document Understanding Abstract: Many text segmentation methods are elaborate and thus are not suitable to real-time implementation on mobile devices. Having an efficient and effective method, robust to noise, blur, or uneven illumination, is interesting due to the increasing number of mobile applications needing text extraction. We propose a hierarchical image representation, based on the morphological Laplace operator, which is used to give a robust text segmentation. This representation relies on several very sound theoretical tools; its computation eventually translates to a simple labeling algorithm, and for text segmentation and grouping, to an easy tree-based processing. We also show that this method can also be applied to document binarization, with the interesting feature of getting also reverse-video text.

14:00-16:10, Paper ThPT4.14
Anyocr: A Sequence Learning Based OCR System for Unlabeled Historical Documents
Jenckel, Martin	Univ. of Kaiserslautern, DFKI
Bukhari, Syed Saqib	German Res. Center for Artificial Intelligence (DFKI)
Dengel, Andreas	German Res. Center for Artificial Intelligence (DFKI)
Keywords: Character and Text Recognition, Machine learning and data mining, Classification and clustering Abstract: Institutes and libraries around the globe are preserving the literary heritage by digitizing historical documents. However, to make this data easily accessible the scanned documents need to be transformed into search-able text. State of the art OCR systems using Long-Short-Term-Memory networks (LSTM) have been applied successfully to recognize text in both printed and handwritten form. Besides the general challenges with historical documents, e.g. poor image quality, damaged characters, etc., especially unknown scripts and old fonts make it difficult to provide the large amount of transcribed training data required for these methods to perform well. Transcribing the documents manually is very costly in terms of man-hours and require language specific expertise. The unknown fonts and requirement for meaningful context also make the use of synthetic data unfeasible. We therefore propose an end-to-end framework anyOCR that cuts the required input from language experts to a minimum and is therefore easily extendable to other documents. Our approach combines the strengths of segmentation-based OCR methods utilizing clustering on individual characters and segmentation-free OCR methods utilizing a LSTM architecture. The proposed approach is applied to a collection of 15th century Latin documents. Combining the initial clustering with segmentation-free OCR was able to reduce the initial error of about 16% to less than 8%.

14:00-16:10, Paper ThPT4.15
A Scale and Rotation Invariant Scheme for Multi-Oriented Character Recognition
Tripathy, Nilamadhaba	Jadavpur Univ
Chakraborti, Tapabrata	Indian Statistical Inst
Nasipuri, Mita	Jadavpur Univ
Pal, Umapada	Indian Statistical Inst
Keywords: Character and Text Recognition, Support vector machines and kernel methods Abstract: In printed stylized documents, text lines may be curved in shape and as a result characters of a single line may be multi-oriented. This paper presents a multi-scale and multi-oriented character recognition scheme using foreground as well as background information. Here each character is partitioned into multiple circular zones. For each zone, three centroids are computed by grouping the constituent character segments (components) of each zone into two clusters. As a result, we obtain one global centroid for all the components in the zone, and further two centroids for the two generated clusters. The above method is repeated for both foreground as well as background information. The features are generated by encoding the spatial distribution of these centroids by computing their relative angular information. These features are then fed into a SVM classifier. A PCA based feature selection phase has also been applied. Detailed experiments on Bangla and Devanagari datasets have been performed. It has been seen that the proposed methodology outperforms a recent competing method.

14:00-16:10, Paper ThPT4.16
What Does Scene Text Tell Us?
Uchida, Seiichi	Kyushu Univ
Shinahara, Yuto	Kyushu Univ
Attachments: Supplementary material Keywords: Document Understanding, Character and Text Recognition Abstract: Scene text is one of the important information sources for our daily life because it has particular functions such as disambiguation and navigation. In contrast, ordinary document text has no such a function. Consequently, it is natural to have a hypothesis that scene text and document text have different characteristics. This paper tries to prove this hypothesis by semantic analysis of texts by word2vec, which is a neural network model to give a vector representation of each word. By the vector representation, we can have the semantic distributions of scene text and document text in Euclidean space and then determine semantic categories for them by simple clustering. Experimental study reveals several differences between scene text and document text. For example, it is found that scene text is a semantic subset of document text and several semantic categories are very specific to scene text.

14:00-16:10, Paper ThPT4.17
Scene Text Script Identification with Convolutional Recurrent Neural Networks
Mei, Jieru	Huazhong Univ. of Science & Tech
Dai, Luo	Huazhong Univ. of Science and Tech
Shi, Baoguang	Huazhong Univ. of Science and Tech
Bai, Xiang	Huazhong Univ. of Sci. and Tech
Keywords: Document Understanding, Deep learning Abstract: Script identification for scene text images is a challenging task. This paper describes a novel deep neural network structure that efficiently identifies scripts of images. In our design, we exploit two important factors, namely the image representation, and the spatial dependencies within text lines. To this end, we bring together a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) into one end-to-end trainable network. The former generates rich image representations, while the latter effectively analyzes long-term spatial dependencies. Besides, on top of the structure, we adopt an average pooling structure in order to deal with input images of arbitrary sizes. Experiments on several datasets, including SIW-13 and CVSI2015, demonstrate that our approach achieves superior performance, compared with previous approaches.

14:00-16:10, Paper ThPT4.18
A Convolutional Neural Network Approach for Text Line Segmentation in Handwritten Document Images
Vo, Quang Nhat	Chonnam National Univ.
Lee, Gueesang	Chonnam National Univ.

14:00-16:10, Paper ThPT4.19
Local Blur Correction for Document Images
Kieu, Van Cuong	Lab. LIPADE, Univ. Paris Descartes
Cloppet, Florence	Paris Descartes Univ
Vincent, Nicole	Paris Descartes Univ
Keywords: Document Understanding, Enhancement, restoration and filtering Abstract: Image quality is a hot topic since most of image analysis and recognition systems are sensitive to degradation. In this paper, we are interested in the quality of document images to recover text content with as few errors as possible. One of the defects that degrades the Optical Character Recognition rate (OCR) is blur. Therefore, we propose to detect blur and to attenuate its effect on an OCR result. As blur is nonuniform on the document area, we propose a local approach. No prior model is chosen for blur that may be of various natures. Thanks to a local clustering based on a novel blur feature, we build a debluring adapted to the heterogeneous blur. As a result, the blurred image is locally corrected according to the blur type. This method focuses on the ease of character segmentation, then makes OCR more efficient. The experiments are carried out on a public database DIQA together with the use of an OCR. The obtained results show that the proposed method gives an overall improvement of 11% in OCR rate.

14:00-16:10, Paper ThPT4.20
Table Headers: An Entrance to the Data Mine
Nagy, George	RPI
Seth, Sharad	Univ. of Nebraska-Lincoln
Keywords: Document Understanding, Graphics Recognition Abstract: Algorithmic methods are demonstrated for information extraction from table header elements, including data categories and data hierarchies. The table headers are found with the Minimum Index Point Search algorithm. The header-path alignment and header completion algorithms yield database-ready table content and configuration statistics on a random sample of 400 diverse tables with ground truth and 1120 tables without ground truth from international statistical data sites.

14:00-16:10, Paper ThPT4.21
Inexact Graph Matching for Entity Recognition in OCRed Documents
Kooli, Nihel	Univ. De Lorraine
Belaid, Abdel	Loria
Keywords: Document Understanding, Pattern Recognition for Search, Retrieval and Visualization, Machine learning and data mining Abstract: This paper proposes an entity recognition system in image documents recognized by OCR. The system is based on a graph matching technique and is guided by a database describing the entities in its records. The input of the system is a document which is labeled by the entity attributes. A first grouping of those labels based on a function score leads to a selected set of candidate entities. The entity labels which are locally close are modeled by a structure graph. This graph is matched with model graphs learned for this purpose. The graph matching technique relies on a specific cost function that integrates the feature dissimilarities. The matching results are exploited to correct the mislabeling errors and then validate the entity recognition task. The system evaluation on three datasets which treat different kind of entities shows a variation between 88.3% and 95% for recall and 94.3% and 95.7% for precision.

14:00-16:10, Paper ThPT4.22
Historical Document Digitization through Layout Analysis and Deep Content Classification
Corbelli, Andrea	Univ. of Modena and Reggio Emilia
Baraldi, Lorenzo	Univ. of Modena and Reggio Emilia
Grana, Costantino	Univ. Degli Studi Di Modena E Reggio Emilia
Cucchiara, Rita	Univ. Degli Studi Di Modena E Reggio Emilia
Keywords: Document Understanding, Signal, image and video processing, Deep learning Abstract: Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the "Enciclopedia Treccani", a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.

14:00-16:10, Paper ThPT4.23
Rotation-Free Online Handwritten Character Recognition Using Dyadic Path Signature Features, Hanging Normalization, and Deep Neural Network
Yang, Weixin	South China Univ. of Tech
Jin, Lianwen	South China Univ. of Tech
Ni, Hao	Oxford-Man Inst. of Quantitative Finance, Univ. of Oxfo
Lyons, Terry	Oxford-Man Inst. of Quantitative Finance, Univ. of Oxfo
Keywords: Handwriting Recognition, Character and Text Recognition, Deep learning Abstract: The path signature feature (PSF) which was initially introduced in rough paths theory as a branch of stochastic analysis, has recently been successfully applied to the field of pattern recognition for extracting sufficient quantity of information contained in a finite trajectory, but with potentially high dimension. In this paper, we propose a variation of path signature representation, namely the dyadic path signature feature (D-PSF), to fully characterize the trajectory using a hierarchical structure to solve the rotation-free online handwritten character recognition (OLHCR) problem. We adopt the deep neural network (DNN) as classifier, and investigate three hanging normalization methods to improve the robustness of the DNN to rotational distortions. Extensive experiments on digits, English letters, and Chinese radicals demonstrated that the proposed D-PSF, jointly with hanging normalization and DNN, achieved very promising results for rotated OLHCR, significantly outperforming previous methods.


ThPT5	Poster Session Hall
ThP5	Poster Session

14:00-16:10, Paper ThPT5.1
Multiple Facial Action Unit Recognition Enhanced by Facial Expressions
Yang, Jiajia	School of Computer Science and Tech. Univ. of Science
Shan, Wu	Univ. of Science and Tech. of China
Wang, Shangfei	Univ. of Science and Tech. of China
Ji, Qiang	RPI
Keywords: Affective computing Abstract: Facial expressions and facial action units (AU) respectively describe facial behavior globally and locally. Therefore, the dependencies between expressions and AUs carry crucial information for facial action unit recognition, yet have not been thoroughly exploited. In this paper, we propose a novel facial action unit recognition method enhanced by facial expressions, which are only required during training. Specifically, we propose a three-layer restricted Boltzmann machine (RBM) to capture the probabilistic dependencies among expressions and AUs. The parameters of the RBM model are learned by maximizing the log conditional likelihood with gradient ascent. After that, the learned RBM model combines AU measurements with the AU-expression relations it captures to perform multiple AU recognition through probabilistic inference. Experimental results on three benchmark databases, i.e. the CK+ database, the ISL database and the BP4D database, demonstrate the effectiveness of our method on capturing the joint relations among AUs and expression to improve AU recognition.

14:00-16:10, Paper ThPT5.2
Face Recognition Using Activities of Directed Graphs in Spatial Pyramid
Oruê, Jonatan	UFMS
Gonçalves, Wesley	UFMS
Keywords: Face recognition Abstract: Face recognition has became one of the most popular application in computer vision, due to a large demand for security. In recent years, major advances have been reported in the face recognition and, therefore, the methods are becoming more accurate and efficient. Recent research pointed that feature extraction using activity in directed graphs showed excellent results in feature extraction. In this paper, we proposed a face recognition method based on activities of directed graphs in spatial pyramid. First, the image is represented by a graph, where each pixel is mapped into a vertex and then, the activity of each vertex is estimated. Finally, histograms of activities in spatial pyramid are calculated to characterize the face. According to the results, it was confirmed the efficiency of the proposed method in facial recognition using widely used datasets.

14:00-16:10, Paper ThPT5.3
Pose Estimation Using Spectral and Singular Value Recomposition
Bhagavatula, Chandrasekhar	Carnegie Mellon Univ
Aljadaany, Raied	Carnegie Mellon Univ
Savvides, Marios	Carnegie Mellon Univ
Keywords: Face recognition, Classification and clustering, Other applications Abstract: In face recognition tasks, the changing pose of the face can cause enough information to be lost to cause the recognition to fail so being able to determine the pose of the face beforehand can allow for some better recognition performance. Many methods used for pose estimation tasks rely on finding some underlying structure of the data given to create a classifier. We propose an alternative method in which the training data itself is the underlying structure of a classifier. This is accomplished through the use of matrix decomposition equations. However, instead of decomposing a matrix, one is created by carefully selecting the terms in the decomposition equation such that the resulting matrix has the desired properties for classification. We show two recomposition methods using the Spectral Decomposition and Singular Value Decomposition equations. We show this method can perform pose estimation with a high accuracy of 85.21% and an accuracy of 98.42% when allowing a ±15 ◦ tolerance on the pose estimate on the CUbiC FacePix dataset. We also show results on both yaw and pitch estimation on the Pointing’04 dataset with our methods achieving 77.01% accuracy on yaw estimation.

14:00-16:10, Paper ThPT5.4
VLAD Encoded Deep Convolutional Features for Unconstrained Face Verification
Zheng, Jingxiao	Univ. of Maryland, Coll. Park
Chen, Jun-Cheng	Univ. of Maryland, Coll. Park
Bodla, Navaneeth	Univ. of Maryland, Coll. Park
Patel, Vishal	Rutgers, the State Univ. of New Jersey
Chellappa, Rama	Univ. of Maryland
Keywords: Face recognition, Deep learning Abstract: We present a method for combining the Vector of Locally Aggregated Descriptor (VLAD) feature encoding with Deep Convolutional Neural Network (DCNN) features for unconstrained face verification. One of the key features of our method, called the VLAD-encoded DCNN (VLAD-DCNN) features, is that spatial and appearance information are simultaneously processed to learn an improved discriminative representation. Evaluations on the challenging IARPA Janus Benchmark A (IJB-A) face dataset show that the proposed VLAD-DCNN method is able to capture the salient local features and yield promising results for face verification. Furthermore, we show that additional performance gains can be achieved by simply fusing the VLAD-DCNN features that capture the local variations with the traditional DCNN features which characterize more global features.

14:00-16:10, Paper ThPT5.6
Group and Collaborative Dictionary Pair Learning for Face Recognition
Mao, Minqi	Zhejiang Normal Univ
Zheng, Zhonglong	Zhejiang Normal Univ
Chen, Zy	ZJNU
He, Xw	ZJNU
Ye, Rh	ZJNU
Keywords: Face recognition, Pattern Recognition for Bioinformatics, Dimensionality reduction and manifold learning Abstract: In this paper, three new algorithms are presented by applying group idea and collaborative thought to projective dictionary pair learning (DPL). These algorithms further extend the framework of discriminative dictionary learning (DL). Based on projective dictionary pair learning which realizes the goals of signal representation and pattern classification by learning a synthesis dictionary and an analysis dictionary at the same time, this paper successfully facilitates group idea and collaborative thought into projective dictionary pair learning. The application of these methods not only leads to very competitive accuracies in face recognition tasks compared with DPL, but also greatly reduces the time complexity in training and test stages, compared with conventional DL methods.

14:00-16:10, Paper ThPT5.7
Regularized Metric Adaptation for Unconstrained Face Verification
Lu, Boyu	Univ. of Maryland, Coll. Park
Chen, Jun-Cheng	Univ. of Maryland, Coll. Park
Chellappa, Rama	Univ. of Maryland
Keywords: Face recognition, Transfer learning Abstract: In this work, we propose a metric adaptation method for set-based face verification and evaluate it on the newly released IARPA Janus Benchmark A (IJB-A) dataset and its extended version, the Janus Challenging Set 2 (CS2). A template-specific metric is trained to adaptively learn the discriminative information in test templates and the negative training set, which contains subjects that are mutually exclusive to subjects in test templates. The proposed regularized joint Bayesian metric learning framework not only alleviates the over-fitting problem but also provides a way to efficiently reduce the model size. We also analyze the selection of the compact and representative negative set to speed up the training time and to reduce storage space. Experiments on the IJB-A dataset yield promising results.

14:00-16:10, Paper ThPT5.8
Facial Expression Recognition by Re-Ranking with Global and Local Generic Features
Vo, Minh Duc	Univ. of Science - VNUHCM
Sugimoto, Akihiro	National Inst. of Informatics
Le, Hoang Thai	Univ. of Science - VNUHCM
Keywords: Facial expression recognition Abstract: Recognizing the facial expression plays an important role in human computer interaction. Following the recent success of the Convolutional Neural Network (CNN) in image classification and object recognition, this paper proposes a facial expression recognition method that makes full use of CNNs to detect face features globally and locally and that combines global and local generic features for improving accuracy in recognition. Our method uses global generic features with the Support Vector Machine (SVM) classifier to generate most plausible candidates in expression class while local generic features with the SVM classifier to look into the candidates to re-rank them for recognition. Experimental results using data-sets available in public support the effectiveness of our proposed method by demonstrating improved accuracy against the state-of-the-arts.

14:00-16:10, Paper ThPT5.9
Facial Expression Recognition Based on Static and Dynamic Approaches
Acevedo, Daniel Germán	Univ. De Buenos Aires
Negri, Pablo	Univ. Argentina De La Empresa
Buemi, María Elena	Univ. De Buenos Aires
Mejail, Marta E.	Univ. De Buenos Aires
Keywords: Facial expression recognition Abstract: The identification of facial expressions with human emotions plays a key role in non-verbal human communication and has applications in several areas. In this work, we analyze two main approaches for expression recognition. One is a dynamic approach introducing a new simple descriptor based on the angles formed by the landmarks to capture the dynamic of the facial expression on a sequence. In this case the recognition is performed by a Conditional Random Field (CRF) classifier. An analysis of the most discriminative landmarks for this approach is presented. The other is a static-based appearance method. In this approach, a binary-based descriptor, denominated Oriented Fast and Rotated BRIEF (ORB), is used on a single frame of a sequence of images to extract texture information, and classified with a Support Vector Machine. We compare both methodologies, analyse their similarities and differences, and also propose simple combinations of both approaches to deal with their limitations.

14:00-16:10, Paper ThPT5.10
A Framework for Joint Facial Expression Recognition and Point Localization
Saeed, Anwar	IESK Magdeburg Univ
Al-Hamadi, Ayoub	IESK, Otto-Von-Guericke-Univ. Magdeburg
Keywords: Facial expression recognition, Affective computing Abstract: Unlike many approaches that use detected facial points to infer facial expressions, in this work, we propose an approach in which we jointly tackle the two tasks on a frame basis. After ensuring the consistent face cropping, our framework makes use of geometric- and appearance- based methods for the facial expression recognition, and of cascade regression and local-based methods for the facial point detection. For the data fusion, we adapted the Viterbi algorithm. The training and testing were carried out on two public benchmark databases. With the proposed framework, we improved the recognition rate of the facial expression and the accuracy of the facial point localization by at least 5.4% and 8.9%, respectively, in comparison to the conventional sequential methods.

14:00-16:10, Paper ThPT5.11
Deep Action Unit Classification Using a Binned Intensity Loss and Semantic Context Model
Kim, Edward	Villanova Univ
Vangala, Shruthika	Villanova Univ
Keywords: Facial expression recognition, Affective computing, Deep learning Abstract: One of the most important cues for human communication is the interpretation of facial expressions. We present a novel computer vision approach for Action Unit (AU) recognition based upon a deep learning framework combined with a semantic context model. We introduce a new convolutional neural network training loss specific to AU intensity that utilizes a binned cross entropy method to fine-tune an existing network. We demonstrate that this loss can be more effectively trained in comparison to an L2 regression or naive cross entropy approach. The results of our binned cross entropy neural network are then passed to our semantic model, which utilizes the co-occurrence of action units for improved binary and real valued classification. Through our qualitative and quantitative results, we demonstrate the improvement of our framework over the current state-of-the-art.

14:00-16:10, Paper ThPT5.12
Hybrid Hypergraph Construction for Facial Expression Recognition
Huang, Yuchi	Chinese Acad. of Sciences
Lu, Hanqing	Inst. of Automation, Chinese Acad. of Science
Keywords: Facial expression recognition, Deep learning, Semi-supervised learning and spectral methods Abstract: In this paper, we proposed a novel framework for facial expression recognition, in which face images were taken as vertices in a hypergraph and the task of expression recognition was formulated as the problem of hypergraph based inference. A hybrid strategy was developed to construct hyperedges: we generated probabilities of facial action units by deep convolutional networks and took each action unit as an `attribute' to represent a hyperedge; we also formed hyperedges by using embedded network features before the last full connected layer to perform local clustering. In this way, each face image was assigned to various hyperedges by exploiting the representational power of deep convolutional networks. Our facial expression recognition system generates expression labels by a hypergraph based transductive inference approach, which tends to assign the same label to vertices that share many incidental hyperedges, with the constraints that predicted labels of training images should be similar to their ground truth labels. We compared the proposed approach to state-of-the-art methods and its effectiveness was demonstrated by extensive experimentation.

14:00-16:10, Paper ThPT5.14
An Approach for Automated Multimodal Analysis of Infants’ Pain
Zamzmi, Ghada	Univ. of South Florida
Goldgof, Dmitry	Univ. of South Florida
Kasturi, Rangachar	Univ. of South Florida
Sun, Yu	Univ. of South Florida
Ashmeade, Terri	Univ. of South Florida
Pai, Chih-Yun	Univ. of South Florida
Keywords: Facial expression recognition, Image and video analysis and understanding Abstract: Current practices of assessing infants' pain depends on the observer's subjective and potentially inconsistent judgment and requires continuous monitoring by care providers. Therefore, pain may be misinterpreted or totally missed leading to misdiagnosis and over/under treatment. To address these shortcomings, current practices can be augmented with a machine-based assessment system that monitors various pain cues and provides an objective and continuous assessment of pain. Although several machine-based pain assessment approaches have been introduced, the majority of these approaches assess pain based on analysis of a single pain indicator (i.e., unimodal). In this paper, we propose an automated multimodal approach that utilizes a combination of both behavioral and physiological pain indicators to assess infants' pain. We also present a unimodal approach that depends on a single pain indicator for assessment. Recognizing pain using a single indicator yielded 88%, 85%, and 82% overall accuracies for facial expression, body movement, and vital signs, respectively. Combining facial expression, body movement, and changes in vital signs (i.e., the multimodal approach) for assessment achieved 95% overall accuracy. These preliminarily results indicate that utilizing both behavioral and physiological pain indicators could provide a better and more reliable assessment of infants' pain.


ThBT1	G.Cancun T1.A
ThPMO1	Oral Session

16:10-16:30, Paper ThBT1.1
Dilemma First Search for Effortless Optimization of NP-Hard Problems
Weissenberg, Julien	ETH Zurich
Riemenschneider, Hayko	ETH Zurich
Dragon, Ralf	ETH Zurich
Van Gool, Luc	ETH Zurich and Univ. of Leuven
Attachments: Supplementary material Keywords: Machine learning and data mining Abstract: To tackle the exponentiality associated with NP-hard problems, two paradigms have been proposed. First, Branch & Bound, like Dynamic Programming, achieve efficient exact inference but requires extensive information and analysis about the problem at hand. Second, meta-heuristics are easier to implement but comparatively inefficient. As a result, a number of problems have been left unoptimized and plain greedy solutions are used. We introduce a theoretical framework and propose a powerful yet simple search method called Dilemma First Search (DFS). DFS exploits the decision heuristic needed for the greedy solution for further optimization. DFS is useful when it is hard to design efficient exact inference. We evaluate DFS on two problems: First, the Knapsack problem, for which efficient algorithms exist, serves as a toy example. Second, Decision Tree inference, where state-of-the-art algorithms rely on the greedy or randomness-based solutions. We further show that decision trees benefit from optimizations that are performed in a fraction of the iterations required by a random-based search.

16:30-16:50, Paper ThBT1.2
Link Prediction Via Supervised Dynamic Network Formation
Wang, Yue	Central Univ. of Finance and Ec
Bai, Lu	Central Univ. of Finance and Ec
Keywords: Machine learning and data mining Abstract: The link prediction is a classical problem for computer science and many other research fields. Existing link prediction methods mainly apply the link patterns for the network in order to predict future possible links. However, in a network generated by the human interaction, the links may not only relate to the observed datasets, but also are affected by the decisions of human. That is to say, this kind of networks are affected by the game processes between all the related individuals. This paper proposes a model based on the dynamic network formation to mimic the game processes, and uses this model to deal with the link prediction problem. Moreover, we theoretically demonstrate the relationship between the parameters and the limit status of our model. Experimental results illustrate that the proposed model does better link prediction on the networks generated by human interaction than the traditional methods. Further more, this model gives a good prediction for the future possible links in a continuous time period.

16:50-17:10, Paper ThBT1.3
Bound Analysis of Natural Gradient Descent in Stochastic Optimization Setting
Luo, Zhijian	Zhejiang Univ
Liao, Danping	Zhejiang Univ
Qian, Yuntao	Zhejiang Univ
Keywords: Machine learning and data mining Abstract: Natural gradient descent is a metric aware optimization algorithm which utilizes an underlying Riemannian parameter space, and has successfully improved performance in statistical asymptotic and experimental point of view. In this paper, we investigate the bound property of natural gradient descent in stochastic optimization setting. The bound property is analyzed in both direct and indirect ways. Substituting natural gradient for vanilla gradient is considered as the direct analysis. In this way, we analyze the bound of natural gradient descent method by convergence analysis technique. Afterwards, the bound is analyzed in an indirect way by introducing mirror gradient according to its equivalence to natural gradient. Employing mirror gradient in bound analysis makes the procedure of parameter update more intuitive.We finally present experimental results to support our theoretical findings.

17:10-17:30, Paper ThBT1.4
Spatiotemporal Event Sequence Mining from Evolving Regions
Aydin, Berkay	Georgia State Univ
Angryk, Rafal	Georgia State Univ
Keywords: Machine learning and data mining Abstract: In this paper, we introduce methods for mining spatiotemporal event sequences from event datasets with evolving region objects. Spatiotemporal event sequences are the ordered lists of event types whose event instances frequently follow each other in spatiotemporal context. Two Apriori-based algorithms are designed for the task of spatiotemporal event sequence mining. We provide explanations for interestingness measures we employed. We present extended experimental results that demonstrate the computational efficiency of our methods.


ThBT2	G.Cancun T1.B
ThPMO2	Oral Session

16:10-16:30, Paper ThBT2.1
Locally Warping-Based Image Stitching by Imposing Line Constraints
Xiang, Tianzhu	Wuhan Univ
Xia, Gui-Song	Wuhan Univ
Zhang, Liangpei	State Key Lab. LIESMARS, Wuhan Univ
Keywords: Computational photography, Signal, image and video processing, Low-level vision Abstract: Warping-based image stitching methods often suffer from perspective variations among multiple images and lead to shape and perspective distortions in stitching results. Moreover, they also quickly lose their efficiency in low-textured images, due to the lack of reliable point correspondences. To solve these problems, this paper presents a locally warping-based image stitching by imposing line constraints. First, a two-stage alignment scheme with line constraints is introduced to achieve accurate alignment. More precisely, line features are adopted as alignment constraints to jointly estimate local homographies with point correspondences, which provides strong correspondences especially in low-textured cases. Then line constraints are also imposed to the content-preserving warping framework to further reduce alignment errors and preserve image structures. Second, in order to preserve shape and perspective information, a global similarity transform is introduced to mitigate projective distortions. Experimental results demonstrate the efficiency of our method, which yields more encouraging image stitching results in contrast with state-of-the-art methods.

16:30-16:50, Paper ThBT2.2
Robust Tensor Factorization Using Maximum Correntropy Criterion
Zhang, Miaohua	Griffith Univ
Gao, Yongsheng	Griffith Univ
Sun, Changming	CSIRO
La Salle, John	CSIRO
Liang, Junli	Northwestern Pol. Univ
Keywords: 2D/3D object detection and recognition, Classification and clustering, Handwriting Recognition Abstract: Traditional tensor decomposition methods, e.g., two dimensional principle component analysis (2DPCA) and two dimensional singular value decomposition (2DSVD), minimize mean square errors (MSE) and are sensitive to outliers. In this paper, we propose a new robust tensor factorization method using maximum correntropy criterion (MCC) to improve the robustness of traditional tensor decomposition methods. A half-quadratic optimization algorithm is adopted to effectively optimize the correntropy objective function in an iterative manner. It can effectively improve the robustness of a tensor decomposition method to outliers without introducing any extra computational cost. Experimental results demonstrated that the proposed method significantly reduces the reconstruction error on face reconstruction and improves the accuracy rate on handwritten digit recognition.

16:50-17:10, Paper ThBT2.3
Conformal Geometric Algebra Method for Detection of Geometric Primitives
Altamirano-Gómez, Gerardo Esteban	Cinvestav Guadalajara
Bayro Corrochano, Eduardo Jose	CINVESTAV, Unidad Guadalajara
Attachments: Supplementary material Keywords: Perceptual organization, Segmentation, features and descriptors, Representation and analysis in pixel/voxel images Abstract: In this paper, we present a geometric algebra approach for detection of geometric entities in images. Our algorithm is grounded on two methodologies: representation of geometric entities and perceptual properties using Conformal Geometric Algebra, and a voting scheme which is implemented using a clustering algorithm. Our method is applied in a hierarchical way, so that, we extract local and global information from images. Experimental results show the application of our approach to detection of circles, lines, complex structures of shape, and symmetry axis. In addition, we show an FPGA implementation that speed-up the execution time of the algorithm.

17:10-17:30, Paper ThBT2.4
Image Stack Surface Area Minimization for Groupwise and Multimodal Affine Registration
Guan, Birmingham Hang	Univ. of Florida
Corring, John	Univ. of Florida
Sethi, Manu	Univ. of Florida
Ranka, Sanjay	Univ. of Florida
Rangarajan, Anand	Univ. of Florida
Attachments: Supplementary material Keywords: Low-level vision, Signal, image and video processing, Medical image and signal analysis Abstract: Considering the graph of a feature function as an embedded surface in three dimensions is a standard device in computer vision. When multiple feature functions (eg. multiple images) are available, the natural extension of the above concept is to a higher-dimensional embedded surface. This has received surprisingly little attention. In this paper, we advocate for this view by showing the utility of surface area for estimating spatial transformations between images for the purposes of registration. In contrast to the entropy of the stack of images (when represented as continuous and differentiable functions) being zero, the surface area turns out to be useful measure for image registration. We show that during the progression of standard registration algorithms like Congealing, the area of the stack of images being registered decreases. Our own algorithm for affine registration based on image stack surface area (ISSA) minimization has several advantages compared with Congealing and Mutual Information (MI) registration. Finally, we highlight the generality of our framework by also showcasing experiments on affine point-set registration.


ThBT3	Maya T2.A
ThPMO3	Oral Session

16:10-16:30, Paper ThBT3.1
Multi-Scale Underwater Descattering
Ancuti, Cosmin	Univ. Catholique De Louvain
Ancuti, Codruta	Univ. of Girona
De Vleeschouwer, Christophe	Univ. Catholique De Louvain
Garcia, Rafael	Univ. of Girona
Bovik, Alan	The Univ. of Texas at Austin
Keywords: Enhancement, restoration and filtering, Signal, image and video processing Abstract: Underwater images suffer from severe perceptual/ visual degradation, due to the dense and non-uniform medium, causing scattering and attenuation of the propagated light that is sensed. Typical restoration methods rely on the popular dark channel prior to estimate the light attenuation factor, and subtract the back-scattered light influence to invert the underwater imaging model. However, as a consequence of using approximate and global estimates of the back-scattering light, most existing single-image underwater descattering techniques perform poorly when restoring non-uniformly illuminated scenes. To mitigate this problem, we introduce a novel approach that estimates the back-scattered light locally, based on the observation of a neighborhood around the pixel of interest. To circumvent issue related to selection of the neighborhood size, we propose to fusion the images obtained over both small and large neighborhoods, each capturing distinct features from the input image. In addition, the Laplacian of the original image is provided as a third input to the fusion process, to enhance texture details in the reconstructed image. These three derived inputs are seamlessly blended via a multi-scale fusion approach, using saliency, contrast, and saturation metrics to weight each input. We perform an extensive qualitative and quantitative evaluation against several specialized techniques. In addition to its simplicity, our method outperforms the previous art on extreme underwater cases of artificial ambient illumination and high water turbidity.

16:30-16:50, Paper ThBT3.2
Super High Dynamic Range Video
Ogino, Yuka	Tokyo Inst. of Tech
Tanaka, Masayuki	Tokyo Inst. of Tech
Shibata, Takashi	NEC
Okutomi, Masatoshi	Tokyo Inst. of Tech
Keywords: Signal, image and video processing, Sensor array & multichannel signal processing Abstract: High dynamic range (HDR) imaging is highly demanded in computer vision algorithms. An HDR image is composed with several low dynamic range (LDR) images, which usually have some disparities. In many HDR imaging algorithms, the disparities are estimated based on the texture information of the LDR images. However, the texture information is often lost completely if scenes include extremely bright and dark regions simultaneously. Recently, super high dynamic range (SHDR) imaging algorithm has been proposed where the disparities are estimated based on the segment shapes instead of the textures for handling such extreme scenes. In this paper, we extend the SHDR imaging algorithm to SHDR video generation introducing temporal smoothness terms. The temporal smoothness terms improve the temporal stability and the precision of the disparity estimation. Quantitative and qualitative evaluations demonstrate that the proposed algorithm outperforms existing algorithms.

16:50-17:10, Paper ThBT3.3
Sparse-Coded Cross-Domain Adaptation from the Visual to the Brain Domain
Ghaemmaghami, Pouya	Univ. of Trento
Nabi, Moin	Univ. of Trento
Yan, Yan	Univ. of Trento
Sebe, Nicu	Univ. of Trento
Keywords: Multimedia analysis, indexing and retrieval, Signal, image and video processing, Biological image and signal analysis Abstract: Brain decoding (i.e., retrieving information from brain signals by employing machine learning algorithms) has recently received considerable attention across many communities. In a typical brain decoding paradigm, different types of stimuli are shown to the participant of the neuroimaging experiment, while his/her concurrent brain activity is captured using neuroimaging techniques. Then a machine learning algorithm is employed to categorize the measured brain signal into the target stimuli classes. Accurate prediction of the stimulus category by the algorithm is considered a positive evidence of the hypothesis of the existence of stimulus-related information in brain data. However, most of the brain decoding studies suffer from the constraint of having few and noisy samples. In order to overcome this limitation, in this paper, an adaptation paradigm is employed in order to transfer knowledge from visual domain to brain domain. We experimentally show that such adaptation procedure leads to improved results for the object recognition task in the brain domain, outperforming significantly the results achieved by the brain features alone. This is the first study in the direction of transferring knowledge by adapting representations learned on visual domain to the brain modality. We believe this paper opens up avenues for exploiting large-scale visual datasets to achieve performance gain in brain decoding.

17:10-17:30, Paper ThBT3.4
A Multi-Objective Approach Based on TOPSIS to Solve the Image Segmentation Combination Problem
Khelifi, Lazhar	Univ. of Montreal
Mignotte, Max	Univ. De Montréal, Dépt. D'informatique Et R.O
Keywords: Segmentation, features and descriptors Abstract: Recently, there has been renewed interest in the fusion of image segmentation. However, previous relevant research has been impeded by the lack of an appropriate single segmentation criterion, which yields an improved final segmentation result. This paper proposes a new framework to tackle this problem. It is based on multi-objective optimization strategy, followed by a decision making technique called: technique for order performance by similarity to ideal solution (TOPSIS). This new fusion framework aims to overcome the limits caused by using a single criterion by combining and optimizing, simultaneously, two different and complementary segmentation criteria; namely, the global consistency error (GCE) (region-based criterion) and the F-measure (edge-based criterion). This new multi-criterion fusion framework is validated on the Berkeley image dataset and compared to different segmentation algorithms (with or without fusion strategy). Experiments show that the results of our new multi-objective approach improve the state of the art in terms of popular indices.


ThBT4	Maya T2.B
ThPMO4	Oral Session

16:10-16:30, Paper ThBT4.1
Back to the Future: A Fully Automatic Method for Robust Age Progression
Sagonas, Christos	Imperial Coll. London
Panagakis, Yannis	Imperial Coll. London, Department of Computing
Arunkumar, Saritha	Ibm Uk
Ratha, Nalini	IBM Res
Zafeiriou, Stefanos	IMPERIAL Coll. OF LONDON
Keywords: Biometric systems and applications, Image based modeling, Face recognition Abstract: It has been shown that significant age difference between a probe and gallery face image can lower the matching accuracy. If the face images can be normalized in age, there can be a huge impact on the face verification accuracy and thus many novel applications such as matching driver’s license, passport and visa images with the real person can be possible. Face progression can address this issue by generating a face image for a specific age. Many researchers have attempted to address this problem focusing on predicting older faces from a younger face. In this paper, we propose a novel method called RAP for robust and automatic face progression in totally unconstrained conditions. Our method takes into account that faces belonging to the same age-groups share age patterns such as wrinkles while faces across different age-groups share some common patterns such as expressions and skin colors. Given training images of K different age-groups the proposed method learns to recover K low-rank age and one low-rank common components. These extracted components from the learning phase are used to progress an input face to younger as well as older ages in bidirectional fashion. Using standard datasets, we demonstrate that the proposed progression method outperforms state-of-the-art age progression methods and also improves matching accuracy in a face verification protocol that includes age progression.

16:30-16:50, Paper ThBT4.2
Palmprint Identification Via Discriminative Index Learning
Svoboda, Jan	Univ. Della Svizzera Italiana
Masci, Jonathan	Univ. Della Svizzera Italiana
Bronstein, Michael	USI Lugano
Keywords: Biometric systems and applications, Pattern Recognition for Surveillance and Security, Deep learning Abstract: Deep convolutional networks (CNNs) in the past few years have rapidly revolutionized the entire computer vision and pattern recognition community. The paradigm shift is that they learn representations for a given task directly from the data in an end-to-end fashion, rather than using crafted feature extractors, and achieve in some cases better than human performance. Thanks to their flexibility they have been successfully employed in many computer vision and pattern recognition tasks, including biometric applications with majority of face recognition applications, and a few successful applications for fingerprint or speech recognition. In palmprint recognition their power has not yet been exploited and methods still rely on crafted representations which do not scale well to large datasets and that usually undergo a complex parameter tuning. In this work we show that CNNs can be successfully used for palmprint identification. We design an architecture and derive a novel loss function, inspired by the d-prime index, to train these systems with emphasis on genuine/impostor score distribution separation. Our approach does not require cumbersome parameter tuning and sets the new state-of-the-art results on two standard and widely used datasets, namely IIT Delhi Palmprint Database and Casia Palmprint Image Database. Considering our results, we are proposing new possible standard in palmprint recognition.

16:50-17:10, Paper ThBT4.3
Analyzing User Behavior in Online Advertising with Facial Expressions
Yang, Songfan	Sichuan Univ
An, Le	Huazhong Univ. of Science and Tech
Keywords: Facial expression recognition, Human Computer Interaction, Gesture and Behavior Analysis Abstract: In recent years, more and more online advertisements have been produced to reach large population with reduced advertising cost, and major IT companies largely rely on online advertising for revenue. Hence, for both advertisers and advertisement hosts, effective advertising is of great interest and importance. In this paper, we aim to quantify and predict users’ advertisement viewing experiences based on their facial expression responses. We propose a metric termed moment-to-moment zapping probability (MMZP) for predicting users’ zapping, i.e., skipping behavior. Experiments are performed on a recently published facial expression dataset for online advertising analysis. By quantifying and analyzing the users’ zapping behavior with MMZP, we discover knowledge that may help improve the advertising effectiveness, such as how to present the most suitable advertisements to a user with specific background such as gender and ethnicity.

17:10-17:30, Paper ThBT4.4
Generalized Face Anti-Spoofing by Detecting Pulse from Face Videos
Li, Xiaobai	Univ. of Oulu
Komulainen, Jukka	Univ. of Oulu
Zhao, Guoying	Univ. of Oulu
Yuen, Pong C	Hong Kong Baptist Univ
Pietikäinen, Matti	Univ. of Oulu
Keywords: Forensic biometrics and its applications, Signal, image and video processing Abstract: Face biometric systems are vulnerable to spoofing attacks. Such attacks can be performed in many ways, including presenting a falsified image, video or 3D mask of a valid user. A widely used approach for differentiating genuine faces from fake ones has been to capture their inherent differences in (2D or 3D) texture using local descriptors. One limitation of these methods is that they may fail if an unseen attack type, e.g. a highly realistic 3D mask which resembles real skin texture, is used in spoofing. Here we propose a robust anti-spoofing method by detecting pulse from face videos. Based on the fact that a pulse signal exists in a real living face but not in any mask or print material, the method could be a generalized solution for face liveness detection. The proposed method is evaluated first on a 3D mask spoofing database 3DMAD to demonstrate its effectiveness in detecting 3D mask attacks. More importantly, our cross-database experiment with high quality REAL-F masks shows that the pulse based method is able to detect even the previously unseen mask type whereas texture based methods fail to generalize beyond the development data. Finally, we propose a robust cascade system combining two complementary attack-specific spoof detectors, i.e. utilize pulse detection against print attacks and color texture analysis against video attacks.


ThBT5	Maya T2.C
ThPMO5	Oral Session

16:10-16:30, Paper ThBT5.1
A Late Fusion Approach to Combine Multiple Pedestrian Detectors
Jordão, Artur	Federal Univ. of Minas Gerais
Souza, Jessica Sena de	Federal Univ. of Minas Gerais
Schwartz, William	Federal Univ. of Minas Gerais
Keywords: Pattern Recognition for Surveillance and Security, 2D/3D object detection and recognition Abstract: Pedestrian detection is a well-known problem in Computer Vision. To improve detection, several feature descriptors have been proposed and combined. However, there are cases where the most powerful features fail to discriminate between false positives similar to the human body structure and actual true positives, which is a critical problem for applications such as surveillance, driving assistance and robotics. To address this issue, we propose a novel approach to combine results of distinct pedestrian detectors by reinforcing the human hypothesis. The method is able to reduce the confidence of the false positives due to the lack of spatial consensus when multiple detectors are considered. Our experimental validation, performed on three pedestrian detection benchmarks, INRIA person, ETH and Caltech pedestrian dataset, demonstrates that the proposed approach, referred to as Spatial Consensus (SC), outperforms the state-of-the-art on INRIA and ETH datasets and achieves comparable results on the Caltech dataset.

16:30-16:50, Paper ThBT5.2
Selection of Robust Features for the Cover Source Mismatch Problem in 3D Steganalysis
Li, Zhenyu	The Univ. of York
Bors, Adrian	Univ. of York
Keywords: Security issues, Other applications, Segmentation, features and descriptors Abstract: This paper introduces a novel method for extracting sets of feature from 3D objects characterising a robust steganalyzer. Specifically, the proposed steganalyzer should mitigate the Cover Source Mismatch (CSM) paradigm. A steganalyzer is considered as a classifier aiming to identify separately cover and stego objects. A steganalyzer behaves as a classifier by considering a set of features extracted from cover stego pairs of 3D objects as inputs during the training stage. However, during the testing stage, the steganalyzer would have to identify whether specific information was hidden in a set of 3D objects which can be different from those used during the training. Addressing the CSM paradigm corresponds to testing the generalization ability of the steganalyzer when introducing distortions in the cover objects before hiding information through steganography. Our method aims to select those 3D features that model best the changes introduced in objects by steganography or information hiding and moreover they are able to generalize for different objects, not present in the training set. The proposed robust steganalysis approach is tested when considering changes in 3D objects such as those produced by mesh simplification and additive noise. The results obtained from this study show that the steganalyzers trained with the selected set of robust features achieve better detection accuracy of the changes embedded in the objects, when compared to other sets of features.

16:50-17:10, Paper ThBT5.3
Context Based Face Spoofing Detection Using Active Near-Infrared Images
Sun, Xudong	Inst. of Automation, Chinese Acad. of Sciences
Huang, Lei	Inst. of Automation, Chinese Acad. of Sciences
Liu, Changping	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Security issues, Pattern Recognition for Bioinformatics, Biometric systems and applications Abstract: In this paper, with the help of controllable active near-infrared (NIR) lights, we construct near-infrared differential (NIRD) images. Based on reflection model, NIRD image is believed to contain the lighting difference between images with and without active NIR lights. Two main characteristics based on NIRD images are exploited to conduct spoofing detection. Firstly, there exist obviously spoofing media around the faces in most conditions, which reflect incident lights in almost the same way as the face areas do. We analyze the pixel consistency between face and non-face areas and employ context clues to distinguish the spoofing images. Then, lighting feature, extracted only from face areas, is utilized to detect spoofing attacks of deliberately cropped medium. Merging the two features, we present a face spoofing detection system. In several experiments on self collected datasets with different spoofing media, we demonstrate the excellent results and robustness of proposed method.

17:10-17:30, Paper ThBT5.4
A Foveation Technique Applied to Face De-Identification
Alonso, Victor Ernesto	INAOE
Enriquez-Caldera, Rogerio	INAOE
Ramirez-Cortes, Juan Manuel	INAOE
Sucar, Luis Enrique	INAOE
Keywords: Security issues, Pattern Recognition for Surveillance and Security, Performance Evaluation Abstract: This paper presents a new foveation-based method in the discrete cosine transform (DCT) domain to preserve the privacy rights of subjects through face de-identification while preserving awareness for gender and facial expression classification tasks. A comparative analysis between the commonly used ad-hoc methods for image obfuscation and the proposed method is performed. The awareness-privacy tradeoff at different obfuscation levels is quantified by using a support vector machine (SVM) classifier and a two-directional two-dimensional Principal Component Analysis method. Experimental results using a subset of the FERET database show that the new technique accomplishes higher gender and facial expression classification rates at lower face recognition rates.

Technical Program forThursday December 8, 2016