Laboratoire d'InfoRmatique en Images et Systèmes d'information
UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon
Data scientists spend big chunk of their time preparing, cleaning, and transforming raw data before getting the chance to feed this data to their well-crafted models. Despite the efforts to build robust predication and classification models, data errors still the main reason for having low quality results. This massive labor-intensive exercises to clean data remain the main impediment to automatic end-to-end AI pipeline for data science.
In this talk, I focus on data cleaning as an inference problem that can be automated by leveraging the great advancements in AI and ML in the last few years. I will describe The HoloClean++ framework, a machine learning framework for data profiling and cleaning (error detection and repair). The framework has multiple successful deployments with cleaning census data, and pilots with commercial enterprises to boost the quality of source (training) data before feeding them to downstream analytics.
HoloClean++ builds two main probabilistic models: a data generation model (describing how data was intended to look like); and a realization model (describing how errors might be introduced to the intended clean data). The framework uses few-shot learning, data augmentation, and weak supervision to learn the parameters of these models, and use them to predict both error and their possible repairs.
While the idea of using statistical inference to model the joint data distribution of the underlying data is not new, the problem has been always: (1) how to scale a model with millions of data cells (corresponding to random variables); and (2) how to get enough training data to learn the complex models that are capable of accurately predicting the anomalies and the repairs. HoloClean++ tackles exactly these two problems.
Bio: Ihab Ilyas is a professor in the Cheriton School of Computer Science and the NSERC-Thomson Reuters Research Chair on data quality at the University of Waterloo. His main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, machine learning for data curation, and information extraction. Ihab is also a co-founder of Tamr, a startup focusing on large-scale data integration. He is a recipient of the Ontario Early Researcher Award, a Cheriton Faculty Fellowship, an NSERC Discovery Accelerator Award, and a Google Faculty Award, and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees, elected SIGMOD vice chair, and an associate editor of the ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.