Beyond declarative mapping and cleaning

Paolo Papotti
Friday, April 10, 2015 - 13:00 to 14:00
Lyon1, bâtiment Nautibus, RdC, salle C4

In the "big data" era, data integration is a popular activity both in academia and in industry. Integrating hundreds of heterogeneous sources on a daily basis requires a great amount of manual work in order to have data that is polished enough to be useful in the final applications, such as querying and mining. The problem is even harder in practice, as data is often dirty in nature because of typos, duplicates, and so on, that can lead to poor results in the analytic tasks.
To achieve the level of automation and scalability required by the large number of sources, several successful systems have been proposed. They rely on a formal, declarative approach based on first order logic, where the users provide high-level specifications of the tasks (the "what"), and the systems compute optimal solutions without requiring human intervention on the generated code (the "how"). However, despite the positive results, there is still a gap between these proposals and the leading commercial systems. The latter are harder to maintain, to debug, and to test, but provide the level of personalization and detail that are needed to solve real-world problems. In this talk, I will describe some of my results in tackling mapping and cleaning with a declarative approach, and how this experience has pushed me to explore a new approach where user defined functions and declarative specifications coexist in a unified, distributed system, ultimately taking the best of both worlds.