Laboratoire d'InfoRmatique en Images et Systèmes d'information
UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon
Entity Resolution (ER) is the task of detecting different entity descriptions that pertain to the same real-world objects. In this talk, we delve into the main end-to-end workflows that tackle it in an efficient and effective way, scaling to large volumes of structured or semi-structured data. First, we describe the two main flavors of batch (i.e., budget-agnostic) Entity Resolution: the one based on a series of schema-agnostic blocking and block processing (i.e., meta-blocking) methods and the one based on string similarity join techniques. We discuss the main methods per workflow step and present an experimental analysis of their relative performance, highlighting their pros and cons. Next, we explain how these workflows can be adapted to a progressive (i.e., budget-aware) functionality that produces results in a pay-as-you-go way. We discuss the main progressive methods that are employed for each type of workflow and compare them experimentally to their batch counterparts. Finally, we briefly discuss how all the above methods can be massively parallelized through Apache Spark. We conclude with a short description and presentation of JedAI, an open-source system that supports all discussed end-to-end workflows and the corresponding methods.
George Papadakis, University of Athens, Greece