Laboratoire d'InfoRmatique en Images et Systèmes d'information
UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon
Functional dependencies (FDs) are potentially useful to assess data quality. However, in order to identify data inconsistencies, patterns of semantically related data, and pursue robustness with respect to data errors, it has been necessary to devise approximate versions of functional dependencies, yielding relaxed functional dependencies (RFDs) [1]. The latter can be classified into three categories, depending on the constraint they relax with respect to the canonical definition of FD. In particular, there exist RFDs relaxing on the data comparison method, i.e., they exploit data similarity rather than equality, those relaxing on the extent, i.e., they admit the possibility that the RFD holds on a subset of data, and finally, hybrid RFDs relaxing on both criteria. In all such categories, thresholds might be used in either to specify the similarity between tuples or to indicate the percentage of the database on which the RFD holds.
While FDs were originally specified at database design time, as properties of a schema that should hold on every instance of it, recently there has been the need to devise algorithms to discover them from data, since some FDs would often been missed at design time. Moreover, in the era of big data the evolutions of a database instance can often yield changes in the semantics of data, which should probably affect previously specified FDs rather than be blocked by them. These becomes particularly critical in the context of RFDs, for which a design time specification is further complicated by the necessity to specify several types of thresholds. Thus, we need to devise suitable algorithms to automatically discover them from big data collections, in order to deploy them effectively. In the literature, there are several discovery algorithms for FDs, whereas few RFDs are equipped with specific algorithms for discovering them from data. To this end, based upon a formal definition of RFD, we aim to derive algorithms to potentially discover any type of RFD. In particular, we have devised a general method capable of validating any type of candidate RFD, including hybrid ones [2, 3]. We are also attempting to derive algorithms capable to automatically inferring thresholds. Finally, we are investigating incremental RFD discovery algorithms in order to both exploit distributed computing paradigms to enhance the efficiency of discovery process, but also to enable online discovery of RFDs in data streams [4].
[1] L. Caruccio, V. Deufemia, and G. Polese. Relaxed Functional Dependencies - A Survey of Approaches. IEEE Transactions on Knowledge and Data Engineering 28(1): 147-165 (2016). [2] L. Caruccio, V. Deufemia, and G. Polese. On the Discovery of Relaxed Functional Dependencies. In Proceedings of the 20th International Database Engineering & Applications Symposium (IDEAS): 53-61 (2016). [3] L. Caruccio, V. Deufemia, and G. Polese. Evolutionary mining of relaxed dependencies from big data collections. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics (WIMS): 5 (2017). [4] L. Caruccio, S. Cirillo, V. Deufemia, and G. Polese. Incremental Discovery of Functional Dependencies with a Bit-vector Algorithm. In Proceedings of the 27th Italian Symposium on Advanced Database Systems (SEBD): (2019).