Equipe BD
Equipe BD
Laboratoire d'InfoRmatique en Images et Systèmes d'information
UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon

You are here

Could Functional Dependencies Help to Identify Balanced Classification Datasets ?

Qui: 
Marie LE GUILLY
Quand: 
Friday, April 6, 2018 - 12:45 to 13:45
Où: 
B.Pascal salle de réunion du Liris

When using machine learning algorithms to solve classification problems, one recurring problem is the one of unbalanced datasets, especially in binary classification when one class is much bigger than the other. Several solutions have been proposed to tackle the problem, such as undersampling of the majority class. This undersampling approach is generally based on statistical on the data. We propose a different approach to this problem, by considering the functional dependencies and their interactions in the two classes. To this end, we propose to make use of the “distance of databases” as defined by Katona, Keszler and Sali in a paper of 2011. This distance between two databases is based on the functional dependencies of each database. We propose to study the influence of this distance on the performance of models classificying the tuples from the two databases: the conjecture is that it should be easier to discriminate between distant datasets. If correct, this could be used to build balanced dataset by choosing samples from the majority class that form a dataset distant from the minority class in terms of functional dependencies.