Equipe BD
Equipe BD
Laboratoire d'InfoRmatique en Images et Systèmes d'information
UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon

You are here

Designing and enacting data science pipelines as queries

Thursday, February 6, 2020 - 12:45 to 14:00
Université Lyon1, Dép. Informatique, Bât. Nautibus, salle C2

Since the emergence of the 5V’s (i.e., n-V’s) models describing non-functional properties of data, new visions of querying have emerged. Batch, on-demand queries, with expected complete and sound results, have evolved into complex data science pipelines combining processing and analytics tasks. Similar to a query, described as a data flow, a data science pipeline is a combination of tasks. Different to classic queries that rely on well-defined data structures with associated operators. Data science pipelines combine data visualization, cleaning, preparation, modelling and prediction, and assessment tasks. These tasks use input data with different structures. From our point of view, we can consider these data science pipelines a new type of queries that we call data science queries.

Data science queries produce as results models and forecasts with associated error estimations. These queries can be re-executed applying different data processing and analytics methods for reducing such errors. Datasets do not longer represent a complete consistent model of a mini-world, they represent instead, partial and incomplete observations of phenomena produced within complex systems that can be analysed. The objective of analysing datasets is to get close to a possible and transient understanding of phenomena produced within complex systems. This does not mean that data loading, in memory/cache/disk indexing, data persistence, query optimization, concurrent access, consistency and access control, and other management functions are no longer required. Rather, this means that these functions must be revisited under less strong hypothesis for example regarding data consistency, completeness and cleanness, to support the enactment of data science queries.

In my talk I will introduce the challenges and possible research directions regarding data science queries, applications and initial projects. Of course, I will discuss how previous research results can backup this project and how the project could be weaved into the activities of the database group at LIRIS.