Laboratoire d'InfoRmatique en Images et Systèmes d'information
UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon
In many domains, scientific discoveries rely increasingly on our ability to exploit ever growing volumes of data. A key point is managing the complexity of data life cycles, i.e. the various operations that happen to data from their creation to their deletion: transfer, archival, replication, disposal, etc. These formerly straightforward operations become intractable when data volume grows dramatically, because of the heterogeneity of data management software on the one hand, and the complexity of the infrastructures involved on the other. In this context, cooperation between different systems becomes very complex and requires ad-hoc solutions and many human interventions.
This work contributes theoretical and practical tools that allow a formal and efficient management of data life cycles in large scientific applications. We propose a meta-model that allows for the first time to represent formally and graphically the life cycle of data distributed in an assemblage of systems on heterogeneous infrastructures. Then, we present Active Data[1], an implementation of this meta-model and a programming model which allow to execute code at each step of the data life cycle. Active Data programs have access to the complete state of data at any time, in any system or infrastructure they are distributed on. These programs can thus make informed decisions based on a global knowledge, and implement many optimizations that would otherwise have been impossible.
We finally present performance evaluations and use-cases that demonstrate the expressivity of the programming model and the implementation quality.