LOGO

Multimodal deep SensoriMotor Representation learning (MeSMRise)

Research project ANR-23-CE23-0021-01

04/01/24 - 09/30/28


Research hypotheses

Overview

In deep learning, many recent models of Self-Supervised Learning (SSL) use instance based discriminative tasks. The underlying idea is to define positive (resp. negative) pairs of data whose representations have to be similar (resp. dissimilar). To define these positive pairs, the models consider a set of augmentations (e.g. resize, color jitter, blur, etc.) to be applied to the same inputs. The models learn to be invariant (or equivariant where considering the impact of the augmentation on the representation) to the properties modified by these augmentations, but not to the semantic content of the input.


However, there is “a fundamental misalignment between human and typical AI representations: while the former are grounded in rich sensorimotor experience, the latter are typically passive and limited to a few modalities such as vision and text” [1]. In this project, we propose to take inspiration from the way babies learn to explore their environment through actions that shape their multimodal experience to improve. We will build more specifically upon sensorimotor contingencies theory [2], which combines coherent pieces of evidence from neuroscience and psychology in a unified framework. The key claims are about:

  • regularities, with sensory motor contingencies defined as “the structure of the rules governing the sensory changes produced by various motor actions” (e.g. a straight line will be represented as the perceptual invariance caused by the eyes translation along this line) [2]
  • active perception as the “organism’s exploration of the environment that is mediated by knowledge of sensory motor contingencies” [3].

Our claim in this project is that to go beyond invariant and equivariant pretext tasks, the next research axis to explore in SSL is to consider action as the core of the representation learning and perception process, guided by the concepts of sensorimotor contingencies theory. Thus, this conceptual shift of using action as an unifying key point in learning will guide towards more general architectures and representations. It will also allow to interact with the environment to have access to all its dynamic and properties. Moreover, having more human like perceptual and learning mechanisms should help to generalize to various environments, as human do.


___

[1] N. Hay, M. Stark, et al. Behavior is everything: Towards representing concepts with sensorimotor contingencies. AAAI, 2018

[2]J.K. O'Regan and A. Noë. A sensorimotor account of vision and visual consciousness. Behavioural and brain sciences, 2001

[3]E. Myin and J.K. O'Regan. Perceptual consciousness, access to modality and skill theories. A way to naturalize phenomenology? Journal of consciousness studies, 2002

Workpackages details

The project is structured in 4 WPs:

WP 1 considers learning unisensory and multisensory representation and perception based on SSL architectures that integrates action as the core of their representation, in line with the sensorimotor contingencies.

WP 2 consider learning more high level representation based on the dynamic of the interaction with the environment to define objects as a network of possible interactions. Moreover these representations will help learning of WP 1.

From multi-level and multisensory representation, WP 3 will define a hierarchical policy of active exploration of the environment to improve WP 1 & 2 learning and perception.

WP 4 defines the shared evaluation environment, consisting in a virtual world filled with objects that the agent can manipulate.

project organisation
Project organisation