CoPhy

Counterfactual Learning of Physical Dynamics

Understanding causes and effects in mechanical systems is an essential component of reasoning in the physical world. This work poses a new problem of counterfactual learning of object mechanics from visual input. We develop the COPHY benchmark to assess the capacity of the state-of-the-art models for causal physical reasoning in a synthetic 3D environment and propose a model for learning the physical dynamics in a counterfactual setting. Having observed a mechanical experiment that involves, for example, a falling tower of blocks, a set of bouncing balls or colliding objects, we learn to predict how its outcome is affected by an arbitrary intervention on its initial conditions, such as displacing one of the objects in the scene. The alternative future is predicted given the altered past and a latent representation of the confounders learned by the model in an end-to-end fashion with no supervision. We compare against feedforward video prediction baselines and show how observing alternative experiences allows the network to capture latent physical properties of the environment, which results in significantly more accurate predictions at the level of super human performance.

Authors

Fabien Baradel Natalia Neverova Julien Mille Greg Mori Christian Wolf
INSA-Lyon, LIRIS Facebook AI Research INSA-Val de Loire, LIFAM Simon Fraser University, Borealis AI INSA-Lyon, LIRIS

Benchmark

We introduce the Counterfactual Physics benchmark suite (COPHY) for counterfactual reasoning of physical dynamics from raw visual input. It is composed of three tasks based on three physical scenarios: BlocktowerCF, BallsCF and CollisionCF, defined similarly to existing state-ofthe-art environments for learning intuitive physics: Shape Stack (Groth et al., 2018), Bouncing balls environment (Chang et al., 2017) and Collision (Ye et al., 2018) respectively. This was done to ensure natural continuity between the prior art in the field and the proposed counterfactual formulation. Each scenario includes training and test samples, that we call experiments. Each experiment is represented by two sequences of synthetic RGB images (covering the time span of 6 sec at 4 fps):

  • an observed sequence demonstrates evolution of the dynamic system under the influence of laws of physics (gravity, friction, etc.), from its initial state to its final state. For simplic ity, we denote A the initial state and B the observed outcome;

  • a counterfactual sequence with an initial state C after the do-intervention, and the counterfactual outcome D.

A do-intervention is a visually observable change introduced to the initial physical setup (such as, for instance, object displacement or removal).

Finally, the physical world in each experiment is parameterized by a set of visually unobservable quantities, or confounders (such as object masses, friction coefficients, direction and magnitude of gravitational forces), that cannot be uniquely estimated from a single time step. Our dataset provides ground truth values of all confounders for evaluation purposes. However, we do not assume access to this information during training or inference, and do not encourage it.

Each of the three scenarios in the COPHY benchmark is defined as follows.

BlocktowerCF

Each experiment involves K=3 or K=4 stacked cubes, which are initially at resting (but potentially unstable) positions. We define three different confounder variables: masses, m∈{1, 10} and friction coefficients, µ∈{0.5, 1}, for each block, as well as gravity components in X and Y direction, gx,y∈{−1, 0, 1}. The do-interventions include block displacement or removal. This set contains 146k sample experiments corresponding to 73k different geometric block configurations.

BallsCF

Experiments show K bouncing balls (K=2…6). Each ball has an initial random velocity. The confounder variables are the masses, m∈{1, 10}, and the friction coefficients, µ∈{0.5, 1}, of each ball. There are two do-operators: block displacement or removal. There are in total 100k experiments corresponding to 50k different initial geometric configurations.

CollisionCF

This set is about moving objects colliding with static objects (balls or cylinders). The confounder variables are the masses, m∈{1, 10}, and the friction coefficients, µ∈{0.5, 1}, of each object. The do-interventions are limited to object displacement. This scenario includes 40k experiments with 20k unique geometric object configurations.

Usage

Given this data, the problem can be formalized as follows. During training, we are given the quadruplets of visual observations A, B, C, D, but do not not have access to the values of the confounders. During testing, the objective is to reason on new visual data unobserved at training time and to predict the counterfactual outcome D, having observed the first sequence (A, B) and the modified initial state C after the do-intervention, which is known.

The COPHY benchmark is by construction balanced and bias free w.r.t. (1) global statistics of all confounder values within each scenario, (2) distribution of possible outcomes of each experiment over the whole set of possible confounder values (for a given do-intervention). We make sure that the data does not degenerate to simple regularities which are solvable by conventional methods predicting the future from the past. In particular, for each experimental setup, we enforce existence of at least two different confounder configurations resulting in significantly different object trajectories. This guarantees that estimating the confounder variable is necessary for visual reasoning on this dataset. More specifically, we ensure that for each experiment the set of possible counterfactual outcomes is balanced w.r.t. (1) tower stability for BlocktowerCF and (2) distribution of object trajectories for BallsCF and CollisionCF. As a result, the BlocktowerCF set, for example, has 50 ± 5% of stable and unstable counterfactual configurations.

The exact distribution of stable/unstable examples for each confounder in this scenario is shown below:

All images for this benchmark have been rendered into the visual space (RGB, depth and instance segmentation) at a resolution of 448 × 448 px with PyBullet (only RGB images are used in this work). We ensure diversity in visual appearance between experiments by rendering the pairs of sequences over a set of randomized backgrounds. The ground truth physical properties of each object (3D pose, 4D quaternion angles, velocities) are sampled at a higher frame rate (20 fps) and also stored. The training / validation / test split is defined as 0.7 : 0.2 : 0.1 for each of the three scenarios

Data Format

We propose 2 different versions of the dataset:

  • 13GB: A lossly compressed version, where image sequences are stored in mp4 format (20GB), or
  • 550GB (!!): a losslessly compressed version, where individual frames are stored as .png images.

We highly recommend the lighter 20GB version.

The root directory is composed of 3 sub-directories (ballsCF, blocktowerCF and collisionCF) which correspons to the three proposed datasets explained above.

BallsCF:

  • Different subsets exist for different number of balls, corresponding to different subfolders.
  • Each subset is organized into subsets again, corresponding to different seeds used for sampling/creation.
  • train/validation/test splits are done over seeds, a text file indicates the split:

      <number_of_balls>/<split>_<number_balls>.txt                 
    
  • For each seed, the data is spread over several files:

      /<seed>/<ex_id>/confounders.py        # confounders  
      /<seed>/<ex_id>/explanations.txt      # doc
      /<seed>/<ex_id>/{ab,cd}/bboxes.npy    # pixel bounding boxes
      /<seed>/<ex_id>/colors.txt            # association between object and color
      /<seed>/<ex_id>/rgb.mp4               # rgb video
      /<seed>/<ex_id>/segm.mp4              # video of the segmentation
      /<seed>/<ex_id>/states.npy            # 3D states of each objects
    

CollisionCF:

  • Each subset is organized into subsets again, corresponding to different seeds used for sampling/creation.
  • train/validation/test splits are done over seeds, a text file indicates the split:

      <seed>/<split>_<type>.txt             # correspond to the train/val/test splits
      <seed>/<ex_id>/confounders.py         # confounders  
      <seed>/<ex_id>/explanations.txt
      <seed>/<ex_id>/{ab,cd}/bboxes.npy     # pixel bounding boxes
      <seed>/<ex_id>/colors.txt             # association between object and color
      <seed>/<ex_id>/rgb.mp4                # rgb video
      <seed>/<ex_id>/segm.mp4               # video of the segmentation
      <seed>/<ex_id>/states.npy             # states of each objects: pose and velocities
    

BlocktowerCF:

  • Different subsets exist for different number of cubes, corresponding to different subfolders.
  • Each subset is organized into subsets again, corresponding to different seeds used for sampling/creation.
  • train/validation/test splits are done over seeds, a text file indicates the split:

      <number_of_cubes>/<split>_<number_balls>_<type>.txt # train/val/test splits	    
    
  • For each seed, the data is spread over several files:

      <seed>/<ex_id>/confounders.py # confounders  
      <seed>/<ex_id>/gravity.txt # gravity of the env 
      <seed>/<ex_id>/{ab,cd}/bboxes.npy # pixel bounding boxes
      <seed>/<ex_id>/colors.txt # association between object and color
      <seed>/<ex_id>/rgb.mp4 # rgb video
      <seed>/<ex_id>/segm.mp4 # video of the segmentation
      <seed>/<ex_id>/states.npy # 3D states of each objects velocities
    

Extracting .png frames from the MPEG4 videos

For reducing the data storage we release the .mp4 files of each sequence. Below is the command for extracting the png files assuming that you are already in a directory where the mp4 file is located:

rgb frames

ffmpeg -hide_banner -loglevel panic -y -i rgb.mp4 rgb_%03d.png

segm frames

ffmpeg -hide_banner -loglevel panic -y -i segm.mp4 segm_%03d.png

Vice versa if you want to re-create the video mp4 file from à set of png files you can use the command below:

rgb video

ffmpeg -hide_banner -loglevel panic -y -framerate 5 -pattern_type glob -i 'rgb_*.png' rgb.mp4

segm video

ffmpeg -hide_banner -loglevel panic -y -framerate 5 -pattern_type glob -i 'segm_*.png' segm.mp4

Download

If you use this benchmark, you are required to cite the following paper:

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, Christian Wolf. COPHY: Counterfactual Learning of Physical Dynamics. In International Conference on Learning Representations (ICLR), 2019.

MPEG4 Compressed 13GB version

  • Sequences are stored in MPEG4 format.
  • Download from the EU Zenodo site

Losslessly Compressed 560GB (!!) version

  • Individual frames are stored as .png images.
  • Download: available soon.

Paper

Conference version

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, Christian Wolf.
COPHY: Counterfactual Learning of Physical Dynamics.
International Conference on Learning Representations, 2019 [Openreview-Link].

Arxiv version

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, Christian Wolf.
COPHY: Counterfactual Learning of Physical Dynamics.
arXiv:1909.12000, 2019 [Arxiv].