DensePhysNet: Learning Dense Physical Object Representations via Multi-step Dynamic Interactions

Supplementary video for our setup, model, and results


We study the problem of learning physical object representations for robot manipulation. Understanding object physics is critical for successful object manipulation, but also challenging because physical object properties can rarely be inferred from the object’s static appearance. In this paper, we propose DensePhysNet, a system that actively executes a sequence of dynamic interactions (e.g., sliding and colliding), and uses a deep predictive model over its visual observations to learn dense, pixel-wise representations that reflect the physical properties of observed objects. Our experiments in both simulation and real settings demonstrate that the learned representations carry rich physical information, and can directly be used to decode physical object properties such as friction and mass. The use of dense representation enables DensePhysNet to generalize well to novel scenes with more objects than in training. With knowledge of object physics, the learned representation also leads to more accurate and efficient manipulation in downstream tasks than the state-of-the-art.


Figure 1: Our goal is to build a robotic system that learns a dense physical object representation from a few dynamic interactions with objects. The learned representation can then be used to decode object properties such as its material and mass, applied in manipulation tasks such as sliding objects with unknown physics, and combined with a physics engine to tackle novel tasks.


Figure 2: DensePhysNet consists of five modules: (a) an image encoder, (b) a multi-step information aggregator, (c) an action encoder, (d) a cross convolutional layer, and (e) a motion predictor.