Google Research Open-Sources ‘SAVi’: an object-centric architecture that extends the attention mechanism from slots to videos


Multiple distinct elements act as building blocks of the composition that can be independently processed and recombined in human understanding of the world. The foundation of high-level cognitive skills like language, causal reasoning, arithmetic, planning, etc. is a compositional model of the universe. Therefore, it is essential to generalize in a predictable and systematic way. Machine learning algorithms with object-centric representations have the potential to dramatically improve sampling efficiency, resilience, generalization to new problems, and interpretability.

The unsupervised learning of multi-object representations is widely used in various applications. These algorithms learn to separate and represent objects from the statistical structure of the data alone, without requiring supervision, using inductive object-centric biases. Despite their promising results, these approaches are currently constrained by two major problems:

  1. They’re limited to toy data like moving 2D sprites or extremely crude 3D scenes, and they struggle with more realistic data with complex textures.
  2. During training and inference, it is not clear how to interact with these models. The concept of object is imprecise and task dependent, and the segmentation of these models does not always correspond to the tasks of interest.

To overcome the problem of unsupervised / weakly supervised multi-object segmentation and tracking in video data, a new Google search introduces a sequential Slot Attention extension called Slot Attention for Video (SAVi).


Inspired by predictor-corrective approaches for the integration of ordinary differential equations, SAVi performs a prediction and a corrective step for each video frame viewed. In order to describe the temporal dynamics and the interactions of objects, the prediction step uses the self-attention among the slots. The slot-normalized cross-attention with the inputs is used in the correction step to update (or correct) the set of slot representations. The output of the predictor is then used to initialize the corrector to the next time step, allowing the model to consistently track objects over time. These two processes are permutation equivalent, preserving the symmetry of the slit.

Recent work on learning object-centered representation has examined the incorporation of inductive biases associated with 3D scene geometry, both for static scenes and for movies. This is about bridging the gap to a visually richer and more realistic environment, but opposes the use of conditioning and optical flow. The FlowCaps technique proposes to exploit the optical flow in a similar way in a multi-object model. It uses capsules instead of locations and expects individual capsules to be dedicated to objects or parts of objects with a specific appearance, making it unsuitable for settings with a wide range of types. objects. Objects are represented using an interchangeable representation based on locations.

Researchers are studying conditional tasks based on semi-supervised video object segmentation (VOS) computer vision problems, in which segmentation masks are provided for the initial video image during assessment. They focus on the problem where models do not have access to any supervised information beyond conditioning information on the first frame, which is addressed through supervised learning on fully annotated films or comparable data sets. Even when segmentation labels lack training and testing time, multi-object segmentation and tracking can occur.

Each video was split into six 6-frame subsequences during training, with the first frame receiving the conditioning signal. Researchers train 100,000 steps (200,000 for fully unsupervised video decomposition) with a batch size of 64. In SAVi, they use a total of 11 locations. Two rounds of Slot Attention per frame were used for fully unsupervised video decomposition experiments and only one iteration otherwise.

On the CATER1 dataset, the researchers first test SAVi in an unconditional scenario and with a standard RGB reconstruction target. Because the two frame-based approaches (Slot Attention and MONet) apply each frame independently, they lack a built-in sense of temporal consistency. The (unconditional) SAVi model surpasses these references, proving the relevance of our architecture for unsupervised learning of object representation, but only on simple synthetic data.

The team shifted the training focus from RGB image prediction to optical flow prediction to handle these more realistic videos. In addition, they condition the latent slots of the SAVi model on indices relating to objects in the first frame of the video. SAVi was trained for six consecutive frames in each scenario, but for the duration of the test. In segmenting video objects, it is common to use precise segmentation information for the first frame, with patterns such as T-VOS or CRW propagating the initial masks throughout the video series.

T-VOS achieves 50.4% and 46.4% mIoU on MOVi and MOVi ++ datasets, respectively, while CRW achieves 42.4% and 50.9% mIoU. When trained to predict flow and with segmentation masks as a conditioning signal in the first frame, SAVi learns to produce time-consistent masks that are much better on MOVi (72.0 percent mIoU) and slightly poorer than T-VOS and CRW on MOVi ++ (43.0% mioU).

There are still some challenges to overcome before the system can be applied to all the visual and dynamic complexity of the real world.

  • First, the training method employed assumes that optical flow information is available at the time of training, which may not be the case in real world videos.
  • Second, the parameters considered in this study remain limited by taking only rigid objects with rudimentary physics; in the case of MOVi datasets, only moving objects. In addition, learning with optical flow alone is difficult for static objects.

Nonetheless, research reveals that the suggested model works very well in terms of segmentation and tracking. This shows that the capacity of the model is not the main constraint for learning object-centered representation. This method of using location information to condition the initialization of slot representations could lead to a variety of semi-supervised techniques.



Previous Miami design storyteller
Next Replacement construction of I-70 structure comes to a standstill, paving between Frisco and Silverthorne continues