Provable Compositional Generalization for Object-Centric Learning

Thaddäus Wiedemer*
MPI-IS, University of Tübingen, Tübingen AI Center
Jack Brady*
MPI-IS, University if Tübingen, Tübingen AI Center
Alexander Panfilov*
University of Tübingen, MPI-IS, TÜbingen AI Center
Attila Juhos*
MPI-IS, University of Tübingen, Tübingen AI Center
Matthias Bethge
University of Tübingen, Tübingen AI Center
Wieland Brendel
MPI-IS, ELLIS Institute Tübingen, Tübingen AI Center

tl;dr: We show theoretical conditions under which compositional generalization is guaranteed for object-centric representation learning.

News

Feb '24 Our paper was accepted as a Oral at ICLR 2024!
Oct '23 The pre-print is now available on arXiv.

Abstract

Overview

Object-centric representations encode each object in a sence in a separate slot. Because of this, they are thought to generalize compositionlly, which means an object-centric model should be able to extract the correct representation for unseen combinations of observed objects. However, in practice, when we train an object-centric autoencoder on a training set that only contains some object combinations, we find that the model performs well for seen combinations, but not for unseen ones.

We pose compositional generalization as a slot identifiability problem: Is the model able to reconstruct the ground-truth slot on the training set? If so, does this ability generalize to unseen combinations of objects?

We show that if the training set is chosen correctly, if the decoder of the model is additive, and if we train with an additional loss term that we dub compositional consistency loss, the model generalizes compositionally in this sense.

We understand compositional generalization in object-centric learning as the ability of a model to represent the objects in a scene in distinct slots, even for unseen object combinations. By plotting the reconstruction error over the entire domain, we see that an additive decoder by itself generates valid images within and outside of the training set, that is, it generalizes (A). But put together with the encoder, the reconstruction error shoots up outside of the training domain—the model is unable to generalize (B). Only if we add our proposed compositional consistency objective that aligns the encoder with the decoder is the entire mdoel able to generalize (C).

How do we formalize compositional generalization and which assumptions do we make?

We start from a result from our earlier work that shows that a model can correctly identify the slots of a assumed ground-truth data-generating process (that is, it can learn object-centric representations) if the model data-generation process is compositional and irreducible (both temrs introduced in our earlier work) and the model is trained on the entire domain. From there, we take four steps to arrive at our final result:

  1. We assume all the training data is generated from a slot-supported subset of the data domain, which is any subset that contains all variations of each individual object, but not necessarily all combinations of all objects.
  2. We extend our earlier result to show that a compositional autoencoder trained via a reconstruction objective can still identify the ground-truth slots on this training domain if the slot-supported subset is also convex. Note that the recovered training subset of the latent space (blue) is not exactly the same as the ground-truth, since the model recovers it only up to some permissible ambiguities.
  3. Since the model recovered the individual slots, we can simply recombine them to obtain unseen combinations of objects in the model's latent space. We show that if the decoder is additive, its reconstructions of these unseen combinations correspond exactly to images of unseen object combinations.
  4. However, for unseen object combinations, the encoder initially does not produce representations that can be expressed by a recombination of slots in the models latent space. We therefore need to align the encoder output with the expected input to the decoder. We achieve this via our proposed compositional consistency loss.

We arrive at a model whose representations slot-identify the ground-truth latent space, even for unseen combinations of objects.

What does this look like in practice?

From an implementation perspective, our biggest contribution is the compositional consistency loss that is included as an additional training objective to the default reconstruction loss. We can implement by taking the model's latent representation of the batch during the forward pass, randomly recombining slots between samples, and then applying the decoder and encoder to it. The resulting latent representation of each sample should be consistent with the recombination of latents, which we can measure, e.g., via a L2-loss.

In latent space, recombining the slots of two random samples will most likely lead to a combination of slots that is not part of the training domain. The decoder can produce a valid image of this slot combination since it is additive and therefore invariant to the order and combination of slots.

The compositional consistency loss enforces cycle-consistency in the model's latent space after slot-wise recombinations, similar to the cycle-consistency in image space enforced by the default reconstruction loss (left). Most recombinations of slots between samples in a batch correspond to object combinations outside of the training domain (right).

We can show that the model slot-identifies the ground truth exactly when both losses are minimized. Additionally, we do not explicitly need to optimize for compositionality of the decoder, as this property is implicitly enforced during training.

We train multiple models in a controlled setting and see that models slot-identify if they minimize both training objectives (left) and that compositionality is implicitly satisfied (right).

What does this mean for object-centric methods like Slot Attention?

The popular object-centric method Slot Attention fails to generalize compositionally Given our theoretical insights, we can understand what's going wrong.

First, its decoder is not additive since it uses a softmax operation across slots. Therefore, the decoder does not generalize. Replacing the softmax with a slot-wise sigmoid fixes this issue.

Second, as with a vanilla autoencoder, we need to add our compositional consistency training objective to enable the whole model to generalize.

Slot Attention fails to generalize compositionally out-of-the-box, but modifying it to satisfy additivity and training with the compositional consistency loss alleviates this.

Acknowledgements & Funding

BibTeX

If you find our study helpful, please cite our paper:

@inproceedings{wiedemer2024provable,
  title={Provable Compositional Generalization for Object-Centric Learning},
  author={
    Thadd{\"a}us Wiedemer and Jack Brady and Alexander Panfilov and Attila Juhos and Matthias Bethge and Wieland Brendel
  },
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=7VPTUWkiDQ}
}
Webpage designed using Bootstrap 4.5.