Provable Compositional Generalization for Object-Centric Learning
tl;dr: We show theoretical conditions under which compositional generalization is guaranteed for object-centric representation learning.
News
Feb '24 | Our paper was accepted as a Oral at ICLR 2024! |
Oct '23 | The pre-print is now available on arXiv. |
Abstract
Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.
Overview
Object-centric representations encode each object in a sence in a separate slot. Because of this, they are thought to generalize compositionlly, which means an object-centric model should be able to extract the correct representation for unseen combinations of observed objects. However, in practice, when we train an object-centric autoencoder on a training set that only contains some object combinations, we find that the model performs well for seen combinations, but not for unseen ones.
We pose compositional generalization as a slot identifiability problem: Is the model able to reconstruct the ground-truth slot on the training set? If so, does this ability generalize to unseen combinations of objects?
We show that if the training set is chosen correctly, if the decoder of the model is additive, and if we train with an additional loss term that we dub compositional consistency loss, the model generalizes compositionally in this sense.
We understand compositional generalization in object-centric learning as the ability of a model to represent the objects in a scene in distinct slots, even for unseen object combinations. By plotting the reconstruction error over the entire domain, we see that an additive decoder by itself generates valid images within and outside of the training set, that is, it generalizes (A). But put together with the encoder, the reconstruction error shoots up outside of the training domain—the model is unable to generalize (B). Only if we add our proposed compositional consistency objective that aligns the encoder with the decoder is the entire mdoel able to generalize (C).
How do we formalize compositional generalization and which assumptions do we make?
We start from a result from our earlier work that shows that a model can correctly identify the slots of a assumed ground-truth data-generating process (that is, it can learn object-centric representations) if the model data-generation process is compositional and irreducible (both temrs introduced in our earlier work) and the model is trained on the entire domain. From there, we take four steps to arrive at our final result:
- We assume all the training data is generated from a slot-supported subset of the data domain, which is any subset that contains all variations of each individual object, but not necessarily all combinations of all objects.
- We extend our earlier result to show that a compositional autoencoder trained via a reconstruction objective can still identify the ground-truth slots on this training domain if the slot-supported subset is also convex. Note that the recovered training subset of the latent space (blue) is not exactly the same as the ground-truth, since the model recovers it only up to some permissible ambiguities.
- Since the model recovered the individual slots, we can simply recombine them to obtain unseen combinations of objects in the model's latent space. We show that if the decoder is additive, its reconstructions of these unseen combinations correspond exactly to images of unseen object combinations.
- However, for unseen object combinations, the encoder initially does not produce representations that can be expressed by a recombination of slots in the models latent space. We therefore need to align the encoder output with the expected input to the decoder. We achieve this via our proposed compositional consistency loss.
We arrive at a model whose representations slot-identify the ground-truth latent space, even for unseen combinations of objects.
What does this look like in practice?
From an implementation perspective, our biggest contribution is the compositional consistency loss that is included as an additional training objective to the default reconstruction loss. We can implement by taking the model's latent representation of the batch during the forward pass, randomly recombining slots between samples, and then applying the decoder and encoder to it. The resulting latent representation of each sample should be consistent with the recombination of latents, which we can measure, e.g., via a L2-loss.
In latent space, recombining the slots of two random samples will most likely lead to a combination of slots that is not part of the training domain. The decoder can produce a valid image of this slot combination since it is additive and therefore invariant to the order and combination of slots.
We can show that the model slot-identifies the ground truth exactly when both losses are minimized. Additionally, we do not explicitly need to optimize for compositionality of the decoder, as this property is implicitly enforced during training.
What does this mean for object-centric methods like Slot Attention?
The popular object-centric method Slot Attention fails to generalize compositionally
First, its decoder is not additive since it uses a softmax operation across slots. Therefore, the decoder does not generalize. Replacing the softmax with a slot-wise sigmoid fixes this issue.
Second, as with a vanilla autoencoder, we need to add our compositional consistency training objective to enable the whole model to generalize.
Acknowledgements & Funding
We thank (in alphabetical order): Andrea Dittadi, Egor Krasheninnikov, Evgenii Kortukov, Julius von Kügelgen, Prasanna Mayilvahannan, Roland Zimmermann, Sébastien Lachapelle, and Thomas Kipf for helpful insights and discussions.
This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A, 01IS18039B. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG.
The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting TW and AJ.
BibTeX
If you find our study helpful, please cite our paper:
title={Provable Compositional Generalization for Object-Centric Learning},
author={
Thadd{\"a}us Wiedemer and Jack Brady and Alexander Panfilov and Attila Juhos and Matthias Bethge and Wieland Brendel
},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=7VPTUWkiDQ}
}