Provably Learning Object-Centric Representations
MPI-IS
University of Cambridge &
tl;dr: We analyze when object-centric representations can be learned without supervision and introduces two assumptions, compositionality and irreducibility, to prove that ground-truth object representations can be identified.
News
May '23 | Our paper was accepted for an Oral Presentation at ICML 2023! |
May '23 | The pre-print is now available on arXiv |
Abstract
Learning structured representations of the visual world in terms of objects promises to significantly improve the generalization abilities of current machine learning models. While recent efforts to this end have shown promising empirical progress, a theoretical account of when unsupervised object-centric representation learning is possible is still lacking. Consequently, understanding the reasons for the success of existing object-centric methods as well as designing new theoretically grounded methods remains challenging. In the present work, we analyze when object-centric representations can provably be learned without supervision. To this end, we first introduce two assumptions on the generative process for scenes comprised of several objects, which we call compositionality and irreducibility. Under this generative process, we prove that the ground-truth object representations can be identified by an invertible and compositional inference model, even in the presence of dependences between objects. We empirically validate our results through experiments on synthetic data. Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models' compositionality and invertibility and their identifiability empirically.
When can unsupervised object-centric representations provably be learned?
We assume that observed scenes comprising multiple objects are rendered by an unknown generator f from multiple ground-truth latent slots. We assume that this generative model has two key properties, which we call compositionality and irreducibility. Under this model, we prove that an invertible inference model with a compositional inverse yields latent slots that identify the ground-truth slots up to permutation and slot-wise invertible functions. We call this slot identifiability. To measure violations of compositionality in practice, we introduce a contrast function (compositional contrast) which is zero if and only if a function is compositional, while to measure invertibility, we rely on the reconstruction loss in an auto-encoder framework.
How do we formalize object-centric data in our theoretical framework?
We define a generative model for multi-object scenes where objects are described by latent slots. We would like to enforce that each latent slot is responsible for encoding a distinct object in a scene. To this end, we make two core assumptions on the generator function from latent slots to images.
The first assumption we make, compositionality, captures that each object is generated by one latent slot. We formalize this through an assumption on the Jacobian matrix of the generator function, which states that each pixel has a non-zero partial derivative with at most one latent slot. This imposes a local sparsity structure on the Jacobian visualized below (Section 2.1)
The second assumption we make, irreducibility, captures that each latent slot generates at most one object rather than multiple objects. To enforce this, we assume that the mechanisms which generate parts of the same object are dependent. We formalize a mechanism through the Jacobian matrix of the generator and dependence in a non-statistical sense using the rank of these mechanisms (Section 2.2).
When can object-centric representations be learned in our framework?
Under this generative model, we can then prove our main theoretical result: Inference models which are invertible (i.e. minimize reconstruction loss) and have a compositional inverse (i.e. decoder maps different slots to different pixels) will identify the ground-truth latent slots (slot identifiability) (Theorem 1)
We validate empirically on controlled synthetic data that inference models which maximize invertibility and compositionality indeed identify the ground-truth latent slots (Section 5.1)
Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models’ compositionality and invertibility, and slot-identifiability (Section 5.2)
Acknowledgements & Funding
We thank: the anonymous reviewers for helpful suggestions which lead to improvements in the manuscript, Andrea Dittadi for helpful discussions regarding experiments, Attila Juhos, Amin Charusaie, Michel Besserve, and Simon Buchholz for helpful technical discussions, and Zac Cranko for theoretical efforts in the early stages of the project. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting JB, RSZ and YS. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A, 01IS18039B. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645.
BibTeX
If you find our work helpful, please cite our paper:
author = {
Brady, Jack and
Zimmermann, Roland S. and
Sharma, Yash and
Sch{\"o}lkopf, Bernhard and
von K{\"u}gelgen, Julius
Brendel, Wieland and
},
title = {
Provably Learning
Object-Centric Representations
},
year = {2023},
booktitle = {
Proceedings of the 40th International
Conference on Machine Learning
},
articleno = {126},
numpages = {25},
location = {Honolulu, Hawaii, USA},
series = {ICML'23}
}