# Provably Learning Object-Centric Representations

MPI-IS

University of Cambridge &

**tl;dr:**
We analyze when object-centric representations can be learned without supervision and introduces two assumptions, compositionality and irreducibility, to prove that ground-truth object representations can be identified.

### News

May '23 | Our paper was accepted for an Oral Presentation at ICML 2023! |

May '23 | The pre-print is now available on arXiv |

### Abstract

Learning structured representations of the visual world in terms of objects promises to significantly improve the generalization abilities of current machine learning models. While recent efforts to this end have shown promising empirical progress, a theoretical account of when unsupervised object-centric representation learning is possible is still lacking. Consequently, understanding the reasons for the success of existing object-centric methods as well as designing new theoretically grounded methods remains challenging. In the present work, we analyze when object-centric representations can provably be learned without supervision. To this end, we first introduce two assumptions on the generative process for scenes comprised of several objects, which we call compositionality and irreducibility. Under this generative process, we prove that the ground-truth object representations can be identified by an invertible and compositional inference model, even in the presence of dependences between objects. We empirically validate our results through experiments on synthetic data. Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models' compositionality and invertibility and their identifiability empirically.

### When can unsupervised object-centric representations provably be learned?

We assume that observed scenes comprising
multiple objects are rendered by an unknown generator f from multiple ground-truth latent slots. We assume that this
generative model has two key properties, which we call **compositionality** and **irreducibility**. Under this model, we
prove that an invertible inference model with a compositional inverse yields latent slots that identify the ground-truth slots
up to permutation and slot-wise invertible functions. We call this **slot identifiability**. To measure violations of compositionality in practice,
we introduce a contrast function (**compositional contrast**) which is zero if and only if a function is compositional, while to measure invertibility, we rely
on the reconstruction loss in an auto-encoder framework.

### How do we formalize object-centric data in our theoretical framework?

We define a generative model for multi-object scenes where objects are described by latent slots. We would like to enforce that each latent slot is responsible for encoding a distinct object in a scene. To this end, we make two core assumptions on the generator function from latent slots to images.

The first assumption we make, **compositionality**, captures that each object is generated by one latent slot. We formalize this through an assumption
on the Jacobian matrix of the generator function, which states that each pixel has a non-zero partial derivative with at most one latent slot.
This imposes a local sparsity structure on the Jacobian visualized below (Section 2.1)

**A**) For a compositional generator

**f**, every pixel is affected by at most one latent slot. As a result, there always exists an ordering of the pixels such that the generator’s Jacobian

**Jf**consists of disjoint blocks, one for each latent slot (bottom). Note that both the pixel ordering and the specific structure of the Jacobian are not fixed across scenes and might depend on the latent input z. (

**B**) For a non-compositional generator, there exists no pixel ordering that exposes such a structure in the Jacobian, since the same pixel can be affected by more than one latent slot

The second assumption we make, **irreducibility**, captures that each latent slot generates at most one object rather than multiple objects.
To enforce this, we assume that the mechanisms which generate parts of the same object are dependent. We formalize a mechanism through the
Jacobian matrix of the generator and dependence in a non-statistical sense using the rank of these mechanisms (Section 2.2).

**A**) A simple example of a reducible mechanism is one for which disjoint subsets of latents from the same slot render pixel groups S1 and S2 separately such that they form independent sub-mechanisms. This independence between sub-mechanisms is indicated by the difference in colors. (

**B**) Not all reducible mechanisms look as simple as panel A: here, S1 and S2 depend on every latent component in the slot, but the information in S1 ∪ S2 still decomposes across S1 and S2 as sub-mechanisms 1 and 2 are independent. (

**C**) In contrast, for an irreducible mechanism, the information does not decompose across any pixel partition S, S′, and so it is impossible to separate it into independent sub-mechanisms.

### When can object-centric representations be learned in our framework?

Under this generative model, we can then prove our **main theoretical result**: Inference models which are invertible (i.e. minimize reconstruction loss)
and have a compositional inverse (i.e. decoder maps different slots to different pixels) will identify the ground-truth latent slots (slot identifiability) (Theorem 1)

We validate empirically on controlled synthetic data that inference models which maximize invertibility and compositionality indeed identify the ground-truth latent slots (Section 5.1)

Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models’ compositionality and invertibility, and slot-identifiability (Section 5.2)

**A**) and on synthetic images (

**B**). We then evaluate the models' compositionality and invertibility measured by our proposed compositional contrast and the reconstruction error. We see that these properties are predictive of the models' identifiability.

### Acknowledgements & Funding

We thank: the anonymous reviewers for helpful suggestions which lead to improvements in the manuscript, Andrea Dittadi for helpful discussions regarding experiments, Attila Juhos, Amin Charusaie, Michel Besserve, and Simon Buchholz for helpful technical discussions, and Zac Cranko for theoretical efforts in the early stages of the project. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting JB, RSZ and YS. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A, 01IS18039B. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645.

### BibTeX

If you find our work helpful, please cite our paper:

author = {

Brady, Jack and

Zimmermann, Roland S. and

Sharma, Yash and

Sch{\"o}lkopf, Bernhard and

von K{\"u}gelgen, Julius

Brendel, Wieland and

},

title = {

Provably Learning

Object-Centric Representations

},

year = {2023},

booktitle = {

Proceedings of the 40th International

Conference on Machine Learning

},

articleno = {126},

numpages = {25},

location = {Honolulu, Hawaii, USA},

series = {ICML'23}

}