Provably Learning Object-Centric Representations

Jack Brady*
University of Tübingen & MPI-IS
Roland S. Zimmermann*
University of Tübingen & MPI-IS
Yash Sharma
University of Tübingen & MPI-IS
Bernhard Schölkopf
Julius von Kügelgen⁺
University of Cambridge & MPI-IS
Wieland Brendel⁺

tl;dr: We analyze when object-centric representations can be learned without supervision and introduces two assumptions, compositionality and irreducibility, to prove that ground-truth object representations can be identified.


May '23 Our paper was accepted for an Oral Presentation at ICML 2023!
May '23 The pre-print is now available on arXiv


When can unsupervised object-centric representations provably be learned?

Overview of our theoretical framework. We introduce two assumptions on the generative process for scenes comprised of several objects, which we call compositionality and irreducibility, and investigate how they relate to identifiability.

We assume that observed scenes comprising multiple objects are rendered by an unknown generator f from multiple ground-truth latent slots. We assume that this generative model has two key properties, which we call compositionality and irreducibility. Under this model, we prove that an invertible inference model with a compositional inverse yields latent slots that identify the ground-truth slots up to permutation and slot-wise invertible functions. We call this slot identifiability. To measure violations of compositionality in practice, we introduce a contrast function (compositional contrast) which is zero if and only if a function is compositional, while to measure invertibility, we rely on the reconstruction loss in an auto-encoder framework.

How do we formalize object-centric data in our theoretical framework?

We define a generative model for multi-object scenes where objects are described by latent slots. We would like to enforce that each latent slot is responsible for encoding a distinct object in a scene. To this end, we make two core assumptions on the generator function from latent slots to images.

The first assumption we make, compositionality, captures that each object is generated by one latent slot. We formalize this through an assumption on the Jacobian matrix of the generator function, which states that each pixel has a non-zero partial derivative with at most one latent slot. This imposes a local sparsity structure on the Jacobian visualized below (Section 2.1)

Difference between a compositional and a non-compositional generator. (A) For a compositional generator f, every pixel is affected by at most one latent slot. As a result, there always exists an ordering of the pixels such that the generator’s Jacobian Jf consists of disjoint blocks, one for each latent slot (bottom). Note that both the pixel ordering and the specific structure of the Jacobian are not fixed across scenes and might depend on the latent input z. (B) For a non-compositional generator, there exists no pixel ordering that exposes such a structure in the Jacobian, since the same pixel can be affected by more than one latent slot

The second assumption we make, irreducibility, captures that each latent slot generates at most one object rather than multiple objects. To enforce this, we assume that the mechanisms which generate parts of the same object are dependent. We formalize a mechanism through the Jacobian matrix of the generator and dependence in a non-statistical sense using the rank of these mechanisms (Section 2.2).

(A) A simple example of a reducible mechanism is one for which disjoint subsets of latents from the same slot render pixel groups S1 and S2 separately such that they form independent sub-mechanisms. This independence between sub-mechanisms is indicated by the difference in colors. (B) Not all reducible mechanisms look as simple as panel A: here, S1 and S2 depend on every latent component in the slot, but the information in S1 ∪ S2 still decomposes across S1 and S2 as sub-mechanisms 1 and 2 are independent. (C) In contrast, for an irreducible mechanism, the information does not decompose across any pixel partition S, S′, and so it is impossible to separate it into independent sub-mechanisms.

When can object-centric representations be learned in our framework?

Under this generative model, we can then prove our main theoretical result: Inference models which are invertible (i.e. minimize reconstruction loss) and have a compositional inverse (i.e. decoder maps different slots to different pixels) will identify the ground-truth latent slots (slot identifiability) (Theorem 1)

We validate empirically on controlled synthetic data that inference models which maximize invertibility and compositionality indeed identify the ground-truth latent slots (Section 5.1)

Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models’ compositionality and invertibility, and slot-identifiability (Section 5.2)

We train object-centric models on controlled synthetic data (A) and on synthetic images (B). We then evaluate the models' compositionality and invertibility measured by our proposed compositional contrast and the reconstruction error. We see that these properties are predictive of the models' identifiability.

Acknowledgements & Funding


If you find our work helpful, please cite our paper:

  author = {
    Brady, Jack and
    Zimmermann, Roland S. and
    Sharma, Yash and
    Sch{\"o}lkopf, Bernhard and
    von K{\"u}gelgen, Julius
    Brendel, Wieland and
  title = {
    Provably Learning
    Object-Centric Representations
  year = {2023},
  booktitle = {
    Proceedings of the 40th International
    Conference on Machine Learning
  articleno = {126},
  numpages = {25},
  location = {Honolulu, Hawaii, USA},
  series = {ICML'23}
Webpage designed using Bootstrap 4.5.