InfoNCE: Identifying the Gap Between Theory and Practice

Evgenia Rusak*
University of Tübingen
MPI-IS
Patrik Reizinger*
MPI-IS
Attila Juhos*
MPI-IS
Oliver Bringmann
University of Tübingen
Roland S. Zimmermann°
MPI-IS
Wieland Brendel°
MPI-IS

tl;dr: We generalize previous identifiability results for contrastive learning toward anisotropic latents that better capture the effect of augmentations used in practical applications, thereby reducing the gap between theory and practice.

News

Jan '25 The paper was accepted for AISTATS 2025.
Feb '24 The pre-print is now available.

Abstract

Illustration of the mismatch between the standard CL model and practice. A: CL with the commonly used InfoNCE objective is identifiable when all latents change to the same extent across the positive pair, which is unlikely to happen in practice. B: The more likely scenario when augmentations affect different latents to a different extent leads to dimensional collapse and information loss. C: Our proposed objective, AnInfoNCE, accommodates that features can vary to a different degree in the positive pair, avoiding collapse.

Motivation

In Contrastive Learning, a model gets two views of a training sample as input and is required to learn an embedding where views of the same sample are close together, and views of different samples are far apart. The views are generated using augmentations such as random cropping and color jittering. They can be considered transformations in the implicit latent space such that different augmentations change different latent dimensions to a different extent. The connection to identifiability theory has been made by Zimmermann et al. 2021, who showed that CL can provably recover the ground truth latents if the conditional distribution is isotropic, which means that the latent dimensions are varied to the same extent by the augmentations. However, this assumption is not realistic in practice because some latents, such as color, are varied strongly. In contrast, other latents, such as the shape or class of an object, are affected minimally. Kügelgen et al. relaxed the isotropic assumption on the positive conditional distributions to two partitions: content (δ-distribution) and style (non-degenerate distribution). They showed that strongly varying latents (style) are lost while non-varying latents (content) are learned. In this paper, we first allow the positive conditional distribution of different latents to change on a spectrum and then introduce a new variation of InfoNCE, called AnInfoNCE, which holds for anisotropically changing latents, for which we prove identifiability.

Main Takeaways

  • We introduce AnInfoNCE, a generalized contrastive loss assuming anisotropic positive pair distributions, and present an identifiability proof.
  • Experimentally, we verify our loss in synthetic and well-controlled image experiments, having full knowledge of the ground-truth generative process.
  • We further demonstrate the efficacy of AnInfoNCE on CIFAR and ImageNet in recovering the latent factors but observe a trade-off with linear readout accuracy.

Theory

We designed a new loss function, coined AnInfoNCE, which, contrary to InfoNCE, can model augmentations that induce anisotropic variances on the latent factors: $$ \mathcal{L}_{\text{AINCE}}({\boldsymbol f},{\bf\hat{\Lambda}}) = \underset{\substack{{\boldsymbol x}, {\boldsymbol x}^{\!+}\\ \{{\boldsymbol x}^{\!-}_i\}}}{\mathbb{E}} \left[ -\ln \frac{e^{-\left\Vert {\boldsymbol f}({\boldsymbol x}^{\!+}) - {\boldsymbol f}({\boldsymbol x})\right\Vert^2_{\bf\hat{\Lambda}}}}{e^{-\left\Vert {\boldsymbol f}({\boldsymbol x}^{\hspace{-0.05em}+}) - {\boldsymbol f}({\boldsymbol x})\right\Vert^2_{\bf\hat{\Lambda}}} + \sum_{i=1}^M e^{-\left\Vert {\boldsymbol f}({\boldsymbol x}^{\!-}_i) - {\boldsymbol f}({\boldsymbol x})\right\Vert^2_{\bf\hat{\Lambda}}}} \right] $$ where the new similarity function: $$ -\lVert {\boldsymbol f}({\boldsymbol x}^{\!+}) - {\boldsymbol f}({\boldsymbol x}) \rVert^2_{\bf\hat{\Lambda}} = - \Big( {\boldsymbol f}({\boldsymbol x}^{\!+}) - {\boldsymbol f}({\boldsymbol x}) \Big)^{\!\!\top}{\bf\hat{\Lambda}}\Big( {\boldsymbol f}({\boldsymbol x}^{\!+}) - {\boldsymbol f}({\boldsymbol x}) \Big) $$ is now equipped with a trainable diagonal scaling matrix, \(\hat{\Lambda}\), the concentration parameter.

To understand the theoretical justification of this method, assume a Data Generating Process, where the observed data \({\boldsymbol x}, {\boldsymbol x}^{\!+},\{{\boldsymbol x}^{\!-}_i\}\) are generated by passing the normalized latent vectors \({\boldsymbol z}, {\boldsymbol z}^{\!+},\{{\boldsymbol z}^{\!-}_i\}\) through an invertible generator \(\boldsymbol{g}\). The anchor and negative pairs \({\boldsymbol z}, \{{\boldsymbol z}^{\!-}_i\}\) are uniformly sampled from the hypersphere, whereas \({\boldsymbol z}^{\!+}\) follows an anisotropic conditional distribution: $$ p({\boldsymbol z}) = \mathrm{const},\quad p({\boldsymbol z}^{\!+}|{\boldsymbol z}) \propto e^{-({\boldsymbol z}^{\!+}-{\boldsymbol z})^\top {\bf{\Lambda}}({\boldsymbol z}^{\!+} -{\boldsymbol z})} $$ The diagonal scaling matrix \(\bf\Lambda\) describes the anisotropic effect of augmentations on the latent features. Our main theoretical result shows that if a pair \(({\boldsymbol f},{\bf\hat{\Lambda}})\) minimizes \(\mathcal{L}_{\text{AnInfoNCE}}\), then the latent factors \(z\) are identified up to orthogonal transformations, i.e., we can recover the ground-truth latents up to simple transformations.

A Model Trained with AnInfoNCE Can Learn Content and Style Latents

Following previously introduced experimental settings we consider a fully-controlled Data Generating Process. Since we have control of the ground-truth latents, we can directly model a setting where augmentations affect different latents to a different extent. We achieve this by introducing an anisotropy in the positive conditional by varying the ground-truth concentration parameter. We set the value of half of the latent dimensions to a low concentration parameter and the other half to a high concentration parameter. A higher concentration parameter corresponds to a more narrow positive conditional distribution and more content-like behavior while dimensions with a lower concentration parameter correspond to style-like latents. Training a model on the generated data with AnInfoNCE, we observe that both content and style latents can be identified. In contrast, style information is lost when training with the regular InfoNCE loss.

Training a model with AnInfoNCE recovers the ground truth content and style latents while style information is lost when training with regular InfoNCE.

Training with AnInfoNCE Improves Augmentations Readout - But There is a trade-off with classification accuracy

We do not have access to the true Data-Generating Process in real-world datasets - thus, we have to use a proxy task. Sampling from the true conditional distribution is modeled as using augmentations to generate different views as inputs for the model; therefore, we can assess how well we recover ground truth latents by measuring the degree to which the model keeps the information about the used augmentations. Therefore, we calculate the linear readout accuracy on the augmentations used during training to judge how well we recover the latent factors.

Our experiments on CIFAR10 and ImageNet show that our loss with a trainable concentration parameter leads to much higher readout accuracy on the used augmentations, i.e., we successfully recover more latent dimensions compared to the regular InfoNCE loss. While we reduce the information loss, this surprisingly does not lead to better downstream accuracy.

$$ \begin{array} {ccc}\hline \text{Training Objective} & \text{Classes} & \text{Augmentations} \\ \hline \textbf{CIFAR10} \\ \text{InfoNCE} & \textbf{90.9} & 57.7 \\ \text{AnInfoNCE} & 88.4 & \textbf{80.5} \\ \hline \textbf{ImageNet} \\ \text{InfoNCE} & \textbf{68.2} & 74.4 \\ \text{AnInfoNCE} & 59.0 & \textbf{79.3} \\ \hline \end{array} $$

Training with AnInfoNCE successfully recovers more latent dimensions than regular InfoNCE, but we observe a trade-off with downstream linear accuracy.

Where does the trade-off between better recovery of latent factors and worse downstream accuracy on real-world data come from? Read our paper where we extensively analyze remaining mismatches between Contrastive Learning theory and practice and explore practical mitigation strategies!

Acknowledgements & Funding

BibTeX

If you find our study or our dataset helpful, please cite our paper:

@article{rusak2024contrastive,
  author = {
    Rusak, Evgenia and
    Reizinger, Patrick and
    Juhos, Attila and
    Bringmann, Oliver and
    Zimmermann, Roland S. and
    Brendel, Wieland and
  },
  title = {
    InfoNCE: Identifying the
    Gap Between Theory and Practice
  },
  year = {2024},
}
Webpage designed using Bootstrap 4.5.