Does CLIP's generalization performance mainly stem from high train-test similarity?

Prasann Mayilvahanan*
University of Tübingen, MPI-IS, Tübingen AI Center

Thaddäus Wiedemer*
MPI-IS, University of Tübingen, Tübingen AI Center

Evgenia Rusak
University of Tübingen, MPI-IS, TÜbingen AI Center

Matthias Bethge
University of Tübingen, Tübingen AI Center

Wieland Brendel
MPI-IS, ELLIS Institute Tübingen, Tübingen AI Center

Code

tl;dr: CLIP's ability to generalize to standard OOD benchmarks does not mainly stem from highly similar images in its training dataset.

News

Feb '24	Our paper was accepted at ICLR 2024!
Feb '24	An earlier version of our paper was accepted at the NeurIPS 2023 DistShift workshop!
Oct '23	The pre-print is now available on arXiv.

Abstract

Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful CLIP's high zero-shot performance is as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet's train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP's overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP's performance, and other properties of the training data must drive CLIP to learn good representations. Additionally, by pruning data points that are dissimilar to the OOD benchmarks, we uncover a 100M split of LAION (¼ of its original size) on which CLIP can be trained to match its original performance

Acknowledgements & Funding

We would like to thank (in alphabetical order): Thomas Klein, George Pachitariu, Matthias Tangemann, Vishaal Udandarao, Max Wolff, and Roland Zimmermann for helpful discussions, feedback, and support with setting up the experiments.
This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645. This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG.
We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting PM, TW, and ER.

BibTeX

If you find our study helpful, please cite our paper:

@inproceedings{mayilvahanan2024does,
  title={Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?},
  author={
    Prasanna Mayilvahanan and Thadd{\"a}us Wiedemer and Evgenia Rusak and Matthias Bethge and Wieland Brendel
  },
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=tnBaiidobu}
}