Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Roland S. Zimmermann*
MPI-IS

Thomas Klein*
University of Tübingen, MPI-IS

Wieland Brendel
MPI-IS

tl;dr: We compare the mechanistic interpretability of vision models differing with respect to scale, architecture, training paradigm and dataset size and find that none of these design choices have any significant effect on the interpretability of individual units. We release a dataset of unit-wise interpretability scores that enables research on automated alignment.

News

Oct '23	Our paper was accepted as a Spotlight at NeurIPS 2023!
Oct '23	A shorter version of paper was accepted at the NeurIPS 2023 workshop XAI in Action: Past, Present, and Future Applications.
Jul '23	The pre-print is now available on arXiv.
Jul '23	The IMI dataset is released on Zenodo.

Abstract

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We here use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the nine investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 130'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset is meant to facilitate research on automated instead of human-based interpretability evaluations that can ultimately be leveraged to directly optimize the mechanistic interpretability of models.

Motivation

Scaling models up both in terms of model and dataset size has fueled a lot of the progress that models have made in terms of performance and robustness. We now ask whether this development has also incidentally led to more interpretable models. To answer this question, we perform a large-scale psychophysics experiment to investigate the interpretability of multiple models through the two most-used mechanistic interpretability methods. However, we do not find any improvement, but rather the opposite: GoogLeNet, from ten years ago, is still more interpretable than eight state-of-the-art vision models!

We perform a large-scale psychophysics experiment to investigate the interpretability of nine networks through the two most-used mechanistic interpretability methods. We see that scaling has not led to increased interpretability. We release a large dataset called ImageNet Mechanistic Interpretability that contains human interpretability scores. We expect our dataset to enable building automated measures for quantifying the interpretability of models and, thus, bootstrap the development of more interpretable models.

Has scaling models in terms of their dataset and model size improved interpretability?

We leverage the experimental paradigm proposed in our earlier work to measure the unit-wise interpretabilty afforded by an explanation. In a large-scale psychophysical experiment, we compare models that differ in architecture, training objectives, and training data. While these models reflect the advancements in model design in recent years (sorted by model size first and then dataset size), we surprisingly see little to no effect of these design choices on mechanistic, per-unit interpretability. While these results might appear promising, since all models yield scores of about 80% (natural), note that we demonstrate that interpretability is far more limited than it first appears and breaks down dramatically as the task is made harder (see Section 4.4 of the paper).

We compare the mechanistic interpretability of nine vision models for two interpretability methods: maximally activating dataset samples (natural, orange) and feature visualizations (synthetic, blue).

Are better ImageNet classifiers more interpretable?

While the investigated models have strongly varying classification performance, as measured by the ImageNet validation accuracy, their interpretability shows less variation for both natural exemplars and synthetic feature visualizations. More accurate classifiers are not necessarily more interpretable. For synthetic feature visualizations, there might even be a regression of interpretability with increasing accuracy.

Higher classification performance does not come with higher interpretability, for neither natural exemplars (orange) nor synthetic feature visualizations (blue).

ImageNet Mechanistic Interpretability: A new dataset for automated alignment

The results above paint a rather disappointing picture of the state of mechanistic interpretability of computer vision models: Just by scaling up models and datasets, we do not get increased interpretability for free, suggesting that if we want this property, we need to explicitly optimize for it.

To enable research on such automated evaluations, we release our experimental results as a new dataset called ImageNet Mechanistic Interpretability (IMI). Note that this is the first dataset containing unit-wise interpretability measurements obtained through psychophysical experiments for multiple explanation methods and models. We hope that this dataset will enable the development of automated interpretability measures that can be used to directly optimize models for mechanistic interpretability. The dataset itself should be seen as a collection of labels and meta information without fixed features that should be predictive of a unit's interpretability. Moreover, finding and constructing features that are predictive of the recorded labels will be one of the open challenges posed by this line of research.

Acknowledgements & Funding

We thank Felix Wichmann, Evgenia Rusak, Robert-Jan Bruintjes, Robert Geirhos, Matthias Kümmerer, and Matthias Tangemann for their valuable feedback. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting RSZ and TK.

BibTeX

If you find our study or our dataset helpful, please cite our paper:

@inproceedings{zimmermann2023scale,
  author = {
    Zimmermann, Roland S. and
    Klein, Thomas and
    Brendel, Wieland and
  },
  title = {
    Scale Alone Does not Improve Mechanistic
    Interpretability in Vision Models
  },
  booktitle = {
    Thirty-seventh Conference on Neural
    Information Processing Systems
  },
  year = {2023},
}

Webpage designed using Bootstrap 4.5.