Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Roland S. Zimmermann*
Thomas Klein*
University of Tübingen, MPI-IS
Wieland Brendel

tl;dr: We compare the mechanistic interpretability of vision models differing with respect to scale, architecture, training paradigm and dataset size and find that none of these design choices have any significant effect on the interpretability of individual units. We release a dataset of unit-wise interpretability scores that enables research on automated alignment.


Oct '23 Our paper was accepted as a Spotlight at NeurIPS 2023!
Oct '23 A shorter version of paper was accepted at the NeurIPS 2023 workshop XAI in Action: Past, Present, and Future Applications.
Jul '23 The pre-print is now available on arXiv.
Jul '23 The IMI dataset is released on Zenodo.



Scaling models up both in terms of model and dataset size has fueled a lot of the progress that models have made in terms of performance and robustness. We now ask whether this development has also incidentally led to more interpretable models. To answer this question, we perform a large-scale psychophysics experiment to investigate the interpretability of multiple models through the two most-used mechanistic interpretability methods. However, we do not find any improvement, but rather the opposite: GoogLeNet, from ten years ago, is still more interpretable than eight state-of-the-art vision models!

We perform a large-scale psychophysics experiment to investigate the interpretability of nine networks through the two most-used mechanistic interpretability methods. We see that scaling has not led to increased interpretability. We release a large dataset called ImageNet Mechanistic Interpretability that contains human interpretability scores. We expect our dataset to enable building automated measures for quantifying the interpretability of models and, thus, bootstrap the development of more interpretable models.

Has scaling models in terms of their dataset and model size improved interpretability?

We leverage the experimental paradigm proposed in our earlier work to measure the unit-wise interpretabilty afforded by an explanation. In a large-scale psychophysical experiment, we compare models that differ in architecture, training objectives, and training data. While these models reflect the advancements in model design in recent years (sorted by model size first and then dataset size), we surprisingly see little to no effect of these design choices on mechanistic, per-unit interpretability. While these results might appear promising, since all models yield scores of about 80% (natural), note that we demonstrate that interpretability is far more limited than it first appears and breaks down dramatically as the task is made harder (see Section 4.4 of the paper).

We compare the mechanistic interpretability of nine vision models for two interpretability methods: maximally activating dataset samples (natural, orange) and feature visualizations (synthetic, blue).

Are better ImageNet classifiers more interpretable?

While the investigated models have strongly varying classification performance, as measured by the ImageNet validation accuracy, their interpretability shows less variation for both natural exemplars and synthetic feature visualizations. More accurate classifiers are not necessarily more interpretable. For synthetic feature visualizations, there might even be a regression of interpretability with increasing accuracy.

Higher classification performance does not come with higher interpretability, for neither natural exemplars (orange) nor synthetic feature visualizations (blue).

ImageNet Mechanistic Interpretability: A new dataset for automated alignment

The results above paint a rather disappointing picture of the state of mechanistic interpretability of computer vision models: Just by scaling up models and datasets, we do not get increased interpretability for free, suggesting that if we want this property, we need to explicitly optimize for it.

To enable research on such automated evaluations, we release our experimental results as a new dataset called ImageNet Mechanistic Interpretability (IMI). Note that this is the first dataset containing unit-wise interpretability measurements obtained through psychophysical experiments for multiple explanation methods and models. We hope that this dataset will enable the development of automated interpretability measures that can be used to directly optimize models for mechanistic interpretability. The dataset itself should be seen as a collection of labels and meta information without fixed features that should be predictive of a unit's interpretability. Moreover, finding and constructing features that are predictive of the recorded labels will be one of the open challenges posed by this line of research.

Acknowledgements & Funding


If you find our study or our dataset helpful, please cite our paper:

  author = {
    Zimmermann, Roland S. and
    Klein, Thomas and
    Brendel, Wieland and
  title = {
    Scale Alone Does not Improve Mechanistic
    Interpretability in Vision Models
  booktitle = {
    Thirty-seventh Conference on Neural
    Information Processing Systems
  year = {2023},
Webpage designed using Bootstrap 4.5.