Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
tl;dr: We compare the mechanistic interpretability of vision models differing with respect to scale, architecture, training paradigm and dataset size and find that none of these design choices have any significant effect on the interpretability of individual units. We release a dataset of unit-wise interpretability scores that enables research on automated alignment.
News
Oct '23 | Our paper was accepted as a Spotlight at NeurIPS 2023! |
Oct '23 | A shorter version of paper was accepted at the NeurIPS 2023 workshop XAI in Action: Past, Present, and Future Applications. |
Jul '23 | The pre-print is now available on arXiv. |
Jul '23 | The IMI dataset is released on Zenodo. |
Abstract
In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We here use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the nine investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 130'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset is meant to facilitate research on automated instead of human-based interpretability evaluations that can ultimately be leveraged to directly optimize the mechanistic interpretability of models.
Motivation
Scaling models up both in terms of model and dataset size has fueled a lot of the progress that models have made in terms of performance and robustness. We now ask whether this development has also incidentally led to more interpretable models. To answer this question, we perform a large-scale psychophysics experiment to investigate the interpretability of multiple models through the two most-used mechanistic interpretability methods. However, we do not find any improvement, but rather the opposite: GoogLeNet, from ten years ago, is still more interpretable than eight state-of-the-art vision models!
We perform a large-scale psychophysics experiment to investigate the interpretability of nine networks through the two most-used mechanistic interpretability methods. We see that scaling has not led to increased interpretability. We release a large dataset called ImageNet Mechanistic Interpretability that contains human interpretability scores. We expect our dataset to enable building automated measures for quantifying the interpretability of models and, thus, bootstrap the development of more interpretable models.
Has scaling models in terms of their dataset and model size improved interpretability?
We leverage the experimental paradigm proposed in our earlier work to measure the unit-wise interpretabilty afforded by an explanation. In a large-scale psychophysical experiment, we compare models that differ in architecture, training objectives, and training data. While these models reflect the advancements in model design in recent years (sorted by model size first and then dataset size), we surprisingly see little to no effect of these design choices on mechanistic, per-unit interpretability. While these results might appear promising, since all models yield scores of about 80% (natural), note that we demonstrate that interpretability is far more limited than it first appears and breaks down dramatically as the task is made harder (see Section 4.4 of the paper).
We compare the mechanistic interpretability of nine vision models for two interpretability methods: maximally activating dataset samples (natural, orange) and feature visualizations (synthetic, blue).
Are better ImageNet classifiers more interpretable?
While the investigated models have strongly varying classification performance, as measured by the ImageNet validation accuracy, their interpretability shows less variation for both natural exemplars and synthetic feature visualizations. More accurate classifiers are not necessarily more interpretable. For synthetic feature visualizations, there might even be a regression of interpretability with increasing accuracy.
ImageNet Mechanistic Interpretability: A new dataset for automated alignment
The results above paint a rather disappointing picture of the state of mechanistic interpretability of computer vision models: Just by scaling up models and datasets, we do not get increased interpretability for free, suggesting that if we want this property, we need to
To enable research on such automated evaluations, we release our experimental results as a new dataset called
Acknowledgements & Funding
We thank Felix Wichmann, Evgenia Rusak, Robert-Jan Bruintjes, Robert Geirhos, Matthias Kümmerer, and Matthias Tangemann for their valuable feedback. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting RSZ and TK.
BibTeX
If you find our study or our dataset helpful, please cite our paper:
author = {
Zimmermann, Roland S. and
Klein, Thomas and
Brendel, Wieland and
},
title = {
Scale Alone Does not Improve Mechanistic
Interpretability in Vision Models
},
booktitle = {
Thirty-seventh Conference on Neural
Information Processing Systems
},
year = {2023},
}