Measuring Mechanistic Interpretability at Scale Without Humans
tl;dr: We introduce the first scalable method to measure the per-unit interpretability in vision neural networks, demonstrate its alignment to human judgements and use it to perform the largest evaluation of per-unit interpretability so far (over 70M units).
News
Feb '24 | The pre-print is now available. |
Abstract
In today's era, whatever we can measure at scale, we can optimize. So far, measuring the interpretability of units in deep neural networks (DNNs) for computer vision still requires direct human evaluation and is not scalable. As a result, the inner workings of DNNs remain a mystery despite the remarkable progress we have seen in their applications. In this work, we introduce the first scalable method to measure the per-unit interpretability in vision DNNs. This method does not require any human evaluations, yet its prediction correlates well with existing human interpretability measurements. We validate its predictive power through an interventional human psychophysics study. We demonstrate the usefulness of this measure by performing previously infeasible experiments: (1) A large-scale interpretability analysis across more than 70 million units from 835 computer vision models, and (2) an extensive analysis of how units transform during training. We find an anticorrelation between a model's downstream classification performance and per-unit interpretability, which is also observable during model training. Furthermore, we see that a layer's location and width influence its interpretability.
Definition of the Machine Interpretability Score.
A.We build on top of the established task definition proposed by Borowski et al. 2021 to quantify the per-unit interpretability via human psychophysics experiments.
The task quantifies how well participants understand the sensitivity of a unit by asking them to match strongly activating query images to strongly activating
Motivation
The goal of understanding the inner workings of a neural network is inherently human-centric: Irrespective of what tools have been used, in the end, humans should have a better comprehension of the network. Removing the need for human labor by automating the interpretability evaluation can open up multiple high-impact research directions: One benefit is that it enables the creation of more interpretable networks by explicitly optimizing for interpretability — after all, what we can measure at scale, we can optimize. Moreover, it allows more efficient research on explanation methods and might lead to an increased overall understanding of neural networks.
You can find more detailed and interactive versions of some of the paper's visualizations here.
How Do We Automate Interpretability Evaluations?
We extend work by Borowski et al. 2021 which introduces a setup that allows quantifying how well humans can infer the sensitivity of a unit in a vision model, e.g., a (spatially-averaged) channel in a CNN or neuron in an MLP, from explanations: They leverage a 2-AFC task design in a psychophysics experiment (see left side of the above figure) to measure how well humans understand a unit by probing how well they can predict which of two extremely activating (query) images yields a higher activation, after seeing visual explanations. Specifically, two sets of explanations are displayed: highly and weakly activating images, called positive and negative explanations, respectively.
We automate this evaluation by passing the explanations and query images through a feature encoder to compute pair-wise image similarities (DreamSim) before using a simple (hard-coded) binary classifier to solve the underlying task. The Machine Interpretability Score (MIS) is the average of the predicted probability of the correct choice over multiple tasks for the same unit. The MIS proves to be highly correlated with human interpretability ratings and allows fast evaluations of new hypotheses.
How Good is the Machine Interpretability Score?
Our proposed Machine Interpretability Score (MIS) explains existing interpretability annotations (Human Interpretability Score, HIS) from IMI well: It reproduces the ranking of models presented in IMI while being fully automated and not requiring any human labor, as evident by the strong correlation between MIS and HIS.
MIS Explains Interpretability Model Rankings. We see a strong correlation between the Machine Interpretability Score (MIS) and the Human Interpretability Score (HIS) when comparing models.
As the previous results were of a descriptive (correlation) nature we now perform a causal (interventional) experiment. We use MIS to perform a causal intervention and determine the least (
MIS Allows Detection of (Non-) Interpretable Units.
We demonstrate that the MIS can be used to detect highly or weakly interpretable units
How Does the Interpretability Differ across Models?
Equipped with a fast and cheap measure of human-perceived interpretability we can now perform experiments considered infeasible before. We begin be substantially extending the analysis of Zimmermann et al. 2023 from a noisy average over a few units for a few models to all units of 835 models. Note that performing this analysis with human evaluations would have amounted in costs of around one billion USD. See the full paper for more details on the experiment and the results.
Comparison of the Average Per-unit MIS for Models. We compare models regarding their average per-unit interpretability (as judged by MIS); the shaded area depicts the 5th to 95th percentile over units. We see that all models fall into an intermediate performance regime, with stronger changes in interpretability at the tails of the model ranking. Models probed by Zimmermann et al. 2023 are highlighted in red. You can find an interactive version of this plot here.
How Does the Interpretability Change during Training?
We also show how our proposed measure allows even more fine-grained evaluations by tracking the interpretability of a model during training. Among other things, we find that the interpretability of a model decreases during training. For more insights, see the full paper.
Change of Interpretability During Training. For a ResNet-50 trained for 100 epochs on ImageNet, we track the MIS and top-1 accuracy after every epoch (epoch 0 refers to random initialization). While the MIS improves drastically in the first epoch, it monotonically decreases during the rest of the training (left). This results in an antiproportional relation between MIS and accuracy (right).
Acknowledgements & Funding
We thank Evgenia Rusak, Prasanna Mayilvahanan, Thaddäus Wiedemer and Thomas Klein for their valuable feedback. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. This research utilized compute resources at the Tübingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting RSZ.
BibTeX
If you find our study or our dataset helpful, please cite our paper:
author = {
Zimmermann, Roland S. and
Klindt, David and
Brendel, Wieland and
},
title = {
Measuring Mechanistic Interpretability
at Scale Without Humans
},
year = {2024},
}