Measuring Mechanistic Interpretability at Scale Without Humans

Roland S. Zimmermann
MPI-IS
David Klindt
Stanford University
Wieland Brendel
MPI-IS

tl;dr: We introduce the first scalable method to measure the per-unit interpretability in vision neural networks, demonstrate its alignment to human judgements and use it to perform the largest evaluation of per-unit interpretability so far (over 70M units).

News

Feb '24 The pre-print is now available.

Abstract

Definition of the Machine Interpretability Score. A.We build on top of the established task definition proposed by Borowski et al. 2021 to quantify the per-unit interpretability via human psychophysics experiments. The task quantifies how well participants understand the sensitivity of a unit by asking them to match strongly activating query images to strongly activating visual explanations of the unit. B. Crucially, we remove the need for humans and fully automate the evaluation: We pass the explanations and query images through a feature encoder to compute pair-wise image similarities (DreamSim) before using a (hard-coded) binary classifier to solve the underlying task. Finally, the Machine Interpretability Score (MIS) is the average of the predicted probability of the correct choice over N tasks. C. The MIS proves to be highly correlated with human interpretability ratings and allows fast evaluations of new hypotheses.

Motivation

The goal of understanding the inner workings of a neural network is inherently human-centric: Irrespective of what tools have been used, in the end, humans should have a better comprehension of the network. Removing the need for human labor by automating the interpretability evaluation can open up multiple high-impact research directions: One benefit is that it enables the creation of more interpretable networks by explicitly optimizing for interpretability — after all, what we can measure at scale, we can optimize. Moreover, it allows more efficient research on explanation methods and might lead to an increased overall understanding of neural networks.

You can find more detailed and interactive versions of some of the paper's visualizations here.

How Do We Automate Interpretability Evaluations?

We extend work by Borowski et al. 2021 which introduces a setup that allows quantifying how well humans can infer the sensitivity of a unit in a vision model, e.g., a (spatially-averaged) channel in a CNN or neuron in an MLP, from explanations: They leverage a 2-AFC task design in a psychophysics experiment (see left side of the above figure) to measure how well humans understand a unit by probing how well they can predict which of two extremely activating (query) images yields a higher activation, after seeing visual explanations. Specifically, two sets of explanations are displayed: highly and weakly activating images, called positive and negative explanations, respectively.

We automate this evaluation by passing the explanations and query images through a feature encoder to compute pair-wise image similarities (DreamSim) before using a simple (hard-coded) binary classifier to solve the underlying task. The Machine Interpretability Score (MIS) is the average of the predicted probability of the correct choice over multiple tasks for the same unit. The MIS proves to be highly correlated with human interpretability ratings and allows fast evaluations of new hypotheses.

How Good is the Machine Interpretability Score?

Our proposed Machine Interpretability Score (MIS) explains existing interpretability annotations (Human Interpretability Score, HIS) from IMI well: It reproduces the ranking of models presented in IMI while being fully automated and not requiring any human labor, as evident by the strong correlation between MIS and HIS.

MIS Explains Interpretability Model Rankings. We see a strong correlation between the Machine Interpretability Score (MIS) and the Human Interpretability Score (HIS) when comparing models.

As the previous results were of a descriptive (correlation) nature we now perform a causal (interventional) experiment. We use MIS to perform a causal intervention and determine the least (hardest) and most (easiest) interpretable units in a GoogLeNet and ResNet-50. We then use the psychophysics setup of Zimmermann et al. 2023 to measure their interpretability and compare them to randomly sampled units. Strikingly, the psychophysics results match the predicted properties: Units with the lowest MIS have significantly lower interpretability than random units, which have significantly lower interpretability than those with the highest MIS. More details on the experiments and the result cab be found in the full paper.

MIS Allows Detection of (Non-) Interpretable Units. We demonstrate that the MIS can be used to detect highly or weakly interpretable units without requiting any human psychophysics experiments. Errorbars denote the 95% confidence interval.

How Does the Interpretability Differ across Models?

Equipped with a fast and cheap measure of human-perceived interpretability we can now perform experiments considered infeasible before. We begin be substantially extending the analysis of Zimmermann et al. 2023 from a noisy average over a few units for a few models to all units of 835 models. Note that performing this analysis with human evaluations would have amounted in costs of around one billion USD. See the full paper for more details on the experiment and the results.

Comparison of the Average Per-unit MIS for Models. We compare models regarding their average per-unit interpretability (as judged by MIS); the shaded area depicts the 5th to 95th percentile over units. We see that all models fall into an intermediate performance regime, with stronger changes in interpretability at the tails of the model ranking. Models probed by Zimmermann et al. 2023 are highlighted in red. You can find an interactive version of this plot here.

How Does the Interpretability Change during Training?

We also show how our proposed measure allows even more fine-grained evaluations by tracking the interpretability of a model during training. Among other things, we find that the interpretability of a model decreases during training. For more insights, see the full paper.

Change of Interpretability During Training. For a ResNet-50 trained for 100 epochs on ImageNet, we track the MIS and top-1 accuracy after every epoch (epoch 0 refers to random initialization). While the MIS improves drastically in the first epoch, it monotonically decreases during the rest of the training (left). This results in an antiproportional relation between MIS and accuracy (right).

Acknowledgements & Funding

BibTeX

If you find our study or our dataset helpful, please cite our paper:

@article{zimmermann2024mis,
  author = {
    Zimmermann, Roland S. and
    Klindt, David and
    Brendel, Wieland and
  },
  title = {
    Measuring Mechanistic Interpretability
    at Scale Without Humans
  },
  year = {2024},
}
Webpage designed using Bootstrap 4.5.