Measuring Mechanistic Interpretability at Scale Without Humans: Visualizations

Roland S. Zimmermann
MPI-IS

David Klindt
Stanford University

Wieland Brendel
MPI-IS

Welcome to the extended and interactive visualizations of our paper. You can hide and highlight elements in a plot by clicking on an item in the legend. Please note that the visualizations are best viewed on a desktop or laptop; some of them might look different here than in the paper due differences in the rendering engines.

Validating the Machine Interpretability Score

MIS Explains Interpretability Model Rankings. Our proposed Machine Interpretability Score (MIS) explains existing interpretability annotations (Human Interpretability Score, HIS) from IMI well: It reproduces the ranking of models presented in IMI while being fully automated and not requiring any human labor, as evident by the strong correlation between MIS and HIS. We compare the performance of the HIS for three different perceptual similarity functions (LPIPS, DISTS and DreamSim). Note that the our final design choices uses DreamSim.

MIS Explains Per-unit Interpretability Annotations. The proposed MIS does not only explain summary statistics for an entire model (see here) but also individual per-unit interpretability annotations. We show the calculated MIS and the recorded HIS for every unit in IMI. We compare the performance of the HIS for three different perceptual similarity functions (LPIPS, DISTS and DreamSim). Note that the our final design choices uses DreamSim.

Comparison of Models

Comparison of the Average Per-unit MIS for Models. We substantially extend the analysis of Zimmermann et al. 2023 from a noisy average over a few units for a few models to all units of 835 models. The models are compared regarding their average per-unit interpretability (as judged by MIS); the shaded area depicts the 5th to 95th percentile over units. We see that all models fall into an intermediate performance regime, with stronger changes in interpretability at the tails of the model ranking. Models probed by Zimmermann et al. 2023 are highlighted in red.

Comparison of the Average Per-unit MIS for Different Layer Types and Models. We show the average interpretability of units from the most common layer types in vision models (BatchNorm, Conv, GroupNorm, LayerNorm, Linear). We follow Zimmermann et al. 2023 and restrict our analysis of Vision Transformers to the linear layers in each attention head. While not every layer type is used by every model, we still see some separation between types for significance results: Linear and convolutional layers mostly outperform normalization layers. Models are sorted by average per-unit interpretability.

Relation Between ImageNet Accuracy and MIS. The average per-unit MIS of a model is anticorrelated with the model's top-1 ImageNet classification accuracy.

Analysis of Constant Units

Ratio of Constant Units. We compute the ratio of units constant with respect to the input (over the training set of ImageNet-2012) for all models considered. While the ratio is low for most models, it becomes large for a few models.

Webpage designed using Bootstrap 4.5.

Measuring Mechanistic Interpretability at Scale Without Humans: Visualizations

Paper

Code [Soon]

Blog Post

Validating the Machine Interpretability Score

Comparison of Models

Analysis of Constant Units