Don't trust your eyes: on the (un) reliability of feature visualizations

Abstract

How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to “explain” how neural networks process natural images. This can be used as a sanity check for feature visualizations. We underpin our empirical findings by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include general black-box neural networks. Therefore, a promising way forward could be the development of networks that enforce certain structures in order to ensure more reliable feature visualizations.

Wieland Brendel
Wieland Brendel
Principal Investigator (PI)

Wieland Brendel received his Diploma in physics from the University of Regensburg (2010) and his Ph.D. in computational neuroscience from the École normale supérieure in Paris (2014). He joined the University of Tübingen as a postdoctoral researcher in the group of Matthias Bethge, became a Principal Investigator and Team Lead in the Tübingen AI Center (2018) and an Emmy Noether Group Leader for Robust Machine Learning (2020). In May 2022, Wieland joined the Max-Planck Institute for Intelligent Systems as an independent Group Leader and is now a Hector-endowed Fellow at the ELLIS Institute Tübingen (since September 2023). He received the 2023 German Pattern Recognition Award for his substantial contributions on robust, generalisable and interpretable machine vision. Aside of his research, Wieland co-founded a nationwide school competition (bw-ki.de) and a machine learning startup focused on visual quality control.