Interpretability, pt 1

I summarize some recent results and theories about interpretability in machine learning.

Interpretability, pt 1

According to Tim Rudner and Helen Toner of CSET, as machine learning has become both increasingly prevalent in automating decisions across a wide range of societal applications as well as increasingly complex in its training and architecture, such systems have actually been deployed with less human supervision than ever [Rudner & Toner 21]. We already have such systems implemented in health care, banking, driving, and hiring. Increased complexity contributes to a lack of "explainability," otherwise known as interpretability. Users of, and those effected by, ML systems have the right to request an explanation for a decision (especially in high-stakes decisions), and often do. In sum, these factors have pressured researchers to pursue interpretability of ML models.

As part of my reading on Interpretability, I wanted to summarize what I've learned, especially from the great communicator: Chris Olah, cofounder of Anthropic.

Introduction

Mechanistic interpretability refers to the ability to explain how a complex model works at a local level. In the context of ML/AI, it involves gaining insights into why a model makes its predictions or decisions. The goal is a human-understandable explanation of the decision-making process followed by the model, especially when the decisions being made are of high consequence, such as in healthcare, finance, and autonomous systems.

Some common techniques for mechanistic interpretability include:

  1. Feature Importance Analysis: Identifying which features or variables the model considers most important in making predictions; e.g., feature importance scores, permutation importance, and SHAP (SHapley Additive exPlanations) values.
  2. Local Variations: Explaining individual predictions by perturbing input features and observing how the prediction changes.
  3. Sensitivity Analysis: Evaluating how changes in input variables affect the model's output; e.g., through gradients or partial derivatives,
  4. Rule-based Explanations: Summarizing decisions with simplified rules that capture the model's general behavior.
  5. Model Distillation: Training a simpler, more interpretable model (e.g., decision tree) to mimic the behavior of a complex model.
  6. Visualizations: Creating visual representations of model behavior; e.g., feature importance plots, partial dependence plots, and saliency maps.
  7. Domain Knowledge Integration: Combining the model's outputs with existing domain knowledge or expert insights.
  8. Attention Mechanisms: For models like neural networks, attention mechanisms can highlight which parts of the input data were most influential in producing the output.

"Every approach to interpretability must somehow overcome the curse of dimensionality" [Olah 22]. The curse of dimensionality refers to how the number of inputs (or, equivalently, apparent volume of the domain) grows as a power of the dimension of the domain. That is, if our input consists of \(n\) variables \(x_1, x_2, ..., x_n\) which all range over the same, say, \(100\) values, then the number of input tuples is \(100^n\), which grows exponentially in \(n\). This means that high-dimensional input spaces pose problems for learning functions: it would seem that we need an exponential amount of data to understand the behavior of a function. How can we understand the function without taking an exponential amount of spatial or temporal resources?

Mechanistic Interpretability is the approach to interpretability that takes advantage of the fact that every function is specified by a model which has a code description: the length of this code description of the model does not grow exponentially in \(n\), so it at least seems plausible as an object small enough to be understood [Olah 22]. To start, the mechanistic interpretation attempts to ascribe interpretations to variables, or segments of memory, which may not be descriptively labeled. This analogy specifically relates to understanding neural network parameters (thought of as binary instructions); and neuron activations (analogous to segments of memory). In particular, neuron activations are of high dimensionality alike to the domain, so we aim to break these activations into smaller, independently-understandable components. This is especially true when there are no mathematical assumptions such as linearity of the maps between inputs and outputs which could act as shortcuts to interpretation.

In the context of understanding language models consisting of transformers with multilayer perceptron layers, Elhage et. al argue that the non-linearity imposed by the MLP necessitates understanding the activations encoded in the MLP layers of [Elhage et al 22].

Polysemanticity

A layer in a neural network represents a function mapping an input vector (each component of the input vector might be called a separate input) to an output vector. The two input vector need not have the same length as the output vector; that is, the domain dimension need not match the codomain dimension. The set of all possible output vectors forms a vector space called the representation or activation space of the layer. Every vector space has infinitely many bases, or sets of independent/orthogonal vectors which, together, span the entire space via taking linear combinations. The vectors in a basis are called neurons in this context. In principle, the human-understandable features of this layer, or the ways outputs in the representation vary with (possibly meaningful) changes along one of the input components, might embed along weird directions/curves in the representation space.

Features are considered important to understanding neural networks because each activation space has exponential volume, and we need to decompose it into features we can reason about independently to have any hope of understanding it [Olah 23]. Not only should we hope for a tractable number of features in this decomposition, but also an organizational structure to these features. There is a general sense of hope that these hopes can be fulfilled, but first let us dissect the problem.

If it is true that the representation space has a basis that not only spans all important features, but also is aligned with all important features, we say that the layer has a privileged basis. As Chris Olah puts it, "Just as a CPU having operations that act on bytes encourages information to be grouped in bytes rather than randomly scattered over memory, activation functions often encourage features to be aligned with [an input] neuron, rather than corresponding to a random linear combination of neurons" [Olah 22]. To possess this type of regularity of feature alignment would make the job of reverse engineering the decision-making process by a neural network much easier. In this case, it can be that neurons map very clearly to clear concepts. As an example, see Cammarata et. al for how they demonstrate that certain neurons in a trained model can be interpreted as curve detectors [Cammarata et. al 20, 21].

However, it is not always the case that a neuron has a privileged basis. Transformer language models are notorious for having neurons which do not correspond to understandable concepts. Many behaviors are possible, but we especially those neurons which group several unrelated features together polysemantic [Olah et al, 20]. This term refers to the trait that a neuron will "respond" significantly to multiple features.

Elhage et. al explain in transformers, "the token embeddings, residual stream, and attention vectors do not have privileged-bases, while the MLP layer activations are privileged" [Elhage et al 22].

One explanation for why a feature would not align with a neuron follows the Superposition Hypothesis. [Elhage et al 22]. This hypothesis comments on the general behavior of neural networks, positing that it is inherently misguided to expect to find a privileged basis in a generic neuron of the neural network, for neural networks are designed "to represent more features than they have neurons, [thereby exploiting] a property of high-dimensional spaces to simulate a model with many more neurons" [Elhage et al 22]. This means that there are more features that the neural network wishes to express than neurons to which the neural network could dedicate aligning to solely one feature. In practice, it might happen that the neural network dedicates a particular neuron to each of the most prominent features, but by simple counting, there aren't enough neurons to pass around to do this for all features the network wishes to simulate.

Polysemanticity and the Superposition Hypothesis [Elhage et al 22].

One argument for the plausibility of the Superposition Hypothesis follows the following intuition:

  • A neural network typically constructs a direction in activation space (not a non-linear curve) in which to embed a feature for simplicity and in accordance with the matrix multiplications that guide the underlying computations.
  • There are (orders) more possible features to express than neurons: especially in a language or vision model, billions of possible variations might be received as input without billions of neurons to specialize on each. But how those neurons get distributed to specialize depends on the disproportionate importances of each input (e.g., some people's faces are more popular or photographed than others).
  • The most efficient way to encode many 'facts' or learned truths is not always to give importance to just one parameter; it may be to distribute weight across multiple, especially for the less common input vectors.
  • Neural networks are motivated by minimizing a loss function throughout their training. One can reasonably expect that representing more features in the data helps to reduce loss up until the point when adding more features interferes with successfully simulating the more important features. So we should expect a neural network to follow a path of efficiently packing in features.
  • The Johnson-Lindenstrauss Lemma guarantees that if we have \(N\) directions/features that we'd like to embed, while we could assign each their own vector orthogonal to the rest and hence construct a space of dimension \(N\) to model those features, it is also possible to construct a vector space and set of just \(\Omega\left(\frac{\log(N)}{\epsilon^2}\right)\) vectors which are "almost orthogonal" in the sense that the inner product between any pair has absolute value no greater than specified \(\epsilon\). A neural network could stumble upon representations of features in efficient ways not much worse than what is guaranteed by this theorem.
  • Moreover, because "features are sparse," i.e., the quantity of features extractable from inputs is often orders less than the quantity of possible inputs, we actually open the possibility for projections of vectors to vector spaces of smaller dimension to be essentially invertible. This is known as compressed sensing. In this sense, we can view trained neural networks to be simulating larger neural networks in which every neuron represents to a "disentangled" or dedicated feature by permitting slight interference between competing features within its neurons. We think of the neurons from this larger, simulated network as simply projecting onto the spaces offered by the neurons in the smaller-dimensional network as "almost orthogonal" vectors. To mechanistic interpreters, this projected version manifests as a polysemanticity.
Neural networks as projections of larger networks [Elhage et al 22].

For those models whose developments the Superposition Hypothesis describes well, there are a few implications on how we go about understanding them:

  • We can either attempt to create models that offer less polysemanticity through design (see Softmax Linear Units as a attempt to do just this [Elhage et al 22]), or
  • Find a way to understand representations with polysemanticity through some ideas offered by the subject of compressed sensing.

Future Directions for Interpretability

Recently, Chris Olah expressed a series of "dreams" for mechanistic interpretability; in particular, for resolving the challenge posed by neurons exhibiting superposition/polysemanticity as well as scalability [Olah 23]. By scalability, Olah refers to the problem of transferring methods used for understanding smaller models to much larger models. He posits that there is more work to be done in considering the larger-scale structures that emerge from these models; the universality of how particular features are treated across different neural networks; the relationship between local/mechanistic and global properties of these models; and the automation of interpretation by AI. He views AI-safety as a principal goal of these pursuits.

Larger-Scale Structure

Epistemically, mechanistic interpretability follows a bottom-up approach: local, neuron-level laws are used to understand the entire network. Such local objects of study include features [input data or computed values from input data], parameters [learned weights and biases], and circuits [sets of neurons across layers]. Olah explains that it is easy to be mislead if one were to try going top-down:

"Consider a vision model trained on ImageNet. A natural hypothesis is that it will contain a dog head detector – and indeed, ImageNet models do seem to reliably contain dog head detectors! However, there will also be earlier layers, right before it contains a dog head detector, in which the model has detectors for a dog snout, dog fur, eyes, and so on, which combined can very accurately predict the presence of a dog head. This means that a linear probe will discover a "dog detector," even if the layer isn't best understood as having a dog head detector feature yet" [Olah 23].

For clarity, a linear probe is a simple and interpretable method used to investigate the information contained in the learned representations of a neural network, with the goal to examine how well these representations capture certain semantic properties of the input data. It is constructed by adding a linear layer/linear classifier/probe head on top of the learned layers of the network which maps the representation to a target property or label. One can train the probe to whatever semantic task is desired, such as predicting part-of-speech of a generated word. The performance of the trained probe reflects to what extent the learned representations capture the desired semantic feature: high accuracy suggests the model has learned the feature well.

Olah argues that, as opposed to the top-down approach, the pursuit of understanding local parts like circuits in mechanistic interpretability offers the ability to truly determine which claims about the network are true; i.e., that all statements about model behavior can be broken down into statements about (sufficiently small) circuits. In this view, the model's circuits are seen as as a "kind of epistemic foundation for interpretability" [Olah et al, 20].

Olah also points to circuit motifs, or the recurring/analogous patterns in complex networks [Olah 23]. He argues that understanding how motifs form and function could be more important than the study of individual circuits [Olah et al, 20]. There is speculation that, in the same way the body not only has small-scale cells and a large-scale body, but also medium-scale organs, regions, and inter-organ systems, the study of circuits and motifs could reveal larger-scale emergent structures within neural networks.

While the superposition common to transformer language models has complicated the search for larger-scale structures, the search has been more fruitful for vision models. And if more structure can be revealed within superposition, there is hope for find similar results for these language models.

  • Feature families/Equivariance: certain features (and the weights & circuits implementing them) might be describable as parameterized by one more adjustable parameters. Equivariance [the ability for the model to equally treat incoming data modulo some family of transformations] is typically found in few-parameter families, but can appear in a wide variety of contexts and models. In general, we expect there to be a meta-organization of features which fulfill our hope for features to truly help us assign meaning to the activation space [Olah 23].
Examples of equivariance for vision models [Olah et al 20].
More examples of equivariance for vision models [Olah et al 20].
  • Feature Organization by Weight: one empirical but largely not understand phenomenon is when superposition occurs in a neuron, the neuron is more likely to superimpose features which are less likely to interfere than features which are highly related. Because (1) related features are more likely to be connected across layers, and (2) unrelated features are more likely to be superimposed with one another, it seems like superposition obscures a method for feature organization [Olah 23].
  • Weight banding: the phenomenon of weights in the last layer of a convolutional neural net with global average pooling and trained on, say, ImageNet, to appear quite uniform. It is thought that weight banding occurs to preserve information about larger scale structure in images [Petrov et al 21].
Analogy between tissue striation and weight banding [Petrov et al 21].

Universality

Should there exist any structure for organizing features and circuits emergent across various neural networks, the job of interpreting networks becomes a task of transferring what we've already learned to new models. The claim that neural networks mostly construct their circuits (for say, vision or language tasks) according to a universal structure for circuits is known as the Universality Hypothesis.

"The Universality Hypothesis determines what form of circuits research makes sense. If it was true in the strongest sense, one could imagine a kind of 'periodic table of visual features' which we observe and catalogue across models. On the other hand, if it was mostly false, we would need to focus on a handful of models of particular societal importance and hope they stop changing every year. There might also be in-between worlds, where some lessons transfer between models but others need to be learned from scratch [Olah et al, 20].
Visual demonstration of supposed universality in particular feature detectors across four trained models [Olah et al, 20].

There are various established examples of universality in Gabor filters, high-low frequency detectors, certain attentional features in transformer language models, and even in examples connecting curve detectors to mouse neurons in their visual cortex [Olah 23].

Micro to Macro

It is an art to have microscopic phenomena imply macroscopic laws.

Induction heads are microscopic phenomena with well-understood macroscopic effects: namely, causing visible "bumps" in transformer language model loss curves. Olsson et al describe an induction heads to be a circuit whose function is to look back over an input sequence for previous instances of the current token, note the token that appeared immediately after it in the sequence, and predict that second token to be the completion to the current token a simple copy-and-complete routine [Olsson et al 22].

The "bump" in loss curves exhibited by induction heads [Olah 23]. 

It seems that the emergence universal circuits are reflected in corresponding reductions in the loss curves of the neural networks and at the moments/scales in which they emerge.

Olah clearly describes how the story of universal circuits and macroscopic theories coming from microscopic theories might depend on scale:

"As models scale, they implement new circuits, creating scaling laws. A more complex story [to a macroscopic theory] might be that there are dependencies between circuits: the model can't jump directly to very complex circuits without simpler circuits as stepping stones. If we could understand this dependency graph – which circuits depend on which other circuits, how much capacity do they take up, and how much do they marginally reduce the loss – we might explain both learning dynamics and scaling laws in one mechanistic theory" [Olah 23].

Automated Interpretability

According to an interesting and recent publication by Billis et al, it is possible to apply automation in order to scale interpretability techniques across an entire LLM [Billis et al 23]. They were able to explain many neuron activations using GPT-4, and admitted that while their results were not yet great, having the ability to separate neurons with polysemanticity could immediately allow their tools to achieve much higher scores.

It is not yet clear that Automated Interpretability is the best route for Mechanistic Interpretability to follow — perhaps there will prove to be benefits to human analysis. Olah challenges:

"If we are relying on AI to audit the safety of AI systems, it seems like there are significant concerns about whether we should trust the model we are using to help us. This isn't obviously an insurmountable objection – there are surely clever ways to try and work around it. But it seems likely that many of the ways you might work around it would make the failure modes of interpretability more correlated with the failure modes of other approaches to safety" [Olah 23].

Should the previously-discussed methods for approaching interpretability prove insufficient to understanding the majority of models, it will be helpful to receive the assistance of AI in this task — accelerating our capacity to interpret while still allowing humans to verify the results.

Mechanistic Interpretability and Safety

Olah loosely speculates about the safety implications of interpretability. Olah envisions a path towards achieving "for all situations, the model won't deliberately do action in any of these situations" by first solving scalability and superposition, then enforcing the sufficient statement "there does not exist any feature with which the model will deliberately do action" [Olah 23]. To build this kind of system might require an encyclopedic knowledge of the implications of a universal set of circuits. Because of the speculative nature of the Universality Hypothesis and the capability to document the implications of every circuit, this dream might be the hardest to reach any time soon.

Conclusion

I hope to write more about Interpretability soon!

Subscribe to Sifter

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe