Poll results, and other updates

A tie between an explainer on model efficiency and an empirical survey of the interpretability literature.

Aug 03, 2025

In July, paying subscribers of The Counterfactual voted on which topic they’d like to see me explore in a future post.

The results were a tie between two options: first, an explainer on methods for making large language models (LLMs) more efficient; and second, an empirical survey of the interpretability literature—with a focus on the “explanatory virtues”, as described in Ayonrinde & Jaburi (2025). That means I’ll do both of them!

In terms of a rough timeline, my goal is to complete and publish these posts over the next few months. My expectation is that I can probably finish the efficiency explainer first, since the primary task will be reading up on the literature on model distillation, model pruning, and so on. The empirical survey of mechanistic interpretability papers will likely take a bit longer: I’ll need to code a pipeline for sampling candidate papers and validating whether they are in fact suitable examples, and then I’ll need to hand-annotate1 which methods and languages they actually used.

In addition to these upcoming posts, I also have some other articles in the works, including (as described here) a discussion of Artificial Intelligence and higher education, as well as a review of the recent Cognitive Science Society Conference, which I attended last week.

The full pitches

In case readers are curious, here’s the pitch for the model efficiency explainer:

Large language models (LLMs) need a lot of computational resources. The models themselves are generally very large (i.e., many parameters) and they’re also trained on extraordinarily large datasets. Training these models is thus generally prohibitively expensive—and running them in inference can also be quite expensive too, especially with the move towards “reasoning” models that generate long sequences of “thinking” tokens.
For a number of reasons—including cost, fears of running out of data, cognitive plausibility, and more—there is considerable interest in making models more efficient. There are, in turn, different ideas on how to do this. One popular approach is model distillation, in which researchers first train a large model and then “distill” it into a smaller model that can be run more efficiently. Researchers have also explored techniques like quantization (representing weights and activations with fewer bits), pruning (systematically removing “redundant” model components), and much more.
If this option is chosen, I’ll spend some time reviewing the relevant literature and developing an explainer on how these techniques actually work. The most similar analog of my past posts would probably be my vision-language model explainer or my mechanistic interpretability explainer.

And here’s the pitch for the interpretability survey:

Kola Ayonrinde, a researcher at the UK AI Safety Institute, recently co-authored a really interesting paper that adopted a philosophy of science angle on mechanistic interpretability. The idea was to identify a set of explanatory virtues associated with different theories of scientific explanation, including: accuracy, parsimony, falsifiability, and more. They then “graded” different interpretability techniques (e.g., sparse auto-encoders, compact proofs, etc.) according to these virtues.
If this option is selected, I will conduct a systematic literature review of recent interpretability papers, identify which methods were used, and present a quantitative breakdown of the “virtues” present (and absent) in this sample. By “systematic”, I mean something loosely based on the PRISMA method of conducting reviews:
I will select a random sample of papers based on a few select search terms from specific sources and time periods.
I will then read the abstract of each paper to determine its suitability for inclusion.
In the final sample of papers, I will apply a scoring rubric for cataloguing which techniques were used.
These techniques can then be mapped to different explanatory virtues using the scheme presented in the paper (see Table 1).
As a bonus, I will also tag which language(s) the researchers used. My suspicion is that the vast bulk of interpretability research is still being conducted with English stimuli and English models, but it’d be useful to quantify this more directly.

I’ll probably test how well an LLM does this as well, but I’d like to do hand annotation for the primary sample to ensure data quality.

The Counterfactual

Poll results, and other updates

A tie between an explainer on model efficiency and an empirical survey of the interpretability literature.

The full pitches

Discussion about this post