"Mechanistic interpretability" for LLMs…

Sean Trott

Jul 8, 2024

Trying to peek inside the "black box".

Read →

7 Comments

Cameron Jones

Jul 9

Awesome post! The figure are gorgeous!

"Maybe it’s the computational theory of the mind that’s doomed in the end—for both brains and LLMs."

Could you explain this in more depth?

Expand full comment

Reply (1)

Sean Trott

Jul 9

Thank you!

And yeah, I was trying to present one version of the critique that MI (and perhaps also cog neuro) is not really feasible or at least will bump up against a limit before it feels like we've hit a satisfying explanation. Namely, that our understanding of "cognition" as consisting of representations mediating perception/action, and of mental "operations" over those representations, just: 1) doesn't have that much further to go on its own; 2) ultimately isn't the right framework for connecting with the substrate itself, i.e., "the brain is under no obligation to follow the principles of psychology".

And if that's true, then a couple of things might follow:

1) It's more useful to focus on the substrate itself. In the case of neuroscience, this would involve things like mapping out the cell types in the brain (in terms of their morphology, gene expression profile, etc.), their connectivity; maybe gross anatomy; and then also things like neurophysiology, i.e., describing the dynamics of neural activity and relating it to those other features (connectivity, morphology). In other words: studying the brain as a biological organ, not a cognitive system.

2) It's less clear to me what "focusing on the substrate" for LLMs really means. It's not just that LLMs can be described with math (like the brain), they really *are* mathematical operations. Plus transformers are pretty homogenous and boring as "substrates" compared to the brain. But maybe there's interesting work to be done in terms of characterizing the dynamics of parameter updates throughout pre-training, redundancy or lack thereof in attention weights, and so on.

At any rate, I (obviously) don't endorse this view but it's a concern I have about my chosen line of work.

Expand full comment

Reply (1)

Cameron Jones

Jul 10

Thanks! This is really helpful. That makes a lot of sense and is also something I worry about. We've been discussing things along these lines at DISI and it reminded me of this van Gelder paper https://www.jstor.org/stable/pdf/2941061.pdf. I think there's a strong computational position which says that there is some algorithm or function that we can determine a given system as computing. But maybe that analysis depends on our understanding of what the system is "supposed to" be doing.

I think this problem is probably less likely to be an issue for LLMs, because at some high level we know what they're "supposed to" be doing. But maybe we should also have more 'dynamical systems' analysis of LLMs. Unsure how facetious I'm being.

Expand full comment

Reply (1)

Sean Trott

Jul 10

This paper looks great, thanks!

I think there’s a paper to write along the lines of “what kind of thing is an llm?” It would connect to but be orthogonal to the other discussions we’ve been having re: capabilities and such. Even just a descriptive account of the different metaphors/analogies would be useful and interesting. Been thinking of writing up an informal version here, but would be fun to collaborate on a paper!

Expand full comment

Reply (1)

Benjamin Riley

Jul 10

Oh man, please write this paper! And I'll echo Cameron -- this is terrific. I'm curious if you've peeked at the research I recently wrote about regarding language not being essential to thought? Feels relevant in ways I'm not sure I can articulate!

Expand full comment

Reply (1)

Sean Trott

Jul 11

Thank you!

I have seen it (and I also read the original paper), and I think it's an interesting connection/argument you made with respect to LLMs, i.e., if language is an adaptation for communication (not thought), then training on language alone doesn't get you human-like thought.

I've been mulling over writing a post relating to some of the topics in that paper, but briefly: I think one potential objection (which I'm not sure I share, as I probably broadly agree with the spirit of your critique) is that even if we suppose that language adapted for communication, that on its own doesn't necessarily entail that you can't get thought from language. E.g., if we take a very simple account in which there's something called Thought, and then something called Language which is a "inter-subjective translation" of Thought, then while it's true that Thought/Language are separable for *humans*, it's in principle possible that Language reflects enough of the structure of Thought that a system (like an LLM) could acquire much of that structure by training on Language alone.

You might object that there are a set of generative mechanisms *powering* Thought that you can't necessarily deduce from Language. I think that's a useful line of reasoning to go down, but does require (in my view) enumerating in more detail what those mechanisms might be. I think the Mahowald paper you cited in the more recent paper (dissociating language and thought) does a nice job of trying to disentangle these things in LLMs (even if I tend to view them as more continuous, myself).

Expand full comment

Kevin Borg

Apr 26

Hi Sean,

nice to meet you and thank you for this herculean education on MI, which I had never heard of until I met you.

I am not a coder, just a layman from another time and that was tough work after 4+hrs, but I do have 1 quick question if that's ok.

Do you feel like you've stumbled into: "The Library of Babel" and are now wandering around "The Garden Of Forking Paths"?

After reaching the end, my first thought was, Sean has just opened up the proverbial "Borgesian Can of Infinite Worms" !

Sean, I wish you all the success in your journey, you are a better man than me my friend, and I do hope you find the suitable solution, thank you again for your excellent article and introduction.

Cheers Kev Borg (Kiwi) <3

Expand full comment

The Counterfactual

"Mechanistic interpretability" for LLMs…