Towards an unambiguous notation for LLMs

Names like "o1-preview" are pretty vague—can we do better?

and

Dec 02, 2024

ChatGPT is far from the only conversant language model in town: the number of pre-trained large language models (LLMs) available for both commercial and academic use is bigger than ever, and it keeps growing. Even within the set of models trained by OpenAI, there’s (among others) GPT-4o, o1-preview, and GPT-4. Beyond OpenAI, we have “Claude” in its various incarnations (Claude 3 Opus, Claude 3.5 Sonnet, etc.), multiple versions of Meta’s Llama-3, and of course hundreds of thousands of open-source models available on HuggingFace.

As others have noted, these names aren’t always particularly interpretable, nor do they necessarily pick out unique referents. For example, “GPT” stands for “generative pre-trained transformer”. This is presumably1 an accurate description of the GPT family of models, but pretty much all well-known LLMs these days are “generative pre-trained transformers” in some manner of speaking. The name “Claude”, of course, is even more opaque about the model in question, evoking instead the persona Anthropic has worked to create.

For commercial models, it makes sense that companies wouldn’t want to give away too many details about model architectures or training data. But from the standpoint of academic research, the two of us (Sean and Pam) still feel that it would be useful to have more precise terminology for the hundreds of thousands of open-source models whose architectural details are available—but which typically require digging through technical reports to figure out.

Such a notation would necessarily be technical, but it should also be informative and interpretable to the trained eye.

A brief analogy to chemistry

One way to think about this is to analogize it to other domains that do have more precise terminology and notational practices.

A good example is chemistry. Molecular formulas explicitly indicate the elements present in a compound, as well as their respective quantities. This allows trained practitioners to “read off” information about compounds in a relatively unambiguous way: the notation “H20” indicates that the compound in question has two hydrogen atoms and one oxygen atom.

Of course, a molecular formula isn’t entirely ambiguous. Isomers, for example, contain the same elements, but those elements are arranged differently in space. Thus, chemists also use other notations (e.g., structural formulas) to refer to compounds depending on exactly which information they’re trying to convey.

Still, even standard molecular formulas are much more precise than the names we use for LLMs.

What should an LLM notation convey?

Designing a good notation is a communication problem. Good notations should convey relevant and useful information for picking out unique referents in some conceptual space—and they should do so systematically, such that someone trained in the notation can easily read off that information. As we’ve seen with molecular formulas, this notation may not be perfectly unambiguous, but it should convey the kind of information about LLMs that researchers and engineers need to know. But what information is that?

LLMs vary along a number of dimensions, including features of their architecture and their training data. Here’s a graphic illustrating some of that variance, taken from a previous post:

A partial typology of the design decisions that go into training an LLM. Taken from LLMs and the external validity problem.

To simplify the problem, let’s start by focusing on architectural features. Most LLMs are transformer models, which use (among other things) self-attention to make predictions about tokens. When making these predictions, some models only look at preceding context (i.e., auto-regressive models: “salt and ___”), while others can look at both “sides” of a masked token (i.e., bidirectional models: “salt ___ pepper”). Thus, a good notation should probably indicate whether an LLM is a transformer model (as opposed to, say, an LSTM), and whether it is auto-regressive or bidirectional.

Another architectural feature is a model’s size. The coarsest way to denote this is by the overall number of parameters. Some models already include this information in their names, like the Pythia suite of models: “pythia-14m” refers to a model with 14 million parameters, while “pythia-70m” refers to a model with 70 million parameters.

That’s already much more precise than many model names, and maybe it’s all most people need to know. But for some researchers (like ourselves), it’s also useful to know more granular information about those parameters. That “14m” comes from a combination of factors, including (but not limited to): the size of the model’s vocabulary (50304); the size of the embeddings used to represent each word (128); the number of hidden layers in the model (6); the number of attention heads at each layer (4); and the size of the feedforward network at each layer (512). Fortunately, the researchers at EleutherAI have summarized all the relevant information about their open-source models in a paper, and it’s also all available on the HuggingFace model hub.

How, then, might we refer to the “pythia-14m” model in terms of its architecture? A notation might convey at least the following information:

Transformer (T), auto-regressive (A), vocabulary size of 50304 (V50304), embedding size of 128 (E128), six hidden layers (L6), four attention heads at each layer (A4), and a feedforward network size of 512 (FFN512).

So maybe something like:

T-A-V50304-E128-L6-A4-FFN512.

It’s not exactly easy reading, but it has the advantage of being precise and of allowing pretty rapid (and accurate) comparison to other model specifications. Pythia-70m has the same architecture, but it’s bigger (70M parameters):

T-A-V50304-E512-L6-A8-FFN2048.

This notation gives us quick insight into how the models are similar and how they’re different. They’re both auto-regressive transformer models with six hidden layers and the same vocabulary size, but the bigger model has larger embeddings, more attention heads at each layer, and a larger feedforward network. In principle, pythia-70m could’ve achieved its larger size with more layers but the number of attention heads at each layer—so it’s useful to know where that 70m comes from.

As we mentioned above, this information can be found in the original paper on the Pythia suite, and it’s also easy to access via an LLM’s “config” parameter in Python if the LLM is loaded with the HuggingFace package. But not all papers do make this information as clear as the Pythia paper does—and loading the model in Python takes time (and compute resources) that not every researcher or journalist has.

What about the data?

Above, we introduced a potential notation for referring to particular LLMs in terms of several important architectural features. But this notation is by no means exhaustive. It doesn’t include a number of other architectural details, such as how a model’s residual connections are implemented. And it also doesn’t include any information about a crucial part of how LLMs are trained: the data.

LLMs are often trained on billions of words (or technically “tokens”) sourced from various text corpora. When training an LLM, researchers have to select an appropriate tokenization protocol (e.g., byte-pair encoding) to break up strings of text into manageable chunks. Ideally, researchers should also have some sense of what’s contained in those text corpora. At minimum, we should know how many words (or tokens) an LLM was trained on, since this makes a difference for LLM performance. More ambitiously, it would be useful to know what languages and language varieties constitute the corpora in question: what registers, in which dialects, from which time periods?

It’s beyond our expertise to introduce a notation for linguistic data. But linguists—particularly those working with corpora—already use a variety of codes to describe languages (e.g., ISO language codes), language varieties (e.g., Glottolog codes), and text corpora (e.g., “en-GB-News”2). In principle, these codes could be combined as needed to create precise but interpretable labels for the bodies of text that LLMs are trained on.

The main obstacle to this, unfortunately, is that we don’t always know this information about the text corpora used to train LLMs. This is obviously true for proprietary models (e.g., those trained by OpenAI). But it’s also hard to figure out even for open-source models: the datasets are so large that we often lack fine-grained information about exactly what’s included.

That doesn’t mean it’s impossible to figure out, however: we just need more dedicated work to actually characterizing these massive datasets in terms of theoretically relevant parameters.

A marriage of architecture and data

We began this short essay with the observation that it would be useful to be able to refer to specific LLMs with some kind of meaningful, precise notation. Names like “Claude” or “BERT” are fine for casual conversation, but they don’t tell you as much as one might like to know about the models in question.

In our view, properties of an LLM’s architecture and training data seem like important parts of a good LLM notation. For example, two LLMs trained on exactly the same data but with slightly different architectures could be compared at a glance using the notation we proposed (bolding indicates axes along which the models differ):

(1) T-A-V50304-E128-L6-A4-FFN512
(2) T-A-V50304-E512-L6-A8-FFN2048

As we mentioned above, this notation isn’t exhaustive. It’s entirely possible that researchers in the field would find a different set of model properties more useful to include. Our goal in writing this essay was not to persuade practitioners to adopt specifically this notation.

But it’s still worth asking—especially because some readers are probably thinking it—who exactly is this notation for? To many people, these details are irrelevant or just too technical to understand. And, of course, a number of state-of-the-art models (like those trained by OpenAI or Anthropic) could never be described this way because their details have not been released.

We think there are a couple of target audiences.

The first, and most obvious, is researchers in the field. While it’s possible to glean this information from technical reports or model config files, it’s not always easily accessible—accompanying more colloquial names (like “BERT”) with this more precise notation would make figuring it out much easier. And that information, in turn, is crucial for doing the kind of LLM-ology the field so badly needs: careful empirical work that connects a model’s architecture and data to its performance. It’s possible a notation like this would even encourage researchers to think about these properties more carefully and systematically.3

The second group is harder to define, but includes what we think of as “people interested in AI developments”. Not all of these people will want to get into the weeds. But some do, and—as we suggested above—a clear, systematic notation can sometimes be a useful cognitive tool for gaining conceptual purchase on a new domain. The mere process of learning what these axes refer to (“L6”, “A4”, etc.) requires a kind of on-the-ground, mechanical understanding of what LLMs are and what they’re built to do.

It’s quite possible (even likely) that the notation we’ve cursorily suggested isn’t the right one—but we hope it’s a start.

Since OpenAI’s models are proprietary, we can’t be strictly certain about this.

This particular abbreviation indicates that the corpus contains English text from GB (Great Britain) News.

We can certainly speak to this experience: in a recent paper, we compared a number of pre-trained Spanish language models in their ability to predict human judgments—and it was at different points frustrating and enlightening to learn how many axes those models varied along.

A guest post by

Pam

Currently a postdoctoral researcher, trained in cognitive science and neuroscience, but more recently AI-interested.

Leo

Dec 10Edited

Thank you for this post: LLM notation and nomenclature drives me mad. I would like to cite this post and I always advise that writers and developers deserve full credit for their ideas, so, is there a better author credit for "Pam" or is the anonymity just a given?

The best I have so far is:

Trott, S. & Pam [username PAM@pdriv]. (2024, December 3). Towards an unambiguous notation for LLMs: Names like “o1-preview” are pretty vague—Can we do better? [Substack newsletter]. The Counterfactual. https://seantrott.substack.com/p/towards-an-unambiguous-notation-for

Cheers,

Expand full comment

2 replies by Sean Trott and others

Charles Gillingham

Dec 3

It sounds like you want a standardized record of an LLM’s properties, similar to Wikipedia’s “info box”. I think you should start a public database or wiki with a page for each existing LLM and encourage everyone in the community to help keep it updated.

4 more comments...

The Counterfactual