Building inductive biases into LLMs

Language models are hungry for data—can we initialize their weights in ways that help reduce how much data they need?

Mar 04, 2025

Contemporary large language models (LLMs) typically need lots of training data: often hundreds of billions or even trillions of individual word tokens. This is part of what makes them “large”—along with the fact that the models themselves have many parameters.1 By contrast, humans probably encounter somewhere on the order of 150M words by age 20.

If you’re anything like me, it might be hard for you to conceptualize the difference between these numbers. A 2022 paper by Alex Warstadt and Sam Bowman (both of whom helped create the BabyLM challenge) illustrated these different scales visually, which you can see below. Alternatively, you can think about it in terms of percentages: although estimates of human language exposure vary, pretty much all estimates (~100M) fall below 0.1% of the number of words most contemporary LLMs are exposed to (>100B).

Visual depiction of the different training data scales across language models, compared to an estimate of human language exposure at age 10. Image taken from Warstadt and Bowman (2022).

Modern LLMs have gotten very good at modeling language (as well as other tasks). But making LLMs this good seems to involve training them on >1000x as many words as humans encounter—along with additional fine-tuning or training, such as reinforcement learning with human feedback. From this perspective, LLMs look very inefficient—they’re hungry for linguistic data in a way that humans aren’t. Why is that?

Accounting for this gap would have at least two important implications.

The first is theoretical. Many debates in Cognitive Science hinge on the origins of various cognitive capacities, such as our fluency with language: are they innate or learned—and if they’re learned through experience, what kinds of experience matter? Humans, of course, are exposed to all sorts of input besides language, such as experience in the physical world; they’re also born with a neural system that’s been shaped by millennia of biological evolution. Figuring out which of these factors helps LLMs learn language more efficiently could address debates about humans.

The second is practical. Training LLMs on so much data costs a lot of money. From a financial perspective, AI companies would probably love to be able to produce capable LLMs more efficiently. This would also help academic researchers, who don’t currently have the resources to be able to train models on a massive scale. A related concern is that in our efforts to scale models, we might eventually run out of training data: there are ongoing debates about whether we’ve hit “peak data” and, if so, what to do about it. If we could build models more efficiently, peak data would be less of a concern.

So what might help LLMs learn more efficiently? One promising approach starts with the notion of an inductive bias: maybe there’s a way to engineer LLMs—either in terms of their architecture or their initial weights—that makes them more well-suited to learning the statistical patterns of human language.

Inductive biases, briefly explained

Learning from experience involves drawing generalizations across individual experiences. The central challenge, however, is that there are often many generalizations that one could draw from a given sample of data, and not all of them are equally accurate. It can be helpful, then, to constrain the space of generalizations that one might make. That’s the crux of what an inductive bias is—a constraint.2

In a 2023 paper, computational linguists Aaron Mueller and Tal Linzen described an inductive bias as follows:

We use the term inductive bias to refer to the set of assumptions that a model relies on when generalizing to new data. Our usage includes any factor that leads the model to generalize in one way rather than another…

Learning the statistical patterns of language is a classic example of a generalization problem. Faced with examples of language in action, humans seem to draw accurate generalizations pretty easily. Among cognitive scientists, there’s considerable debate about whether and to what extent this fact is indication that humans possess an innate “module” tuned for language learning specifically. Regardless of one’s view on universal grammar, it’s true that humans are equipped at some level to learn language. It seems appropriate, then, to say that humans have some kind of inductive bias, which may or may not be tuned for language specifically. These “biases” are, in turn, the result of generations upon generations of biological evolution—and are likely “implemented” somehow in the structure of our brains and sensorimotor systems.

What about large language models?

LLMs are trained to make predictions about which words will occur in different contexts. When an LLM is first initialized, its predictions are typically very bad. That’s because its weights (or “parameters”) are usually initialized randomly—thus, its predictions will also be effectively random. But by observing more and more examples of how language works, an LLM can “update” its weights such that its predictions become more and more accurate. Crucially, this process of updating weights can be seen as drawing generalizations about the data. As Linzen and Mueller noted, the factors influencing these generalizations are equivalent to inductive biases:

Our usage [of “inductive bias”] includes any factor that leads the model to generalize in one way rather than another…

One example of such a factor could be a model’s architecture: past work (led by Tom McCoy) has found that different neural network architectures (LSTMs vs. GRUs) draw different kinds of generalizations. Additionally, that same Linzen and Mueller paper I referenced above suggests that the depth of a transformer model (i.e., the number of layers) may also influence the kinds of generalizations it draws. Choice of architectural properties, then, might constitute a choice of “inductive bias”.

More recent work, however, has focused on another factor: the model’s initial set of weights.

“Pre-pretraining” as a method for distilling inductive biases

Above, I mentioned that the initial weights of an LLM are typically randomized. The point of training the LLM is to update those weights from a random distribution of numbers to a different distribution that helps the LLM make better predictions.

Beyond that, there’s really nothing magical about training: it’s a convenient approach for arriving at a good set of model parameters. It would be even more convenient, of course, if that initial set of parameters wasn’t entirely random, but rather was “tuned” somehow to reflect the kind of inductive biases we think the LLM might need. That way, we might be able to train the LLM on fewer examples (and save lots of money and time). The challenge with doing that is we still lack a good understanding of how exactly the weights of an LLM produce the kinds of behaviors we want to see: they’re black boxes. If we knew exactly how it worked, we wouldn’t need to train the models—we could just analytically produce the optimal set of weights for the problem at hand!

That said, a number of recent papers have explored a variety of clever methods for producing a more reasonable—and less random—set of initial weights. One method is called “pre-pretraining”, which is a little confusing but does essentially describe how it works: if “pretraining” is the normal process of training a language model from a random set of weights3, then “pre-pretraining” is the training that goes on before that happens to make those initial weights a little less random. A closely related method is called “prior distillation”, which I might prefer as a term: a model’s “priors” are the set of assumptions (or inductive biases) it has about which generalizations are appropriate to draw, so “distilling” those priors means codifying them somehow in the set of initial weights (again, before the normal training process begins).

Building inductive biases into a model

To explain how this works in practice, I’ll focus on a 2023 paper by Tom McCoy and Tom Griffiths. Their goal was to “distill” the right inductive bias in a language model’s initial weights to enable it to learn from fewer examples. As a starting point, they defined the bias formally in Bayesian terms as a prior over possible formal languages. In their definition, a formal language is simply a set of strings defined by a formal, abstract rule:

For example, the set {AB, ABAB, ABABAB, ...} is a formal language defined by the expression (AB)+, meaning “one or more copies of AB.”

This might seem like a strange language, but it’s inspired by traditional approaches to modeling the grammatical structure of language. Many linguists believe that humans are very good—perhaps innately so—at learning recursive structures in particular4, such as the “tail recursion” exemplified by the (AB)+ rule. For example, if “A” here refers to a preposition and “B” to a noun phrase, then this formal language could capture (i.e., generate and/or parse) a construction like:

under the vase on the table in the library

McCoy and Griffiths defined a number of formal languages like the one above, then sampled examples from each of those languages. This part was relatively straightforward: the rules in each formal language are generative, meaning you can use them to generate examples. For example, the (AB)+ rule could be used to automatically generate hundreds or even thousands of examples that adhere to that language’s simple structure.

Finally, the authors trained a type of neural network—a long short-term memory network, or “LSTM”—on the examples they’d generated from each formal language. This part looked pretty similar to how language models are typically trained. One key difference was that the training data consisted of strings themselves generated by a formal grammar:

For example, if the sequence is ABA, it would be expected to first predict what the first token is (A); then to predict what the second token is (B) given that the first token is A; then to predict what the third token is (A) conditioned on the prefix AB; and finally to produce a special end-of-sequence token conditioned on the prefix ABA.

The other key difference is that the authors use something called model-agnostic meta-learning (or “MAML”). Instead of training the model to learn a particular formal language, MAML helps the model learn how to learn a number of different formal languages more easily. More precisely: the model is first trained on one of the formal languages in question, and then evaluated. The errors this model makes help sculpt a new set of parameters, which are used as the initial set of parameters when training another model on a different formal language. This process is repeated for all of the formal languages. Over time, this results in a set of initial parameters that would enable a model to learn any of the formal languages more efficiently.

If MAML is successful, this initialization should encode an inductive bias that enables the model to learn any language in our distribution from relatively few examples. Because we control the distribution of languages, we also control the inductive bias that is meta-learned. We refer to a neural network that has undergone inductive bias distillation as a prior-trained neural network because it has been given a particular prior via training.

These “prior-trained” neural networks are—perhaps unsurprisingly—much better at learning formal languages rapidly from data. More striking is the finding that they’re also faster at learning a natural language (English). That suggests that this method for identifying an initial set of parameters is pretty effective. In short, the authors successfully “distilled” an inductive bias into the model by first training it on formal languages.

Other papers have yielded similar results. For example, this very recent 2025 paper led by Gianluca Bencomo found that this meta-learning procedure produced models that rapidly learned natural language—and in fact, it substantially reduced other differences previously observed between model architectures or data representations. Put another way: the set of weights a model is initialized with matters just as much (if not more) than the model’s architecture. Another recent paper led by Michael Hu found that “pre-pretraining” on formal languages was actually more effective than pre-pretraining on natural language. That suggests there’s something particularly important about the structure of those formal languages that helps produce a set of initial weights especially tuned for efficient language learning.5

Curriculum learning by another name?

Back in the early 1990s, the UC San Diego cognitive scientist Jeff Elman—one of the pioneers of recurrent neural networks (RNNs)—published a paper called Learning and development in neural networks: the importance of starting small. Elman’s key question in that paper was very much of a kind with the questions explored above: how do human children learn the structures of language? Like the papers of today, he tried to address this question using neural networks as a model of human learning (there really is nothing new under the sun).

The neural networks of that time were much smaller and much less powerful than ChatGPT. When trained from the start on sentences with complex hierarchical structure (e.g., “The dog [that the cat chased] barked”), they failed to learn the appropriate statistical and grammatical rules. Interestingly, however, when Elman manipulated their initial “memory capacity”—creating a kind of bottleneck in the initial stages—they ended up learning the appropriate structures more incrementally and more successfully. They “started small”, in other words, and in doing so, gradually learned the kinds of hierarchical structures that a fully “adult” network was unable to acquire. In the abstract, Elman wrote:

This result suggests that rather than being a limitation, developmental restrictions on resources may constitute a necessary prerequisite for mastering certain complex domains. Specifically, successful learning may depend on starting small.

Elman’s focus was on restricting the capacities of the network. In doing so, he helped introduce the idea that manipulating network properties throughout learning stages might be helpful. A similar idea was eventually explored by manipulating properties of the input, e.g., starting with simpler examples and gradually progressing to more complicated ones. This idea was popularized in a 2009 paper led by Yoshua Bengio called Curriculum Learning.

Since its introduction, “curriculum learning” fell in and out of favor. Increasingly large foundation models like the GPT series pointed to the possibility that with enough scale, models might be able to learn anything regardless of how the data were presented. But for both theoretical and practical reasons, there’s been a renewed push towards efficiency, in terms of both model and data scale (see Tim Lee’s reporting for more details on this trend).6

That push is exactly what’s led to the kind of research I described above on inductive biases. “Pre-pretraining” seems to me to be a kind of curriculum learning: a network is first trained on one kind of data (e.g., a formal language), then another (e.g., natural language). What’s particularly interesting about these results is that pre-pretraining on formal languages actually yields better results than simply pre-pretraining on more natural language data. That suggests there’s something important about exposing neural networks to the kind of constrained and structural regularities of strings generated by explicit symbolic rules—which helps “bootstrap” their way into dealing with some of the trickier idiosyncrasies of actual human language.

Where you go depends on where you start

Humans, of course, aren’t “pre-pretrained” on explicit formal languages. But one view of the human language faculty—most associated with Noam Chomsky—is that humans are born with a certain proclivity for learning the kinds of grammatical structures observed in human language. Presumably, some combination of biological mutations and selective pressures ended up shaping this faculty.7 From this perspective, pre-pretraining on formal languages could be viewed as a kind of rough functional approximation of those forces: not at the implementational level, but in the sense that it (maybe) results in a similar kind of inductive bias.

That does raise a few really interesting theoretical questions, some of which happen to have practical implications.

First, is there a way to actually simulate those pressures somehow? This would require building hypotheses about the kinds of forces early hominids (or pre-hominids) were subject to, operationalizing those hypotheses computationally, then applying them to neural networks. The closest I’ve seen to this approach is the paper on genomic bottlenecks I’ve written about previously, which focuses more on the importance of compressing architectural information in a system with a smaller information capacity (the genome). I’m not sure exactly what it would look like to simulate those other forces.

Second, what other kinds of priors could be “distilled” in the network, and how might they help? It seems like priors about the structures of strings helps learn language. But I tend to be more interested in meaning. Is there some way to enrich word embeddings—or network weights—with information beyond that encoded by language statistics? I’m not just referring here to vision-language models, which explicitly involve additional training and aren’t necessarily very efficient. Rather, maybe there’s a way to codify core principles of embodied experience in the initialized weights of a network, and maybe those principles could help bootstrap language learning. There’s been some work along those lines, but I’ve yet to see it applied through the lens of inductive biases.

Third, is there some way to skip the distillation step altogether? This sounds a bit like cheating: of course it would be nice to magically initialize a neural network with the right weights, but what are the chances of that? In the machine learning literature, there’s a concept called the “lottery ticket hypothesis”:

Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.

I think it’s also related to the concept of knowledge distillation: the idea that once you’ve trained a really big model, you can “distill” its information into a smaller, more efficient model—even though that smaller model would’ve struggled to learn that information on its own.

Both concepts point to the following insights: first, that training is a convenient way to get to the weight matrices you want; and second, that scale is helpful primarily in that it gives you a larger space of “hypotheses” to search from as you’re looking for those weight matrices. The inductive bias literature suggests there might be some functional principles we can use to guide this search: some initializations are better than others, and we can identify those initializations by “pre-pretraining” on formal languages. What’s missing—at least to my knowledge—is a set of mathematical principles that specify exactly why. What makes those initializations better suited to learning what we want the networks to learn? Figuring that out would, in my view, be a huge step towards understanding the mechanistic principles by which language models learn to do what they do.

Often in the order of hundreds of billions, these days.

This is also consistent with how the term “bias” is used in machine learning and statistical modeling more generally: the bias of a statistical model is the set of assumptions that a model has about the form data can take. For example, linear regression makes a very strong assumption about the data—namely, that the relationship between X and Y is linear—and thus exhibits high “bias”.

As a side note, I’ve always found it a little funny that this is the term—it’s just training! The reason is historical: “pr-training” was a thing we did to produce a large foundational language model, which could then be further trained (or “fine-tuned”) for more domain-specific tasks. Again: it’s all training, but “pre-” gestured at the stage before applying the model to a researcher’s specific problem.

Though not all, notably!

Hence the paper’s title: “Between circuits and Chomsky”.

See also this article on Understanding AI on the end of pretraining as we know it.

Even among nativists, there’s considerable disagreement about whether the human language faculty was directly selected for or whether the use of language for communication is a kind of spandrel.

spencer

Mar 5

so we can think of the varying strength of synapses of interconnected biological neurons as evolutionary pre-pretraining for language (with school as pretraining and vocational training as training), and we should translate the learnings of that pre-pretraining to a geometric understanding for computational linguistics, so that we can advantageously apply these learnings as inductive biases to network architecture and weight initialization better

craziest shit i've ever read

Expand full comment

1 reply by Sean Trott

1 more comment...

The Counterfactual

Discussion about this post