How do organisms come to know about the world?
The prototypical case that first comes to mind probably involves some manner of associative learning: an organism learns that some objects or locations are associated with reward (e.g., food) while others are associated with danger (e.g., physical harm), and acts accordingly in the future. At a biological level, these associations are probably formed via changes in the brain. Neuroscientists are still figuring out exactly which changes and mechanisms subserve exactly which kinds of associative learning, but this likely involves changes in the connection strength between neurons (or populations of neurons).1
This process of associative learning, then, has updated the organism’s model of the world: we might even say this organism has acquired information about the world and that this information has effectively been distilled in the material changes that took place in the organism’s brain.2
Yet associative learning is not the only mechanism by which information can be distilled, even if it’s one of the paradigmatic cases. Distilling information—alternatively, “learning” or “constructing a world model”—can happen in many different ways, at many different timescales. Some of these timescales are slower, as in natural selection: they happen over many generations, and require many organisms to have some experience or set of experiences in order to distill the relevant information. And others can be much faster: humans can learn many things through language without needing to experience the thing in question at all.
What all these mechanisms have in common is a reliance on some process of compressing the statistics of the world into some simpler form—whether that’s DNA, synaptic weights, or language.
Learning through selection
I still remember when I first understood the theory of natural selection in a non-teleological way. Up to that point, I generally thought of it in teleological terms: certain traits were adapted “for” certain functions, almost as if the long arc of evolution had some kind of purpose.3 But at some point in college, I began to think of the process in more material, mechanistic terms: first, there’s some variation in traits in a population (as caused by genetic variance); second, some dimensions of variance will increase the likelihood of reproduction in a given environment over others; and thus, third, some of those traits will be more likely to propagate in that environment, while others will be more likely to disappear. There was no purpose4 or telos here—just the inexorable carving of furrows into a hillside.
To be sure, there’s considerable nuance that this conceptualization glosses over. Evolutionary biologists still debate the relative role of random drift vs. outright “selection” in shaping populations5, the appropriate unit of analysis when making biological claims, and much more.
The key point, however, is that there’s some mechanism by which the structure of the environment affects the content of an organism’s DNA and thus the structure of the organism itself. As per the section above, we might even say that information about the environment is in an approximate sense compressed in that organism’s genetics—not directly, of course, but statistically: over time, we’d presumably observe some correspondence between the statistics of the environment and the statistics of a population’s genetics. Cycles of natural selection thus involve a kind of “learning”. It’s just unlike the associative learning we observe within a single organism’s lifespan, selection operates across many lifespans.
You might be waiting for an example. On some level, if you endorse the theory of natural selection, most of the genome counts: after all, the claim is that the genetic distribution we observe at a given time point has to a large extent been shaped by reproductive success in an environment.6 But we can also be more specific. Spiders, for example, don’t have to “learn” how to build a web: as far as I understand, they’re pretty much born with that “knowledge”.7 Camouflage is another case: the very structure or appearance of an organism has been altered—slowly, over generations—in ways that enabled previous organisms to avoid danger. The honeybee waggle dance—a system for communicating about food sources—is also at least partly genetic (though also probably partly learned).8
If those examples seem dubious—maybe some of them are partly learned by individual organisms, after all—we could turn to other cases, like the very morphology of an organism (be it animal, plant, bacterium, or otherwise). It’s strange to think of our bodies and physiologies as encoding “information”, but in some loose, metaphorical sense they do. Plants adapted to different environments tend to have different leaf shapes, reflecting the relative need for photosynthesis (e.g., in low-light conditions) vs. the need for water conservation (e.g., in the desert). Deep-sea fish often produce their own light. And we’ve all unfortunately become familiar with the fact that viruses mutate, often in ways that help them evade detection by the immune system.
Perhaps it seems odd to call these examples “knowledge”. But there’s a structural correspondence between the processes involved here and what I described earlier in associative learning: the connection is so strong, in fact, that there’s a framework in Neuroscience called neural Darwinism, which casts neural changes in the language of selection. We might say, then, that something like “learning” has occurred through the process of natural selection.
Learning in the brain
Brains allow organisms to encode novel information about the environment that isn’t already encoded in their DNA. As I mentioned at the start, this is probably the prototypical case of “learning”. And brains are, of course, the product of selection too—not for specific, genetically determined traits, but for the capacity to acquire information about statistical contingencies an organism encounters.
This is, on its face, kind of a remarkable development. For one, it appears to be much more efficient: learning cycles can happen within, rather than across, lifespans. And more generally, it’s an adaptation towards flexibility. Organisms with brains don’t have to follow fixed, genetically determined patterns of behavior: connections in the brain can change depending on an organism’s experience, and those changed connections can in turn affect the organism’s behavior.
It probably feels less strange to call these changes “learning”. But I’d argue the difference from the section above is one of degree, not one of kind: neural connections are not, in the end, constitutive of the world itself—and the sense in which they “distill” information about the world is simply that there is some kind of statistical correspondence. At the same time, brains come with rich anatomical structure of their own, which we might think of as prewired “priors” about the structure of the world, and which is itself the consequence of distillation across generations.
What’s perhaps most remarkable about this process is that the complex circuitry of brains must be built according to the “instructions” present in our DNA, even though these instructions are much simpler. This is what’s sometimes called a “genomic bottleneck”:
However, the information capacity of the genome is orders of magnitude smaller than that needed to specify the connectivity of an arbitrary brain circuit, indicating that the rules encoding circuit formation must fit through a “genomic bottleneck” as they pass from one generation to the next.
Learning through culture and language
Organisms like humans can distill information not only in their brains but in their cultural practices. These practices reflect knowledge about the world. For example, a group dependent upon maize might learn over time that nixtamalization is important for preventing pellagra (a deficiency in niacin), and this culinary practice would thus be enshrined in that group’s typical maize preparations. Crucially, these practices can be transmitted between members of a social group or even across groups. Transmission can in turn occur through a number of means, including direct imitation, ritualistic practice, or linguistic instruction.
Let’s contrast this with the scenarios above, focusing on the case of “adapting” to maize consumption. At the level of biological evolution, a population of organisms in a maize-rich environment might—if they’re very lucky—develop mutations over time that give them the innate ability to unlock the tryptophans in corn proteins, thereby preventing pellagra. At the level of individual learning, a given organism would need to learn either that non-nixtamalized corn will lead to a niacin deficiency, or that you can treat the corn with an alkaloid to make it more suitable for consumption. Each of these scenarios is challenging for the same reason: pellagra develops over a long time period. That means that the typical mechanisms for selecting for or against genetic variants (i.e., reproductive success or death) are unlikely to work: an organism might live long enough to reproduce, develop pellagra, then die. Similarly, the mechanisms that enable individual learning are also unlikely to work: it will be hard for any given individual to learn to associate the consumption of maize with ill health—and even harder to independently innovate nixtamalization.
In fact, the anthropologist Solomon Katz proposed in a 1990 article that the more distance the consequences of consuming a given food item are from the act of consumption, the more a social group will rely on culturally evolved practices that transform the food appropriately:
The more ‘distal’ the food becomes, the more the optimal food transformational practices evolve in the realm of culturally mediated ritual, myth, and symbol (pg. 242).
Even if those transformational practices aren’t explicitly cast by the group itself in terms of nixtamalization or preventing pellagra, the argument is that they serve this broader function (in addition to fulfilling other social or cultural functions). Notably, this is why other groups might go wrong when they try to copy those practices: as Katz points out, there were outbreaks of pellagra when maize was brought back to the Old World—presumably because the people preparing it didn’t recognize that nixtamalization was an important part of the preparation process.9
In contrast, once your social group has developed a cultural practice, that practice can be shared, bypassing the need for each individual to innovate the practice themselves. The relevant information has already been distilled. From the perspective of “learning cycles”, this is clearly much more efficient than individual learning or natural selection. (Notably, however, the distillation process itself might still take a long time.)
Nixtamalization might seem like a niche example, but this kind of culturally-mediated knowledge is everywhere when it comes to humans. It’s arguably our defining trait: humans living in a modern industrialized societies enjoy the accumulated fruits of thousands of years of technological development—from writing systems to refrigeration.
One of the most critical technologies—perhaps the most important—is language. Language is itself a cultural practice, but it also allows us to describe and transmit other practices, as well as convey information more generally. I can describe facts about the world and thereby teach you those facts: you don’t need to experience those things yourself10. I can also use language to teach you how to perform a novel action (like cooking a new recipe). Moreover, the very structure and content of language itself encodes information. For example, the category systems we use to describe different animals or plants we encounter reflects the distinctions that are relevant to communication.
Put another way, cultural practices like language distill information about the world. In principle, then—and probably in practice—you can learn a lot about the world from language alone. This last point brings us to another, newer mechanism by which information can be distilled: large neural networks trained on some of the collective output of a social group.
How large language models distill knowledge
I’ve written quite a bit about how large language models (LLMs) work on this newsletter, so I won’t go into too much detail here. The key point is that they’re trained on large bodies of text sourced from the Internet (and in some cases, images or other media). These datasets are so massive they’re hard to fathom: LLMs like Meta’s Llama 3 are trained on trillions of word “tokens”, which (more or less) amounts to trillions of words.11 This training process involves building representations that make LLMs better and better at predicting which words are likely to appear in a sentence, such as:
The day was rainy, so the sky was cloudy and ____
As I noted in the section above, you can learn a surprising amount about the world from language alone. Even if you’ve never seen the sky or experienced rain, reading lots of sentences containing those words might allow you to predict a word like “gray” in this context—as opposed to words like “blue” or “red”, or even less plausible completions like “incorrigible”.
Now, the question of whether the LLM knows that cloudy skies are likely gray is a thorny philosophical problem. But even if you don’t think LLMs are capable of knowing things, it seems reasonable to assert that this information is encoded somehow in the weights of the model. Even skeptical views of LLMs—like the notion that they’re “blurry JPEGs of the web”—assume that LLMs work by compressing their training data in some way. That means that if the information is present in the training data, there’s some chance it’ll survive that compression process.12 Language does not, of course, exhibit a 1-1 mapping with external reality: it’s an imperfect medium that reflects our own communicative and cognitive constraints. At the same time, the topology of language-space presumably bears some resemblance to the the world we inhabit, and thus it’s also likely that the topology of LLM representations correspond in some systematic (even if limited) way to the world.
This is by no means a novel insight. There’s a long research tradition demonstrating that you can infer a surprising amount about the structure of human and perceptual and conceptual knowledge from the distributional statistics of language use. As I’ve written before, large language models (LLMs) are useful “model organisms” for testing this hypothesis. Indeed, recent work in Cognitive Science—like this 2024 Scientific Reports paper—demonstrates that language models produce judgments about perceptual properties that match up to human judgments pretty well.13
Again, the question of whether language models truly understand these concepts is a deep one. But even a skeptical view about model capabilities would likely acknowledge the correspondence between the structure of model representations and the structure of the world. The map, to be sure, is not the territory: there are probably distortions and gaps all over the place—yet the map is still far from random.
A telos of ever-greater efficiency?
Learning, then—defined as the distillation of information about the world—happens across timescales and at varying levels of efficiency. One way of looking at the mechanisms I’ve described above is as a series of ever more efficient distillation methods. Readers inclined towards transhumanism (or fans of 2001: A Space Odyssey14) might be especially attracted to this viewpoint: a long chain of changes birthed human intelligence, which in turn can be seen as a vessel for the birth of even more efficient distillation systems.
There’s definitely something to that perspective. As I pointed out above, it’s certainly more temporally efficient to acquire certain information about the world through language or other cultural practices—rather than learning those things through personal experience or through the long, arduous process of natural selection. I made a very rough sketch of this argument in the figure below:

Yet temporal efficiency is only one criterion. We presumably also care about accuracy: given some level of representational detail, we want those details to map pretty systematically onto the territory. That is, we want to avoid biases that distort our picture of the world. Second, the world around us is dynamic, not static. That means today’s accurate representation might be inaccurate tomorrow. One solution to this is to have a flexible distillation mechanism—one that can quickly adapt to these changes. Failing that, we want to avoid overfitting to the idiosyncrasies of any particular environment. Instead, one might try to get by with “good-enough” representations that are relatively transferable across contexts.15 These good-enough representations might also be efficient from a storage perspective, i.e., they require less storage capacity.
How do the different mechanisms we’ve discussed stack up from this perspective? I certainly don’t claim to have the answers here; we’re dealing with constructs at varying levels of abstraction and with different evaluative criteria. But we can still try to piece together a vague picture.
From the standpoint of flexibility, frozen model weights don’t seem particularly flexible—though maybe a process for “continual training” would remedy that, assuming you can avoid issues like catastrophic interference.
Yet you also have to consider the range of inputs the system is exposed to. An organism that can see but not smell will learn different associations than an organism that can smell but not see. Our Umwelt—the way we experience the world—determines what we can and can’t learn. This applies most obviously to associative learning within organisms, but it also matters for something like training a neural network: a transformer trained only on Wikipedia will learn different representations than a transformer trained only on chat transcripts—let alone a transformer trained on images or video. My intuition (which might well be wrong) is that a system with a more diverse range of inputs can produce more stable and accurate representations. I wrote about that in a past post:
One way I think about the utility of multimodal information is as offering a kind of stability to our representations. If you only have one source of information about a particular stimulus, the more prone you are to catastrophic failure––i.e., if that input stream fails in some way, your representation of that stimulus is completely distorted. But multiple sources of information about the world offer redundancy, which could prevent that kind of catastrophic failure.
Ultimately, my goal here is not to rank these mechanisms according to what’s “best”. I don’t really think that’s even a coherent question: there are always trade-offs between different evaluation criteria—or sometimes even within a given criterion across levels of analysis. Rather, I hope to have presented a (somewhat) useful conceptual framework, which allows us to compare the mechanisms by which information is distilled and transmitted.
This process is not entirely unlike, and is in fact the inspiration for, the kind of parameter updates we see in artificial neural networks.
Note that this claim is agnostic to whether the information is true. Associations can be incorrect or misleading—they’re still associations.
It probably didn’t help that Darwin himself made the comparison to artificial selection (e.g., of birds), which does involve an intentional, agentive “designer”.
The removal of agency from these theories does raise questions about the origins of our own agency (if it exists). I recommend Jessica Riskin’s The Restless Clock for a very thorough accounting of the “passive” vs. “active” mechanistic theories of life.
“Neutral theory” holds that a nontrivial proportion of a population’s genetic changes occur as a result of random genetic drift, as opposed to being reproductively beneficial in a given environment; most mutations are “neutral”.
Depending on how many mutations you view as “neutral”, I suppose.
Here there’s an obligatory caveat, which is that I’m certainly not an arachnid expert. But there does seem to be some evidence for the genetic basis of web-weaving (e.g., this 1972 paper). But that also doesn’t mean their decisions aren’t influenced by immediate features of the environment.
This is a case where the research appears to be rapidly evolving. Different populations of bees use different “dialects” (i.e., features of the code map differently onto meanings like the distance or direction of honey), and as far as I understand, there’s some debate about the extent to which those are genetically influenced vs. learned. It does appear to be true that the dance is “improved” in quality throughout infancy, which suggests some learning mechanism.
This kind of overlooking of important details in culinary or agricultural practices is discussed at great length in Seeing Like a State.
Though of course, personal experience with something may well augment your understanding of it.
It’s hard to calculate exactly how many words a human encounters in their lifetime, but most estimates fall between 2M and 7M words a year—so about 140M words by age 20. LLMs like Llama 3 thus encounter orders of magnitude more words than an individual human. This raises interesting questions about the relative sample-inefficiency of LLMs compared to humans, which I’ll write about in a future post.
It’s also entirely possible that this training process produces “new” information as a result of the compression—i.e., novel abstractions across training exemplars.
Models trained on language statistics also (to some extent) reproduce gender stereotypes, capture judgments about visual concepts, display sensitivity to mental states, and much more.
Or, for that matter, Thus Spoke Zarathustra. ovie
This last criterion is where the notion of a “genomic bottleneck” might be particularly relevant: because the genome has to be simpler than a brain circuit, it enforces a kind of regularization mechanism, compressing all that complexity to something that retains only the most useful features.