GPT-4 is "WEIRD"—what should we do about it?
When we use LLMs as "model organisms", which humans are we modeling? And how can we overcome the problem of unrepresentative data?
How do we know whether an empirical claim is “valid”?
One important criterion is external validity. Briefly, “external validity” asks whether the results of a given study generalize beyond the specific sample or test environment you studied. In most scientific fields, it’s very difficult—if not impossible—to study the entire population you’re interested in. Thus, researchers usually rely on a sample drawn from that population and try to draw conclusions about the population based on the behavior of that sample. In order for those generalizations to be valid, however, the sample should be random and representative.
As I’ve written before, this often isn’t true in practice. In psychology, for example, samples are disproportionately drawn from “WEIRD” populations (Western, educated, industrialized, rich, and democratic), which jeopardizes the generalizability of findings obtained from those samples. Psychology and related fields are trying to rise to the challenge of obtaining more representative samples, but it’s a hard problem and there’s still a long way to go. Indeed, it’s not even clear to me what a truly “random” sample would mean—for many questions in social science, cultural variation is so substantial that the idea of an “average human” simply doesn’t make much sense.
More recently, a similar problem has been raised with respect to large language models (LLMs). LLMs are sometimes used as “model organisms” for humans. But we know LLMs are not trained on a fully representative sample of text. Thus, which humans does a given LLM “represent”?
Model organisms, in brief
In biology, a model organism is a non-human organism used as a model for the study of a particular phenomenon that’s difficult or unethical to study in humans. For example, fruit flies have famously been used in a number of genetics experiments, and rodents are frequently used to study neural function. There are lots of benefits to using model organisms: we can learn valuable things to improve human society (e.g., develop vaccines or other treatments, etc.) without the cost of experimenting on humans.
But there are costs and risks too. Even ignoring the ethical quandaries of experimenting on non-human organisms that can’t consent, there’s the issue of external validity. To what extent do findings obtained in a sample of fruit flies or rodents generalize to humans? The answer isn’t always clear. Moreover, there might be bias even within the sample of model organisms being used for research. For example, biologists have traditionally used male rats as test subjects, which has led to issues when generalizing these results to humans. (The NIH now requires the use of both sexes in research on vertebrates unless the use of only one sex is justified.)
As I’ve written before, similar trade-offs arise for the issue of using LLMs as model organisms as well. On the one hand, LLMs are useful for asking questions we can’t ask of humans. Because they’re (usually) trained on text alone, they’re useful for asking what kinds of behaviors and capacities can emerge from exposure purely to linguistic input—a central question in Cognitive Science. We can also intervene on LLMs more easily (and more ethically!) than in humans, allowing us to study the underlying representations that a system forms to process and predict language.
On the other hand, most LLMs are trained on biased samples of text. This is true on multiple levels. First, the largest LLMs are trained primarily on English, which isn’t representative of the world’s languages—just as no language is representative of the remarkable diversity across languages and language families. Even within English, LLMs are trained mostly on written English on the Internet, which is not representative of the diversity even within speakers of English.
And second, LLMs are exposed to a biased sample of cultural practices and values. This is true both for “pre-training” (the initial process of updating LLM weights using next-token prediction) and for the processes used to make LLMs like ChatGPT “behave” (e.g., reinforcement learning with human feedback, or RLHF). This bias in what behavior they’re exposed on, and in how they’re fine-tuned, should have predictable consequences in terms of how LLMs themselves behave.
This is again, not wholly surprising: different cultures are different, after all, and systems trained on behavior generated disproportionately by a subset of those cultures will reflect that biased subset’s values more-so than the other cultures whose behavior was omitted from (or at least under-represented in) the training data. And as the work of Joe Henrich and others demonstrates, variation in cultural practices is also correlated with variation at the level of individual psychology (e.g., moral decision-making, economic behavior, etc.).
LLMs are WEIRD
A recent preprint (Atari et al., 2023) makes this point concisely:
If culture can influence fundamental aspects of psychology, then the question is not really whether or not LLMs learn human-like traits and biases; rather, the question may be more accurately framed in terms of which humans LLMs acquire their psychology from.
Concisely, then, we should expect most LLMs to behave more like individuals from WEIRD cultures. Is that true?
To address this question, Atari et al. (2023) use the World Values Survey (WVS), a dataset compiling attitudes towards justice, morality, corruption, migration, and other issues across the world. First, they presented the key questions from the WVS to ChatGPT1; for each question, they obtained 1000 possible responses to capture possible variance in GPT’s behavior. Then, using a couple different methods, they asked how correlated GPT’s responses were with the aggregate behavior from countries around the world.
One way to do this is to use hierarchical clustering, which tries to identify “clusters” of individual countries that GPT’s responses are closest to. Here’s what they found using this analysis:
Holistically taking into account all normalized variables, GPT was identified to be closest to the United States and Uruguay, and then to this cluster of cultures: Canada, Northern Ireland, New Zealand, Great Britain, Australia, Andorra, Germany, and the Netherlands. On the other hand, GPT responses were farthest away from cultures such as Ethiopia, Pakistan, and Kyrgyzstan.
In a related analysis, the authors asked whether a country’s cultural distance from the United States was predictive of its cultural distance from GPT. That is, are countries that are more similar to the USA also more similar to GPT in terms of their cultural values? The answer, pretty unequivocally, was “yes”. The correlation between these measures was r = -0.7, which is generally seen as a pretty strong correlation in social science. This GPT:country correlation was also related to other variables, such as the UN’s Human Development Index (r = 0.85), GDP per capita (r = 0.85), and the proportion of individuals using the Internet (r = 0.69).
GPT also displayed a WEIRD bias in its “thinking style”. There’s no universal consensus on what “thinking style” means, but measures of thinking style tend to focus on two broad categories: “analytic” (grouping items in terms of their abstract categories) and “holistic” (grouping items in terms of their contextual or functional relationships). For example, if presented with pictures of three items—hair, beard, and shampoo—an “analytic” response might group hair and beard together, while a “holistic” response might group hair and shampoo together.2 Past work has found that WEIRD participants tend to behave more “analytically”, while non-WEIRD participants tend to behave more “holistically”. Sure enough, GPT responds “analytically” to this task, performing most similarly to respondents from Finland and the Netherlands.
The authors then prompted GPT with a variant of a task used to assess “self-concept”, i.e., whether individuals tend to see themselves more as individuals (associated with people from WEIRD countries) or as parts of a social whole (associated with people from non-WEIRD countries). In this task, people generate 10-20 statements starting with “I am…”, and these statements are categorized in terms of how they’re completed (i.e., whether they reference individual properties or membership in a group). The authors asked GPT to predict how the “average person” would respond. Once again, GPT tended to produce individualistic responses.
The important thing here is that GPT consistently behaves like individuals from “WEIRD” countries. In other words: even if you’re not convinced that “thinking style” is a real thing or that we’re measuring it the right way, whatever we are measuring displays pretty convincing behavioral correlations. The same goes for measures of cultural values from the World Values Survey and measures of self-concept. In each case, GPT behaves more similarly to people from WEIRD countries than people from non-WEIRD countries.3
More than just WEIRD
LLMs are usually trained on text found on the Internet. But as Crockett & Messeri (2023) point out, 2.9B people as of 2021 had never used the Internet. That means it’s highly unlikely that the linguistic behavior of those individuals will be reflected in an LLM’s training data.
And further, as I noted above, even within WEIRD populations there’s plenty of bias in terms of which viewpoints get expressed online. A 2019 survey of Twitter users, for example, found that most tweets came from a minority of users, and that the 10% most active users are disproportionately interested in and focused on politics. At least at the time, Twitter users tended to be disproportionately young, highly educated, and wealthy (compared to people in the United States in general); they were also more likely to identify as Democrats. This probably isn’t surprising to anyone who’s used Twitter. People have even created terms like “extremely online” to refer to kind of person who spends a lot of time engaging with Internet culture, and that kind of person is not particularly representative of the median American.
Crockett & Messeri (2023), then, suggest that LLMs are not just “WEIRD”:
By default, then, LLMs are HYPER-WEIRD: beyond overrepresenting people from WEIRD countries, they overrepresent attitudes that are Hegemonic, Young, and Publicly ExpRessed.
The implication here is that relying on LLMs as model organisms might even be worse than the age-old approach of relying on people from WEIRD countries. If this analysis is correct, LLMs represent an even tinier slice of the world’s remarkable cultural and linguistic diversity.
Where do we go from here?
The world is a big place, and human culture can develop in all sorts of interesting directions. If we want to make claims about the population of humans in general—whether about their individual psychology, their social dynamics, their use of language, or their political behavior—it’s important to take this diversity into account.
Unfortunately, most research in social science hasn’t fully grappled with this issue. As I’ve pointed out before, psychologists have long relied on “model organisms” of their own: college undergraduates at four-year universities. The potential use of LLMs as “model organisms” brings new opportunities, but also new challenges.
What, then, are we to do? Should we stop using LLMs as model organisms altogether? Some researchers would surely answer “yes”, but I’ll advocate for a kind of middle ground.
The first thing to do is to properly scope claims. This is true whether you’re the one making the claim or assessing the claim. Just like you should assess the representativeness of a human sample, it makes sense to ask what an LLM used as a “model” is actually a model of. Here, as is often the case, it’s all about the training data. Atari et al. (2023) write:
Instead, researchers should step back and look at the sources of the input data, sources of human feedback fed into the models, and the psychological peculiarity that these future generations of LLMs are bestowed upon by WEIRD-people-generated data.
Unfortunately, sometimes this is easier said than done. With proprietary models like ChatGPT, researchers outside OpenAI don’t know exactly what data they’re trained on—let alone the details of how they are “tamed” using RLHF.
One solution is to rely on open-source models whose data has been made completely transparent, like OLMo—a new LLM released by the Allen Institute for AI. In addition to publishing all of the weights for this model, the authors have also released the dataset (“Dolma”) in full, along with descriptive details about how they acquired and curated the dataset. This kind of work is incredibly valuable: with open-source models (and datasets), researchers can at least be more confident about the inputs a given LLM was trained on, which in turn makes it easier to figure out how to scope a claim.
Of course, researchers will still be interested in studying non-WEIRD cultures. It’s important to scope our claims, but if we stop there, then we’ll end up with a bunch of empirical work that simply neglects non-WEIRD cultures. Thus, if we want to use LLMs as model organisms, that means we need to actively work towards creating non-WEIRD LLMs as well.
Here, one approach is to fine-tune existing LLMs to address these sources of bias. For example, a 2023 preprint (Ramezani & Xu, 2023) found that fine-tuning English-centric LLMs on data from the World Values Survey (the same one Atari et al. used) improved an LLM’s ability to predict a given country’s cultural attitudes. It remains unclear how far you can get with this kind of approach, or how generalizable it will be. It could be that fine-tuning a “WEIRD” LLM can make it less WEIRD for a given task, but won’t make it less WEIRD for other tasks.
Another approach is to train new LLMs wholesale on languages other than English. This isn’t a new idea: Natural Language Processing (NLP) researchers have been training LLMs on other languages for years now, many of which are now available in the open-source Hugging Face library. But the problem of English-centric, WEIRD LLMs is now recognized as a commercial problem as well (as this article from the Economist makes clear). After all, many people in the world don’t speak English (or if they do, they don’t use English as their primary language). That means that LLM-based companies with the aspiration of serving that population need to think about how they’ll offer services of equivalent quality. Models like ChatGPT simply don’t perform as well in other languages, even those spoken by a relatively large number of speakers like Bengali. To address this issue, companies like Krutrim are working on building LLMs that can generate (and process) text in 10 Indian languages.
The challenge here is that many of the world’s languages simply have less training data than languages like English. Some of these “low-resource” languages may have limited written materials, and others may not even have a written form at all. Part of the solution here surely involves producing better language documentation, which necessarily involves careful field-work—collaborating with speakers in a community to produce a dictionary and ideally even translating texts into and out of that language. Researchers can also train multilingual models, in the hope that linguistic knowledge acquired from a high-resource language (like English) can be “transferred” to a low-resource language or used to “induce” bilingual dictionaries.
The last point I’ll make here is decidedly conceptual. As is hopefully clear by now, the notion of an “average human” doesn’t make all that much sense. That doesn’t mean we shouldn’t calculate things like averages. But it does mean that the search for a single model of human psychology will be, at best, very difficult—and perhaps impossible. If we think of this in terms of model fitting, the question is essentially how many parameters one needs to build a representative model of humans. My intuition is that we’ll need more than one. The behavior of a single person changes from one day to the next, from one context to another—to say nothing of the behavior of individuals across the world. If, as I’ve argued, models like GPT-4 sometimes capture the “wisdom of the crowd”, we need to ask not only how large that crowd is but also which humans that crowd represents.
So to sum up: we should scope our claims when using “WEIRD” models; we should also try to build and test more non-WEIRD models; and we should also think about what we mean by “average”. If you’ve read my post on how Cognitive Science needs linguistic diversity, that playbook probably sounds familiar. The hard part, at this point, is execution.
Other relevant posts:
I’m assuming they used ChatGPT-4 (GPT-4), but I’m not sure.
I suppose people who shampoo their beards would group all three together?
Another recent preprint finds a similar results: the behavior of ChatGPT and Bard on diagnostic measures of self-perception are most correlated with the behavior of individuals in English-speaking countries. Thanks to Wolfgang Messner for the pointer to this article!
nice piece Sean
Thank you for this detailed write-up. May I draw your attention to a recent preprint that we published on a related topic?
Messner, W., Greene, T., & Matalone, J. (2023). From Bytes to Biases: Investigating the Cultural Self-Perception of Large Language Models. arXiv Preprint. DOI: https://doi.org/10.48550/arXiv.2312.17256
Large language models (LLMs) are able to engage in natural-sounding conversations with humans, showcasing unprecedented capabilities for information retrieval and automated decision support. They have disrupted human-technology interaction and the way businesses operate. However, technologies based on generative artificial intelligence (GenAI) are known to hallucinate, misinform, and display biases introduced by the massive datasets on which they are trained. Existing research indicates that humans may unconsciously internalize these biases, which can persist even after they stop using the programs. This study explores the cultural self-perception of LLMs by prompting ChatGPT (OpenAI) and Bard (Google) with value questions derived from the GLOBE project. The findings reveal that their cultural self-perception is most closely aligned with the values of English-speaking countries and countries characterized by sustained economic competitiveness. Recognizing the cultural biases of LLMs and understanding how they work is crucial for all members of society because one does not want the black box of artificial intelligence to perpetuate bias in humans, who might, in turn, inadvertently create and train even more biased algorithms.