Building the science factory with LLMs

How much of the science "assembly line" can be automated?

Dec 13, 2023

I write a lot about research on Large Language Models (LLMs) and how they work. But most use cases for LLMs, of course, involving applying them to some target domain (e.g., programming1) and improving the efficiency or effectiveness of work in that domain. This post is about whether and how LLMs might accelerate progress in my own field: social science research—specifically, research on human cognition.

Accelerating progress in any scientific discipline would be a huge deal. Making science more efficient would mean discarding bad theories faster and perhaps even developing newer, better theories (where “better” could mean more practical, more parsimonious, and/or more predictive). If you think ideas are getting harder to find, then perhaps LLMs could partially counteract that stagnation. Many scientific disciplines are also facing a “replicability crisis”; one use case of LLMs could be to make it easier to verify whether published results are, in fact, reproducible and replicable (perhaps even before publication).

Finally, scientific discoveries can have major real-world impacts and applications. Of course, those impacts aren’t always (or only) positive—but that’s one more reason it’s useful to think through exactly how LLMs would impact the pace and direction of scientific research.

Generated by DALL·E — Image courtesy of DALL-E. Prompt: “a bunch of robot scientists on an assembly line”, followed by “make it watercolor”.

Defining the problem

Different social science disciplines focus on different units of analysis, from individual humans (e.g., psychology) to voting blocs or entire countries (e.g., political science). For the purpose of this post, I’m going to restrict the discussion to cognitive science, with an emphasis on language (a subfield called “psycholinguistics”). That’s because psycholinguistics is the subfield I know best, and also because I think it’s amenable to at least partial automation via LLMs.

Psycholinguists aim to describe the mechanisms underlying language processing and production. Accordingly, their explanations may span all levels of analysis: an “answer” to a research question could be about the series of cognitive operations involved, or it might make reference to specific brain regions that are implicated in carrying out those operations.

Perhaps most importantly, research in psycholinguistics is empirical, meaning that it’s based on data. Further, this empirical research is typically experimental in nature2. Theories are tested by designing studies that operationalize the main constructs in a theory, e.g., by manipulating key variables across sets of “stimuli”, which are in turn presented to human participants. Psycholinguistics then measure responses to those stimuli (e.g., reaction time, eye-tracking, blood flow to various brain regions). If responses are systematically different to stimuli across conditions, it provides evidence that this experimental contrast is cognitively relevant—which, if the experiment has been carefully and appropriately designed, can provide evidence for or against a theory.

For example, suppose a psycholinguist is interested in whether the way an event is described affects the attribution of blame for accidental events: does a transitive construction (“She broke the vase”) make people think someone is more blameworthy than an intransitive construction (“The vase was broken”)?

To test this, psycholinguists might design stimuli describing various events in which an object is broken or damaged accidentally. For each type of event (e.g., a broken vase), a participant might read one of two construals: a transitive or intransitive description. Participants might then be asked to rate the blameworthiness of the agent involved on a scale, and these ratings could be compared across conditions. If ratings are reliably different (e.g., in a statistical test), it suggests this experimental contrast is capturing something “real” about the relevant variables involved in language processing and event construal. (In case you’re interested, a version of this study has been done, and the answer appears to be yes, at least for the stimuli tested.)

Decomposing the problem

An important step towards streamlining a complex process is breaking it into simpler components. This is a core insight behind the assembly line, and it’s also a key principle of modular software design. Seemingly hard problems can often be described in terms of distinct, relatively simple “modules”—each of which may be easier to automate than the whole process at once.

What are the relevant components of psycholinguistic research, and how easily could they be automated?

Let’s start by considering the end-to-end process of “coming up with a research idea” and “publishing a study on that topic”. My first attempt at breaking this process down is pictured below, with some additional inspiration from a recent NBER paper (Korinek, 2023) on how LLMs could improve research in economics.

An attempt at breaking down the research process into component processes.

Any process can be modularized almost indefinitely, but this hopefully gives a decent sense for what the main activities involved are. This process is also necessarily “idealized”; plenty of great research studies aren’t as linear as what’s presented here.

Having decomposed the problem, we can now ask where and how LLMs might be able to help.

Can LLMs help us generate hypotheses?

One of the hardest parts about science is coming up with a hypothesis. This usually involves some mixture of personal intuition and reading related research (i.e., a literature review). That second step is particularly important, as it enables researchers to identify what’s currently unknown.

When constructing a hypothesis, researchers should try to answer these two questions:

Does this address a gap in our existing knowledge?
Why is this gap important?

As Korinek (2023) writes, LLMs may be able to help with this in a few ways. First, they could help with “ideation”: providing a sounding board for thinking through ideas and identifying issues with lines of reasoning.

Second, LLMs could help with the process of finding relevant literature. My guess is that a targeted LLM with access to a specific database of research papers (perhaps using either fine-tuning or retrieval-augmented generation) would be most helpful here. For example, Elicit is an AI “research assistant” that searches across millions of papers on Semantic Scholar, and provides summaries and citations in response to specific research questions.

Finally, and most critically, LLMs could help with the process of generating novel hypotheses, e.g., by forging connections that haven’t previously been made in the literature. An early version of this was actually implemented by one of my colleagues, Brad Voytek (along with Jessica Voytek). It works by counting how many times different scientific terms occur together (e.g., “amygdala” and “depression”), building a network of which terms occur most frequently, and then identifying pairs of terms that, based on the network structure, ought to be connected but aren’t.

Figure 1A from Voytek & Voytek (2012). A simple word association model can be used to derive new hypotheses between words/terms that haven’t yet been explicitly connected in the literature.

That paper was from 2012, and LLMs have come a long way since then. A more recent paper used GPT-4 to generate novel social psychology hypotheses. These hypotheses were compared to hypotheses from actual published research papers, and a group of social psychology experts (blinded to condition) ended up rating the GPT-4 hypotheses more highly along a number of dimensions. From the abstract (bolding mine):

…we generated hypotheses using GPT-4 and found that social psychology experts rated these generated hypotheses as higher in quality than human generated hypotheses on dimensions of clarity, originality, impact, plausibility, and relevance.

This all seems promising to me. An LLM “research assistant” could be particularly helpful at the early stages of research, when you’re still exploring the literature and thinking through possible directions.

I also want to be clear that I do think there will always be immense value in reading academic papers yourself. The experience of sitting—and struggling—with a text cannot, in my opinion, be replaced by summaries of the text, even if those summaries are very good. But there are only so many hours in a day, and the status quo is that researchers often cite papers without having fully read them anyway. A good LLM assistant could raise the “floor” by augmenting the literature review process, giving researchers familiarity with a larger body of work and pointing them to areas of the hypothesis space they might not have noticed.

Designing experiments

Once you have a hypothesis, it’s time to design an experiment. In psycholinguistics, this involves a couple of steps:

Identify the key theoretical constructs from your hypothesis. For example, if the hypothesis is “people struggle more to process sentences with ambiguous words”, one relevant construct is “processing difficulty” and another is “ambiguous words”.
Decide how to operationalize those constructions. “Processing difficulty” is often operationalized as reading time, but it can be measured in other ways as well.
Develop stimuli. If your experiment involves reading sentences with or without ambiguous words, then you’ll need a bunch of sentences for each condition. You’ll also probably want to match those sentences as closely as possible across conditions to control for confounds.
Implement the experiment using experimental design software. I usually use JsPsych or Gorilla, but there are tons of options here. Some involve programming (and are typically more flexible), and others rely on GUIs.
Pre-register your results. Once you have your experiment designed, you should be able to make clear predictions about how the data will look. It’s good practice to pre-register your results, e.g., on sites like the Open Science Framework.
Run a pilot study. Before you run the official study, it’s also good practice to run a “pilot”, e.g., with a smaller sample. This is helpful for identifying bugs in your experiment design (e.g., maybe your randomization procedure doesn’t work!), but it can also be helpful for generating additional exploratory hypotheses.

If that seems complicated, that’s because it is! For newcomers to science, I think experimental design often feels like the “boring” side of research—you have to move away from the lofty world of ideas and think concretely about what you’re actually going to do. That’s in large part why I’ve grown to love it over the years: the process of translating an abstract hypothesis to a concrete design is really rewarding.

But there are still parts of this process I wish I could speed up. For example, developing stimuli is sometimes a really arduous task. Psycholinguistic experiments often involve upwards of forty stimuli per condition, and as I mentioned above, you need to be very careful about matching those stimuli for various features (e.g., length, word frequency, word concreteness) to eliminate potential confounds. Sometimes you also need “fillers”—stimuli you don’t plan on analyzing but which distract participants from the main purpose of the study. You might also need to “norm” these stimuli: for example, if you’re trying to manipulate something like “real-world plausibility” across your sentences, you might want to run a separate study ensuring that these sentences do, in fact, vary in the way you intended (and not in others).

For me, I think this portion of experimental design is ripe for innovation. I’ve written before (here and here) about how GPT-4 might be able to help psycholinguists norm stimuli. I’m now experimenting with using GPT-4 to create stimuli, e.g., using a prompt like the one below:

Hypothesis: When reading about unintended events (e.g., someone knocking over a vase), people assign more blame when the event is described using a transitive construction (e.g., "He knocked over the vase") than an intransitive construction (e.g., "The vase got knocked over"). Additionally, this will interact with the seriousness of the situation: for low-stakes situations (e.g., knocking over a vase), the construction will play a bigger role; for high-stakes situations (e.g., accidentally seriously injuring someone), the construction will not play a big role.
Independent Variable #1: Construction (transitive vs. intransitive).
Independent Variable #2: Seriousness (very serious vs. low stakes).
Dependent Variable: Blame assigned, on a scale from 1 (not at all to blame) to 10 (entirely to blame).
Task: Create 10 low-stakes scenarios and 10 very serious scenarios. Each scenario should have two versions, using either the Transitive or Intransitive construction in the critical sentence (but the same scenario). Each scenario should have approximately 1 sentence setting up the situation. The critical construction should appear in its own sentence. Please output these stimuli as a .csv-style table with the following structure:
Scenario number,Seriousness,Construction,Scenario Text,Critical Sentence.
Note that each column should be separated by a comma (and no space) as in a .csv. If a sentence has a comma, please wrap that sentence in quotes.

Here are the top rows of the .csv file that GPT-4 ended up generating:

Scenario number,Seriousness,Construction,Scenario Text,Critical Sentence
1,Low stakes,Transitive,"Tom was playing soccer in the backyard","He kicked the ball into the fence."
1,Low stakes,Intransitive,"Tom was playing soccer in the backyard","The ball got kicked into the fence."

There are 40 rows in all, for the 40 stimuli (20 scenarios, 2 versions each) GPT-4 created. In theory, this .csv file could be plugged into an experimental design software like Gorilla to collect and measure responses from human participants (e.g., in a pilot study). Researchers might also decide they want to use GPT-4 to norm those stimuli beforehand (as I’ve written about before), which might feed back into into the stimulus generation step.

Researchers typically generate stimuli, then norm them for various psycholinguistic properties. The results of this norming may lead researchers to modify the stimuli; once the stimuli have been finalized, a pilot study may be run.

Data collection

The details of data collection depend on the kind of research. Some researchers (like myself) run behavioral studies online; other researchers run neuro-imaging studies (e.g., using EEG or fMRI) in-person—which is typically a much more complicated, time-consuming, and expensive process.

Despite what I’ve written before, I’m not that enthusiastic about replacing human participants with LLMs when the research question is about human cognition. I do think the “LLMs as participants” approach has a couple of use cases:

As a complement or “baseline” for human studies, e.g., to determine to what extent an effect in humans could emerge in a statistical language model.
As model organisms, e.g., when researchers cannot investigate a specific question in humans.
As norming tools (see the section above).

But again: if the question is about human language processing, then I think studying human linguistic behavior is what matters—so we’re unlikely to take humans out of the data collection loop anytime soon.

Data analysis

In psycholinguistics, data analysis usually involves wrangling data (tidying it up, merging data from different subjects, etc.), then running statistical analyses and creating visualizations. Different researchers analyze their data in different ways, but many researchers rely on programming languages like Python and R. This is another area where it’s clear to me that LLMs can—and already are—help researchers quite a bit.

The ability of LLMs to write code is now well-documented (see this New Yorker article for a great perspective from a programmer). I write code daily (in Python and R) for my line of work, and I now use GPT-4 frequently to help with different stages of data analysis. If I’m unsure how to implement something—e.g., I can’t remember the exact syntax for melting a DataFrame—I’ll describe the problem to GPT-4 and ask it to generate code. Usually, this code can be used with minimal modifications, and it saves me a bunch of time searching on Google.

There are also tools like Copilot, which integrate directly into the text editor you’re using to write code, as well as OpenAI’s Code Interpreter, which allow you to upload an actual dataset and run analyses described in natural language. I’m still curious to see how tools like Code Interpreter perform for much messier datasets—personally, I think a tool to help researchers wrangle messy data would be incredibly valuable.

Writing papers

I really like writing papers. But I know many researchers don’t enjoy the writing process. Additionally, many researchers are not native English speakers, and unfortunately, English is generally the “default” language of much of the academic publishing world.

LLMs seem well-suited to help with this process. Indeed, non-native English speakers have been using tools like Grammarly—which is based on similar Natural Language Processing technology—for years. Perhaps the main difference is that LLMs like GPT-4 are much more general-purpose. Given a description of a hypothesis and a statistical result, GPT-4 could probably write a pretty good summary of that result and why it matters.

Personally, I’ve also found GPT-4 useful for things like rewording or shortening abstracts. Many journals require abstracts to be shorter than 150 words, and I’ve been impressed at how effectively GPT-4 makes my abstracts pithier without sacrificing much in the way of informativity or precision.

What about peer review?

Another important thing to mention is that the process of producing research involves much more than just individual researchers writing papers.

A major component of the research production process is peer review. Peer review has its issues, but assuming some variant of academic journals and conferences are going to stick around, editors are still going to need a way to evaluate papers for their research quality. Right now, the review process looks something like this (admittedly over-simplified) diagram:

A schematic of the typical paper review process.

More precisely: journal editors receive a paper; they make a decision about whether to send it out for review; the paper is then reviewed, usually by 2-3 other researchers in the field; the editor then makes a decision about whether to accept the paper, ask for revisions, or reject the paper outright.

This process is sometimes very slow. There are various reasons for this, but one reason is that it’s hard to find reviewers—who typically aren’t paid—willing to read a paper and provide timely, helpful feedback on it. One suggestion would be to start paying reviewers, but this costs money, and those costs are presumably going to show up somewhere (e.g., journal subscription fees). Could LLMs help with this process?

There’s some evidence that they could. This recent paper asked GPT-4 to review papers published across the Nature family of journals and the ICLR conference, and compared its decisions to those made by human reviewers. Here’s what they found:

The overlap in the points raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR).

In other words: GPT-4 and humans agree about ~35% of the time (on average)—which sounds low, but apparently humans only agree with each other about ~32% of the time anyway.

Another paper found more mixed evidence. LLMs may be able to help identify some mathematical errors and provide feedback for papers in the form of a “checklist” format—but LLMs also struggled with comparing multiple papers for quality, and still failed to recover almost ~40% of the explicit mathematical errors.

Perhaps it’s too early to start incorporating LLMs into the review process. But given the potential efficiency gains—and given how slow the process can be currently—I think it makes sense to research their effectiveness. The other area that needs to be researched here is how the use of LLMs in the review process might introduce biases—and in particular, whether these biases are somehow different in direction or magnitude than the ones we know human reviewers sometimes bring to the table.

As I’ve written before, incorporating LLMs doesn’t have to mean replacing humans altogether: one could maintain some “minimum” of human reviewers assigned to a paper to ensure that LLMs aren’t the only ones making the decisions.

What about automatic replication?

As I mentioned in the introduction to this article, many scientific disciplines are also facing a “replicability crisis”: a surprisingly large percentage of published work does not hold up upon replication. There are various attempts to reform the scientific process, including pre-registration and making one’s data and analysis code available (i.e., “open data”). But there still isn’t always as much oversight as one would like—for example, a recent preprint suggests that fewer than 14% of paper reviewers report actually accessing the researcher’s pre-registration (and only 5% of editors do so).

Practices like pre-registration and open data are only as good as the institutional incentives and community norms they operate in: if reviewers aren’t expected to access a researcher’s data and double-check their work—and I can understand why unpaid reviewers wouldn’t want to go through that trouble—then it’s not having as big a positive impact on the research process as it could.

Thus, one potential use case for LLMs that I’m really excited about would be automating (or partially automating) the process of reproducing statistical results in published papers, or even running entire replication studies. This could take a few different forms, ranging in complexity:

Simplest version: accessing a researcher’s code and data files, and reading the code to evaluate whether it matches up with what researchers said they did.
Slightly more complicated version: actually running the analysis script and checking that the results match with what the researchers reported in the manuscript.
Much more complicated: using the Methods section, along with the published stimuli, to generate a new experiment and run a replication study.

Note that even (1) above would probably be an improvement upon the status quo. Like I said, the modal reviewer is not taking the time to read through a researcher’s analysis script (or their pre-registration); an LLM could act as another “pair of eyes”—again, raising the floor of performance. It’s also important to note that we’d probably want to enforce even stronger norms about requiring or highly incentivizing pre-registration or open data in the first place: otherwise researchers might be unfairly penalized for having included a pre-registration (or data), while researchers who neglect that entirely fly under the radar.3

I think that version (3) would be hugely beneficial for the field. It’s also by far the hardest step. That said, I do think it’s possible, particularly if researchers are encouraged to submit clearly specified pre-registrations with all of their experimental materials (e.g., their stimuli). The task would effectively be learning to translate from a pre-registration to a fully designed experiment and analysis script. This would make it much faster to design and implement a replication study—and at least for behavioral studies that can be conducted online, it would be relatively straightforward (and fast!) to find out whether a study replicates or not.

Is this even what we want?

I’m a psycholinguist, and I enjoy conducting psycholinguistics research. Why, then, am I invested in automating much of this process? Do we even want “self-driving” scientific labs? Does automating our work risk compromising its quality? There are a few ways to answer this question, and I think it’s worth thinking through the trade-offs involved.

First, I’ve focused in this post on augmenting the research process, not replacing human researchers entirely. The process of producing research is inefficient for all sorts of reasons—and if LLMs can solve some of those inefficiencies, I think that’s a good thing. Researchers can still do their work, but they can do it faster, and ideally produce research that’s higher quality. Right now, researchers and reviewers take shortcuts: researchers may not norm their stimuli as carefully as they should, and reviewers may not look through a researcher’s pre-registration as carefully as they should. One can assert that researchers and reviewers should be more careful but that doesn’t itself change their behavior. Instead, LLMs might be able to fill this role and improve the median quality of published research.

Of course, automation may beget further automation, and it’s natural to wonder whether what starts as “using AI to create stimuli” might one day turn into “labs entirely operated by AI systems”. What happens to human researchers then? I’m not an economist, so it’s hard for me to think clearly about this (check out Timothy Lee’s article on AI and which jobs are in danger of being automated). But my (probably naive) view is:

Fully automating an entire research discipline is in part a choice made by the community of researchers and their funding bodies.
If we get to that point, societies will need to have solved bigger issues anyway (universal basic income? sources of meaning in a post-scarcity world?); and in a world of “fully automated luxury communism”, presumably humans can do psycholinguistics for fun if that’s what makes their hearts sing.

At any rate, I have a hard time grappling with these arguments because they feel quite abstract. That doesn’t mean I rule them out a priori—just that I have a hard time figuring out how, exactly, different lines of reasoning impact more proximal decisions.

Returning, then, to more proximal issues: another question that arises is whether science really is a decomposable set of steps. An objection of this form might look like this: sure, it’s possible to describe the scientific process this way—but that description is an idealized abstraction, and “real” science is a much messier, fuzzier process. If that’s true, then the modular approach I’ve outlined in this approach is hopelessly naive. Even worse, we might inadvertently terraform science by assuming it’s modular and thereby making it so in our attempts to automate it—perhaps compromising its integrity in the process.

I’m reminded here of the central argument presented in Seeing Like a State: our “high modernist” ambitions (and arrogance) often lead us to make simplifying assumptions about how complex processes work; when we combine these assumptions with top-down reforms (e.g., “scientific forestry”), we transform the world into our model of the world. Sometimes we don’t notice that we’ve missed crucial sources of complexity until it’s too late. Perhaps something similar might happen with science: we’ll “succeed” in automating science, but it will be a simpler, lower-fidelity version of what we currently call science. The science artisans will be replaced by science factories—the implication being that the factories will produce worse science than the artisans.4

Is this true? I do think there’s generally some legitimacy to this kind of philosophically conservative line of reasoning—it urges us to practice humility in the face of a process we may not fully understand (a la Chesterton’s fence). At the same time, I think it leans a little too heavily on the implicit assumption that factory-made things are necessarily worse than handmade things. It’s possible to have an inherent preference for handmade things and still acknowledge that something made in a factory can have equal quality in terms of the purpose for which that thing was made. The key question here is whether we really will lose something important by trying to break the practice of science (or just psycholinguistics) into distinct modules and then trying to make those modules more efficient. Personally, I’m somewhat resistant to the idea that there’s something ineffable or irreducible about this process—but I suppose it’s possible that it’s ineffable to us.

What happens next?

Assuming there is an appetite for making psycholinguistics more efficient—what’s the next step?

Here’s what I’m inclined to do, at least:

Refine the modularization I’ve described above. This was a first pass, and I think it could be improved—ideally by conducting qualitative and quantitative surveys of other psycholinguists.
Identify which modules would be most impactful to automate. Which parts of the process are particularly time-consuming, unpleasant, or simply done poorly in the status quo?
Identify which modules would be most tractable to automate. What’s the low-hanging fruit?

Having identified which modules are most impactful and which are most tractable, we could then start work on automating those parts of the process.

Notably, w3schools—a resource I often direct my students to for additional programming practice—now has a page on using ChatGPT to help with coding.

Though not always. Sometimes a study might be correlational, e.g., correlating different aspects of linguistic structure within or across languages. Nonetheless, there’s generally consensus that the best way to test a mechanistic theory is via an experiment.

It’s similar to a problem one encounters in writing. If the writing is good, no one notices it and can attack the arguments expressed directly. If the writing is bad, it’s hard to tell what’s being argued and this might make the ideas harder to attack.

Update (12/14/2023): There’s also an interesting connection here—made by Benjamin Riley in the comments—to the question of scientific paradigms and paradigm shifts. As I’ve written about before, “paradigms” are a key concept in philosophy of science, closely associated with the philosopher and historian Thomas Kuhn. They encompass things like: a scientific field’s basic axiomatic assumptions, what kinds of things a field considers evidence, what kinds of explanations are considered valid explanations, and much more. A paradigm “shift”, naturally, is when those things change. There are many possible causes for such a shift. But one interesting question that arises is whether the process of modularization and automation I describe above is more likely to “freeze” certain scientific practices into place (as I’ve written LLMs might do for language more generally), effectively locking us into the current paradigm or at least making it harder to shift to a new one.

Benjamin Riley

Dec 14, 2023

Thought provoking as always Sean. My immediate, cheeky thought -- if LLMs could somehow help me to sort through the bevy of research being produced on AI and human cognition that would be super freakin' helpful, as I feel like I'm bookmarking six studies a day that I'll never get around to reading.

Slightly more substantively, your post has me pondering the nature of scientific inquiry and in particular, scientific revolutions in the Kuhnian sense, the moments when our paradigms shift. The approach you sketch out here is highly empirical, and no shame in that. But Kuhn observed that "what occurs during a scientific revolution is not fully reducible to a reinterpretation of individual and stable data." He suggested, with some force I think, that revolutions come when scientists engage in "thought experiments" that create new metaphors for understanding the world differently.

Can LLMs help with this? Maybe! I certainly can imagine them helping spur new "thought experiments" in scientists. But just as I'm concerned that art and literature may be impoverished by the creation of this new tool that is built to generate something that "looks like" a sort of averaging of what's already been created, in my darker imaginings I can envision a future where this technology locks us into the scientific paradigm of the present.

But hopefully you prove me wrong and become the modern Einstein of the mind.

Expand full comment

3 replies by Sean Trott and others

3 more comments...

The Counterfactual

Discussion about this post

Ready for more?