Each month, paying subscribers can vote for which research topic they’d like to see me explore next, i.e., a kind of “citizen science”. Here’s the description of what I’m envisioning from the first poll:
In terms of the topics themselves: most of them will be empirical research projects—i.e., I’ll set out to answer a concrete question using data—and most of those will likely involve Large Language Models (LLMs) specifically. They’ll also mostly be bite-sized: things I can realistically accomplish, on the side, in the span of a month. I’m open to longer-term projects too (e.g., ≥2 months), but they might have to replace some number of the monthly projects. Finally, some of the topics will be research reviews: my attempt to synthesize what’s known on a particular topic.
In the first poll, subscribers voted for a post exploring whether and to what extent LLMs could capture readability judgments. I published that report last week, which was a ton of fun to work on and write up.
Now it’s time to vote on February’s research topic!
The options
Here are the options for this month’s poll (full descriptions below). Whichever option receives the most votes is the one I’ll focus on for the next month. Note that if there’s a close runner-up, I’ll definitely include that option in future polls. And if you have other topics you’d like to see written about, feel free to suggest them in the comments!
The other thing to mention is that one of these (“modifying readability”) will likely take more than 1 month (probably 2-3 months). If that option wins, it’ll be what I work on in February and March (at least), instead of having additional polls those months. So casting a vote for “modifying readability” is in essence casting a vote for multiple months (though each month would come with at least one update post, along with the standard other posts). The other options should all be doable within a 1-month span.
Option 1: LLMs + readability + cognition [study]
Last week, I published a post using LLMs to estimate the readability of different texts.
GPT-4 Turbo, the LLM I used, did pretty well: it was the best predictor of human readability judgments of all the predictors I tested.
But there are lots of other potential predictors that I didn’t manage to test. In particular, there’s good evidence that certain psycholinguistic variables—such as the average concreteness, frequency, and age of acquisition of words in a text excerpt—matter a lot for readability. Some tools for calculating readability incorporate those variables specifically. The original CLEAR corpus paper found that these cognitively-informed variables were pretty predictive of readability, but it’s unclear how they stack up against those “black box” LLM judgments I wrote about in the last post.
Here’s what this option would involve:
I’d use tools like TAALES (“Tool for the automatic analysis of lexical sophistication”) to calculate more nuanced estimates of readability that are informed by psycholinguistic theories.
I’d compare the performance of GPT-4 Turbo’s readability estimates to those cognitively-informed estimates. How successful is each approach on its own? What about combined? Do they explain the same variance in human readability judgments, or different variance?
I’d also conduct a more detailed error analysis, looking at exactly where each approach “goes wrong”. Is there anything systematic about the errors made by either GPT-4 Turbo or this “psycholinguistic features” approach—e.g., do LLMs tend to systematically underestimate the difficulty of certain kinds of texts and overestimate the difficulty of others?
This option is pretty straightforward because the LLM judgments have already been produced. The main challenge would be calculating all those other cognitively-informed readability scores. If I can’t find a Python API to do them, I’ll have to cobble together something using existing corpora (e.g., judgments about age of acquisition, concreteness, etc.). This might underestimate the potential explanatory power of those psycholinguistic variables, but hopefully it would at least give us a useful lower-bound on their informativity.
Option 2: Modifying readability [>1mo study]
Another, more ambitious future direction for the readability project would be using LLMs to modify the readability of text (e.g., an “ELI5” LLM). My impression is that people do this already, either for making their own text easier to read, or for producing a more accessible summary of an article or book chapter. But how good are LLMs like GPT-4 Turbo at this, empirically?
This project would be more ambitious than the other options. Here’s what I’m envisioning:
I’d select a subset of texts (e.g., 100-200 texts) from the CLEAR corpus as my experimental materials.
Then, I’d ask GPT-4 Turbo to generate an “easier” version of each of these texts (or possibly a few easier versions); I might also try the opposite direction too (e.g., create a less accessible version).
Then, I’d run an online human study (e.g., on Prolific) asking people to judge the readability of those new texts. Realistically this would probably involve comparing the original text to the new one and asking which is easier to read.
Optimistically, I’d also like to run a parallel study asking people to judge whether the modified version loses any information or whether it’s still just as informative as the original.
Just for fun, I might ask GPT-4 to do the same task as the humans and see how well its judgments stack up against the human judgments.
Finally, I’d ask whether the modified texts are, on average, judged to be easier than the originals—and also whether they’re judged as losing any information.
There are lots of decision points here, including things I probably haven’t thought about yet. For that reason, along with the fact that human studies take some more time (and money) to set up, I’m anticipating that this would take longer than a month. That’s why, as I mentioned above, voting for this option will effectively mean casting a vote for multiple months. However, for the duration of this project, I’d still write an update post each month describing progress so far for those interested.
Option 3: Winter break hypothesis [study]
Is GPT getting lazier? Some people seem to think so—or at least they did in December 2023.
One specific hypothesis, which feels partly tongue-in-cheek, is that GPT’s “laziness” coincides with time—namely that GPT-4 has “learned” that December is a holiday month, and therefore it doesn’t “want” to work as hard. At least one person (Rob Lynch) has tried to test this “winter break hypothesis”, and found some evidence broadly consistent with it: GPT-4 apparently generates shorter responses when it’s told in the prompt that the month is “December” than when it’s told the month is “May”.
This is interesting and also kind of funny. Now, I’m not fully convinced GPT has gotten measurably “lazier” in the first place—as this article points out, it’s also possible that more people have “habituated” to GPT’s strengths and are noticing when it doesn’t do a good job; it’s also possible that more users simply means more people catching tail events, raising the salience of apparent “laziness”.
But at the very least, this “winter break hypothesis” feels like something that can be empirically tested. Regardless of whether GPT has gotten lazier over time, the claim is that when prompted with information about the month, GPT’s responses will systematically vary according to the prompted month. Apparently others struggled to replicate the analysis I mentioned above, which is partly why I want to look into this.
Here’s what I’d do:
Come up with a small range of simple tasks involving text generation.
Prompt GPT-4 with each of these tasks, along with instructions varying the month (from January through December). Another, orthogonal set of conditions would also add language specifying that the model is “on vacation” (to establish the effect size of this more direct manipulation).
For each reply, measure the length (number of tokens), as in Rob Lynch’s original analysis. Also, for each reply, use GPT-4 to measure the quality of the response (ignorant to condition).
Ask whether length and (measured) quality vary systematically by month and by the more overt manipulation check (“you’re on vacation!”).
This one’s pretty straightforward too. The main decision point will be coming up with a simple set of question/answer tasks to measure performance on.
Option 4: Movies and the past [study]
(This is an option from January’s poll that received some votes, so I’m putting it back in.)
Recently, the journalist Matt Yglesias wrote a really interesting article about movies these days and what they’re about. One of his points is that lots of recent films (especially blockbusters) are set in the past, and that it’d be nice to see more films set in contemporary times. Obviously there have always been period pieces, but many films we now think of as classics were set in the same time period (e.g., decade) in which they were made. He writes:
But “Taxi Driver” was made in the 1970s about the 1970s. “Goodfellas” was released in 1990, and while it’s setting stretches back to the past, it tells the story of Henry Hill through the 1980s. “The Departed” was a huge hit and Oscar winner in 2006, and it’s set in the mid-aughts. “Die Hard” is mostly an incredibly fun romp, but it’s also very much a movie depicting some of the particular social and cultural concerns of the 1980s.
An interesting discussion that arose in the comments of the piece was whether it is in fact empirically true that movies these days are more frequently about the past than they used to be.
There are all sorts of ways one could operationalize and test this claim, and I thought it could be be fun to try. Here’s what I’m envisioning:
Collect a list of the top ~50-100 films for each year dating back to at least 1950 (or however far a reliable list of films per year can be found).
Estimate the decade in which that film is primarily set, using a combination of LLM coding and human “gold standard” annotation.
Ask whether the rate of top films set in the present vs. the past has changed, and in which direction, over the years.
There are tons of degrees of freedom here. There’s not “one right way” to assess the claim, and the way you operationalize the claim likely really matters for the answer you get.
For example: plenty of films are set in the distant future or in some imagined past—should those films be excluded from the analysis? What threshold should be used for determining whether to exclude a film?
Another example: even once the dataset has been assembled, what’s the best way to quantitatively assess whether films are more frequently about the past? One approach would be ask whether the proportion of films within a given year about the current decade vs. previous decades has decreased—but what about future decades? A slightly more sophisticated approach would be to calculate some difference score—i.e., the difference between the year in which a film is made and the estimated year in which it takes place—and ask whether that’s gone up or down. But that difference score will be impacted by outlier period pieces, e.g., films set in the distant past or distant future, which brings us back to the question of which films to exclude.
When it comes to assessing the claim, it’s also possible that the effect is not monotonic: maybe films in the early 20th century were more often set in the past, films in the late 20th century were more often set in the present, and films made in the 21st century are again more often set in the past. This would be interesting to find out.
Another plausible outcome is that different ways of analyzing the data produce qualitatively different results, and that’s useful to know too!