For anyone new to the Counterfactual: each month, paying subscribers can vote for which research topic they’d like to see me focus on for that month. Then, the following month, I’ll publish an article on that topic (e.g., with empirical results). Those reader-sponsored articles will always be publicly available to anyone who wants to read them.
In the past, poll winners have all been original empirical studies using large language models (LLMs) to measure or modify the readability of texts:
Modifying readability with LLMs (pt. 2).
In general, most topics will be empirical in nature, but they won’t all necessarily involve LLMs, and I also typically include at least one option for a review or explainer piece.
Now it’s time to vote on April’s research topic!
Poll options
Here are the options for this month’s poll (full descriptions below). Whichever option receives the most votes is the one I’ll focus on for the next month. Note that if there’s a close runner-up, I’ll definitely include that option in future polls. And if you have other topics you’d like to see written about, feel free to suggest them in the comments!
Option 1: Winter break hypothesis
This one was the runner-up in my last poll, so I wanted to include it again. Here’s the original description:
Is GPT getting lazier? Some people seem to think so—or at least they did in December 2023.
One specific hypothesis, which feels partly tongue-in-cheek, is that GPT’s “laziness” coincides with time—namely that GPT-4 has “learned” that December is a holiday month, and therefore it doesn’t “want” to work as hard. At least one person (Rob Lynch) has tried to test this “winter break hypothesis”, and found some evidence broadly consistent with it: GPT-4 apparently generates shorter responses when it’s told in the prompt that the month is “December” than when it’s told the month is “May”.
This is interesting and also kind of funny. Now, I’m not fully convinced GPT has gotten measurably “lazier” in the first place—as this article points out, it’s also possible that more people have “habituated” to GPT’s strengths and are noticing when it doesn’t do a good job; it’s also possible that more users simply means more people catching tail events, raising the salience of apparent “laziness”.
But at the very least, this “winter break hypothesis” feels like something that can be empirically tested. Regardless of whether GPT has gotten lazier over time, the claim is that when prompted with information about the month, GPT’s responses will systematically vary according to the prompted month. Apparently others struggled to replicate the analysis I mentioned above, which is partly why I want to look into this.
Here’s what I’d do:
Come up with a small range of simple tasks involving text generation.
Prompt GPT-4 with each of these tasks, along with instructions varying the month (from January through December). Another, orthogonal set of conditions would also add language specifying that the model is “on vacation” (to establish the effect size of this more direct manipulation).
For each reply, measure the length (number of tokens), as in Rob Lynch’s original analysis. Also, for each reply, use GPT-4 to measure the quality of the response (ignorant to condition).
Ask whether length and (measured) quality vary systematically by month and by the more overt manipulation check (“you’re on vacation!”).
This one’s pretty straightforward too. The main decision point will be coming up with a simple set of question/answer tasks to measure performance on.
Option 2: Reasoning about referents
A huge part of communication is reference: referring to things in the world or in our minds. One insight from Cognitive Science is that reference is in part a joint activity: speakers select a label for a referent (e.g., “the old armchair”) that they believe will be comprehensible by their addressee; and their addressee, in turn, infers what the speaker was referring to based on what the speaker said.1
Researchers have created an interesting experimental paradigm to study the dynamics underlying the referential process, sometimes called the “tangram game”. In this approach, two human participants are brought into the lab and play a communicative game: one participant (the Director) is tasked with producing a label for a particular object, and the other (the Matcher) is tasked with figuring out which object in an array of objects the Director was referring to. The challenge is that these “objects” are fairly abstract—as depicted below, each one is essentially a black-and-white collection of polygons (a “tangram”).
Directors thus have to come up with a plausible label for the target tangram that would allow the Matcher to figure out which one they’re referring to. For example, one Director trying to refer to “I” used the label:
All right, the next one looks like a person who's ice skating, except they're sticking two arms out in front.
One of the really cool things about this study is that the game is iterative, meaning that Directors/Matchers have to communicate about the same object multiple times. Over time their descriptions become more efficient, e.g.,:
The ice skater.
But even apart from the iterative aspect of the game, there’s something interesting about a task that forces people to use their visual and conceptual reasoning capacities to figure out which abstract array of shapes is the most plausible referent for a label.
In this study, I’ll ask to what extent contemporary LLMs are capable of doing that using an SVG representation of the tangram. That is, given a human label for one of these tangrams, can an LLM figure out which tangram—from an array of multiple “distractors”—was intended? I’ll be using the KiloGram dataset, which was published in 2022.
Option 3: Explainer on tokenization
We often talk about LLMs predicting the next word, but that’s not exactly accurate. In actuality, they’re trained to predict tokens. So what are “tokens”?
The answer is that they’re kind of like words, but not always. Before training an LLM on a text corpus, researchers first tokenize the text—which, nowadays, involves using an algorithm like byte pair encoding (BPE) to figure out the most common strings of characters. In some cases, these tokens correspond to words, but in other cases, they correspond to parts of words, which on their own aren’t necessarily meaningful to humans (e.g., “vanquish” —> “van” + “quish”).
If this option is selected, I’ll write an explainer on what tokenization is, why researchers use it, how some of the more common tokenization methods work, and finally, what we’ve learned about how LLMs represent these tokens.
There’s a very vigorous debate within the field about the extent to which speakers and addressees take each others’ perspective into account during this process.