Modifying readability with large language models (pt. 1)
Can GPT-4 Turbo rewrite texts to make them easier or harder to read?
Last month, paying subscribers of the Counterfactual voted for a project exploring whether large language models (LLMs) can modify the readability of texts. This is a multi-part project that I’ll be reporting on in multiple stages. This month’s report is about stage 1: modifying readability with GPT-4 Turbo and assessing those modified texts using readability formulas (and GPT-4 Turbo).
The same piece of information can be communicated in multiple ways—and some of those ways are easier to understand than others. Jargon has its place, but many writers want to convey their ideas in a way that’s generally accessible. This was the impetus behind the explainer on large language models I co-authored with Timothy Lee of Understanding AI; it’s also the motivation for things like Simple Wikipedia, ELI5, and the idea of “Basic English” more generally.
While some people enjoy this process of rewriting text to make it more readable, other people really don’t. That means there are lots of texts out there that are really difficult to read. In at least some of those cases, it might be useful to have a tool that could automatically make a given piece of text easier to read—while losing as little information as possible.
Large language models (LLMs) are good at producing coherent-seeming text. They’re also, as I wrote last month, pretty good at estimating the readability of a given piece of text. Could LLMs also be useful for modifying text to be easier to read?
In this post, I report on my initial results trying to do just that. As with last month’s post, there’s a GitHub with all the code in a Jupyter notebook if you’d like to replicate these analyses.
What I did: the high-level
Long-time readers of The Counterfactual will know that I emphasize the importance of operationalization in research design. This project is concerned with whether LLMs can be used to modify the readability of text. How should that question be operationalized in a falsifiable way?
One way to think about this is to conceive of the LLM as a kind of “function” that takes some text as input and produces a modified text as output. That modified text should either be easier or harder to read than the input, depending on what we asked the LLM to do. Then, to evaluate how successful this function is, we need a way to measure readability. Here, the gold standard would be human judgments. After all, the best way to find out whether something is easy to read for humans would be to ask some humans. Short of that, we’d need to rely on proxy measures that we think are correlated with human judgments.
That means there are at least three steps here:
Construct this “LLM readability function”.
Determine some proxy measures to measure readability.
Analyze whether the measures in (2) vary systematically with what the LLM function was asked to do.
Fortunately, step (1) is pretty straightforward. One of the advantages of LLMs is that they can be prompted to do something (e.g., “make this text easier to read”). So rather than writing a bunch of code to modify readability, we can just ask the LLM to do it. Here’s the prompt I used to ask GPT-4 Turbo to make text easier to read:
Read the passage below. Then, rewrite the passage so that it is easier to read.
When making the passage more readable, consider factors such as sentence structure, vocabulary complexity, and overall clarity. However, make sure that the passage conveys the same content.
Finally, try to make the new version approximately the same length as the original version.
These instructions were prepended to every text I presented to the LLM through the OpenAI API, along with a reminder at the end to make the text either easier or harder (depending on the goal).
Of course, the trade-off with using LLMs to do this is that we don’t really know how they will interpret these instructions. That’s the point of steps 2-3: we want to quantify whether the outputs an LLM produces when prompted in this way are actually easier or harder to read (again, depending on what we intended). For this initial report, I assessed readability using three readability formulas, as well as asking GPT-4 Turbo itself in a separate prompt. No automated approach to measuring readability is perfect, which is why I tried out multiple approaches—if all the approaches point in the same direction, it should increase our confidence that they’re capturing something real.
Finally, I decided to analyze these results in a couple of ways. First, among the modified texts, do the texts that were produced in the “Easy” condition generally score as easier to read than those produced in the “Hard” condition? That’ll tell us whether the prompt manipulation works to produce systematically different texts. But it could be that GPT-4 Turbo is good at making texts harder to read but not easier, or vice versa. Which brings us to the second question: how much more or less readable are “Easy” and “Hard” texts from the original texts?
This method and analysis approach was all pre-registered on the Open Science Framework, or OSF (see link here).
The details
For those who are interested, I wanted to describe what I actually did in a little more detail.
First, I selected a random subset of 100 texts from the CLEAR Corpus. CLEAR has approximately ~5K texts in total, and using GPT-4 Turbo to modify all of them would’ve been pretty expensive and time-consuming (much more so than just using Turbo to rate the readability of all of them, which is what I did last month).
Then, for each text, I presented it to GPT-4 Turbo using the OpenAI Python API in two different conditions: an “Easy” condition (using the prompt I described above) and a “Hard” condition (which was the inverse of that prompt). I used a temperature of 0 to encourage Turbo to be as deterministic as possible in its responses, and I allowed it to generate as many tokens as the number of tokens in the input passage, plus a “buffer” of 5. (I measured the number of tokens using tiktoken, a Python library.)
For each modified text, I measured readability using the textstat package. I focused on three measures: Automated Readability Index, SMOG, and Flesch-Kincaid. I also elicited estimates from GPT-4 Turbo using the same approach I described in last month’s report, which was more correlated with human judgments than other metrics.
Finally, I used the statsmodels package in Python to build and interpret the linear regression models for testing my hypotheses.
The caveats
As with any study, there are some limitations of this approach. I’m going to quickly focus on two of the most glaring issues here, in part because these are also the issues I plan to address in the next round of studies.
The first concern is that these readability metrics are just that—metrics—and so they may not capture what we actually care about. Last month’s report suggests that they track human judgments of readability pretty well, but maybe they diverge in important and systematic ways. That’s why I also plan to supplement the work presented here with actual human judgments.
The second concern is that these metrics don’t capture the appropriateness of the modified text. Does it lose critical information? How much? Reducing all of Principia Mathematica to a single, imprecise sentence (“It’s about math”) would certainly make it easier to read, but hardly just does justice to the original text. Thus, I’m also going to collect human (and LLM) judgments to try to assess whether and how much information is lost in translation.
With those caveats in mind, let’s turn to the actual results!
What I found
How did GPT-4 Turbo do?
Recall that I approached this question in two different ways:
Among the modified texts, do the “Easy” ones actually score as easier than the “Hard” ones, as measured by the four readability indices?
If we compare the “Easy” and “Hard” texts to the original excerpts, do they score as different (in the expected direction)? (That is, are “Easy” texts easier and are “Hard” texts harder?)
The sections below discuss each of this approaches in turn, then dive into some related results.
“Easy” vs. “Hard”
Let’s start with the first question. To analyze this statistically, I built a regression model for each readability metric that looked something like this:
Readability (Modified) ~ Goal (Easy/Hard) + Readability (Original) + Difference in Length
The key question was whether, for each metric, goal was significantly predictive of the modified text’s readability.1 I included the other two predictors primarily as controls.
For each metric, the coefficient for goal was indeed significant, even after correcting for multiple comparisons. That is, the probability of obtaining a coefficient that large assuming a real relationship of zero (the “p-value”) was very small.
Perhaps more importantly—and more persuasively—you can also see the differences by eye. The leftmost panel in the figure below shows the readability of the “Easy” and “Hard” modified texts according to the metric used to assess readability. Note that these are all on different scales, and two of the metrics (SMOG and ARI) measure difficulty, not readability (i.e., so the metric is “inverted”).
Nonetheless, I wanted to show the raw data to make it clear that none of the condition-wise differences I’m mentioning come about purely by virtue of transforming the data in mysterious ways. In each case, there’s a clear, visible difference between the readability score for the “Easy” and “Hard” texts.
For an even clearer picture, you can also look to the middle panel: this shows the result of z-scoring each of the raw readability scores within each metric, and then inverting the scores for SMOG and ARI to make the directionality the same. The advantage of z-scoring is that now all of the scores are on the same scale, where the numbers on the y-axis can be interpreted in terms of standard deviation (i.e., 1 = 1 standard deviation larger than the mean of this distribution). Here, the results are even more obvious: on average, the texts produced in the “Easier” condition score about one standard deviation higher in readability than the average modified text, and the texts produced in the “Harder” condition score about one deviation lower in readability than the average modified text.
In essence: the manipulation worked! Easier texts score as easier, and harder texts score as harder.
What about that rightmost panel? That’s helpful for visualizing the relative change in readability from the baseline, which I’ll turn to next.
What about the baseline?
The analysis above doesn’t show that the “Easy” texts are actually easier than the original text—just that they’re easier than the “Hard” modified texts.
To get at this second question, I calculated the difference between each modified text’s readability and the original passage’s readability (for each metric in turn). Then, focusing on the “Easy” and “Hard” cases separately, I asked whether that difference was significantly different from zero.2 Again, for each metric, I found a significant result: “Easy” texts tended to be more readable than their baselines, and “Hard” texts tended to be less readable.
That said, this effect was definitely less noticeable for the “Easy” texts than for the “Hard” texts. If you look at the rightmost panel in Figure 1, you’ll see that the difference scores are pretty close to zero for all the “Easy” cases (regardless of which metric was used) but clearly non-zero for all the “Hard” cases (again, regardless of which metric was used).
This suggests that GPT-4 Turbo did a better job making texts harder rather than easier. That could be because the texts were all already pretty easy, or it could be that it’s easier to make things more complicated than more straightforward. The latter explanation certainly makes sense to me: it takes a lot more effort to write something that’s clear and comprehensible than to write something obscure and uninterpretable. Another possibility is that my instructions to use roughly the same number of words had an unintended effect: maybe in some cases, Turbo could’ve rewritten the original text using fewer (and simpler) words, but it used more words than it needed to because of those instructions. Either way, it’s an interesting result that I think merits some follow-up work to figure out exactly what’s going on here.
Comparing to the original text’s readability directly
Inspired by one of the comments on last month’s report (from Megan Skalbeck), I was also curious whether the readability of the original text was correlated with the readability of the modified text. That is, even if the modified text is easier (or harder) to read, does the fact that the original text was particularly easy or difficult to read show up in the modified text as well—a kind of lingering, “residual” effect of the original text’s complexity or simplicity?
The short answer: yes! As shown in Figure 2 below, there’s a clear, positive relationship between the readability score for the original text and the readability score for the modified text. This shows up over and above the effect of the prompt manipulation.
Qualitative inspection
If you’re anything like me, you’re probably wondering what these modified texts look like. The modified texts are all available on GitHub if you want to dig into them yourself, but I did comb through some of them and find some interesting examples to highlight here.
It’s sometimes difficult to know where to start with qualitative analysis, so I looked for the “best” and “worst” examples of the “Easy” and “Hard” modifications (where “best” and “worst” were measured by those examples that were changed in the correct vs. incorrect direction, respectively).
I’ll start with one of the most successful “Hard” modifications, because I think it’s pretty funny—and a good indicator of what it means to make a text uselessly hard to read. For reference, here’s a sample of the original text, which is from the eBook for “Crusaders of New France” (available at Project Gutenberg).3
[Original] In the seventeenth century, moreover, France owed much of her national power to a highly-centralized and closely-knit scheme of government. Under Richelieu the strength of the monarchy had been enhanced and the power of the nobility broken.
And here’s part of that same passage, modified by GPT-4 Turbo to be harder to read:
[Modified - Hard] In the penultimate century's antepenultimate decade, France's national potency was indebted to an exceedingly centralized, intricately interwoven governmental structure. Under the aegis of Richelieu, the monarchy's robustness was augmented, concurrently fracturing the aristocracy's influence.
Same ideas, just much harder to read (at least to me). Which is encouraging, because it means that these metrics do seem to be tracking something “real”—and, of course, the prompt manipulation itself resulted in a noticeable change in the text’s readability.
Just for completeness, here’s the easy version of that passage:
[Modified - Easy] In the 1600s, France became a powerful nation thanks to a government that was very centralized and well-organized. Cardinal Richelieu had made the king's authority stronger and reduced the power of the nobles.
Definitely easier than the hard version. But I’m not sure whether it’s easier than the original, which I suppose is consistent with the quantitative results from above: Turbo’s “Easy” and “Hard” modifications are meaningfully different, but the “Easy” ones aren’t always so much easier than the original.
Are there any cases where GPT-4 Turbo does make something easier to read? Let’s take a look at the passage for which the gap in readability between the “Easy” modification and the original text is largest. The original passage is from “A Bed Of Nettles” by Grant Allen (also available at Project Gutenberg), which I’ve reproduced in part below:
[Original] But the sting certainly does not exhaust the whole philosophy of the nettle. Look, for example, at the stem and leaves. The nettle has found its chance in life, its one fitting vacancy, among the ditches and waste-places by roadsides or near cottages; and it has laid itself out for the circumstances in which it lives.
And the modified version:
[Modified - Easy] The nettle's story is more than just its sting. Take a look at its stem and leaves. The nettle has found its perfect spot in life, thriving in ditches and neglected areas by roads or near homes. It's perfectly suited for these environments.
This one does seem easier to me. It’s also, unfortunately, a little less interesting to read: “its perfect spot in life” is, at least to me, a less unique or evocative turn of phrase than “its one fitting vacancy”. Making things more readable comes at a cost! But it’s also notable, nonetheless, that Turbo seems to have identified specific strategies for modifying readability: infrequent words are replaced with more frequent ones (“cottages” —> “homes”), and long sentences are made shorter.
What about the cases where Turbo does badly? Are there any cases where the “Easy” text scores as harder than the original, or where the “Hard” text scores as easier than the original?
One of the passages where Turbo did the worst—i.e., the “Easy” version scored as harder than the original—was taken from this free kid’s book about fruit. Here’s part of the original passage, which, being a kid’s book, is already pretty easy to read:
[Original] Fruit grows on plants all over the world. Different fruits come from different countries. These are apricots from Armenia. Fruits and vegetables are different. Fruits have seeds but vegetables don't. Some seeds are very big, like in a peach. Some seeds are tiny, like in a kiwi. Fruits have seeds to make new plants.
Short words and short sentences all throughout. Here’s the modified version:
[Modified - Easy] Fruits grow on plants all around the world, and they come from many different places. For example, apricots are from Armenia. It's important to know that fruits are not the same as vegetables. The main difference is that fruits have seeds, while vegetables usually don't. Seeds can vary in size; peach seeds are big, but kiwi seeds are very small.
Still seems pretty straightforward, but it does score as slightly harder to read, using both GPT-4 Turbo’s own ratings and the Flesch-Kincaid score. My guess is that this is because the sentences in the modified version are actually longer, on average, than the sentences in the original version. It includes clauses like “It’s important to know…” and “The main difference is …”.
I also looked for cases where the “Hard” version scored as easier to read than the original, but actually couldn’t find any (at least using Turbo’s ratings). However, the case with the smallest change in the expected direction was based on this passage from the Wikipedia on temperature-programmed reduction (TPR):
[Original] Temperature-programmed reduction (TPR) is a technique for the characterization of solid materials and is often used in the field of heterogeneous catalysis to find the most efficient reduction conditions, an oxidized catalyst precursor is submitted to a programmed temperature rise while a reducing gas mixture is flowed over it.
That’s pretty technical. Here, Turbo’s “Hard” version is equally inscrutable, though scores just a tad easier, according to Turbo itself:
[Modified - Hard] Temperature-programmed reduction (TPR), an analytical methodology for the elucidation of solid-state substances' characteristics, predominantly finds its utility within the domain of heterogeneous catalysis for the optimization of efficacious reduction parameters. In this process, an oxidized catalyst precursor undergoes a meticulously orchestrated thermal escalation under the influence of a reductive gaseous amalgam that is percolated over the substance.
I’m not sure I agree with Turbo’s assessment, personally. Both passages are clearly very technical, but I do think the “Hard” version ends up being even more difficult to read—perhaps needlessly so, though I’d need a TPR expert to weigh in to be sure. So maybe this is a win for Turbo’s modification (the Hard one really does seem harder) but not its assessment (the Hard one was assessed as easier)?
Out of curiosity, I also checked out the “Easy” version of that same passage:
[Modified - Easy] Temperature-programmed reduction (TPR) is a method used to study solid materials, especially in the area of creating better catalysts for chemical reactions. It helps to determine the best conditions for turning an oxidized catalyst into its active form. In this process, the catalyst is gradually heated while a mix of gases that can cause reduction is passed over it.
This one seems easier than the original text. I still don’t really understand what it’s talking about, but I feel like it gets me a bit closer than the original (or the “Hard” one). Of course, it’s also possible it loses some vital information—I’m not a subject-matter expert, so I’d have no way of knowing! That’s exactly why we’d need a follow-up study assessing whether and how much nuance was lost in the simplification.
What we learned
Based on February’s poll, my goal here was to assess how well LLM’s could modify the readability of text.
Here’s a quick recap of what that involved:
I selected a random subset of text excerpts from the CLEAR corpus.
For each text, I used GPT-4 Turbo to make an “Easy” and “Hard” version of that excerpt.
I then calculated the readability of those modified text excerpts using several different metrics of readability; one of those metrics involved using Turbo itself, since last month’s report showed that was relatively successful at predicting human judgments.
I found that Turbo’s modifications generally “worked”, i.e., “Easy” texts scored as easier than “Hard” texts across all metrics, and both scored as different from the original text in the expected direction (though “Easy” texts less so than “Hard” ones).
I also found that the readability of modified texts was positively correlated with the readability of the original text, suggesting a lingering effect of the original text’s content or complexity.
Combined with last month’s report, then, my conclusion is that GPT-4 Turbo is pretty good at both measuring and modifying the readability of text.
But as I’ve mentioned throughout, there are some key limitations to the current work. First, these readability scores are proxies for human judgments. Do the results hold up with actual human ratings of readability? And second, the modified versions were assessed only for readability—not informativity. Using LLMs to texts more readable wouldn’t be very helpful if the modified versions lost all the information from the original. My goal is to address one or ideally both of these issues for next month’s report, which will also involve collecting human judgments in an online experiment.
As always, I welcome any comments, questions, or ideas on this work! Please let me know what you think by leaving a comment (or sending an email)—I always appreciate reading feedback.
Note that because this was effectively the same “hypothesis” being tested in multiple ways (different dependent variables), I also applied Holm-Bonferroni corrections for multiple comparisons.
Under the hood, this amounted to building an intercept-only regression model predicting that difference score and asking whether the intercept was significantly different from zero. Just to confirm that this approach made sense, I also conducted a paired sample t-test in each case, i.e., comparing “Easy” vs. original and “Hard” vs. original. The results were qualitatively the same, again correcting for multiple comparisons.
Note that in all these passages I’ll be showing, the full excerpt that Turbo rated and modified is a fair bit longer—I’m showing a truncated version just to get the point across.