Thanks for the interesting post. Just FYI...The link for "you can find a GitHub with Jupyter notebooks (and the data) available here", which appears early in the post, didn't work for me. But much later, the link for "as always, interested readers can explore other options" does work.
Thanks for this! Random but related note to your last thought on using LLMs to modify readability: I did some testing on that last summer by prompting GPT-4 to rewrite an AI-generated piece of text to a 9th grade reading level, and then looking at the Flesch-Kincaid grade level (+ a bunch of other readability metrics) of the rewritten text. A few interesting findings, but the biggest one was how much the grade level of the original text impacted the grade level of the rewritten text -- essentially, GPT-4 was targeting the grade level in relative, rather than absolute, terms. My original texts were in five buckets based on the tone I had told it to target, and their average Flesch grade levels ranged from 9.19 for the "caring and supportive" tone bucket to 15.8 for the "sophisticated" bucket. For the latter bucket, the average score of the *rewritten* text (again, having told GPT-4 to rewrite it at a 9th grade level) was 13.6. For the former, it was 7.4. (In both cases, n=24)
Thanks for sharing, that's really interesting. So just to check, you first had GPT-4 generate texts based on some tone in the prompt ("caring and supportive" vs. "sophisticated"), then had GPT-4 *rewrite* those texts to a 9th grade level, and then calculated readability of each of those texts—and the finding is that for those modified texts, scores like Flesch-Kincaid vary according to the original bucket the text was drawn from? I.e., there's some kind "residual" effect of the original text?
If so, that's intriguing and I wonder whether how much that effect could be decomposed into: a) the tone/style/level of the original text; vs. b) the content of the original text. I.e., if there's an effect of the original text on the modified text, is that because the original text is about more inherently challenging content (in which case it makes sense that the modified one is still a little more challenging, even if reduced in textual complexity) or because it uses more complex language to start with, controlling for the difficulty of the content?
Do you happen to have this analysis written up anywhere? I'd love to read it if so.
Ah yes, relevant clarification—every piece of text (regardless of the requested tone) was about the same topic: Each one was a roughly 150-200 word introduction for an article explaining different types of saunas. So the informational content of the text was virtually the same. But otherwise yes, your summary is correct!
I don't have any writeup currently, though perhaps I should—just did it for fun. If the data would be of any use to you, though, ie. the original and rewritten texts (240 in total), readability metrics for all of them, and prompts used, as well as some basic analysis, I'd be more than happy to share. Otherwise, I'll come back and drop a link here if I do publish an analysis.
Thanks for the interesting post. Just FYI...The link for "you can find a GitHub with Jupyter notebooks (and the data) available here", which appears early in the post, didn't work for me. But much later, the link for "as always, interested readers can explore other options" does work.
Sorry about that, and thanks for catching it! It should work now.
Thanks for this! Random but related note to your last thought on using LLMs to modify readability: I did some testing on that last summer by prompting GPT-4 to rewrite an AI-generated piece of text to a 9th grade reading level, and then looking at the Flesch-Kincaid grade level (+ a bunch of other readability metrics) of the rewritten text. A few interesting findings, but the biggest one was how much the grade level of the original text impacted the grade level of the rewritten text -- essentially, GPT-4 was targeting the grade level in relative, rather than absolute, terms. My original texts were in five buckets based on the tone I had told it to target, and their average Flesch grade levels ranged from 9.19 for the "caring and supportive" tone bucket to 15.8 for the "sophisticated" bucket. For the latter bucket, the average score of the *rewritten* text (again, having told GPT-4 to rewrite it at a 9th grade level) was 13.6. For the former, it was 7.4. (In both cases, n=24)
Thanks for sharing, that's really interesting. So just to check, you first had GPT-4 generate texts based on some tone in the prompt ("caring and supportive" vs. "sophisticated"), then had GPT-4 *rewrite* those texts to a 9th grade level, and then calculated readability of each of those texts—and the finding is that for those modified texts, scores like Flesch-Kincaid vary according to the original bucket the text was drawn from? I.e., there's some kind "residual" effect of the original text?
If so, that's intriguing and I wonder whether how much that effect could be decomposed into: a) the tone/style/level of the original text; vs. b) the content of the original text. I.e., if there's an effect of the original text on the modified text, is that because the original text is about more inherently challenging content (in which case it makes sense that the modified one is still a little more challenging, even if reduced in textual complexity) or because it uses more complex language to start with, controlling for the difficulty of the content?
Do you happen to have this analysis written up anywhere? I'd love to read it if so.
Ah yes, relevant clarification—every piece of text (regardless of the requested tone) was about the same topic: Each one was a roughly 150-200 word introduction for an article explaining different types of saunas. So the informational content of the text was virtually the same. But otherwise yes, your summary is correct!
I don't have any writeup currently, though perhaps I should—just did it for fun. If the data would be of any use to you, though, ie. the original and rewritten texts (240 in total), readability metrics for all of them, and prompts used, as well as some basic analysis, I'd be more than happy to share. Otherwise, I'll come back and drop a link here if I do publish an analysis.
Got it. That sounds interesting and definitely relevant! I'd definitely be curious to read a write-up and/or see the data if you're willing to share.