Interesting research, Sean. The question you raise about the mid-level correlations between human and algorithmic measurements of readability in contrast to the higher-level r’s between the algorithmic measurements squares with mainstream readability research in the field of reading. But expert human raters exceed algorithms in predictive value. Huey introduced the first serious study of readability based on quantitative methods in 1935. Since then such methods have been reified and then ultimately discredited, and current methods used to level the demands of texts for use, say, on the ACT tests are of a structured qualitative protocol with more mid-level constructs (experiential demands, prior academic participation cued in text, etc.) than micro-constructs like vocabulary and sentence structure. Pragmatically, the assessors are looking for passages that do a good job of predicting future success in college. Although tests still make use of multiple-choice technology for test items with a loss of validity, they make up for it in better selection of passages representing a range of genres and stances. Commercial publishers of k-12 material may still rely on algorithms like SMOG etc. but there is after all some relationship between average word length and cognitive demands of text. Still, for pragmatic intentions the best reading science suggests weaknesses in quantitative methods which can be improved through expert intersubjective human judgment. It’s very interesting that your raters agreed as much as they did. That in itself is evidence that humans tend to read texts in similar ways. I’m not sure what it means for a bot to rewrite a text at a lower or higher level—in what larger context were the bots instructed to rate the passages? Hard for whom? For an abstract child at a certain grade level? To reduce the average sentence length by a particular percentage?
Thank you, Terry! Great points—I didn't know those details about analysis of ACT writing so that was very interesting.
>> But expert human raters exceed algorithms in predictive value.
Agreed, and I think that's aligned with the finding that all the metrics correlate more with *each other* than with humans. (And of course, that's also separate from the question of whether I'm eliciting readability judgments from even humans in the appropriate way.)
>> I’m not sure what it means for a bot to rewrite a text at a lower or higher level—in what larger context were the bots instructed to rate the passages? Hard for whom? For an abstract child at a certain grade level? To reduce the average sentence length by a particular percentage?
Agreed, and I think this is one of the key conceptual challenges I've faced as I took on this project. Although I have a background in psycholinguistics and LLMs, I don't have a background in readability specifically, so it's been a bit of a learning curve, and part of that learning curve has been grappling with the question of what it really means for something to be more or less readable. As you note: more or less readable *for whom*? And in which conditions?
I'm mulling over a follow-up (conceptual) post discussing the very construct of readability, i.e., whether it's a single thing or whether we ought to treat it as readabilities*. If you have any references or resources you think I'd benefit from reading, I'd really appreciate seeing them. Thanks again!
Omg I have been studying readability for years. Some recent stuff has been published. I’ll get you some citations. Psycholinguistics has informed theories of readability as a measurement construct but so far as I know hasn’t moved us much beyond sampling sentence length as a proxy for structural complexity and word length for semantic complexity. Much of the issue is this: Texts can be challenging for many, more subjective factors like motivation as distinct from interest, genre, social and cultural expectations, etc. Since 2006 reading and literacy folks have pretty much thrown in the towel on any principled way to assign more than featherweight status to the formulas. Reading is a sociocultural practice with cognitive elements which is always situated, involving experiential knowledge, social status relationships, and belief systems.
Interesting research, Sean. The question you raise about the mid-level correlations between human and algorithmic measurements of readability in contrast to the higher-level r’s between the algorithmic measurements squares with mainstream readability research in the field of reading. But expert human raters exceed algorithms in predictive value. Huey introduced the first serious study of readability based on quantitative methods in 1935. Since then such methods have been reified and then ultimately discredited, and current methods used to level the demands of texts for use, say, on the ACT tests are of a structured qualitative protocol with more mid-level constructs (experiential demands, prior academic participation cued in text, etc.) than micro-constructs like vocabulary and sentence structure. Pragmatically, the assessors are looking for passages that do a good job of predicting future success in college. Although tests still make use of multiple-choice technology for test items with a loss of validity, they make up for it in better selection of passages representing a range of genres and stances. Commercial publishers of k-12 material may still rely on algorithms like SMOG etc. but there is after all some relationship between average word length and cognitive demands of text. Still, for pragmatic intentions the best reading science suggests weaknesses in quantitative methods which can be improved through expert intersubjective human judgment. It’s very interesting that your raters agreed as much as they did. That in itself is evidence that humans tend to read texts in similar ways. I’m not sure what it means for a bot to rewrite a text at a lower or higher level—in what larger context were the bots instructed to rate the passages? Hard for whom? For an abstract child at a certain grade level? To reduce the average sentence length by a particular percentage?
Thank you, Terry! Great points—I didn't know those details about analysis of ACT writing so that was very interesting.
>> But expert human raters exceed algorithms in predictive value.
Agreed, and I think that's aligned with the finding that all the metrics correlate more with *each other* than with humans. (And of course, that's also separate from the question of whether I'm eliciting readability judgments from even humans in the appropriate way.)
>> I’m not sure what it means for a bot to rewrite a text at a lower or higher level—in what larger context were the bots instructed to rate the passages? Hard for whom? For an abstract child at a certain grade level? To reduce the average sentence length by a particular percentage?
Agreed, and I think this is one of the key conceptual challenges I've faced as I took on this project. Although I have a background in psycholinguistics and LLMs, I don't have a background in readability specifically, so it's been a bit of a learning curve, and part of that learning curve has been grappling with the question of what it really means for something to be more or less readable. As you note: more or less readable *for whom*? And in which conditions?
I'm mulling over a follow-up (conceptual) post discussing the very construct of readability, i.e., whether it's a single thing or whether we ought to treat it as readabilities*. If you have any references or resources you think I'd benefit from reading, I'd really appreciate seeing them. Thanks again!
Keep in touch here or email me tlunder@csus.edu
Omg I have been studying readability for years. Some recent stuff has been published. I’ll get you some citations. Psycholinguistics has informed theories of readability as a measurement construct but so far as I know hasn’t moved us much beyond sampling sentence length as a proxy for structural complexity and word length for semantic complexity. Much of the issue is this: Texts can be challenging for many, more subjective factors like motivation as distinct from interest, genre, social and cultural expectations, etc. Since 2006 reading and literacy folks have pretty much thrown in the towel on any principled way to assign more than featherweight status to the formulas. Reading is a sociocultural practice with cognitive elements which is always situated, involving experiential knowledge, social status relationships, and belief systems.
https://www.researchgate.net/publication/378234167_The_State_of_the_Field_Qualitative_Analyses_of_Text_Complexity