4 Comments

Interesting research, Sean. The question you raise about the mid-level correlations between human and algorithmic measurements of readability in contrast to the higher-level r’s between the algorithmic measurements squares with mainstream readability research in the field of reading. But expert human raters exceed algorithms in predictive value. Huey introduced the first serious study of readability based on quantitative methods in 1935. Since then such methods have been reified and then ultimately discredited, and current methods used to level the demands of texts for use, say, on the ACT tests are of a structured qualitative protocol with more mid-level constructs (experiential demands, prior academic participation cued in text, etc.) than micro-constructs like vocabulary and sentence structure. Pragmatically, the assessors are looking for passages that do a good job of predicting future success in college. Although tests still make use of multiple-choice technology for test items with a loss of validity, they make up for it in better selection of passages representing a range of genres and stances. Commercial publishers of k-12 material may still rely on algorithms like SMOG etc. but there is after all some relationship between average word length and cognitive demands of text. Still, for pragmatic intentions the best reading science suggests weaknesses in quantitative methods which can be improved through expert intersubjective human judgment. It’s very interesting that your raters agreed as much as they did. That in itself is evidence that humans tend to read texts in similar ways. I’m not sure what it means for a bot to rewrite a text at a lower or higher level—in what larger context were the bots instructed to rate the passages? Hard for whom? For an abstract child at a certain grade level? To reduce the average sentence length by a particular percentage?

Expand full comment