Detecting LLM-generated text is hard

Most solutions raise questions of their own.

Feb 07, 2023

People are worried about ChatGPT.

One specific concern I’ve come across is that it is increasingly difficult to distinguish text generated by a Large Language Model (LLM), such as ChatGPT, from text generated by a human being.

Many educators worry that students will use LLMs to write their papers or solve their programming assignments1. Others worry that LLMs will be used to generate propaganda (or just good-old-fashioned spam) on an unprecedented scale. That’s the concern voiced by Gary Marcus in this interview with Ezra Klein:

GARY MARCUS: And you also kind of laid bare the darkest version that I see in the short term, which is personalized propaganda. I mean, this is what Silicon Valley has always been good at, is getting detailed information, surveillance capitalism about you. And now you can plug that into GPT or something maybe a little bit more sophisticated and write targeted propaganda all day long. I mean, this is Orwellian, and it’s not implausible.

Because of these concerns, there’s substantial interest in developing systems that can detect whether a piece of content was generated by an LLM. The field of synthetic text detection is not new, but the massive popularity of ChatGPT has certainly raised the salience of the problem. There’s even a new web app called GPTZero that claims to identify whether a piece of text was machine-generated.

I’m going to set aside the (in some ways more interesting) question of whether this is desirable. I’ll focus instead on whether and to what extent it seems feasible.

What proposals are on the table?

This recent (December 2022) article lays out some of the dominant proposals for detecting synthetic text; much of the material is drawn from this literature review on the same topic.

Based on both that article and the literature review, I’d group the approaches I’ve read about into three broad categories.

Interpretable feature-based methods

These methods use an interpretable linguistic feature (or features) to help a computer classify whether a piece of text was generated by a human or machine.

One of the most conceptually interesting proposals here is to rely on differences in the frequency distribution of words in human-generated vs. machine-generated text. Words in human-generated text tends to follow a power law distribution called Zipf’s law. Roughly, a word’s frequency is inversely proportional to its rank in a frequency table. From Wikipedia (bolding mine):

Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

Apparently, some machine-generated text doesn’t follow this distribution. In theory, then, one could:

Calculate the frequency distribution D of words for a given piece of text.
Compare this distribution to some idealized distribution Z.
If D is sufficiently different from Z, label that text as machine-generated.

Other, related approaches rely not just on the frequency of particular words but on sequences of words. Many LLMs––at least ones older than ChatGPT––generate more repetitive text than humans, i.e., they reuse the same sets of two or three words many times (though this isn’t always true for longer sequences of words, as I’ll discuss later).

The article mentions a few other feature-based approaches too, such as the number of typos (humans make more mistakes than ChatGPT) and the “fluency” of the text (older LLMs generate less coherent paragraphs).

All these approaches have in common two assumptions:

There is something different between human-generated and machine-generated text.
This “something different” is in some sense an interpretable feature, i.e., something that a human could realistically measure and describe.

Putting these assumptions together: the hope is that this measurable, interpretable feature can be used to help a classifier sort text (words, sentences, paragraphs, or even entire books) into two categories: Human or Machine.

Black box approaches

The second approach shares at least one assumption with the first: namely, that there is something different between human-generated and machine-generated text, on average.

But unlike the first approach, the “black box” method throws its hands up with respect to the interpretability of this something different. Instead, it passes the buck to other LLMs, which is not unlike asking a former thief to help catch a criminal (e.g., Frank Abagnale, the inspiration for Catch Me If You Can).

LLMs can be used to produce high-dimensional embeddings of text input. These embeddings are often very hard to interpret , but they can be used in various classification tasks, such as identifying whether a sentence is positive or negative. And synthetic text detection is, of course, just another classification task: a system must sort inputs into one of two piles, ideally giving some kind of confidence score as to its certainty about which pile a piece of text belongs to.

The reason I’m distinguishing this approach from the first is that it remains agnostic with respect to which features are useful for classifying text as machine-generated or not. Advocates of this approach would argue that this makes it much more powerful: hand-coding features always runs the risk that these features will be wrong––or discovered by bad actors––whereas an LLM is much more likely to discover patterns that may be indiscernible to humans, or at least very hard to codify.

Opponents, I assume, would argue that the uninterpretable nature of this approach is a big problem: the “detective” LLM might be relying on spurious patterns (thus incorrectly classifying many humans as LLMs), which are harder to detect precisely because the whole system is a black box.

Watermark approaches

The third approach is very different from the first two. It also seems the most interesting and promising to me––at least in some ways.

Rather than assume there’s something measurably different about human-generated vs. machine-generated text, this approach makes that difference explicit with a watermark. Here’s Scott Aaronson’s description of what he’s working on at OpenAI:

Basically, whenever GPT generates some long text, we want there to be an otherwise unnoticeable secret signal in its choices of words, which you can use to prove later that, yes, this came from GPT. We want it to be much harder to take a GPT output and pass it off as if it came from a human. This could be helpful for preventing academic plagiarism, obviously, but also, for example, mass generation of propaganda—you know, spamming every blog with seemingly on-topic comments supporting Russia’s invasion of Ukraine, without even a building full of trolls in Moscow. Or impersonating someone’s writing style in order to incriminate them. These are all things one might want to make harder, right?

As I understand it, his watermarking proposal is not unlike methods for cryptography. Specifically, when generating text, an LLM will use a pseudo-random method for selecting each token in sequence; this function will have a secret “key” known only to the developer, such as OpenAI. This shouldn’t really affect the output, but if the function is biased in a particular, very subtle way, then knowing the key should allow you to detect whether a piece of text was generated using that function:

But now you can choose a pseudorandom function that secretly biases a certain score—a sum over a certain function g evaluated at each n-gram (sequence of n consecutive tokens), for some small n—which score you can also compute if you know the key for this pseudorandom function.

Again, the idea is that this shouldn’t really degrade the quality of an LLM’s output. Like some real-world watermarks, it should be undetectable to an average reader (whether human or machine). But to someone who possesses the key, the watermark will make it very clear whether a string of words was LLM-generated.

Each approach has drawbacks

Unfortunately, all of these proposals have pretty clear limitations.

Interpretable linguistic features: discoverable and unreliable

Using interpretable linguistic features runs at least two risks. First, more easily interpretable features are arguably more discoverable by bad actors––and therefore more circumventable. Let’s say Reddit administrators begin using the frequency distribution of words to identify LLM-generated content. Someone who wants to post LLM-generated propaganda could ostensibly figure that out––it’s an interpretable linguistic feature, after all––and hack together a system that always produces content following Zipf’s law.

Second, some of the proposed linguistic features may simply be very unreliable. For example, although it seems intuitive that humans would generate more novel content than LLMs, a recent (2021) paper suggests that this isn’t always true. One way to measure this is to count up all the LLM-generated word sequences of length n (e.g., two words, three words, etc.), and ask what proportion of those were seen in the LLM’s training data. As you’d expect, smaller values of n have fewer novel sequences––in the limit, of course, an n = 1 is impossible to be novel unless you’re coining new words. But as n increases, the proportion of novel word sequences also increases. Critically, for larger values of n (e.g., n > 7), the LLM tested actually generated more novel sequences than the human baseline.

Now, I have some concerns about exactly how that human baseline was calculated. But nevertheless, these data suggest that “novelty” is not necessarily a reliable indicator of whether text was produced by an LLM. So a classifier using novelty to categorize text as machine-generated vs. human-generated may generate quite a few false positives (i.e., human-generated text classified as LLM-generated) as false negatives (i.e., LLM-generated text classified as human-generated).

Black box approaches: opaque and still ultimately gameable

I’m also somewhat pessimistic when it comes to black box approaches. It’s true that using an LLM to “catch” other LLMs gives us more flexibility than relying on hand-coded linguistic features. But it still seems possible to game: a bad actor could query the classifier with all sorts of text in an effort to decode the underlying function––they could even have another LLM trying to “outsmart” the detective LLM. (This is not too different from generative adversarial networks work, incidentally.) In this case, I suppose it’s primarily a matter of trying to stay ahead, detection-wise (and hope you’re not just catching up). But again, this assumes that there always will be some function that discriminates between human-generated and LLM-generated text. And I’m not confident that’s true.

Building on this concern about reliability, I worry about the societal impacts of classifying pieces of text––many of which might’ve been written by human––as LLM-generated (or assigning a probability to them as LLM-generated). False positives are inevitable. And they’re especially concerning when the classifier’s underlying function is so opaque, as it would be with black box approaches. How would it feel to have your Reddit post removed automatically because it was classified––by an LLM, ironically––as “likely LLM-generated”?

Watermark approaches: reliant on good faith actors

Watermark approaches seem the most secure, assuming I’m understanding the underlying technology correctly. They also sidestep the problems I’ve listed above.

However, they do have a different kind of limitation. A watermark must be engineered into the model by its developers, which assume that its developers want the model’s output to be detectable as LLM-generated. In the case of OpenAI, that assumption is correct: the developers at OpenAI are also concerned about nefarious uses of LLMs, so they’ve invested in engineering solutions to avoid that outcome (like watermarks).

But that assumption seems much more tenuous if we consider the space of all people or organizations who could realistically train an LLM. The cost of training LLMs will likely continue to fall as the process is streamlined and made more efficient––for example, Mosaic estimates that an LLM of GPT-3 quality could be trained for less than $500K. That’s really not very much money. If the estimate is true, it seems plausible that some highly-motivated individuals could pool their resources and train an LLM of their own––without a watermark, of course. Which is to say nothing of various governments around the world (or corporations, etc.), most of which clearly have operating budgets well above $500K.

The problem of bad actors training their own LLMs is exacerbated by the fact that the materials required to train an LLM are considerably easier to get your hands on than, say, the materials required for nuclear proliferation. And it seems that this will only become more true, unless there is a drastic regulatory change.

Zooming out

In my view, detecting LLM-generated content without a watermark is going to be challenging. How much of a problem will this be, ultimately? I suspect the answer depends on at least two factors:

Whether you think most LLM-generated content in the future will be produced by models trained by a handful of companies (e.g., OpenAI, Google, Baidu).
Whether it’s still very harmful to have even a small proportion of LLM-generated content be undetectable.

If (1) is true, then strong regulations applied to those large companies may address the majority of LLM-generated content, e.g., by requiring these companies to implement something like a watermark. But if (2) is true, then a handful of bad actors could still muddy the waters significantly.

At the beginning of the essay, I sidestepped the question of whether and to what extent having undetectable LLM-generated content is a problem in the first place. In some ways this is the more interesting question, but it’s also not one I feel equipped to answer intelligently, at least not yet: it depends on our own values about what sorts of content we want to proliferate on digital media.

But speaking personally, I think there’s something unsettling about the cost of producing text content falling essentially to zero. The pace of content already feels frenetic and overwhelming at times, especially on social media platforms like Twitter. I’m not sure what to think of that din growing even louder and cacophonous––or perhaps, if my other predictions are correct, ever more homogenous, such that we find ourselves swimming in a waveless, predictable sea.

Though not everyone agrees that this amounts to “plagiarism”, and some suggest we should embrace ChatGPT as a writing and research tool.

The Counterfactual

Discussion about this post