Tokenization in large language models…

May 2, 2024

Modern language models predict "tokens", not words—but what exactly are tokens?

11 Comments

May 2, 2024

Excellent! Thank you! One observation. I’ve found that ChatGPT writes poetry that rhymes by default. Even when you ask it not to rhyme, it slips into rhyming. This may be because it has trouble with negativity. It has insisted that a word which does not rhyme with another word in any dialect does indeed rhyme. There is something funky going on with rhyming which I think may be rooted in its limited use relative to other features absolutely required in 90% of linguistic communication. But what do I know? This was a great read—well-written, thorough for non-experts

Expand full comment

Reply (1)

Sean Trott

May 2, 2024

Thanks! I do think there's an interesting connection to rhyming and knowledge (implicit or otherwise) of phonology. Part of it could also be that GPT's training corpus (and perhaps RLHF, to the extent it incorporates poetry?) probably includes more rhyming poetry than free verse. But it's hard to say definitively given how little we know about what it was trained on.

Expand full comment

Isaac King

May 2, 2024

A nice explanation, thank you for writing it. I was a little disappointed that you didn't cover the most straightforward hypothesis for how LLMs learn the characters in a word; the same way humans do, by being told them explicitly. Surely GPT-4's training data includes all sorts of English learning resources that contain strings like "the word 'dog' is spelled D - O - G." Has this method been ruled out?

I'm also confused about this line:

> Tokenizers based on spaces or punctuation (i.e., “word-based tokenizers”) end up with large vocabularies and struggle to handle unknown words—and they also struggle with writing systems that don’t separate words with spaces.

If I understand the setup correctly, struggling with unknown words would be a problem of the language model itself, while struggling to tokenize things without spaces to show where the word breaks are is a problem of the tokenizer. Those are entirely different programs, aren't they?

Expand full comment

Reply (2)

Sean Trott

May 2, 2024

Thanks for the thoughtful questions!

> Surely GPT-4's training data includes all sorts of English learning resources that contain strings like "the word 'dog' is spelled D - O - G." Has this method been ruled out?

I think that's a possible candidate mechanism and AFAIK the paper I cited didn't rule it out (and I should've thought of it while writing this). A couple of thoughts:

1) The paper I discussed was using considerably smaller models (nothing approaching GPT-4), which may have (again we can't know for sure) also had smaller corpora.

2) The mechanism you're describing makes sense particularly when the task involves asking the model to spell, i.e., produce tokens in a sequence corresponding to the tokens in a model (because that's presumably what the training data would look like: something like "Dog" --> "D-O-G"). It could also influence the token representations themselves (and presumably it'd have to in order for the model to be able to use those representations to produce a candidate spelling).

I think it's worth following up on. I reached out to the paper authors to see whether they investigated this and will let you know what I hear (and update the explainer accordingly!).

> If I understand the setup correctly, struggling with unknown words would be a problem of the language model itself, while struggling to tokenize things without spaces to show where the word breaks are is a problem of the tokenizer. Those are entirely different programs, aren't they?

That's a fair point—by "struggle to handle unknown words" I meant that the tokenizer (by definition) won't have that word in that vocabulary, i.e., it won't map the form onto some token ID, but you're right that the bulk of the "struggling" in this scenario is probably better described as occurring downstream in the LLM.

Expand full comment

Terry Underwood, PhD

May 2, 2024

I get where you’re coming from. I had to slow down and reread several sections. I think if you read the post again saying the mantra “tokens are not words, tokens are not words,” you’re gonna grasp what he is talking about if spaces were used to divide tokens. Racket is one word but two tokens—rack et. This way the machine can process “ballet” and “ballot” and “shallot” and “mallet.” This way. Instead of having to memorize multiple words separately, the machine can atomize words to reduce the load—it uses -et and -ot to distinguish between ballet, ballot, mallet, argot, depot, etc. Keep in mind it doesn’t need to know the meanings of words.

Expand full comment

spencer

May 3, 2024

great post, especially the experiment with forcing the morpheme boundaries (via pamenizer) to see if the llm has an argument for understanding morphology (or at least plurals). wouldn't be surprised if there's something across many of the self-attention mechanisms to pick up on plurals or tenses, but admittedly i only have a superficial understanding of the architecture for these models

Expand full comment

Reply (1)

Sean Trott

May 3, 2024

Thanks! Agreed that it seems plausible that, conditional on the tokenizer having tokenized something in a morphemic way, there's some kind of systematic information contained in the affix embedding (e.g., "##es"). I think there's a lot more interesting work to be done exploring exactly what those embeddings look like, and as you say, how the attention mechanism integrates that information with the root form (e.g., "mujer").

Expand full comment

kulom001

Aug 7, 2024

nice

Expand full comment

Benjamin Riley

May 4, 2024

Love this, Sean. Perhaps a future post can explore how the tokenization of number strings differs from how humans develop numerical understanding.

Expand full comment

Terry Underwood, PhD

May 3, 2024

In reading theory using rhymes is a major pedagogical strategy for improving phonemic awareness among toddlers. Your explanation of tokenizing suggests that the letter sequence (not the sound sequence) is salient to the bot. Could it be that the word “poem” in the prompt activates a subprocesses that turns on some phonological algorithm that applies to the output? I’ve been trying to understand this from the first time I used a bot.

Expand full comment

Reply (1)

Sean Trott

May 3, 2024

That's an interesting connection. I know there's been some work fine-tuning older LLMs (pre-GPT-4 era) to do rhyming, and it's a "task" that some areas of NLP have been interested in for a while. I also wonder how many of the rhymes produced by GPT are learned from poems that use similar rhymes.

Expand full comment

The Counterfactual

Tokenization in large language models…