Discussion about this post

User's avatar
Terry underwood's avatar

Excellent! Thank you! One observation. I’ve found that ChatGPT writes poetry that rhymes by default. Even when you ask it not to rhyme, it slips into rhyming. This may be because it has trouble with negativity. It has insisted that a word which does not rhyme with another word in any dialect does indeed rhyme. There is something funky going on with rhyming which I think may be rooted in its limited use relative to other features absolutely required in 90% of linguistic communication. But what do I know? This was a great read—well-written, thorough for non-experts

Expand full comment
Isaac King's avatar

A nice explanation, thank you for writing it. I was a little disappointed that you didn't cover the most straightforward hypothesis for how LLMs learn the characters in a word; the same way humans do, by being told them explicitly. Surely GPT-4's training data includes all sorts of English learning resources that contain strings like "the word 'dog' is spelled D - O - G." Has this method been ruled out?

I'm also confused about this line:

> Tokenizers based on spaces or punctuation (i.e., “word-based tokenizers”) end up with large vocabularies and struggle to handle unknown words—and they also struggle with writing systems that don’t separate words with spaces.

If I understand the setup correctly, struggling with unknown words would be a problem of the language model itself, while struggling to tokenize things without spaces to show where the word breaks are is a problem of the tokenizer. Those are entirely different programs, aren't they?

Expand full comment
9 more comments...

No posts