Turn-taking in the age of Zoom

How communication media shape communication

May 05, 2023

Perhaps the most basic feature of human conversation is that we take turns.

Interruptions happen, to be sure. But by and large, one person talks at a time. Once they’re done talking, their interlocutor responds—and if they’re a cooperative conversation partner, they probably say something that’s contingent on the preceding turn.

This is so foundational that it may not seem worth pointing out. But it’s important to remember, because there’s some evidence that turn-taking (whether gestural or vocal) is evolutionarily quite a bit older than humans: vocal turn-taking has been observed in species as old as lemurs, which evolved ~65M years ago. And even though our closest relatives (e.g., chimps and bonobos) don’t display vocal turn-taking, they do display gestural turn-taking.

Image from Levinson (2016). Vocal turn-taking has been observed in primate species as old as 65M years, though, puzzlingly, does seem to “skip” non-human great apes like chimpanzees (which nonetheless exhibit gestural turn-taking).

Why does this matter? If turn-taking predates humans, then turn-taking is the communicative “habitat” in which human language and culture evolved. That is, the technology of language—as well as our neural systems for producing and understanding language—is in some way constrained by and adapted to this environmental niche. Presumably, in the absence of a turn-taking system, language could in principle have taken a different evolutionary route.

Together, this suggests that it’s important to understand how exactly turn-taking works.

Universals in the timing of turns?

In 2009, Tanya Stivers and a host of other co-authors (including my colleague, Federico Rossano) published a paper called “Universals and cultural variation in turn-taking in conversation”.

They collected samples of face-to-face conversation from 10 typologically distinct languages around the world1. They then measured the transition time between turns in each conversation. That is, how long does it take for one person to start speaking after another person has finished?

Here, a positive value means the second speaker waited for the first to finish—this is called a “gap”—and the size of that value refers to the length of the gap. A negative value would mean the second speaker started speaking while the first was still finishing up their turn—this is called an “overlap”.

Across all ten languages, the mean transition time was 208ms: on average, it took about 1/5 of a second for the second speaker to start talking after the first had finished. This is, notably, a very short amount of time. It’s also, equally notably, a positive value: although turn transitions are very fast, speakers mostly manage to avoid interruptions.2

As depicted in the figure below, there was also considerable variation across languages.

Figure taken from Stivers et al. (2009). Mean transition time across each language. Error bars represent 1 standard deviation.

Languages like Danish had a relatively longer transition time on average (almost half a second), while languages like Japanese had much shorter transition times (only 7ms).

Is this variation a lot or a little? Honestly, it’s hard to say without some kind of baseline. But one way we can ground this result is by comparing it to the average length of time it takes to produce one syllable of speech, which is about ~250ms.

That means:

Across all languages, turn transitions happen in about the amount of time it takes to produce one syllable (e.g., “Bob”).
Although languages vary in their average transition time, the “extremes” of this distribution (Japanese and Danish) are within ~250ms of the mean across all languages, i.e., the variation is about the duration of one syllable.

This result has since been extended to signed languages (de Vos et al., 2016), with remarkably similar results.

Importantly, this doesn’t mean there was no overlap in any of these languages, and it also doesn’t mean there weren’t many turns with much longer transitions. This is a claim about the central tendency of the distribution.

It’s also worth pointing out that most of these analyses focus on yes/no questions and their replies (e.g., “Did you watch the game last night?”), which, intuitively, seem like they’d have faster responses on average. We also know, from Stivers et al. (2009), that a “yes” response tends to be faster than a “no”. All that said, in a supplementary analysis of a broader corpus of speech, the authors found qualitatively similar results—so it seems like this isn’t just about yes/no questions.

Why we got tired of Zoom

In 2020, many offices (and universities) switched to remote work. Meetings that used to be conducted face-to-face suddenly switched to Zoom, and workers learned how to navigate the perils of mute, screen sharing, and getting frustrated about constant interruptions and latencies.

There’s a lot to appreciate about technologies like Zoom: many meetings simply couldn’t happen face-to-face—I have a collaborator in Belgium, for example—and it also helps accommodate a more flexible work-from-home environment. It was also crucial during the early stages of the pandemic.

But speaking as someone who carried out quite a few meetings over Zoom—and even taught a couple of classes—I’m very glad to be meeting (and teaching) in person again. And I don’t think I’m alone. During the heights of the pandemic, many articles were written about “Zoom fatigue” (1, 2, 3) and what could be done to minimize it.

Notably, one of the perennial frustrations with Zoom—perhaps the frustration—lay in difficulties with turn-taking. At least anecdotally, it felt much harder to get into the “rhythm” of a conversation over Zoom. This was likely due in part to not being able to observe all the various cues you can observe with someone who’s physically co-present. But strangely, I much prefer talking over the telephone than meeting over Zoom, and I don’t think it’s just about my association between Zoom and “work stuff”. Speculating a bit, I think it’s at least partially attributable to the fact that various lags in Zoom—because of bad Internet connections, etc.—often led to substantial “overlap” (i.e., interruptions). Even in a 1-1 meeting, latency issues meant the constant threat of interruption, waiting to make sure the other person wasn’t speaking, and so on. The threat of overlap thus produced longer and longer gaps between turns, disrupting the normal rhythm of conversation. All this added up to feeling kind of exhausted and ultimately “out of sync” with the other person.3

Again, I don’t think this is the only issue. I also prefer talking on the phone because it gives me a little more freedom of movement: I can go for a walk, pace around the room, etc., whereas this is harder with Zoom (especially if my camera is on). And people also likely got tired of constantly looking at themselves over Zoom.

But I do think these transmission delays played a big role. And recently, I came across a 2022 paper entitled “Zoom disrupts the rhythm of conversation”, which presents data consistent with my intuition. I’ll describe their results briefly below.

Zoom disrupts conversational rhythms

The authors ran two different experiments. In the first study, the content of the conversations was tightly controlled (all yes/no questions, as in the 2009 paper I mentioned previously); in the second study, they analyzed more spontaneous, free-form conversation. I’ll focus more on the second study here, though the results were qualitatively similar in both.

In the second study, participants were recorded having a conversation with the experimenter either in person (the “local” condition) or over Zoom (the “remote” condition). Conversation topics were designated in advance, e.g., “pop culture” or “life in Ann Arbor”.

After recording each conversation, the authors transcribed the dialogue and calculated how long it took to transition between turns (i.e., the average “gap”). As one might expect, gaps were significantly longer in the remote condition than the local condition (by a factor of ~3x).

Figure from Boland et al. (2022). Gaps were longer for the remote condition than the local condition. The experimenter also took longer to respond on average, though there was a significant effect of local/remote for both experimenters and participants.

Intriguingly, estimates of the average transmission delay—i.e., Zoom lags—are about 30-70ms. That means that these results cannot be accounted for merely by the fact that it takes a little longer for a spoken utterance to be transmitted over Zoom than in-person. Rather, the authors argue that this minor delay in transmission disrupted the underlying rhythmic dynamics of conversation.

There’s a longstanding idea (dating back at least to 2005) that turn-taking timing follows an oscillatory pattern—which may, in turn, rest upon synchronization of neural oscillators across the participants in the conversation. I won’t go into the neural details here (and much is speculative anyway), but the high-level view is described by the authors as follows (bolding mine):

Just as ballroom dancers must coordinate their moves to the beat of the music, interlocutors must coordinate their utterances to the rhythm of the conversation, relying on neural synchronization to syllable rate…For example, if both interlocutors are speaking at the same rate, it is more straightforward to predict both one’s own timings and the timings of one’s partner…if the variable transmission delays from video-conferencing software cause discrepancies beyond the capacity of the system, interlocutors would be unable to maintain timing synchrony and the dialogue system would be destabilized—perhaps as we have observed in our experiments. (pg. 1279-1280)

In other words: in-person (or perhaps even over the phone), people can entrain to a shared conversational rhythm. But as soon as there’s a transmission delay—as there is over Zoom—this ability to align rhythmically becomes much more challenging. Ultimately, this leads to even longer gaps, in the attempt to avoid interruptions.

Communicative niche, revisited

At the beginning of this article, I made the following points:

Turn-taking likely predates human language.
Turn-taking timing is remarkably consistent across languages and cultures.
Human language is thus adapted to (and constrained by) the idiosyncrasies of turn-taking, and of turn-taking timing.

Turn-taking, in other words, can be viewed as a kind of “ecological niche” in which language resides. That is, human language has to be the kind of thing that can be used in a turn-taking system like the one we have.

What happens, then, when that turn-taking environment is altered in some way—as it was over Zoom?

In the short-term, we saw effects like additional delays in turn-taking, along with Zoom fatigue and perhaps even occasional (if unwarranted) irritation with one’s conversational partner.

It’s probably too early to make claims about medium-term (and certainly long-term) effects. And crucially, the nature of this effects will likely depend on the extent to which technologies like Zoom ends up pervading more or less of our lives. But if the majority of our conversations are conducted over something like Zoom, then I think the selection pressures operating over language and communication start to change in non-trivial ways. In particular, if we take seriously the idea that face-to-face interaction—with all the intricacies of turn-taking timing that come with that—is the natural habitat of language, then a disruption in timing is a disruption to habitat. Where exactly that will lead language is unclear, and will also depend on exactly how flexibly our neural and communicative systems adapt to the new habitat.

In a sense, this is not unlike the question of how widespread use of LLMs in communication may change language itself. In both cases, the adoption of a new technology has the potential to shape certainly features—perhaps fundamental, perhaps not—of our linguistic systems. The question of exactly what those changes will entail is tough to answer; even tougher is the question of whether those changes are good or bad. But I think it’s worthwhile to face the future as clear-eyed as possible, and part of that means appreciating the deep embeddedness of human communication and culture in the fabric of technology and communication media.

For more on why this matters, check out my other post on the importance of linguistic diversity.

The median transition times, importantly, were also positive.

Note that the metaphors here all relate to the rhythm of dialogue.

The Counterfactual

Discussion about this post