Revolutionizing the Study of Talk

By Shilo Rea

The advent of systems such as Google, YouTube and Wikipedia has made it seem as if all types of information are accessible at all times. Yet, for people who study spoken language in realistic conversations, none of those resources really help.

To address this gap, Carnegie Mellon University Psychology Professor Brian MacWhinney has spent the past three decades constructing ways to study how we learn, use and understand spoken language. In the process, he has collaborated with researchers across the world and received continual funding from the National Science Foundation (NSF) and the National Institutes of Health (NIH), and with support from hundreds of researchers in 48 countries, he has built a system called TalkBank, a network of databases that has revolutionized the study of human communication and advanced the development of standards and tools for researchers to use to create, share, search and comment on source materials.

“As possibly the most complex psychological process, conversation is fascinating in itself, but we also want to understand how we can use this understanding to help people with and without language disorders communicate more effectively,” said MacWhinney. “There are times when our communications are misunderstood, whether in the context of different cultures, genders or race or because of an injury or disorder. To understand the various ways in which conversation can break down and to address these problems, we need to understand the basics of talk itself.”

The TalkBank system is being used by researchers in many disciplines—including linguistics, psychology, speech pathology, sociolinguistics, education and computer science.

The clinical populations studied within TalkBank include aphasia, dementia, right hemisphere disorder, traumatic brain injury, autism and stuttering. Other populations and situations include classroom discourse, child-parent interactions in the home, second language learners, bilingual code-switching, U.S. Supreme Court oral arguments, lectures, dialogs in groups of friends, dialogs in work groups and many others.

All TalkBank data are transcribed in the same format for analysis with the same set of computer programs.

“By putting everything into the same formal system, we have created a standard way to collect and analyze talk,” said MacWhinney.

The oldest of the 12 projects under the TalkBank umbrella is CHILDES, which began in 1984 and which provides transcript and media data for the study of child language development. CHILDES includes 59 million words of transcript data from children learning 24 languages.

More than 8,000 articles have been published based on the use of CHILDES data and TalkBank programs. Most TalkBank transcripts are freely and openly available to the public and the audio and video to which they are linked at the utterance level can be directly replayed over the Internet through searches.

These data are helpful to researchers on various levels, from microanalysis that looks at less than one minute of conversation to macro-analysis of big data across thousands of transcripts. In the ideal case, researchers can hypothesize and spot interesting patterns through microanalysis and then proceed to test the generality of these patterns through macro-analysis.

“For example, this interplay between micro- and macro-analysis can be especially useful when you want to explain why something happens in conversations, such as the male-female balance in conversation and who interrupts each other more, when, and why,” MacWhinney said.

Applications for TalkBank are still unfolding with several very new projects. One such project focuses on language in dementia. Twenty-four tech groups from around the world, including Apple, IBM, and research groups in Israel, Singapore, Japan and elsewhere have requested access to DementiaBank data to train their computational algorithms to spot when people start to display the features of dementia.

“I never would have thought these data would be used for this, and I’m quite curious to see how well they will do,” said MacWhinney.

A second new project, called HomeBank, collects daylong audio records in the home to study how patterns of parent-child interaction vary between social groups, families, languages, and types of children. And FluencyBank examines why it is that the majority of children who are disfluent before age six end up normally fluent, whereas 25 percent of the children who are disfluent at this age end up with long-term patterns of disfluency and stuttering.

MacWhinney says that there are four basic principles that have been crucial to the success of TalkBank: the commitment of researchers to open sharing of the products of their research; an emphasis on the roles of multiple disciplines for fully understanding the emergent nature of language, talk, and conversations; the integration of the data from hundreds of projects into a single consistent format; and the responsivity of TalkBank to the research practices and needs of particular research communities.

Learn more about TalkBank