Carnegie Mellon University

Image of the Rosetta Stone

December 06, 2018

Bible Readings Help Create New Multilingual Dataset

New resource can be used to build text-to-speech systems for hundreds of languages

By Byron Spice

Byron Spice
  • School of Computer Science
  • 412-268-9068

It's the Christmas season, which means that beloved Bible verses are being read and recited innumerable times — and in a vast number of languages. The Bible's global reach as evidenced this time of year has enabled a Carnegie Mellon University professor to create a language resource that could enhance communication in hundreds of languages.

By tapping online text and audio recordings of the New Testament in more than 700 languages, Alan Black, a professor in CMU's Language Technologies Institute, has created a dataset that can be used to build text-to-speech computer systems and other modern speech technologies for so-called low-resource languages. These languages, such as Kaqchikel in central Guatemala, Lun Bawang of Malaysia and Indonesia, and Mamprusi in northern Ghana, often are spoken by relatively small groups of people and generally lack the kind of technological tools for recognizing or translating language that are routinely available for high-resource languages such as English, Spanish or Mandarin Chinese.

Black said it generally isn't profitable to build such systems — or often even basic tools such as dictionaries or pronunciation guides — for low-resource languages. But that never mattered to Christian missionaries, he added.

"They don't care about commercial aspects," Black explained. "They care about the Word." In many cases, what few resources exist for these languages are the work of missionaries. "I suspect that for some of these languages these are the only written texts that exist."

Black was able to tap one of those evangelical resources — an online service called that provides recordings of the New Testament in more than a thousand languages — to create what he calls the CMU Wilderness Multilingual Speech Dataset. This dataset, available for free download online via GitHub, includes audio, word pronunciations and other tools necessary to build text-to-speech systems.

From, Black downloaded recordings of more than 700 languages for which both audio and text were available. That represents about 10 percent of the world's languages, he noted.

"They are languages that missionaries would care about," Black said, including those spoken in areas such as Central and South America, West and East Africa, and Southeast Asia.

He then set about aligning the text with the audio, determining which words in the text corresponded with spoken words. By so doing, he was able to establish pronunciation rules that make it possible to vocalize any word in that language, not just those included in the Bible.

To make those alignments, Black and his CMU students were aided by the similar spelling and pronunciations across languages of three Hebrew names — Jesus, David and Abraham — and the first verse of the Book of Matthew: "The book of the genealogy of Jesus Christ, the son of David, the son of Abraham."

"I now probably know that first sentence in Matthew better than anyone else," Black added.

A computer program that makes a best guess at pronunciation helps create an initial alignment of text and audio. This first attempt often is incomprehensible, Black noted, but a machine learning program then analyzes the alignment and fine-tunes it.

Thus far, he and his students have completed alignments for 600 of the languages and hope to finish the remaining, more troublesome languages soon. In some cases, poor quality recordings, misidentified languages and unrecognizable writing systems have thwarted their efforts.

Development of the dataset was an outgrowth of a Defense Advanced Research Projects Agency program called Lorelei, which sought ways to develop speech recognition tools for low-resource languages within a matter of hours or days. Such tools would be useful, for instance, in responding to epidemic outbreaks or other humanitarian crises.

Rather than build such tools on demand, which requires intensive work, Black worked to identify existing resources, such as, that could be tapped to create these tools inexpensively in advance. He and his students have demonstrated that tools such as a speech synthesizer can indeed be created using the Wilderness dataset.

Tools for processing and translating speech are particularly important for low-resource languages because many of their speakers are illiterate, Black explained.

The dataset also should be useful for linguists, he added, noting it makes it possible to do studies of how languages vary across the planet. For instance, the dataset includes about 100 languages from the Amazon basin, enabling studies of how words are formed and how they relate to words in other languages.