Carnegie Mellon University
February 15, 2021

Big Data Accelerates Biodiversity Research

By Stacy Kish

Big Data Accelerates Biodiversity Research

As the planet continues to warm and humans encroach on more wilderness areas, scientists warn of the unfolding sixth mass extinction on the planet. To evaluate the progression of this catastrophe, researchers need a large amount of high-quality data that contains detailed records of plant and animal biodiversity across the planet. The Global Biodiversity Information Facility (GBIF) provides the largest open-access biodiversity data network for researchers, conservation agencies, and ultimately, policy makers. It also provides a bridge to organizations, like museums and citizen science groups, that hold valuable biodiversity resources. With all of this information, could GBIF provide researchers the resources they need to slow the threat of the next mass extinction?

Venturing into the Jungle of Data Science 

For centuries, museums have held the storehouse of specimens required to understand biodiversity across the planet. These archives serve as historical snapshots of biodiversity in one area, at one time. This information, until recently, has remained isolated. Recent efforts to digitize collections has produced a bridge between these rich troves, combining collections into a larger pool that researchers can tap to tackle bigger questions about global biodiversity. 

“Datasets from thousands of museums across the globe are increasingly digitized and accessible in publicly searchable, online data portals,” said Mason Heberling, assistant curator of botany, co-chair of collections at Carnegie Museum of Natural History and first author on the study. “We are increasingly swimming in high volumes of data, but accessing and making sense of these data can be the limiting challenge.”

Big data is big, too big for one person to pour over in their spare time. Like any great exploration into the unknown, Heberling consulted with some experts. In this case, he popped over for a chat with folks at the Digital Sciences, Humanities, Arts: Research and Publishing (dSHARP) coalition at Carnegie Mellon University. Their conversation crystalized on an approach to unearth the secrets hidden in the datasets. These early conversations continued, grew into a collaboration and resulted in a paper published in the February issue of the journal Proceedings of the National Academy of Sciences.

“This is a great example of how digital humanities can collaborate with the sciences,” said study co-author Scott Weingart, program director for the Digital Humanities at CMU. “Often times a collection is based on a particular place, time, or taxa. This [database aggregates] all of the information together so biodiversity science can make broader and more global claims.”

Big Data, Big Results

Heberling and Weingart used a machine-learning algorithm on more than 4,000 research articles, published between 2003 and 2019, to quantify the scientific impact of the GBIF database on biodiversity research. They classified the text according to the probability of certain words co-occurring within the text and among the database. Then, they evaluated the published data to identify changes in biodiversity research compared to research use, the types of data used, the groups using the data and the most common research topics.

The team found that the data available through the GBIF has increased by more than 1,150% in the past decade. This explosion in data is due in part to the participation of citizen scientists, who collect and input information into the database. In addition, research generated from GBIF-mediated data increased during the past decade, from a total of 148 studies published in 2003-2009 to 723 studies published in 2019 alone. Finally, they found that GBIF-enabled research extends across all major scientific disciplines.

“Combining community science-generated data and digitized museum records has been profound in understanding basic biology and the local to global impacts of environmental change,” said Heberling. “We show that making these data freely available in a single platform not only makes biodiversity discoveries more efficient, but also enables new research that wouldn't otherwise be possible.”

Some of the most common topics covered relate to species distribution models, climate studies and biodiversity informatics. Other research areas, like invasive species management and taxonomic treatment, showed a sharp decline. With more data on a global scale, it may be possible to create new theories and take research in new directions to understand the global scale of biodiversity science.

“When you are able to aggregate specimen and observation data from all over the world, we are able to study things on a scale previously untenable,” said Weingart.

Vestiges of Colonialism Remain

Surprisingly, the team also found the legacy of scientific colonialism in the modern aggregated environmental dataset. A visit to any natural history museum is rife with effects of colonial-inspired approaches to sample collection, and at times, theft. This approach has been codified in museums and environmental science over last several hundred years.

“In some ways, we are seeing GBIF reinforces this history,” said Weingart. “Certain countries are exporters of scientists and importers of data.”

The researchers found that many of the biodiversity studies were conducted in regions across the southern hemisphere but authored by researchers from countries in the northern hemisphere. In particular, the large amounts of data was collected from Latin American but gathered by researchers from European institutions. In fact, nearly 8% of regional studies were completed without authors from the study location, who could add nuance to the study results. 

“Though legacies of scientific colonialism persist, the open access integration of data shows promise towards a more global and diverse research community,” said Heberling. “By making these data publicly available online, these specimens are digitally repatriated back to the region they came from.

More Data, Please

Through this study, the team shows the value of large, international databases, lending support for American initiatives like the National Science Foundation’s Advancing Digitization of Biodiversity Collections program and similar programs around the world, as well as efforts to improve the global registry of the world’s natural history collections.

According to Heberling, this review is far from complete. This study highlights the need for continued efforts to develop a new era of data-intensive biodiversity science that embraces information from a variety of fields, including environmental sciences and policy, evolutionary biology, conservation and human health. 

“Our review demonstrates the scientific and societal value of combining freely available biodiversity data from many different sources,” said Heberling. “We hope our analysis encourages new uses of these data.”


Weingart and Heberling were joined by Joseph T. Miller, Daniel Noesgaard and Dmitry Schigel of the Global Biodiversity Information Facility in Copenhagen, Denmark, on the project titled “Data integration enables global biodiversity synthesis.” The project received funds from the GBIF Secretariat. The Digital Sciences, Humanities, Arts: Research & Publishing (dSHARP) coalition is a team of CMU faculty and staff in University Libraries and the Dietrich College of Humanities and Social Sciences dedicated to advancing research and teaching involving digital tools, methods and sources.