Site icon Conservation news

Making mountains out of molehills: system builds public-access big data from many sources

  • How and where to store, manage, and share increasingly large data sets challenges scientists across disciplines.
  • Like a library network for scientific data, the Data Observation Network for Earth (DataONE) links member data repositories to ensure open and secure access to well-described and easily discovered Earth observational data.
  • The network provides guidelines and tools for researchers to document and preserve their data and make them available for future users to expand studies across time periods and locations.

What if we had a public library for scientific data?

The proliferation of sensors monitoring the Earth—from space to planes, drones, vehicles, park rangers, camera traps, and even animal tracking collars—has generated so much information that researchers now need new technology to access and manage it.

Scientists are increasingly uploading data to online platforms for storing and sharing genetic, taxonomic, and spatial data—such as Movebank, GenBank, Barcode Of Life Data systems (BOLD), Wildbook, CollectEarth, Global Biodiversity Information Facility (GBIF), and Map4Environment.

Forest along the Kinabatagan River in Sabah, Malaysia. Scientists increasingly rely on shared data sets to study complex systems and ecological processes. Photo credit: George Powell

As part of U.S. President Obama’s Big Data Initiative, the National Science Foundation (NSF) supported the formation of the Data Observation Network for Earth (DataONE). This network of data repositories came together in 2012 to address the growing need to manage vast amounts of diverse scientific data and make them available for science.

And like a library system, DataONE formalizes collaboration among these data centers to help scientists with three main big data challenges:

  1. Preserving and storing their data securely over time;
  2. Finding reliable data sets to help address large-scale and long-term research questions; and
  3. Visualizing and analyzing large amounts of data.

These issues are “especially important now as we deal with challenges that are long-term in nature, things like climate change, major movements of populations into new areas, and long-lasting droughts,” said William Michener, DataONE principal investigator from the University of New Mexico, in a news release.

Linking institutions and data

Similar to a library that stores books from different time periods on various aspects of a research subject, a network of data banks stores and manages data sets from research conducted at different times and places for future use beyond the initial project.

DataONE maintains and provides access to data through over 40 member data repositories where researchers can upload a data set for future use, by themselves or others. Member institutions—typically government, NGO, or university data centers or long-term research projects—preserve and provide access to contributed datasets, share data with the rest of the DataONE community, and facilitate data search and discovery by researchers, libraries, funders, and other repositories.

Giraffes at dawn in Zimbabwe’s Hwange National Park. Photo credit: George Powell

“A major growth area in science is developing an infrastructure that enables scientists to tackle…grand challenge questions. I think that is the future, data-enabled, data-intensive science,” Michener said.

Most DataONE member institutions store data on climate, forests, oceanography, biodiversity, and ecosystem processes, though the member data repositories vary in size and data type. “In the first phase of DataONE, we concentrated on repositories with environmental and biodiversity observational data,” said Rebecca Koskela, DataONE’s Executive Director, in an email. “[W]e added some social science data to the network through the Minnesota Population Center. We also have archeology data with the addition of the tDAR (the Digital Archeological Record).”

Sharing of data sets allows researchers to generate new information in a way not possible until recent advancements in computing and communications. For example, scientists can increasingly study species, habitats, or processes across multiple places and time periods using multi-national, long-term data sets.

However, once scientists publish the results of a given study or complete their PhD, they often lack a plan for the longer-term role their data may play.

“Historically, most individual scientists have collected their data, they put it onto spreadsheets….those spreadsheets go onto a laptop, and they may be lost over time,” Michener said. “These data are often considered ‘orphan’ data. They have no one to take care of them, and they often disappear several years after the project is completed.”

Tree fern in cloud forest interior, Costa Rica.
Tree fern in cloud forest interior, Costa Rica. Combining research data on multiple species can help future scientists understand ecological patterns. Photo credit: Sue Palminteri

Tools to expand data’s contribution to science

DataONE provides tools to enhance the usefulness of these data by helping scientists to:

  1. properly format data spreadsheets through a system that analyzes files, identifies problems, and offers solutions to preserve their utility over the long-term; and
  2. upload the cleaned-up orphan data with some associated descriptive information to a specific repository, to help other scientists use them in the future. “A good 95% of the data that has been collected as part of science is orphan data, much of which has been lost and we can no longer retrieve,” Michener said.

More generally, DataONE helps its member nodes prepare, catalog, and describe their data to make them available to future users as “self-describing data sets” to address long-term, large-scale research questions.

Accurate and complete metadata makes a data set more easily discovered and used, which enables replication of the technique in the future. However, researchers frequently do not fully describe where, when, and how the data were collected.

Anthias in Bunaken National Park, Sabah, Malaysia. Photo credit: Sue Palminteri

The DataONE network has created educational materials to help researchers create metadata, develop data management plans, and prepare their data for long-term storage. For example, an “Investigator Toolkit” offers access to customized tools that help scientists plan and carry out their data collection, as well as process, store and share information once they’ve collected it. The network website also records tutorials and monthly webinars and makes them freely available online.

Linking data banks enables users to search for a wide range of information from a single access point. The network also makes the data available through an outreach program to increase the involvement of educators, students, and the interested public in data collection, management, and analysis.

Joining a network can also benefit institutions by broadening the reach of the data they preserve, offering assistance in managing their data, and facilitating access to other data repositories and potential collaborators. Members can also opt to share certain project data with just specified collaborators.

Hard-to-find Doherty's bushshrike in Uganda. Data sets from multiple studies can help identify changes in species distributions over time
A hard-to-find Doherty’s bushshrike in Uganda. Data sets from multiple studies can help identify changes in species distributions over time. Photo credit: George Powell

DataONE requires certain capacities of its members to ensure long-term access to resources. Each institution must maintain and ensure access to the data sets over the long term, follow good curation and documentation practices, and use standardized metadata to describe data sets. Special software helps catalog and synchronize the data resources of each new member node.

“The repository selects which software stack that they will use to become a Member Node (our recommendation is to use the Generic Member Node software),” said Koskela. “The software is installed, testing occurs on our testing environment, and if testing is successful, the Member Node goes into production.”

The joining process is described on the DataONE website. Members initiate the request to join DataONE, said Koskela, but the limiting factor in adding more members is resources.

FEEDBACK: Use this form to send a message to the editor of this post. If you want to post a public comment, you can do that at the bottom of the page.

Exit mobile version