Liberating species records from open data repositories for scientific discovery and reuse

Summary

The 2017 GBIF Ebbe Nielsen Challenge will award a total of €14,000 to developers and data scientists who create tools capable of liberating species records from open data repositories for scientific discovery and reuse.

Background

This year's Challenge will seek to leverage the growth of open data policies among scientific journals and research funders, which require researchers to make the data underlying their findings publicly available. Adoption of these policies represents an important first step toward increasing openness, transparency and reproducibility across all scientific domains, including biodiversity-related research.

To abide by these requirements, researchers often deposit datasets in public open-access repositories. Potential users are then able to find and access the data through repositories as well as data aggregators like OpenAIRE and DataONE. Many of these datasets are already structured in tables that contain the basic elements of biodiversity information needed to build species occurrence records: scientific names, dates, and geographic locations, among others.

However, the practices adopted by most repositories, funders and journals do not yet encourage the use of standardized formats. This approach significantly limits the interoperability and reuse of these datasets. As a result, the wider reuse of data implied if not stated by many open data policies falls short, even in cases where open licensing designations (like those provided through Creative Commons) seem to encourage it.

The Challenge

The 2017 GBIF Ebbe Nielsen Challenge seeks submissions that repurpose these datasets and adapting them into the Darwin Core Archive format (DwC-A), the interoperable and reusable standard that powers the publication of almost 800 million species occurrence records from the nearly 1,000 worldwide institutions now active in the GBIF network.

The 2017 Ebbe Nielsen Challenge will task developers and data scientists to create web applications, scripts or other tools that automate the discovery and extraction of relevant biodiversity data from open data repositories. Such tools might generate datasets ready for publication on GBIF.org by:

  1. Automating searches of open data available in public repositories
  2. Effectively mining the information needed to generate checklists, species occurrence and sampling-event datasets (e.g. scientific names, date and location of occurrence et al.) from datasets in these repositories
  3. Mapping datasets’ column headings and/or contents with standardized Darwin Core terms
  4. Routinely converting the reformatted data into Darwin Core archive formats ready for publication through GBIF.org

Resources and reference material

Background on Darwin Core and Darwin Core Archives

Examples of datasets manually harvested and published from open-data repositories

Global compendium of Aedes aegypti and Ae. albopictus occurrence

LTER sampling-event dataset, Bird census at the beach of Doñana Natural Space

Open-data repositories and aggregators 

The following list is not by any means exhaustive. We welcome suggestions on other relevant services to highlight for prospective Challenge entrants.

Extra credit

Keeping the 2016 Ebbe Nielsen Challenge in mind, GBIF is particularly interested in tools that address data biases and fill gaps by mobilizing occurrences from under-represented geographies, taxa, time periods, or thematic areas like vectors of human disease or alien and invasive species.

GBIF is also eager to see tools capable of converting open-access repository datasets into the quantitative 'sampling-event' format recently supported in the Darwin Core standard. Such datasets can capture richer information like species abundance, presence/absence, level of effort, and standard sampling methodologies and protocols.

Special thanks to the Swedish Research Council for its support of the 2017 Ebbe Nielsen Challenge.

View full rules

Eligibility

Summarized from the Official Rules   The Challenge is open to individuals, teams of individuals, companies and their employees, and governmental agencies and their employees.   The Challenge is NOT open to: 
  • Members of the GBIF Secretariat
  • Individuals currently under an external contract issued by the GBIF Secretariat
  • Members of the GBIF Science Committee
  • Heads of Delegation to GBIF

Requirements

Submissions will consist of three main elements:
  1. Entry details, including the names of all team members; identification of a lead team representative; and the objective of the Submission
  2. Narrative description, which explains the approach taken in the Submission; identifies which open data repository (or repositories) the Submission uses and why; estimates the number of datasets that could be mobilized by applying the Submission to the repository; and discusses any addition data processing, quality assessment or quality control steps required prior to publishing the output through GBIF.org.
  3. Example dataset, liberated from an open data repository and prepared for formatting and publishing as a Darwin Core Archive.

In addition, Submissions must:

  • Attribute and credit data originators, metadata authors, and others involved in the preparation of the original datasets
  • Produce outputs that are usable by and clearly documented for others, so that resulting datasets can be used without contacting the original authors 

How to enter

  1. Register for the Challenge. Registrants must either create a ChallengePost account or log in with an existing ChallengePost account. There is no charge for creating a ChallengePost account, and doing so will ensure that you receive updates and can access the “Enter a Submission” page. Note that all team members must also create a ChallengePost account in order to be added to a Submission.
  2. Familiarize yourself with Darwin Core standard as well as methods for publishing data through the GBIF network or accessing data through GBIF.org, GBIF web services, or other tools like rgbif.
  3. Explore relevant public open data repositories and aggregators and their APIs as well as open-source data access and publishing tools (R packages, etc.) 
  4. Produce an application, script or other tool for liberating species records from open data repositories for scientific discovery and reuse 
  5. Submit a narrative description detailing your Submission, along with any relevant technical requirements or implementation details. Consider including a video, slides and/or and a sample output from the tool.
  6. Complete and enter all of the required fields on the “Enter a Submission” page of the Challenge Website (each a “Submission”) by the end of the Challenge Submission Period—that is, by 2300 CEST (UTC +1) on 5 Sept 2017.

Judges

Roderic Page

Roderic Page
Professor of Taxonomy, University of Glasgow | Chair, GBIF Science Committee

Alexandre Antonelli

Alexandre Antonelli
Professor in Biodiversity & Systematics, University of Gothenburg

Brenda Daly

Brenda Daly
Information Systems Manager, SANBI: South African National Biodiversity Institute

Rob Guralnick

Rob Guralnick
Associate Curator, University of Florida

Ana Cláudia Mendes Malhado

Ana Cláudia Mendes Malhado
Co-coordinator LACOS21 | Lecturer, Federal University of Alagoas, Brazil

Anabela Plos

Anabela Plos
Node Manager, GBIF Argentina | Museo Argentino de Ciencias Naturales (CONICET) and Sistema Nacional de Datos Biológicos (SNDB-MinCyT)

Amy Zanne

Amy Zanne
Associate Professor of Biology, George Washington University

Judging Criteria

  • Universality and Scale
    Is the Submission reusable and interoperable across different public data repositories and aggregator systems? Based on these choices, approximately how many datasets might the Submission be expected to liberate?
  • Innovation
    How creative is the Submission?
  • Functionality
    How well does the Submission work? Does it have a working prototype? Does it respect and maintain existing licences?