Liberating species records from open data repositories for scientific discovery and reuse
The 2017 GBIF Ebbe Nielsen Challenge will award a total of €14,000 to developers and data scientists who create tools capable of liberating species records from open data repositories for scientific discovery and reuse.
This year's Challenge will seek to leverage the growth of open data policies among scientific journals and research funders, which require researchers to make the data underlying their findings publicly available. Adoption of these policies represents an important first step toward increasing openness, transparency and reproducibility across all scientific domains, including biodiversity-related research.
To abide by these requirements, researchers often deposit datasets in public open-access repositories. Potential users are then able to find and access the data through repositories as well as data aggregators like OpenAIRE and DataONE. Many of these datasets are already structured in tables that contain the basic elements of biodiversity information needed to build species occurrence records: scientific names, dates, and geographic locations, among others.
However, the practices adopted by most repositories, funders and journals do not yet encourage the use of standardized formats. This approach significantly limits the interoperability and reuse of these datasets. As a result, the wider reuse of data implied if not stated by many open data policies falls short, even in cases where open licensing designations (like those provided through Creative Commons) seem to encourage it.
The 2017 GBIF Ebbe Nielsen Challenge seeks submissions that repurpose these datasets and adapting them into the Darwin Core Archive format (DwC-A), the interoperable and reusable standard that powers the publication of almost 800 million species occurrence records from the nearly 1,000 worldwide institutions now active in the GBIF network.
The 2017 Ebbe Nielsen Challenge will task developers and data scientists to create web applications, scripts or other tools that automate the discovery and extraction of relevant biodiversity data from open data repositories. Such tools might generate datasets ready for publication on GBIF.org by:
- Automating searches of open data available in public repositories
- Effectively mining the information needed to generate checklists, species occurrence and sampling-event datasets (e.g. scientific names, date and location of occurrence et al.) from datasets in these repositories
- Mapping datasets’ column headings and/or contents with standardized Darwin Core terms
- Routinely converting the reformatted data into Darwin Core archive formats ready for publication through GBIF.org
Resources and reference material
Background on Darwin Core and Darwin Core Archives
- What is Darwin Core (and why does it matter)?
- Darwin Core Archive: A how-to guide
- Explainer on GBIF dataset types/classes
- DwC-A templates for checklists, occurrence datasets and sampling-event datasets
- Data quality recommendations
- Recommended terms for sampling events
- DwC-A Validator
Examples of datasets manually harvested and published from open-data repositories
Global compendium of Aedes aegypti and Ae. albopictus occurrence
- Kraemer MUG, Sinka ME, Duda KA, Mylne A, Shearer FM, Brady OJ, Messina JP, Barker CM, Moore CG, Carvalho RG, Coelho GE, Van Bortel W, Hendrickx G, Schaffner F, Wint GRW, Elyazar IRF, Teng H, Hay SI (2015) Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.47v3c.2
Originally published in
Kraemer MUG, Sinka ME, Duda KA et al. (2015) The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus. eLife 4:e08347 http://dx.doi.org/10.7554/eLife.08347
Kraemer MUG, Sinka ME, Duda KA et al. (2015) The global compendium of Aedes aegypti and Ae. albopictus occurrence. Scientific Data 2(7): 150035. http://dx.doi.org/10.1038/sdata.2015.35
- On new GBIF.org: Global compendium of Aedes albopictus occurrence:
- On new GBIF.org: Global compendium of Aedes aegypti occurrence
LTER sampling-event dataset, Bird census at the beach of Doñana Natural Space
- On DataOne: https://search.dataone.org/#view/knb-lter-europe-deims.13610.15384
- on LTER-Europe: https://data.lter-europe.net/deims/dataset/2a0762f2-4630-11e3-aeb9-005056ab003f
- On GBIF Spain IPT: http://www.gbif.es/ipt/resource?r=donana
- On new GBIF.org: https://demo.gbif.org/dataset/9a57e938-3616-4f8c-985a-c9b66e7a1347
Open-data repositories and aggregators
The following list is not by any means exhaustive. We welcome suggestions on other relevant services to highlight for prospective Challenge entrants.
- Dryad | Data access | rdryad
- FigShare | API feature list | rfigshare
- Zenodo | Developers site | rzenodo
- Mendeley Data | Dataset API
- OpenAIRE | API documentation
- DataONE | API reference | R for DataONE
Keeping the 2016 Ebbe Nielsen Challenge in mind, GBIF is particularly interested in tools that address data biases and fill gaps by mobilizing occurrences from under-represented geographies, taxa, time periods, or thematic areas like vectors of human disease or alien and invasive species.
GBIF is also eager to see tools capable of converting open-access repository datasets into the quantitative 'sampling-event' format recently supported in the Darwin Core standard. Such datasets can capture richer information like species abundance, presence/absence, level of effort, and standard sampling methodologies and protocols.
Special thanks to the Swedish Research Council for its support of the 2017 Ebbe Nielsen Challenge.
- Members of the GBIF Secretariat
- Individuals currently under an external contract issued by the GBIF Secretariat
- Members of the GBIF Science Committee
- Heads of Delegation to GBIF
- Entry details, including the names of all team members; identification of a lead team representative; and the objective of the Submission
- Narrative description, which explains the approach taken in the Submission; identifies which open data repository (or repositories) the Submission uses and why; estimates the number of datasets that could be mobilized by applying the Submission to the repository; and discusses any addition data processing, quality assessment or quality control steps required prior to publishing the output through GBIF.org.
- Example dataset, liberated from an open data repository and prepared for formatting and publishing as a Darwin Core Archive.
In addition, Submissions must:
- Attribute and credit data originators, metadata authors, and others involved in the preparation of the original datasets
- Produce outputs that are usable by and clearly documented for others, so that resulting datasets can be used without contacting the original authors
How to enter
- Register for the Challenge. Registrants must either create a ChallengePost account or log in with an existing ChallengePost account. There is no charge for creating a ChallengePost account, and doing so will ensure that you receive updates and can access the “Enter a Submission” page. Note that all team members must also create a ChallengePost account in order to be added to a Submission.
- Familiarize yourself with Darwin Core standard as well as methods for publishing data through the GBIF network or accessing data through GBIF.org, GBIF web services, or other tools like rgbif.
- Explore relevant public open data repositories and aggregators and their APIs as well as open-source data access and publishing tools (R packages, etc.)
- Produce an application, script or other tool for liberating species records from open data repositories for scientific discovery and reuse
- Submit a narrative description detailing your Submission, along with any relevant technical requirements or implementation details. Consider including a video, slides and/or and a sample output from the tool.
- Complete and enter all of the required fields on the “Enter a Submission” page of the Challenge Website (each a “Submission”) by the end of the Challenge Submission Period—that is, by 2300 CEST (UTC +1) on 5 Sept 2017.
Professor of Taxonomy, University of Glasgow | Chair, GBIF Science Committee
Professor in Biodiversity & Systematics, University of Gothenburg
Information Systems Manager, SANBI: South African National Biodiversity Institute
Associate Curator, University of Florida
Ana Cláudia Mendes Malhado
Co-coordinator LACOS21 | Lecturer, Federal University of Alagoas, Brazil
Node Manager, GBIF Argentina | Museo Argentino de Ciencias Naturales (CONICET) and Sistema Nacional de Datos Biológicos (SNDB-MinCyT)
Associate Professor of Biology, George Washington University
Universality and Scale
Is the Submission reusable and interoperable across different public data repositories and aggregator systems? Based on these choices, approximately how many datasets might the Submission be expected to liberate?
How creative is the Submission?
How well does the Submission work? Does it have a working prototype? Does it respect and maintain existing licences?