jesus orozco • almost 9 years ago
Example of manually harvested dataset
From the "Global compendium of Aedes aegypti and Ae. albopictus occurrence" section, link http://dx.doi.org/10.5061/dryad.47v3c.2 I see this has a csv and a README.pdf file explaining the field names (data dictionary).
My question is: when I run a search under For researches > Use data, none of the results I've looked into don't have csv files, although they have attachment in other formats (for instance Excel). How can I distinguish or what can I do in order to retrieve an equivalent source?
Thank you
Comments are closed.

5 comments
Kyle Copas Manager • almost 9 years ago
Hi, Jesus,
Not quite following the last detail here: where are you running the 'search under For researches > Use data'?
jesus orozco • almost 9 years ago
Hi Kyle,
These are the steps I'm following:
1) browse to http://datadryad.org/resource/doi:10.5061/dryad.47v3c.2
2) in the top menu, I pick option "From Researches" -> "Use Data"
3) the search page is shown
4) enter the text "aegypti" in "Search Terms"
5) click "Go"
6) search results are shown
7) I click the first result link, in this case text says "Ritchie SA, Townsend M, Paton CJ, ...(omitted lines for brevity)
8) details page is shown
Title Application of wMelPop Wolbachia strain to crash local populations of Aedes aegypti
Downloaded 33 times
Description Studies of the survival of Aedes aegypti eggs infected with the Wolbachia strain wMelPop "popcorn"
Download Ritchie popcorn extinction data repository....xlsx (32.74 Kb)
Details View File Details
9) compared to the original asset (http://datadryad.org/resource/doi:10.5061/dryad.47v3c.2) one can see the differences.
Title Ae. aegypti and Ae. albopictus occurrences
Downloaded 487 times
Description This datafile contains a comprehensive list of occurrences of Ae. aegypti and Ae. albopictus from 1960-2014.
Download aegypti_albopictus.csv (3.406 Mb)
Details View File Details
The file I found has Excel file format and can't find the content structured in the same way.
At this point I was thinking if there's a search keyword or operator that can be used to retrieve contents that can be used to feed the resulting Darwin file.
Thanks
Kyle Copas Manager • almost 9 years ago
Right, thanks for the detailed example. And it's a good example of the challenge.
The search for aegypti definitely pulls up mosquito-related data, but the hitch with Ritchie and several of the other search results is that a quick scan of the title tells me that it's unlikely to contain GBIF-relevant data. For an occurrence dataset, we'd typically be looking for a species name, date and location. Ritchie et al. aren't really investigating anything to do with the occurrence of the species under review—their looking at ecological patterns.
So, what would I suggest...I'm not sure. If you go back to the search result where you found Ritchie, at a glance, I see very little promising among the next results, only a few 'definite maybes', like the next one.
With some additional interpretation, Dzul-Manzanilla could be coaxed into an occurrence dataset, maybe even a sampling-event one because it's got counts/abundance. The external interpretation derives from the fact that it's already a single species study on Aedes aegypti, even though the species name is not in the spreadsheet.
But for those that follow, none of the titles suggest to me that they'll evince a lot of interest in recorded observations of Ae. aegypti—they're all focused on other research around the species.
So, perhaps, broaden your search terms, including names of higher taxonomic groups. The first result for 'Diptera species', for instance, turns up an interesting candidate, because it includes not only Diptera spp. but also some plants they pollinate. Locations and dates, however, again require some interpretation.
Whatever you devise for searches, you may need some way of previewing and either mapping or prefilling column values to match these to fit.
Hope this helps.
jesus orozco • almost 9 years ago
Thanks for prompt and detailed explanation Kyle.
Kyle Copas Manager • almost 9 years ago
N.p.
Two more thoughts.
First, here's a wish list of datasets, which, even at this early stage, do not all come from the GBIF Secretariat. Might be some useful test cases hidden here: https://github.com/gbif/data-mobilization/issues
Second, just stumbled across a very interesting text file buried in Supplementary Materials here: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1000385#s8
GALLIFORM: WPA Eurasian Database v 1.0 has 65,000 records. The locations are countries, though, which might make its data more suitable to national checklists, say. But it neatly describes the problem the Challenge hopes to address, because in an article highlighting problems about access to biodiversity data, the supporting data itself remains largely hidden (at least in any format that's reusable as such).
Note, too, that the dataset in question is also deposited in Dryad: http://datadryad.org/resource/doi:10.5061/dryad.1464/1