User:Magnus Manske/NHM

From Wikipedia, the free encyclopedia
This is a page trying to track and coordinate efforts around using the Natural History Museum data portal with Wikidata, originally kicked off by Richard Nevell via email.

Basic info

  • Their catalogue of specimens has ~3.5M entries, license CC0-1.0
  • Their catalogue of type specimens has ~76K images (mostly?) under CC-BY-4.0

What has been done

  • both the specimen catalogue and the multimedia list have been downloaded and extracted to Toolforge (/data/project/mix-n-match/manual_lists/nhm)
  • a preliminary species parsing has been used as the basis for a Mix'n'match catalog
  • several Wikidata items have been created on the basis of, and matched to, this catalog
    • So that we can quantify how many items were created, which account made the edits and roughly when? That will be very useful for reporting on how this goes. Richard Nevell (talk) 13:43, 21 November 2017 (UTC)

Issues

  • specific data is difficult to extract from the species catalogue, e.g. species names mixed with the author name/date of first description
  • some species names appear to contain spelling errors, or are outdated
  • odd mix of CSV and JSON data
  • no apparent format documentation
  • no clear link between entries in the species catalogue and the multimedia list (best candidate are "BM\d+" IDs)
    • I suspect that spelling errors or the use of old names doesn't have an easy fix, although there is a data quality scale in the dataset which could help filter those with possibly problematic names. There might be some link between the catalogue entries and the multimedia list. I'll ask the NHM staff about it. As for documentation, do you have a rough idea of what would be useful in case the NHM have some that hasn't been published yet? Richard Nevell (WMUK) (talk) 13:43, 21 November 2017 (UTC)

Goals

  • extract reliable species list
  • match to Wikidata
  • link NHM species to NHM images (under free license)
  • find full resolution images (NHM multimedia list only links to low-res "preview" images)
    • I'm in contact with the NHM so I'll ask about where the full-resolution images are. They were able to tell us where when Stan3 (talk · contribs) uploaded the audio files from BioAcoustica. Richard Nevell (WMUK) (talk) 13:43, 21 November 2017 (UTC)
  • find Wikidata species matched to NHM species that do not have an image
  • upload image from NHM to Commons, add image to Wikidata
  • use NHM SPARQL API or SQL API (documentation)

Volunteers

  • I would love to help on an ad-hoc volunteer basis if I can? I have expertise in systematics and nomenclatural issues, particularly with plant taxa. I also have a few thousand links from specific NHM specimens to open access research papers if you're interested: http://rossmounce.co.uk/2015/05/24/bmnh-specimens-used-in-plos-one/ Metacladistics (talk) 12:54, 20 November 2017 (UTC)
    • @Metacladistics: Thanks for offering to help! I think we should come up with a list of tasks where volunteers can help. The data on paper discussing NHM specimens is very interesting. My starting point for this was using images from the NHM to illustrate Wikidata items and Wikipedia articles. Would it be worth having Wikidata items on individual specimens? Maybe not the whole lot as it would add a huge amount of new items to Wikidata, but may the ones which are discussed in published literature would be worth adding? Richard Nevell (WMUK) (talk) 16:33, 21 November 2017 (UTC)
      • Agree that all specimens should definitely NOT go into Wikidata (maybe revisit that decision in ten years time when Wikidata has conquered everything else). However, if a specimen is cited/mentioned in two or more academic papers that could serve as some sort of 'notability' criterion. Make it three or more if one wants to be more restrictive. Surprisingly few specimens are detectably mentioned in two or more papers(!) Metacladistics (talk) 22:37, 22 November 2017 (UTC)

I don't really understand too well how Wikidata or the mix & match tool work in complex cases. For instance, I've done the sleuthing required to work out that "Corasia bourdillonii" https://tools.wmflabs.org/mix-n-match/#/entry/23974571 is a synonym of the current accepted name Apatetes bourdillonii but Wikidata doesn't have this taxon, so I can't just assign a Q number because it isn't in Wikidata yet. How does one add valid taxa to Wikidata, that Wikidata doesn't yet seem to have?

  • Nevermind. Have figured out how to add valid taxa by trial and error. Metacladistics (talk) 23:46, 22 November 2017 (UTC)