CETAF Specimen Catalogue

From CETAF Identifiers Wiki
Revision as of 13:24, 23 January 2020 by Jörg Holetschek (Talk | contribs)

Jump to: navigation, search

The motivation for setting up the CETAF Specimen Catalogue maintained in Berlin is to provide one single access point for linked open data published by CETAF institutions. By merging semantically annotated specimen data from different sources into a single triple store that is accessible through a SPARQL interface, we hope to facilitate new usages of the data such as linkages between specimens or collectors.

Specimen data from an institution can be incorporated into the catalogue, if

  1. it is published as RDF documents using the CETAF Specimen Preview Profile (CSPP), and
  2. the CETAF ID is published to GBIF as a GUID (occurrence ID).
File:Specimen Catalogue.jpg
Data Flow in the CETAF Specimen Catalogue

Square one for the Catalogue is the GBIF Index downloaded as a zip file bi- or trimonthly. After importing major elements of the index a SQL Server database, the CETAF IDs of the 14 partners of the Stable Identifiers Implementers Group will be extracted into the CETAF ID Catalogue (hence the requirement for the institutions to use it as a GBIF occurrence ID). This will happen regardless of the implementation level; also IDs that only resolve to a human-readable representation of the object will be included. This list of IDs (22 million as of 23rd of January 2020) will be made accessible through a simple web service (implementation pending).

For the institutions supporting Level 2 of the CETAF Identifier system (currently 6 partners), the RDF documents for the identifiers found in the ID Catalogue will be harvested and imported into a triple store (Apache Jena). This LOD cache will be made accessible through a SPARQL access point (still pending). When using this access point, please keep in mind that harvesting takes place in similar intervals as the ID Catalogue is created from the GBIF Index (roughly every 2-3 months). Also, the number of specimens in the RDF Cache is lower than the number of IDs in the ID Catalogue, since not all institutions provide RDF representations of their specimens.