Questions, problem solutions and further discussions (Guide of best practices)
:
as first line character, it will indent text of that line. If you add a comment or questions better mark it with your Wiki signature and use 4 tildes ~~~~
(becomes replaced by the Wiki; advanced usage: 5 tildes ~~~~~
give date only, 3 tildes ~~~
give the user name only). If you have questions please feel free to add an entirely new section at the appropriate part or add a subsection to an existing section.
To be notified of changes on any particular page (via your user preferences e-mail options) use the star and change it to , or below in the Wiki Editor use » [✓] Watch this page «
Contents
What Institution has which Identifiers or Implementations?
See
CETAF Specimen Preview Profile (CSPP)
A set of standard data components for data exchange, see:
Splitting of collection specimens
# This part is from Talk:Splitting of collection specimens (Guide best practices) (see on that talk page for perhaps further details):
Q1. What happens to the NSId when a physical specimen is split into parts?
[Alex Hardisty] The original DS and NSId is retained and updated to point to each of the new parts, with a relation (see below). Each new part gets its own DS/NSId. Each new DS is linked back to its parent.
- [Anton Güntsch (BGBM)] This is what we recommend as well. In addition, I think that the original specimen record needs to know its successors (and provide links to them). This might sound redundant but one cannot rely on the presence of the reverse relationship and a performant inference.
Q2. What happens to the NSId after the physical specimen ceases to exist?
[Alex Hardisty] The general approach is that once created a Digital Specimen and its corresponding NSId exists permanently. When the corresponding physical specimen ceases to exist (e.g., because it was destroyed, lost, etc.) change in status should be recorded by the insertion of a new status information element into the Digital Specimen. Possible statuses are: extant, lost, destroyed, split. <any more?>
- [Anton Güntsch (BGBM)] Exactly. Digital records of specimens have to be kept forever in the CMS and get a meaningful status (‘unclear‘ might be an additional one). We need to agree a controlled terminology for this and we need to find an element representing this status. I believe that neither DwC nor ABCD has this already. Will check.
Q3. How do I represent relationships between specimens (e.g., duplicates) in a standardized way?
[Alex Hardisty] What is the list of standard relations that must be supported? isDuplicateOf, isParatypeOf, hasHolotype, …
- [Anton Güntsch (BGBM)] Again, the terminology needs to be agreed/developed.
Common Technical Problems
Issues of particular institutes are listed separately please refer to: Import issues with CETAF identifiers (user page)
Redirection and Issuing of RDF vs. Human Readable Page
If I [Andreas Plank] understand it correctly, on CETAF-Level 2—having machine-readable RDF metadata:
… the bare http://our-institution.org/any-specimen/123CETAF-ID
does not need a URL redirect necessarily
- it shall issue a web page to be read by humans (default)
- it shall issue RDF if requested via HTTP Header Accept: 'application/rdf+xml';
… but if …
- … you have implemented a redirect of the original
http://our-institution.org/any-specimen/123CETAF-ID
to any other resource, let’s say http://our-institution.org/any-specimen/rdf/123CETAF-ID or http://our-institution.org/nice-webpage/specimen/bellis-perennis/some-yx456-ID … - … then implement a redirect of HTTP response status code 303 “See Other” instead of the sometimes used 302 code “Found” or “Moved Temporarily” which is hard to tell how the client would interpret the 302 response code.
Developing RDF or Proposal of Lightweight Data File Storage (TriG format)
(link to this section: #develop RDF via TriG format or use it as dump data storage)
RDF/XML is complicate to read and perhaps to develop in the mapping of nested data. A more readable approach is using the TriG format[reference 1] and convert it eventually to RDF/XML or to any other needed data format. The TriG format is easy to read, it reads like a sentence which has segmented data elements (semicolons ;) and ending with a dot (.); then comes the next “sentence”. Looking at the minimum example of the CETAF Specimen Preview Profile (CSPP) in TriG format, it gets formatted (of course without line numbers) like:
1 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 2 @prefix dwc: <http://rs.tdwg.org/dwc/terms/> . 3 @prefix dc: <http://purl.org/dc/terms/> . 4 5 <http://herbarium.bgbm.org/data/rdf/B100068798> 6 dc:subject <http://herbarium.bgbm.org/object/B100068798> ; 7 dc:created "2019-11-11T15:41:25+01:00" . 8 9 <http://herbarium.bgbm.org/object/B100068798> 10 dc:title "Erysimum salangense Polatschek & Rech.f." ; 11 dc:created "1967-07-14" ; 12 dc:type "Specimen" ; 13 dc:publisher "BGBM" ; 14 dwc:scientificName "Erysimum salangense Polatschek & Rech.f." ; 15 dwc:previousIdentifications "Erysimum salangense Polatschek & Rech.f." ; 16 dwc:family "CRUCIFERAE" ; 17 dwc:countryCode "AF" ; 18 dwc:decimalLongitude "69.033332824707" ; 19 dwc:decimalLatitude "35.366664886475" ; 20 dwc:recordedBy "Rechinger,K.H." ; 21 dwc:fieldNumber "37047" ; 22 dwc:associatedMedia <http://ww2.bgbm.org/herbarium/images/B/10/00/68/79/B_10_0068798.jpg> .One can see that line 6 declares a dc:subject which is then defined from line 9 on, this becomes nested automatically later when conversion to RDF/XML is done. Lines 5 to 7 explain the very RDF document, because it is delivered under a different URI then the client has requested it (he requested the actual CETAF-ID URI (line 9) and got redirected to this document). Lines 9 and following explain the CETAF-ID URI which explain the actual herbarium specimen.
Going further and enrich more data to it, then the TriG format becomes more nested (you should see the example by click “expand“ on the right)
Line numbers are added to illustrate the relations
- line 8 describes a dc:subject which is further detailed from line 11 on;
- from line 11 on it contains in line 36 a wiki base entry (dwciri:recordedBy) that itself has details stated from line 48 on (and so forth also with the triple iiif for additional media) …
Note: thedwciri:recordedBy
has the same meaning as dwc:recordedBy, but as an RDF predicate dwciri:recordedBy is intended to be repeatable and have an IRI-reference object
- from line 11 on it contains in line 36 a wiki base entry (dwciri:recordedBy) that itself has details stated from line 48 on (and so forth also with the triple iiif for additional media) …
1 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
2 @prefix dwc: <http://rs.tdwg.org/dwc/terms/> .
3 @prefix dwciri: <http://rs.tdwg.org/dwc/iri/> .
4 @prefix dc: <http://purl.org/dc/terms/> .
5 @prefix owl: <http://www.w3.org/2002/07/owl#> .
6
7 <http://herbarium.bgbm.org/data/rdf/B100068798>
8 dc:subject <http://herbarium.bgbm.org/object/B100068798> ;
9 dc:created "2019-11-11T15:41:25+01:00" .
10
11 <http://herbarium.bgbm.org/object/B100068798>
12 dc:title "Erysimum salangense Polatschek & Rech.f." ;
13 dc:description "A herbarium specimen of Erysimum salangense Polatschek & Rech.f. collected by Rechinger,K.H." ;
14 dc:creator "Rechinger, K.H." ;
15 dc:created "1967-07-14" ;
16 dc:type "Specimen" ;
17 dc:publisher "BGBM" ;
18 dwc:materialSampleID "http://herbarium.bgbm.org/object/B100068798" ;
19 dwc:basisOfRecord "PreservedSpecimen" ;
20 dwc:collectionCode "B" ;
21 dwc:catalogNumber "B 10 0068798" ;
22 dwc:scientificName "Erysimum salangense Polatschek & Rech.f." ;
23 dwc:previousIdentifications "Erysimum salangense Polatschek & Rech.f." ;
24 dwc:family "CRUCIFERAE" ;
25 dwc:genus "Erysimum" ;
26 dwc:specificEpithet "salangense" ;
27 dwc:country "Afghanistan" ;
28 dwc:countryCode "AF" ;
29 dwc:locality "\n Afghanistan: NE-Afghanistan, Kathagan. Sar-i Hauz, in declivibus borealibus jugi Salang. substr. granit. Alt.: 2600 m. 14.07.1967, Leg.: K. H. Rechinger 37047.\n " ;
30 dwc:decimalLongitude "69.033332824707" ;
31 dwc:decimalLatitude "35.366664886475" ;
32 dwc:eventDate "1967-07-14" ;
33 dwc:recordNumber "37047" ;
34 dwc:recordedBy "Rechinger,K.H." ;
35 dwc:fieldNumber "37047" ;
36 dwciri:recordedBy <http://www.wikidata.org/entity/Q78738> ;
37 dwc:associatedMedia <http://ww2.bgbm.org/herbarium/images/B/10/00/68/79/B_10_0068798.jpg> ;
38 dc:relation <http://herbarium.bgbm.org/iiif/B100068798> .
39
40 <http://herbarium.bgbm.org/iiif/B100068798>
41 dc:identifier <http://herbarium.bgbm.org/iiif/B100068798> ;
42 dc:type <http://iiif.io/api/presentation/3#Manifest> ;
43 dc:subject <http://herbarium.bgbm.org/object/B100068798> ;
44 dc:format "application/ld+json" ;
45 dc:description "A IIIF resource for this specimen."@en ;
46 dc:created "" .
47
48 <http://www.wikidata.org/entity/Q78738>
49 owl:sameAs <http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/d5fea488-5786-4106-af90-396ef452c3aa> ;
50 owl:sameAs <https://viaf.org/viaf/100383596/> .
Of course you can convert RDF/XML to TriG or n-Triple statements back and forth by using some command line tools; it is illustrated here by applying CLI binaries of Apache Jena (https://jena.apache.org/):
#!/bin/bash
########## validate RDF or test conversion into TriG
rdfparse -t -s -R cetafid_123456.rdf
# parse RDF in test mode (-t), strict (-s most warnings are errors), assume RDF embedded XML document (-R)
ntriples --validate cetafid_123456.rdf > cetafid_123456.rdf.ttl.log # or
turtle --validate cetafid_123456.rdf > cetafid_123456.rdf.ttl.log
# validate conversion to triples: <subject> <predicate> <object>. errors to log file
########## convert RDF to TriG, n-triples (back and forth)
ntriples --quiet cetafid_123456.rdf > cetafid_123456.rdf.ttl # or
turtle --quiet cetafid_123456.rdf > cetafid_123456.rdf.ttl
# convert to triples format: <subject> <predicate> <object>. based on a RDF/XML document
# Note that --quiet does not suppresses errors
turtle --output=trig cetafid_123456.rdf > cetafid_123456.rdf.trig # not formatted with property prefixes (streams data)
turtle --formatted=trig cetafid_123456.rdf > cetafid_123456.rdf.formatted.trig # formats data with property prefixes (needs more memory)
# convert to TriG format based on a RDF/XML document
turtle --output=trig --compress cetafid_123456.rdf > cetafid_123456.rdf.trig.gz # not formatted with property prefixes (streams data)
turtle --formatted=trig --compress cetafid_123456.rdf > cetafid_123456.rdf.formatted.trig.gz # formats data with property prefixes (needs more memory)
# convert to TriG format based on a RDF/XML document and compress it to gz
turtle --output=rdfxml cetafid_123456.rdf.trig > cetafid_123456.rdf.trig.rdf
# convert back to RDF/XML based on TriG format
Mistakes or Errors in RDF
For importing RDF to the SPARQL interface there are some errors that break the import process and must be fixed beforehand (see also detailed import issues on User:Andreas Plank/Import issues with CETAF identifiers). Common mistakes or errors are:
- Proper XML Encoding—Make sure to follow the XML rules to encode data into RDF, e.g. the ampersand
&
must be&
; or if data fields contain tag-elements the<
or>
must be encoded as<
or>
and so forth (perhaps use https://www.w3.org/RDF/Validator/ in general or a software, command line tool that can check it properly)
- RDF Data Elements Conforming to CSPP
- (
dc:kindOfMaterial) You might make a mistake reading superficially the CSPP-elements documentation and think, the CSPP-element might be exactly the same as the data element in RDF. Please take care to distinguish this, the CSPP-elements are just for communication purposes but are not the data element itself ;-), for instance: element kindOfMaterial shall be mapped into<dcterms:type></dcterms:type>
or element collectorName shall be mapped into<dwc:recordedBy></dwc:recordedBy>
etc., see accordingly on that table of documentation. - dc:relation nesting mistake: it is meant to be only inside
<rdf:Description rdf:about="..." ><!-- data --><dc:relation><!-- related rdf:Description nests here --></dc:relation><!-- data --></rdf:Description>
- (
- URI Encoding—Encode URIs the right way, e.g. no bare spaces like in the (technically wrong encoded) example in attribute
rdf:about
:<rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/0U 0281519"><!-- … data omitted … --></rdf:Description>
so, using encoding of URIs space must be properly encoded as%20
:<rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/0U%20%200281519"><!-- … data omitted … --></rdf:Description>
- further reading see Section ‘Percentage Encoding‘ (rfc3986#section-2.1) in Berners-Lee et al.[reference 2].
- further reading see IRI-specifications (‘3.2. IRIs’ #section-IRIs) in Klyne et al.[reference 3].
- Unicode / UTF-8 Problems—Sometimes odd characters cannot be read or encoded into utf-8 characters, example:
- in http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 (the ? illustrates where the odd character is) in
<dwc:municipality>Szirdokpisp?Ki</dwc:municipality>
-
rdfparse
found: [line: …, col: 34] An invalid XML character (Unicode: 0x19) was found in the element content of the document - see on Issues of Unicode of many MNHN-CETAF-IDs (listed on user page: Andreas Plank)
- in http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 (the ? illustrates where the odd character is) in
Unicode Characters not in normal Form C
Query Unicode string data may cause a problem in getting the expected characters, because of different possible character encodings that are possible for one character (often the warning is given: String not in Unicode Normal Form C, see https://en.wikipedia.org/wiki/Unicode_equivalence). Example: the normal form C of "Верховинський" is not an Unicode equivalence of "Верховинський", it only appears so when reading, actually there are different encodings used here (to illustrate it and using the JSON representation): '...\u0439'
й vs '...\u0438\u0306'
й.
Best practice: If there is a single Unicode character available, then favour the Unicode single character instead of the composed character equivalent. There are technical helper functions to account for this but when using a simple search this problem pops up as well and one wonders why nothing appears to be are found.
Using Apache Jena one can circumvent this character encoding problem by using the fn:normalize-unicode("unicode string")
, the following SPARQL-query may illustrate it:
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
# it means: s = subject; p = predicate; o = object
SELECT ?s ?p ?o
WHERE
{ ?s ?p ?o ;
dwc:locality ?locality
FILTER (
( ?s = <http://wu.jacq.org/object/WU0107989> )
&& contains(fn:normalize-unicode(?locality), "Верховинський")
)
}
Missing Linkage of RDF Redirect Document to CETAF-ID
If you have set up a URL redirection to the CETAF-ID RDF then make sure, that the describing RDF contains a semantic linkage of the redirected RDF document to the CETAF-ID RDF via dcterms:subject
[remark 1]. To illustrate what is meant see the following minimal example written on the left in TriG format[reference 1] and on the right in RDF/XML with the highlighted CETAF-ID, that is a subject so to say (dcterms:subject) of the redirect RDF document:
1 @prefix dcterms: <http://purl.org/dc/terms/> .
2 @prefix dwc: <http://rs.tdwg.org/dwc/terms/> .
3 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
4
5 <https://redirect.of.CETAF-ID.de/whatever-path/collection_XYZ-1234>
6 dcterms:subject <https://id.of.CETAF-ID.de/collection_XYZ/1234> ;
7 dcterms:title "rdf document for XYZ Collection Specimen XYZ-1234" .
8
9 <https://id.of.CETAF-ID.de/collection_XYZ/1234>
10 dcterms:title "Specimen XYZ-1234 (XYZ Collection)" ;
11 dcterms:created "2013-9-20" ;
12 … …
13 dwc:scientificName "Geophilus electricus (Linnaeus, 1758)" .
| <rdf:RDF
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/" >
<rdf:Description rdf:about="https://redirect.of.CETAF-ID.de/whatever-path/collection_XYZ-1234">
<dcterms:subject rdf:resource="https://id.of.CETAF-ID.de/collection_XYZ/1234"/>
<dcterms:title>rdf document for XYZ Collection Specimen XYZ-1234</dcterms:title>
</rdf:Description>
<rdf:Description rdf:about="https://id.of.CETAF-ID.de/collection_XYZ/1234">
<dcterms:title>Specimen XYZ-1234 (XYZ Collection)</dcterms:title>
<dcterms:created>2013-9-20</dcterms:created>
<!-- … -->
<dwc:scientificName>Geophilus electricus (Linnaeus, 1758)</dwc:scientificName>
</rdf:Description>
</rdf:RDF>
|
Note the following (left side, TriG format):
- line 5 contains the URI of the RDF redirect document itself
- line 6 describes a (related) dcterms:subject which is further described from line 9 on; this is the actual CETAF-ID specimen
- from line 9 on the CETAF-ID specimen is described in detail
References
- ↑ 1.0 1.1 Bizer, C. and Cyganiak, R. 2014. ‘RDF 1.1 TriG — RDF Dataset Language. W3C Recommendation 25 February 2014’. Edited by Gavin Carothers and Andy Seaborne. https://www.w3.org/TR/trig/.
- ↑ Berners-Lee et al. 2005. ‘Uniform Resource Identifier (URI): Generic Syntax’. https://tools.ietf.org/html/rfc3986
- ↑ Klyne et al. 2014. In RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation. https://www.w3.org/TR/rdf11-concepts/
Remarks
- ↑ often written also as
dc:subject
, but checking the RDF‘s prefix definition it should both resolve eventually to<http://purl.org/dc/terms/>