Difference between revisions of "User:Andreas Plank/Import issues with CETAF identifiers"

From CETAF Identifiers Wiki
Jump to: navigation, search
m ( snsb.info (SNSB): done)
m (data.biodiversitydata.nl (Naturalis): {{Done}} in any case https://github.com/infinite-dao/glean-cetaf-rdfs/blob/main/bin/fixRDF_before_validateRDFs.sh checks URL errors and tries to fix it --~~~~)
Line 15: Line 15:
 
== data.biodiversitydata.nl (Naturalis) ==
 
== data.biodiversitydata.nl (Naturalis) ==
  
{{Tobedone|Pending}} In some RDF files are invalid URI entries that is, they are not {{abbr|URL}}-encoded, e.g. <syntaxhighlight lang="xml" inline><rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/L  0934036`"></syntaxhighlight> having bare spaces or accent characters; {{abbr|URIs}} having spaces there are many (about ≈278.900), having accent characters there are a view, e.g. with error messages like:
+
In some RDF files are invalid URI entries that is, they are not {{abbr|URL}}-encoded, e.g. <syntaxhighlight lang="xml" inline><rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/L  0934036`"></syntaxhighlight> having bare spaces or accent characters; {{abbr|URIs}} having spaces there are many (about ≈278.900), having accent characters there are a view, e.g. with error messages like:
 
<blockquote>
 
<blockquote>
 
<pre>
 
<pre>
Line 26: Line 26:
 
These URI entries have to be fixed otherwise it would not be imported, find and replace is fixing this issue manually for the import.
 
These URI entries have to be fixed otherwise it would not be imported, find and replace is fixing this issue manually for the import.
 
--[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 12:30, 8 July 2020 (CEST)
 
--[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 12:30, 8 July 2020 (CEST)
 +
 +
{{Done}} in any case https://github.com/infinite-dao/glean-cetaf-rdfs/blob/main/bin/fixRDF_before_validateRDFs.sh checks URL errors and tries to fix it --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 15:09, 10 October 2022 (CEST)
  
 
== coldb.mnhn.fr ({{abbr|MNHN}}) ==
 
== coldb.mnhn.fr ({{abbr|MNHN}}) ==

Revision as of 14:09, 10 October 2022


Screenshot Firefox Plugin RESTED get an RDF-resource (20200218).png
Screenshot of the Firefox RESTED plugin (steps to retrieve an RDF data source)

Note: Unresolved or pending issues are on top and issues that are done get to the end. To check for RDF in your browser or on command line:

  1. you can use https://www.w3.org/RDF/Validator/ in general
    or use command line tools from Apache Jena (see Documentation), e.g. on Linux:
    /path/to/your/apache-jena-3.15.0/bin/rdfxml --validate "Testfile.rdf"
    # or with log file
    /path/to/your/apache-jena-3.15.0/bin/rdfparse -R "Testfile.rdf" > "Testfile.rdf.ttl" 2> "Testfile.rdf.log"
  2. you can more specifically use the CETAF Specimen URI Tester (http://herbal.rbge.info)
  3. you can use a plugin in your browser, to basically evaluate redirection to the source RDF, e.g. RESTED Client and then adding Header Accept: application/rdf+xml (see example aside)


data.biodiversitydata.nl (Naturalis)

In some RDF files are invalid URI entries that is, they are not URL-encoded, e.g. <rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/L 0934036`"> having bare spaces or accent characters; URIs having spaces there are many (about ≈278.900), having accent characters there are a view, e.g. with error messages like:

[line: …, col: 68] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200934036[`]...>
[line: …, col: 80] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200799429%20%20%20%20[`]...>
[line: …, col: 68] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200979378[`]...>
[line: …, col: 63] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L.4305564[`]...>

These URI entries have to be fixed otherwise it would not be imported, find and replace is fixing this issue manually for the import. --Andreas Plank (talk) 12:30, 8 July 2020 (CEST)

 Done in any case https://github.com/infinite-dao/glean-cetaf-rdfs/blob/main/bin/fixRDF_before_validateRDFs.sh checks URL errors and tries to fix it --Andreas Plank (talk) 15:09, 10 October 2022 (CEST)

coldb.mnhn.fr (MNHN)

Unicode/XML issues:

Notes for a general work around during harvest/import:
  • invalid unicode characters break rdfparse and subsequent import, so the harvested RDF must be fixed first manually at this point --Andreas Plank (talk) 12:33, 8 June 2020 (CEST)
  • characters that can not be guessed properly will be replaced by a question mark “?” at that position where the wrong unicode character was before

Work in progress: pending Pending unicode/XML issues (but most will be ignored as these are no plant records to be used for the botany pilot):

  • http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 see ? <dwc:municipality>Szirdokpisp?Ki</dwc:municipality>
    rdfparse found: [line: …, col: 34] An invalid XML character (Unicode: 0x19) was found in the element content of the document
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0058 see ? <dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0062 see ? in <dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0061 see ? in <dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1986-0545 see ? <dwc:occurrenceRemarks>1986-545 A 547 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document
    M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«)
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0398 see ? <dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document
    M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«)
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1259 see ? <dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1260</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document
    M?ME => MÊME (AP: probably »MÊME BOCAL QUE 1980-1260«)
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0399 see ? in <dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
    M?ME => MÊME
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1260 see ? in <dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1259</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
    M?ME => MÊME
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0400 see ? in <dwc:occurrenceRemarks>1963-398 A 400 DANS LE MME BOCAL</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
    M?ME => MÊME
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1995-0897 see ? <dwc:occurrenceRemarks>don du northern territory museum extrait du n°13530-003. Proc. Biol. Soc. Wash. v. 109 (no. 2). B?ocal a cote de la 230-0-0-1.</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 125] An invalid XML character (Unicode: 0x16) was found in the element content of the document
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/b-2510 see ? <dwc:occurrenceRemarks>PARALECTOTYPE DESIGNE PAR SPRINGER, 1962 IN COPEIA No 2? 2 : 4321 EX. EXTRAIT DE A.2024 / D.XII-23 , A.II-23 / VOIR SMITH. CONTR. TO ZOOL., No 73,</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 83] An invalid XML character (Unicode: 0x1b) was found in the element content of the document
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4687 see ? in <dwc:occurrenceRemarks>SYNTYPE?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 35] An invalid XML character (Unicode: 0x13) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4718 see ? in <dwc:occurrenceRemarks>SYNTYPES ?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254 / LS = 69 - 71 et 82 mm / LT = 78 - 80,5 et 92 mm</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 37] An invalid XML character (Unicode: 0x13) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ra/1991.4878 see ? in <dwc:locality>R?mire</dwc:locality>
    rdfparse found: [line: …, col: 20] An invalid XML character (Unicode: 0x1a) was found in the element content of the document.
    R?mire => R?mire (perhaps Rémire ?))
  • http://coldb.mnhn.fr/catalognumber/mnhn/ra/1991.4926 see ? in <dwc:locality>piste de St H?lie</dwc:locality>
    rdfparse found: [line: …, col: 32] An invalid XML character (Unicode: 0x1a) was found in the element content of the document..
    H?lie => H?lie

data.rbge.org.uk (RBGE)

(Work in progress: pending Pending) the RDF embedded XML contains pure “&” which is not properly escaped in the XML realm (the propper escape is &amp;). Many RDF files e.g. http://data.rbge.org.uk/herb/E00011206, seems a generic problem --Andreas Plank (talk) 16:33, 16 March 2020 (CET)

  • it will be fixed during harvesting routine, but provided XML should be valid including escaped ampersand & --Andreas Plank (talk) 16:33, 16 March 2020 (CET)

data.nhm.ac.uk (NHM)

( Pending (minor issue does not block)) Requesting “Content-Type: application/rdf+xml” results in 404 (not found) instead of getting RDF (see https://github.com/NaturalHistoryMuseum/ckanext-nhm/issues/458) --Andreas Plank (talk) 14:06, 18 February 2020 (CET)

  • minor issue not relevant because header “Content-Type: application/rdf+xml” is meant for the (returned) resource, not the request --Andreas Plank (talk) 10:40, 20 February 2020 (CET)

No or mixed up RDF description of CETAF-ID

See perhaps the example of CETAF Specimen Preview Profile (CSPP) in general.

id.luomus.fi (LUOMUS)

(Work in progress: pending Pending) The requested RDF does not describe the requested CETAF-ID http://id.luomus.fi/GL.749 itself, the ID “hangs somewhat in the air” (from a descriptive point of view):

  1. http://id.luomus.fi/GL.749 gets redirected to http://id.luomus.fi/GL.749?format=RDFXML and
  2. by analysing the RDF via Apache Jena’s rdfparse it reveals that it describes <http://id.luomus.fi/GL.749?format=RDFXML> <http://purl.org/dc/terms/subject> <http://id.luomus.fi/GL.749> just to be related, but
  3. http://id.luomus.fi/GL.749 itself has no related description (rdf:Description) but there are two descriptions http://tun.fi/MY.275076 and http://tun.fi/MY.881682 which do not relate to http://id.luomus.fi/GL.749. So CETAF-ID http://id.luomus.fi/GL.749 “hangs somewhat in the air” because it is not described.

--Andreas Plank (talk) 12:10, 20 February 2020 (CET)

id.zfmk.de (ZFMK)

(Work in progress: pending Pending) The requested RDF does not describe the requested CETAF-ID http://id.zfmk.de/collection_ZFMK/1650/733377/90217 itself, the ID “hangs somewhat in the air” (from a descriptive point of view):

  1. http://id.zfmk.de/collection_ZFMK/1650/733377/90217 gets redirected to https://id.zfmk.de/collection_ZFMK/rdf/xml/CollectionSpecimen/1650/733377/90217/?shorturl=1 and
  2. by analysing the RDF via Apache Jena’s rdfparse it reveals that it describes something other: https://id.zfmk.de/collection_ZFMK/1650, but unrelated to the ID
  3. http://id.zfmk.de/collection_ZFMK/1650/733377/90217 itself has no related description (rdf:Description) and “hangs somewhat in the air”
  4. checking the website states a stable URL https://id.zfmk.de/collection_ZFMK/page/CollectionSpecimen/1650 but this very URL does not return any RDF

--Andreas Plank (talk) 12:29, 20 February 2020 (CET)

specimens.kew.org (RBGK)

(Work in progress: pending Pending) Requested RDF is instead HTML but RDF --Andreas Plank (talk) 14:32, 18 February 2020 (CET)

fixing seems in progress, which is good, but some IDs from the GBIF API return no specimen but a 404 page (which is possibly an old data record) --Andreas Plank (talk) 10:56, 16 July 2020 (CEST)

For instance under Linux:

wget --header='Accept: application/rdf+xml'  --content-on-error --output-document="specimens.kew.org⁄herbarium⁄1.000.rdf" "http://specimens.kew.org/herbarium/1.000"
head specimens.kew.org⁄herbarium⁄1.000.rdf 
# <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
# <html><head>
# <title>404 Not Found</title>
# </head><body>
# <h1>Not Found</h1>
# <p>The requested URL /herbarium/1.000 was not found on this server.</p>
# </body></html>
(Work in progress: pending Pending) Requested XML/RDF is declared by <?xml version="1.0" encoding="UTF-8"?> but actually encoded as ISO-8859 contradicting the declared xml encoding of UTF-8. This should be fixed to have all non ASCII characters properly mapped. --Andreas Plank (talk) 13:16, 11 August 2020 (CEST)

For instance under Linux:

#!/bin/bash
cetaf_uri='http://specimens.kew.org/herbarium/K000001999'
wget --quiet --output-file="${cetaf_uri##*/}.log" --header='Accept: application/rdf+xml' --content-on-error --output-document="${cetaf_uri##*/}.rdf" "${cetaf_uri}"
#   download quietly RDF into file 'K000001999.rdf'

/path/to/local/downloaded/apache-jena-3.14.0/bin/rdfxml --validate 'K000001999.rdf' 
# validate rdf via Apache-Jena command line tool
# K000001999.rdf :: 12:34:44 ERROR riot                 :: [line: 37, col: 67] Invalid byte 2 of 3-byte UTF-8 sequence.

file 'K000001999.rdf'
# show file generic properties
# K000001999.rdf: XML 1.0 document, ISO-8859 text, with very long lines

Comments (AP 2020-08-11 12:51:26):

  • K000001999.rdf contains no UTF-8 but ISO-8859 encoded characters
  • manual work around would be: iconv -f ISO_8859-1 -t UTF-8 K000001999.rdf

( Done) Requested RDF has dc:relation nesting mistake: it is meant to be only inside <rdf:Description rdf:about="..." ><!-- data --><dc:relation><!-- related rdf:Description nests here --></dc:relation><!-- data --></rdf:Description>, e.g.:

Perhaps develop RDF via TriG format (on Questions, problem solutions and further discussions (Guide of best practices)) helps here ? --Andreas Plank (talk) 13:30, 16 July 2020 (CEST)

The following example compares the actual RDF (left) and the diff command line tool (right) from Linux of wget --header='Accept: application/rdf+xml' --content-on-error --output-document="K000001005.rdf" "http://specimens.kew.org/herbarium/K000001005"

The </rdf:Description> (of the CETAF-ID) ending in line 36 should end much later and must envelop all the <dc:relation> and all other elements accordingly:
35 <dwc:locationRemarks>in umbrosis.</dwc:locationRemarks>
36 </rdf:Description>
37 <!-- Image associated with the specimen -->
38 <dc:relation>
39 <rdf:Description rdf:about="http://www.kew.org/herbcatimg/588771.jpg">
40 <dc:identifier rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/>
41 <dc:type rdf:resource="http://purl.org/dc/dcmitype/Image"/>
42 <dc:subject rdf:resource="http://specimens.kew.org/herbarium/K000001005"/>
43 <dc:format>image/jpeg</dc:format>
44 <dc:description xml:lang="en">Image of herbarium specimen</dc:description>
45 <dc:license rdf:resource="https://creativecommons.org/licenses/by/4.0/"/>
46 </rdf:Description>
47 </dc:relation>
48 <dwc:associatedMedia rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/>
49 </rdf:RDF>
Using diff to illustrate it, the </rdf:Description> counting from line 33 on in line 36, moves to the very bottom before </rdf:RDF>:
--- K000001005.rdf	2020-07-16 10:25:35.236116113 +0200
+++ K000001005-fixed.rdf	2020-07-16 10:40:30.246263344 +0200
@@ -33,7 +33,6 @@
 <dwc:recordNumber>0</dwc:recordNumber>
 <dwc:country>Bahia</dwc:country>
 <dwc:locationRemarks>in umbrosis.</dwc:locationRemarks>
-</rdf:Description>
 <!-- Image associated with the specimen -->
 <dc:relation>
 <rdf:Description rdf:about="http://www.kew.org/herbcatimg/588771.jpg">
@@ -46,4 +45,5 @@
 </rdf:Description>
 </dc:relation>
 <dwc:associatedMedia rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/>
+</rdf:Description>
 </rdf:RDF>

Done --Andreas Plank (talk) 12:23, 10 August 2020 (CEST)

purl.org/nhmuio (NHMUO)

(Work in progress: pending Pending) The requested RDF does not describe the requested CETAF-ID http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 itself, the ID “hangs somewhat in the air” (from a descriptive point of view):

  1. http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 gets redirected to https://data.gbif.no/resolver/O:L:14 and
  2. by analysing the RDF via Apache Jena’s rdfparse it reveals that it describes something other: http://purl.org/gbifnorway/id/O:L:14, but unrelated to the ID
  3. http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 itself has no related description (rdf:Description) and “hangs somewhat in the air”

--Andreas Plank (talk) 13:30, 20 February 2020 (CET)


No RDF but HTML

col.smns-bw.org (SMNS)

(Work in progress: pending Pending) Requested RDF is instead an HTML fragment but RDF.--Andreas Plank (talk) 14:38, 18 February 2020 (CET)

For instance under Linux:

wget --header='Accept: application/rdf+xml'  --content-on-error --output-document="col.smns-bw.org⁄object⁄S10000227722006.rdf" "http://col.smns-bw.org/object/S10000227722006"
file col.smns-bw.org⁄object⁄S10000227722006.rdf
# col.smns-bw.org⁄object⁄S10000227722006.rdf: HTML document, ISO-8859 text, with very long lines, with CRLF line terminators

Fixed Issues

herbarium.bgbm.org (BGBM)

( Done) In some RDF files are invalid URI entries i.e. there is a tab/space character in the URI in owl:sameAs and this would break the whole import of data. The error log of triple store loader (tdbloader2) shows something like:

Bad URI: < http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2> Code: 0/ILLEGAL_CHARACTER in SCHEME: The character violates the grammar rules for URIs/IRIs. ERROR Bad character in IRI (space): <[space]...>

… see for instance in line 63:

62 <rdf:Description rdf:about="http://www.wikidata.org/entity/Q6382619">
63                     <owl:sameAs rdf:resource="	http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2" />
64                 <owl:sameAs rdf:resource="http://viaf.org/viaf/233473288" />
65           </rdf:Description>

The following objects were detected:

snsb.info (SNSB)

 Done: Mistakes of naming RDF elements/properties: they are sometimes mixed with CSPP-Element names, e.g. dwc:kindOfMaterial but with “kindOfMaterial” being meant the CSPP element name only, not the designated property (code) term; I found the following mistakes (considered using the following prefixes) :
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> and PREFIX dcterms: <http://purl.org/dc/terms/>
… the following elements are mistaken for instance and do not resolve:

  • <dwc:kindOfMaterial> => <dcterms:type>
  • <dwc:collectionDate> => <dcterms:created>
  • <dwc:sourceLink> => <dcterms:publisher>

Perhaps there are more RDF elements to fix. --Andreas Plank (talk) 15:22, 8 October 2020 (CEST)

 Done --Andreas Plank (talk) 10:25, 14 October 2020 (CEST)