Difference between revisions of "User:Andreas Plank/Import issues with CETAF identifiers"

From CETAF Identifiers Wiki
Jump to: navigation, search
m (data.biodiversitydata.nl (Naturalis))
m (data.biodiversitydata.nl (Naturalis))
Line 15: Line 15:
 
== data.biodiversitydata.nl (Naturalis) ==
 
== data.biodiversitydata.nl (Naturalis) ==
  
{{Tobedone|Pending}} In some RDF files are invalid URI entries that is, they are not {{abbr|URL}}-encoded, e.g. <syntaxhighlight lang="xml" inline><rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/L  0934036`"></syntaxhighlight> having bare spaces or accent characters; {{abbr|URIs}} having spaces there are many, having accent characters there are a view, e.g. with error messages like:
+
{{Tobedone|Pending}} In some RDF files are invalid URI entries that is, they are not {{abbr|URL}}-encoded, e.g. <syntaxhighlight lang="xml" inline><rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/L  0934036`"></syntaxhighlight> having bare spaces or accent characters; {{abbr|URIs}} having spaces there are many (about ≈278.900), having accent characters there are a view, e.g. with error messages like:
 
<blockquote>
 
<blockquote>
 
<pre>
 
<pre>

Revision as of 11:39, 8 July 2020


Screenshot Firefox Plugin RESTED get an RDF-resource (20200218).png
Screenshot of the Firefox RESTED plugin (steps to retrieve an RDF data source)

Note: Unresolved or pending issues are on top and issues that are done get to the end. To check for RDF in your browser or on command line:

  1. you can use https://www.w3.org/RDF/Validator/ in general
    or use command line tools from Apache Jena (see Documentation), e.g. on Linux:
    /path/to/your/apache-jena-3.15.0/bin/rdfxml --validate "Testfile.rdf"
    # or with log file
    /path/to/your/apache-jena-3.15.0/bin/rdfparse -R "Testfile.rdf" > "Testfile.rdf.ttl" 2> "Testfile.rdf.log"
  2. you can more specifically use the CETAF Specimen URI Tester (http://herbal.rbge.info)
  3. you can use a plugin in your browser, to basically evaluate redirection to the source RDF, e.g. RESTED Client and then adding Header Accept: application/rdf+xml (see example aside)


data.biodiversitydata.nl (Naturalis)

Work in progress: pending Pending In some RDF files are invalid URI entries that is, they are not URL-encoded, e.g. <rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/L 0934036`"> having bare spaces or accent characters; URIs having spaces there are many (about ≈278.900), having accent characters there are a view, e.g. with error messages like:

[line: …, col: 68] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200934036[`]...>
[line: …, col: 80] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200799429%20%20%20%20[`]...>
[line: …, col: 68] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200979378[`]...>
[line: …, col: 63] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L.4305564[`]...>

These URI entries have to be fixed otherwise it would not be imported, find and replace is fixing this issue manually for the import. --Andreas Plank (talk) 12:30, 8 July 2020 (CEST)

coldb.mnhn.fr (MNHN)

Unicode/XML issues:

Notes for a general work around during harvest/import:
  • invalid unicode characters break rdfparse and subsequent import, so the harvested RDF must be fixed first manually at this point --Andreas Plank (talk) 12:33, 8 June 2020 (CEST)
  • characters that can not be guessed properly will be replaced by a question mark “?” at that position where the wrong unicode character was before

Work in progress: pending Pending unicode/XML issues:

  • http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 see ? <dwc:municipality>Szirdokpisp?Ki</dwc:municipality>
    rdfparse found: [line: …, col: 34] An invalid XML character (Unicode: 0x19) was found in the element content of the document
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0058 see ? <dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0062 see ? in <dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0061 see ? in <dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1986-0545 see ? <dwc:occurrenceRemarks>1986-545 A 547 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document
    M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«)
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0398 see ? <dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document
    M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«)
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1259 see ? <dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1260</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document
    M?ME => MÊME (AP: probably »MÊME BOCAL QUE 1980-1260«)
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0399 see ? in <dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
    M?ME => MÊME
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1260 see ? in <dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1259</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
    M?ME => MÊME
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0400 see ? in <dwc:occurrenceRemarks>1963-398 A 400 DANS LE MME BOCAL</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
    M?ME => MÊME
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/1995-0897 see ? <dwc:occurrenceRemarks>don du northern territory museum extrait du n°13530-003. Proc. Biol. Soc. Wash. v. 109 (no. 2). B?ocal a cote de la 230-0-0-1.</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 125] An invalid XML character (Unicode: 0x16) was found in the element content of the document
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/b-2510 see ? <dwc:occurrenceRemarks>PARALECTOTYPE DESIGNE PAR SPRINGER, 1962 IN COPEIA No 2? 2 : 4321 EX. EXTRAIT DE A.2024 / D.XII-23 , A.II-23 / VOIR SMITH. CONTR. TO ZOOL., No 73,</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 83] An invalid XML character (Unicode: 0x1b) was found in the element content of the document
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4687 see ? in <dwc:occurrenceRemarks>SYNTYPE?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 35] An invalid XML character (Unicode: 0x13) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4718 see ? in <dwc:occurrenceRemarks>SYNTYPES ?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254 / LS = 69 - 71 et 82 mm / LT = 78 - 80,5 et 92 mm</dwc:occurrenceRemarks>
    rdfparse found: [line: …, col: 37] An invalid XML character (Unicode: 0x13) was found in the element content of the document.
  • http://coldb.mnhn.fr/catalognumber/mnhn/ra/1991.4878 see ? in <dwc:locality>R?mire</dwc:locality>
    rdfparse found: [line: …, col: 20] An invalid XML character (Unicode: 0x1a) was found in the element content of the document.
    R?mire => R?mire (perhaps Rémire ?)

data.rbge.org.uk (RBGE)

(Work in progress: pending Pending) the RDF embedded XML contains pure “&” which is not properly escaped in the XML realm (the propper escape is &amp;). Many RDF files e.g. http://data.rbge.org.uk/herb/E00011206, seems a generic problem --Andreas Plank (talk) 16:33, 16 March 2020 (CET)

  • it will be fixed during harvesting routine, but provided XML should be valid including escaped ampersand & --Andreas Plank (talk) 16:33, 16 March 2020 (CET)

data.nhm.ac.uk (NHM)

( Pending (minor issue does not block)) Requesting “Content-Type: application/rdf+xml” results in 404 (not found) instead of getting RDF (see https://github.com/NaturalHistoryMuseum/ckanext-nhm/issues/458) --Andreas Plank (talk) 14:06, 18 February 2020 (CET)

  • minor issue not relevant because header “Content-Type: application/rdf+xml” is meant for the (returned) resource, not the request --Andreas Plank (talk) 10:40, 20 February 2020 (CET)

No or mixed up RDF description of CETAF-ID

See perhaps the example of CETAF Specimen Preview Profile (CSPP) in general.

id.luomus.fi (LUOMUS)

(Work in progress: pending Pending) The requested RDF does not describe the requested CETAF-ID http://id.luomus.fi/GL.749 itself, the ID “hangs somewhat in the air” (from a descriptive point of view):

  1. http://id.luomus.fi/GL.749 gets redirected to http://id.luomus.fi/GL.749?format=RDFXML and
  2. by analysing the RDF via Apache Jena’s rdfparse it reveals that it describes <http://id.luomus.fi/GL.749?format=RDFXML> <http://purl.org/dc/terms/subject> <http://id.luomus.fi/GL.749> just to be related, but
  3. http://id.luomus.fi/GL.749 itself has no related description (rdf:Description) but there are two descriptions http://tun.fi/MY.275076 and http://tun.fi/MY.881682 which do not relate to http://id.luomus.fi/GL.749. So CETAF-ID http://id.luomus.fi/GL.749 “hangs somewhat in the air” because it is not described.

--Andreas Plank (talk) 12:10, 20 February 2020 (CET)

id.zfmk.de (ZFMK)

(Work in progress: pending Pending) The requested RDF does not describe the requested CETAF-ID http://id.zfmk.de/collection_ZFMK/1650/733377/90217 itself, the ID “hangs somewhat in the air” (from a descriptive point of view):

  1. http://id.zfmk.de/collection_ZFMK/1650/733377/90217 gets redirected to https://id.zfmk.de/collection_ZFMK/rdf/xml/CollectionSpecimen/1650/733377/90217/?shorturl=1 and
  2. by analysing the RDF via Apache Jena’s rdfparse it reveals that it describes something other: https://id.zfmk.de/collection_ZFMK/1650, but unrelated to the ID
  3. http://id.zfmk.de/collection_ZFMK/1650/733377/90217 itself has no related description (rdf:Description) and “hangs somewhat in the air”
  4. checking the website states a stable URL https://id.zfmk.de/collection_ZFMK/page/CollectionSpecimen/1650 but this very URL does not return any RDF

--Andreas Plank (talk) 12:29, 20 February 2020 (CET)

purl.org/nhmuio (NHMUO)

(Work in progress: pending Pending) The requested RDF does not describe the requested CETAF-ID http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 itself, the ID “hangs somewhat in the air” (from a descriptive point of view):

  1. http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 gets redirected to https://data.gbif.no/resolver/O:L:14 and
  2. by analysing the RDF via Apache Jena’s rdfparse it reveals that it describes something other: http://purl.org/gbifnorway/id/O:L:14, but unrelated to the ID
  3. http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 itself has no related description (rdf:Description) and “hangs somewhat in the air”

--Andreas Plank (talk) 13:30, 20 February 2020 (CET)


No RDF but HTML

col.smns-bw.org (SMNS)

(Work in progress: pending Pending) Requested RDF is instead an HTML fragment but RDF.--Andreas Plank (talk) 14:38, 18 February 2020 (CET)

For instance under Linux:

wget --header='Accept: application/rdf+xml'  --content-on-error --output-document="col.smns-bw.org⁄object⁄S10000227722006.rdf" "http://col.smns-bw.org/object/S10000227722006"
file col.smns-bw.org⁄object⁄S10000227722006.rdf
# col.smns-bw.org⁄object⁄S10000227722006.rdf: HTML document, ISO-8859 text, with very long lines, with CRLF line terminators

specimens.kew.org (RBGK)

(Work in progress: pending Pending) Requested RDF is instead HTML but RDF --Andreas Plank (talk) 14:32, 18 February 2020 (CET)

For instance under Linux:

wget --header='Accept: application/rdf+xml'  --content-on-error --output-document="specimens.kew.org⁄herbarium⁄K001116483.rdf" "http://specimens.kew.org/herbarium/K001116483"
file specimens.kew.org⁄herbarium⁄K001116483.rdf 
# specimens.kew.org⁄herbarium⁄K001116483.rdf: HTML document, ASCII text, with very long lines, with CRLF, LF line terminators

Fixed Issues

herbarium.bgbm.org (BGBM)

( Done) In some RDF files are invalid URI entries i.e. there is a tab/space character in the URI in owl:sameAs and this would break the whole import of data. The error log of triple store loader (tdbloader2) shows something like:

Bad URI: < http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2> Code: 0/ILLEGAL_CHARACTER in SCHEME: The character violates the grammar rules for URIs/IRIs. ERROR Bad character in IRI (space): <[space]...>

… see for instance in line 63:

62 <rdf:Description rdf:about="http://www.wikidata.org/entity/Q6382619">
63                     <owl:sameAs rdf:resource="	http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2" />
64                 <owl:sameAs rdf:resource="http://viaf.org/viaf/233473288" />
65           </rdf:Description>

The following objects were detected: