Difference between revisions of "User:Andreas Plank/Import issues with CETAF identifiers"
m (→coldb.mnhn.fr ({{abbr|MNHN}})) |
m (→coldb.mnhn.fr (MNHN): +unicode issues) |
||
Line 11: | Line 11: | ||
== coldb.mnhn.fr ({{abbr|MNHN}}) == | == coldb.mnhn.fr ({{abbr|MNHN}}) == | ||
+ | |||
+ | Unicode/XML issues: | ||
+ | : Notes for a general work around during harvest/import: | ||
+ | :* invalid unicode characters break <code>rdfparse</code> and subsequent import, so the harvested RDF must be fixed first manually at this point --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 12:33, 8 June 2020 (CEST) | ||
+ | :* characters that can not be guessed properly will be replaced by a question mark “?” at that position where the wrong unicode character was before | ||
{{Tobedone|Pending}} unicode/XML issues: | {{Tobedone|Pending}} unicode/XML issues: | ||
Line 19: | Line 24: | ||
* http://coldb.mnhn.fr/catalognumber/mnhn/ic/1986-0545 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1986-545 A 547 DANS LE M?ME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1986-0545 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1986-545 A 547 DANS LE M?ME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
--><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | --><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | ||
− | --><br />AP: probably »DANS LE MÊME BOCAL« | + | --><br />M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«) |
* http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0398 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0398 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
--><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | --><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | ||
− | --><br />AP: probably »DANS LE MÊME BOCAL« | + | --><br />M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«) |
* http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1259 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1260</dwc:occurrenceRemarks></syntaxhighlight><!-- | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1259 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1260</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
--><br /><code>rdfparse</code> found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | --><br /><code>rdfparse</code> found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | ||
− | --><br />AP: probably »MÊME BOCAL QUE 1980-1260« | + | --><br />M?ME => MÊME (AP: probably »MÊME BOCAL QUE 1980-1260«) |
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0399 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.<!-- | ||
+ | --><br />M?ME => MÊME | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1260 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1259</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document.<!-- | ||
+ | --><br />M?ME => MÊME | ||
* http://coldb.mnhn.fr/catalognumber/mnhn/ic/1995-0897 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>don du northern territory museum extrait du n°13530-003. Proc. Biol. Soc. Wash. v. 109 (no. 2). B?ocal a cote de la 230-0-0-1.</dwc:occurrenceRemarks></syntaxhighlight><!-- | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1995-0897 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>don du northern territory museum extrait du n°13530-003. Proc. Biol. Soc. Wash. v. 109 (no. 2). B?ocal a cote de la 230-0-0-1.</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
--><br /><code>rdfparse</code> found: [line: …, col: 125] An invalid XML character (Unicode: 0x16) was found in the element content of the document | --><br /><code>rdfparse</code> found: [line: …, col: 125] An invalid XML character (Unicode: 0x16) was found in the element content of the document | ||
* http://coldb.mnhn.fr/catalognumber/mnhn/ic/b-2510 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>PARALECTOTYPE DESIGNE PAR SPRINGER, 1962 IN COPEIA No 2? 2 : 4321 EX. EXTRAIT DE A.2024 / D.XII-23 , A.II-23 / VOIR SMITH. CONTR. TO ZOOL., No 73,</dwc:occurrenceRemarks></syntaxhighlight><!-- | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/b-2510 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>PARALECTOTYPE DESIGNE PAR SPRINGER, 1962 IN COPEIA No 2? 2 : 4321 EX. EXTRAIT DE A.2024 / D.XII-23 , A.II-23 / VOIR SMITH. CONTR. TO ZOOL., No 73,</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
--><br /><code>rdfparse</code> found: [line: …, col: 83] An invalid XML character (Unicode: 0x1b) was found in the element content of the document | --><br /><code>rdfparse</code> found: [line: …, col: 83] An invalid XML character (Unicode: 0x1b) was found in the element content of the document | ||
− | + | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0061 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks></syntaxhighlight><!-- | |
− | + | --><br /><code>rdfparse</code> found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. | |
− | + | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4687 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>SYNTYPE?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254</dwc:occurrenceRemarks></syntaxhighlight><!-- | |
− | + | --><br /><code>rdfparse</code> found: [line: …, col: 35] An invalid XML character (Unicode: 0x13) was found in the element content of the document. | |
== data.rbge.org.uk ({{abbr|RBGE}}) == | == data.rbge.org.uk ({{abbr|RBGE}}) == |
Revision as of 14:17, 8 June 2020
Screenshot of the Firefox RESTED plugin (steps to retrieve an RDF data source) |
Note: Unresolved or pending issues are on top and issues that are done get to the end. To check for RDF in your browser you can (1) use the CETAF Specimen URI Tester (http://herbal.rbge.info) or use a plugin in your browser, e.g. RESTED Client and then adding Header Accept: application/rdf+xml
(see example aside)
Contents
coldb.mnhn.fr (MNHN)
Unicode/XML issues:
- Notes for a general work around during harvest/import:
- invalid unicode characters break
rdfparse
and subsequent import, so the harvested RDF must be fixed first manually at this point --Andreas Plank (talk) 12:33, 8 June 2020 (CEST) - characters that can not be guessed properly will be replaced by a question mark “?” at that position where the wrong unicode character was before
- invalid unicode characters break
Pending unicode/XML issues:
- http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 see ?
<dwc:municipality>Szirdokpisp?Ki</dwc:municipality>
rdfparse
found: [line: …, col: 34] An invalid XML character (Unicode: 0x19) was found in the element content of the document - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0058 see ?
<dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1986-0545 see ?
<dwc:occurrenceRemarks>1986-545 A 547 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document
M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«) - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0398 see ?
<dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document
M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«) - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1259 see ?
<dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1260</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document
M?ME => MÊME (AP: probably »MÊME BOCAL QUE 1980-1260«) - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0399 see ? in
<dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
M?ME => MÊME - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1260 see ? in
<dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1259</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
M?ME => MÊME - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1995-0897 see ?
<dwc:occurrenceRemarks>don du northern territory museum extrait du n°13530-003. Proc. Biol. Soc. Wash. v. 109 (no. 2). B?ocal a cote de la 230-0-0-1.</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 125] An invalid XML character (Unicode: 0x16) was found in the element content of the document - http://coldb.mnhn.fr/catalognumber/mnhn/ic/b-2510 see ?
<dwc:occurrenceRemarks>PARALECTOTYPE DESIGNE PAR SPRINGER, 1962 IN COPEIA No 2? 2 : 4321 EX. EXTRAIT DE A.2024 / D.XII-23 , A.II-23 / VOIR SMITH. CONTR. TO ZOOL., No 73,</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 83] An invalid XML character (Unicode: 0x1b) was found in the element content of the document - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0061 see ? in
<dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. - http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4687 see ? in
<dwc:occurrenceRemarks>SYNTYPE?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 35] An invalid XML character (Unicode: 0x13) was found in the element content of the document.
data.rbge.org.uk (RBGE)
( Pending) the RDF embedded XML contains pure “&
” which is not properly escaped in the XML realm (the propper escape is &
). Many RDF files e.g. http://data.rbge.org.uk/herb/E00011206, seems a generic problem --Andreas Plank (talk) 16:33, 16 March 2020 (CET)
- it will be fixed during harvesting routine, but provided XML should be valid including escaped ampersand
&
--Andreas Plank (talk) 16:33, 16 March 2020 (CET)
data.nhm.ac.uk (NHM)
( Pending (minor issue does not block)) Requesting “Content-Type: application/rdf+xml
” results in 404 (not found) instead of getting RDF (see https://github.com/NaturalHistoryMuseum/ckanext-nhm/issues/458) --Andreas Plank (talk) 14:06, 18 February 2020 (CET)
- minor issue not relevant because header “
Content-Type: application/rdf+xml
” is meant for the (returned) resource, not the request --Andreas Plank (talk) 10:40, 20 February 2020 (CET)
No or mixed up RDF description of CETAF-ID
See perhaps the example of CETAF Specimen Preview Profile (CSPP) in general.
id.luomus.fi (LUOMUS)
( Pending) The requested RDF does not describe the requested CETAF-ID http://id.luomus.fi/GL.749
itself, the ID “hangs somewhat in the air” (from a descriptive point of view):
- http://id.luomus.fi/GL.749 gets redirected to http://id.luomus.fi/GL.749?format=RDFXML and
- by analysing the RDF via Apache Jena’s
rdfparse
it reveals that it describes<http://id.luomus.fi/GL.749?format=RDFXML> <http://purl.org/dc/terms/subject> <http://id.luomus.fi/GL.749>
just to be related, buthttp://id.luomus.fi/GL.749
itself has no related description (rdf:Description
) but there are two descriptionshttp://tun.fi/MY.275076
andhttp://tun.fi/MY.881682
which do not relate tohttp://id.luomus.fi/GL.749
. So CETAF-IDhttp://id.luomus.fi/GL.749
“hangs somewhat in the air” because it is not described.--Andreas Plank (talk) 12:10, 20 February 2020 (CET)
id.zfmk.de (ZFMK)
( Pending) The requested RDF does not describe the requested CETAF-ID http://id.zfmk.de/collection_ZFMK/1650/733377/90217
itself, the ID “hangs somewhat in the air” (from a descriptive point of view):
- http://id.zfmk.de/collection_ZFMK/1650/733377/90217 gets redirected to https://id.zfmk.de/collection_ZFMK/rdf/xml/CollectionSpecimen/1650/733377/90217/?shorturl=1 and
- by analysing the RDF via Apache Jena’s
rdfparse
it reveals that it describes something other:https://id.zfmk.de/collection_ZFMK/1650
, but unrelated to the IDhttp://id.zfmk.de/collection_ZFMK/1650/733377/90217
itself has no related description (rdf:Description
) and “hangs somewhat in the air”- checking the website states a stable URL https://id.zfmk.de/collection_ZFMK/page/CollectionSpecimen/1650 but this very URL does not return any RDF
--Andreas Plank (talk) 12:29, 20 February 2020 (CET)
purl.org/nhmuio (NHMUO)
( Pending) The requested RDF does not describe the requested CETAF-ID http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3
itself, the ID “hangs somewhat in the air” (from a descriptive point of view):
- http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 gets redirected to https://data.gbif.no/resolver/O:L:14 and
- by analysing the RDF via Apache Jena’s
rdfparse
it reveals that it describes something other:http://purl.org/gbifnorway/id/O:L:14
, but unrelated to the IDhttp://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3
itself has no related description (rdf:Description
) and “hangs somewhat in the air”--Andreas Plank (talk) 13:30, 20 February 2020 (CET)
No RDF but HTML
col.smns-bw.org (SMNS)
( Pending) Requested RDF is instead an HTML fragment but RDF.--Andreas Plank (talk) 14:38, 18 February 2020 (CET)
For instance under Linux:
wget --header='Accept: application/rdf+xml' --content-on-error --output-document="col.smns-bw.org⁄object⁄S10000227722006.rdf" "http://col.smns-bw.org/object/S10000227722006" file col.smns-bw.org⁄object⁄S10000227722006.rdf # col.smns-bw.org⁄object⁄S10000227722006.rdf: HTML document, ISO-8859 text, with very long lines, with CRLF line terminators
specimens.kew.org (RBGK)
( Pending) Requested RDF is instead HTML but RDF --Andreas Plank (talk) 14:32, 18 February 2020 (CET)
For instance under Linux:
wget --header='Accept: application/rdf+xml' --content-on-error --output-document="specimens.kew.org⁄herbarium⁄K001116483.rdf" "http://specimens.kew.org/herbarium/K001116483" file specimens.kew.org⁄herbarium⁄K001116483.rdf # specimens.kew.org⁄herbarium⁄K001116483.rdf: HTML document, ASCII text, with very long lines, with CRLF, LF line terminators
Fixed Issues
herbarium.bgbm.org (BGBM)
( Done) In some RDF files are invalid URI entries i.e. there is a tab/space character in the URI in owl:sameAs
and this would break the whole import of data. The error log of triple store loader (tdbloader2) shows something like:
Bad URI: < http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2> Code: 0/ILLEGAL_CHARACTER in SCHEME: The character violates the grammar rules for URIs/IRIs. ERROR Bad character in IRI (space): <[space]...>… see for instance in line 63:
62 <rdf:Description rdf:about="http://www.wikidata.org/entity/Q6382619"> 63 <owl:sameAs rdf:resource=" http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2" /> 64 <owl:sameAs rdf:resource="http://viaf.org/viaf/233473288" /> 65 </rdf:Description>The following objects were detected:
- http://herbarium.bgbm.org/data/rdf/B100000580 --Andreas Plank (talk) 16:21, 30 January 2020 (CET) Done --Andreas Plank (talk) 11:45, 3 February 2020 (CET)
- http://herbarium.bgbm.org/data/rdf/B100000503 --Andreas Plank (talk) 16:21, 30 January 2020 (CET) Done --Andreas Plank (talk) 11:45, 3 February 2020 (CET)
- http://herbarium.bgbm.org/data/rdf/B100000627 --Andreas Plank (talk) 16:21, 30 January 2020 (CET) Done --Andreas Plank (talk) 11:45, 3 February 2020 (CET)