Difference between revisions of "User:Andreas Plank/Import issues with CETAF identifiers"
m (wget commands) |
m (→col.smns-bw.org (SMNS)) |
||
(43 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
| style="width:180px;vertical-align:top;" | Screenshot of the Firefox RESTED plugin (steps to retrieve an RDF data source) | | style="width:180px;vertical-align:top;" | Screenshot of the Firefox RESTED plugin (steps to retrieve an RDF data source) | ||
|} | |} | ||
− | '''Note:''' Unresolved or pending issues are on top and issues that are done get to the end. To check for RDF in your browser you can ( | + | '''Note:''' Unresolved or pending issues are on top and issues that are done get to the end. To check for RDF in your browser or on command line: |
+ | # you can use https://www.w3.org/RDF/Validator/ in general<br /> or use command line tools from [https://jena.apache.org/documentation/io/ Apache Jena (see Documentation)], e.g. on Linux: <div style="font-size:smaller;font-family:mono;">/path/to/your/apache-jena-3.15.0/bin/rdfxml --validate "Testfile.rdf" <br /># or with log file<br />/path/to/your/apache-jena-3.15.0/bin/rdfparse -R "Testfile.rdf" > "Testfile.rdf.ttl" 2> "Testfile.rdf.log"</div> | ||
+ | # you can more specifically use the [http://herbal.rbge.info CETAF Specimen URI Tester (http://herbal.rbge.info)] | ||
+ | # you can use a plugin in your browser, to basically evaluate redirection to the source RDF, e.g. [https://addons.mozilla.org/de/firefox/addon/rested/ RESTED Client] and then adding Header <code>Accept: application/rdf+xml</code> (see example aside) | ||
---- | ---- | ||
__TOC__ | __TOC__ | ||
+ | |||
+ | == coldb.mnhn.fr ({{abbr|MNHN}}) == | ||
+ | |||
+ | Unicode/XML issues{{anchor|issue Unicode and UTF-8 (MNHN)}}: | ||
+ | : Notes for a general work around during harvest/import: | ||
+ | :* invalid unicode characters break <code>rdfparse</code> and subsequent import, so the harvested RDF must be fixed first manually at this point --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 12:33, 8 June 2020 (CEST) | ||
+ | :* characters that can not be guessed properly will be replaced by a question mark “?” at that position where the wrong unicode character was before | ||
+ | |||
+ | {{Tobedone|Pending}} unicode/XML issues (but most will be ignored as these are no plant records to be used for the botany pilot): | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 see ? <syntaxhighlight lang="xml" inline><dwc:municipality>Szirdokpisp?Ki</dwc:municipality></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 34] An invalid XML character (Unicode: 0x19) was found in the element content of the document | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0058 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0062 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0061 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1986-0545 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1986-545 A 547 DANS LE M?ME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | ||
+ | --><br />M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«) | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0398 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | ||
+ | --><br />M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«) | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1259 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1260</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document<!-- | ||
+ | --><br />M?ME => MÊME (AP: probably »MÊME BOCAL QUE 1980-1260«) | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0399 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.<!-- | ||
+ | --><br />M?ME => MÊME | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1260 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1259</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document.<!-- | ||
+ | --><br />M?ME => MÊME | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0400 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>1963-398 A 400 DANS LE MME BOCAL</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.<!-- | ||
+ | --><br />M?ME => MÊME | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/1995-0897 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>don du northern territory museum extrait du n°13530-003. Proc. Biol. Soc. Wash. v. 109 (no. 2). B?ocal a cote de la 230-0-0-1.</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 125] An invalid XML character (Unicode: 0x16) was found in the element content of the document | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/b-2510 see ? <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>PARALECTOTYPE DESIGNE PAR SPRINGER, 1962 IN COPEIA No 2? 2 : 4321 EX. EXTRAIT DE A.2024 / D.XII-23 , A.II-23 / VOIR SMITH. CONTR. TO ZOOL., No 73,</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 83] An invalid XML character (Unicode: 0x1b) was found in the element content of the document | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4687 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>SYNTYPE?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 35] An invalid XML character (Unicode: 0x13) was found in the element content of the document. | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4718 see ? in <syntaxhighlight lang="xml" inline><dwc:occurrenceRemarks>SYNTYPES ?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254 / LS = 69 - 71 et 82 mm / LT = 78 - 80,5 et 92 mm</dwc:occurrenceRemarks></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 37] An invalid XML character (Unicode: 0x13) was found in the element content of the document. | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ra/1991.4878 see ? in <syntaxhighlight lang="xml" inline><dwc:locality>R?mire</dwc:locality></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 20] An invalid XML character (Unicode: 0x1a) was found in the element content of the document.<!-- | ||
+ | --><br />R?mire => R?mire (perhaps Rémire ?)) | ||
+ | * http://coldb.mnhn.fr/catalognumber/mnhn/ra/1991.4926 see ? in <syntaxhighlight lang="xml" inline><dwc:locality>piste de St H?lie</dwc:locality></syntaxhighlight><!-- | ||
+ | --><br /><code>rdfparse</code> found: [line: …, col: 32] An invalid XML character (Unicode: 0x1a) was found in the element content of the document..<!-- | ||
+ | --><br />H?lie => H?lie | ||
+ | |||
== data.nhm.ac.uk ({{abbr|NHM}}) == | == data.nhm.ac.uk ({{abbr|NHM}}) == | ||
− | ({{ | + | ({{Done|Pending (minor issue does not block)}}) Requesting “<code>Content-Type: application/rdf+xml</code>” results in 404 (not found) instead of getting RDF (see https://github.com/NaturalHistoryMuseum/ckanext-nhm/issues/458) --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 14:06, 18 February 2020 (CET) |
+ | <blockquote> | ||
+ | * minor issue not relevant because header “<code>Content-Type: application/rdf+xml</code>” is meant for the (returned) resource, not the request --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 10:40, 20 February 2020 (CET) | ||
+ | </blockquote> | ||
− | == | + | == No or mixed up RDF description of {{abbr|CETAF-ID}} == |
− | ( | + | See perhaps the [[CETAF Specimen Preview Profile (CSPP)#example_CSPP-compliant_RDF|example of CETAF Specimen Preview Profile (CSPP)]] in general. |
+ | |||
+ | === id.zfmk.de ({{abbr|ZFMK}}) === | ||
+ | |||
+ | ({{Tobedone}}) The requested RDF does not describe the requested {{abbr|CETAF-ID}} <code><nowiki>http://id.zfmk.de/collection_ZFMK/1650/733377/90217</nowiki></code> itself, the ID “hangs somewhat in the air” (from a descriptive point of view): | ||
<blockquote> | <blockquote> | ||
− | + | # http://id.zfmk.de/collection_ZFMK/1650/733377/90217 gets redirected to https://id.zfmk.de/collection_ZFMK/rdf/xml/CollectionSpecimen/1650/733377/90217/?shorturl=1 and | |
− | + | # by analysing the RDF via Apache Jena’s <code>rdfparse</code> it reveals that it describes something other: <code><nowiki>https://id.zfmk.de/collection_ZFMK/1650</nowiki></code>, but unrelated to the ID | |
− | + | # <code><nowiki>http://id.zfmk.de/collection_ZFMK/1650/733377/90217</nowiki></code> itself has no related description (<code>rdf:Description</code>) and “hangs somewhat in the air” | |
− | + | # checking the website states a stable URL https://id.zfmk.de/collection_ZFMK/page/CollectionSpecimen/1650 but this very URL does not return any RDF | |
− | # | + | --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 12:29, 20 February 2020 (CET) |
− | </ | + | |
</blockquote> | </blockquote> | ||
− | |||
− | ({{Tobedone}}) Requested RDF is instead an HTML fragment but RDF.--[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 14:38, 18 February 2020 (CET) | + | |
+ | === purl.org/nhmuio ({{abbr|NHMUO}}) === | ||
+ | |||
+ | ({{Tobedone}}) The requested RDF does not describe the requested {{abbr|CETAF-ID}} <code><nowiki>http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3</nowiki></code> itself, the ID “hangs somewhat in the air” (from a descriptive point of view): | ||
+ | <blockquote> | ||
+ | # http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 gets redirected to https://data.gbif.no/resolver/O:L:14 and | ||
+ | # by analysing the RDF via Apache Jena’s <code>rdfparse</code> it reveals that it describes something other: <code><nowiki>http://purl.org/gbifnorway/id/O:L:14</nowiki></code>, but unrelated to the ID | ||
+ | # <code><nowiki>http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3</nowiki></code> itself has no related description (<code>rdf:Description</code>) and “hangs somewhat in the air” | ||
+ | --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 13:30, 20 February 2020 (CET) | ||
+ | </blockquote> | ||
+ | |||
+ | == No RDF but HTML == | ||
+ | … | ||
+ | |||
+ | == Fixed Issues == | ||
+ | |||
+ | === col.smns-bw.org ({{abbr|SMNS}}) === | ||
+ | |||
+ | Requested RDF is instead an HTML fragment but RDF.--[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 14:38, 18 February 2020 (CET) | ||
+ | : ({{Done}}) Seems fixed --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 10:27, 12 October 2022 (CEST) | ||
<blockquote> | <blockquote> | ||
For instance under Linux: | For instance under Linux: | ||
− | <syntaxhighlight lang="bash"> | + | <syntaxhighlight lang="bash" style="font-size:smaller;"> |
wget --header='Accept: application/rdf+xml' --content-on-error --output-document="col.smns-bw.org⁄object⁄S10000227722006.rdf" "http://col.smns-bw.org/object/S10000227722006" | wget --header='Accept: application/rdf+xml' --content-on-error --output-document="col.smns-bw.org⁄object⁄S10000227722006.rdf" "http://col.smns-bw.org/object/S10000227722006" | ||
file col.smns-bw.org⁄object⁄S10000227722006.rdf | file col.smns-bw.org⁄object⁄S10000227722006.rdf | ||
Line 40: | Line 117: | ||
</blockquote> | </blockquote> | ||
− | == herbarium.bgbm.org ({{abbr|BGBM}}) == | + | === herbarium.bgbm.org ({{abbr|BGBM}}) === |
({{done}}) In some RDF files are invalid URI entries i.e. there is a tab/space character in the URI in <code>owl:sameAs</code> and this would break the whole import of data. The error log of triple store loader (tdbloader2) shows something like: | ({{done}}) In some RDF files are invalid URI entries i.e. there is a tab/space character in the URI in <code>owl:sameAs</code> and this would break the whole import of data. The error log of triple store loader (tdbloader2) shows something like: | ||
Line 46: | Line 123: | ||
<pre>Bad URI: < http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2> Code: 0/ILLEGAL_CHARACTER in SCHEME: The character violates the grammar rules for URIs/IRIs. ERROR Bad character in IRI (space): <[space]...></pre> | <pre>Bad URI: < http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2> Code: 0/ILLEGAL_CHARACTER in SCHEME: The character violates the grammar rules for URIs/IRIs. ERROR Bad character in IRI (space): <[space]...></pre> | ||
… see for instance in line 63: | … see for instance in line 63: | ||
− | <syntaxhighlight lang="xml" line start="62" highlight=" | + | <syntaxhighlight lang="xml" line start="62" highlight="2"> |
<rdf:Description rdf:about="http://www.wikidata.org/entity/Q6382619"> | <rdf:Description rdf:about="http://www.wikidata.org/entity/Q6382619"> | ||
<owl:sameAs rdf:resource=" http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2" /> | <owl:sameAs rdf:resource=" http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2" /> | ||
Line 56: | Line 133: | ||
* http://herbarium.bgbm.org/data/rdf/B100000503 --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:21, 30 January 2020 (CET) {{done}} --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 11:45, 3 February 2020 (CET) | * http://herbarium.bgbm.org/data/rdf/B100000503 --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:21, 30 January 2020 (CET) {{done}} --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 11:45, 3 February 2020 (CET) | ||
* http://herbarium.bgbm.org/data/rdf/B100000627 --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:21, 30 January 2020 (CET) {{done}} --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 11:45, 3 February 2020 (CET) | * http://herbarium.bgbm.org/data/rdf/B100000627 --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:21, 30 January 2020 (CET) {{done}} --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 11:45, 3 February 2020 (CET) | ||
+ | </blockquote> | ||
+ | |||
+ | === snsb.info ({{abbr|SNSB}}) === | ||
+ | |||
+ | {{Done|Done: Mistakes of naming RDF elements/properties}}: they are sometimes mixed with [[CSPP#CSPP_Elements|{{abbr|CSPP}}-Element names]], e.g. dwc:kindOfMaterial but with “kindOfMaterial” being meant the CSPP element name only, not the designated property (code) term; I found the following mistakes (considered using the following prefixes) :<br /><!-- | ||
+ | --><code><nowiki>PREFIX dwc: <http://rs.tdwg.org/dwc/terms/></nowiki></code> and <!-- | ||
+ | --><code><nowiki>PREFIX dcterms: <http://purl.org/dc/terms/></nowiki></code><br /> … the following elements are mistaken for instance and do not resolve: | ||
+ | * <code><nowiki><dwc:kindOfMaterial></nowiki></code> => <code><nowiki><dcterms:type></nowiki></code> | ||
+ | * <code><nowiki><dwc:collectionDate></nowiki></code> => <code><nowiki><dcterms:created></nowiki></code> | ||
+ | * <code><nowiki><dwc:sourceLink></nowiki></code> => <code><nowiki><dcterms:publisher></nowiki></code> | ||
+ | Perhaps there are more RDF elements to fix. | ||
+ | --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 15:22, 8 October 2020 (CEST) | ||
+ | : Done --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 10:25, 14 October 2020 (CEST) | ||
+ | |||
+ | === data.biodiversitydata.nl (Naturalis) === | ||
+ | |||
+ | In some RDF files are invalid URI entries that is, they are not {{abbr|URL}}-encoded, e.g. <syntaxhighlight lang="xml" inline><rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/L 0934036`"></syntaxhighlight> having bare spaces or accent characters; {{abbr|URIs}} having spaces there are many (about ≈278.900), having accent characters there are a view, e.g. with error messages like: | ||
+ | <blockquote> | ||
+ | <pre> | ||
+ | [line: …, col: 68] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200934036[`]...> | ||
+ | [line: …, col: 80] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200799429%20%20%20%20[`]...> | ||
+ | [line: …, col: 68] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200979378[`]...> | ||
+ | [line: …, col: 63] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L.4305564[`]...> | ||
+ | </pre> | ||
+ | </blockquote> | ||
+ | These URI entries have to be fixed otherwise it would not be imported, find and replace is fixing this issue manually for the import. | ||
+ | --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 12:30, 8 July 2020 (CEST) | ||
+ | |||
+ | {{Done}} in any case https://github.com/infinite-dao/glean-cetaf-rdfs/blob/main/bin/fixRDF_before_validateRDFs.sh checks URL errors and tries to fix it --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 15:09, 10 October 2022 (CEST) | ||
+ | |||
+ | === data.rbge.org.uk ({{abbr|RBGE}}) === | ||
+ | |||
+ | The RDF embedded XML contains pure “<code>&</code>” which is not properly escaped in the XML realm (the propper escape is <code>&amp;</code>). Many RDF files e.g. http://data.rbge.org.uk/herb/E00011206, seems a generic problem --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:33, 16 March 2020 (CET) | ||
+ | <blockquote> | ||
+ | * it will be fixed during harvesting routine, but provided XML should be valid including escaped ampersand <code>&</code> --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:33, 16 March 2020 (CET) | ||
+ | * {{Done}} in any case https://github.com/infinite-dao/glean-cetaf-rdfs/blob/main/bin/fixRDF_before_validateRDFs.sh checks <code>&</code> errors and tries to fix it --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 15:24, 10 October 2022 (CEST) | ||
+ | </blockquote> | ||
+ | |||
+ | === id.luomus.fi ({{abbr|LUOMUS}}) === | ||
+ | |||
+ | The requested RDF does not describe the requested {{abbr|CETAF-ID}} <code><nowiki>http://id.luomus.fi/GL.749</nowiki></code> itself, the ID “hangs somewhat in the air” (from a descriptive point of view): | ||
+ | <blockquote> | ||
+ | # http://id.luomus.fi/GL.749 gets redirected to http://id.luomus.fi/GL.749?format=RDFXML and | ||
+ | # by analysing the RDF via Apache Jena’s <code>rdfparse</code> it reveals that it describes <syntaxhighlight lang="text" inline><http://id.luomus.fi/GL.749?format=RDFXML> <http://purl.org/dc/terms/subject> <http://id.luomus.fi/GL.749></syntaxhighlight> just to be related, but | ||
+ | # <code><nowiki>http://id.luomus.fi/GL.749</nowiki></code> itself has no related description (<code>rdf:Description</code>) but there are two descriptions <code><nowiki>http://tun.fi/MY.275076</nowiki></code> and <code><nowiki>http://tun.fi/MY.881682</nowiki></code> which do not relate to <code><nowiki>http://id.luomus.fi/GL.749</nowiki></code>. So {{abbr|CETAF-ID}} <code><nowiki>http://id.luomus.fi/GL.749</nowiki></code> “hangs somewhat in the air” because it is not described. | ||
+ | --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 12:10, 20 February 2020 (CET) | ||
+ | * {{Done}} --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:19, 10 October 2022 (CEST) | ||
+ | </blockquote> | ||
+ | |||
+ | === specimens.kew.org ({{abbr|RBGK}}) === | ||
+ | |||
+ | Requested RDF is instead HTML but RDF --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 14:32, 18 February 2020 (CET) | ||
+ | : fixing seems in progress, which is good, but some IDs from the GBIF API return no specimen but a 404 page (which is possibly an old data record) --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 10:56, 16 July 2020 (CEST) | ||
+ | : seems {{Done}} --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:26, 10 October 2022 (CEST) | ||
+ | <blockquote> | ||
+ | For instance under Linux: | ||
+ | <syntaxhighlight lang="bash" style="font-size:smaller;"> | ||
+ | wget --header='Accept: application/rdf+xml' --content-on-error --output-document="specimens.kew.org⁄herbarium⁄1.000.rdf" "http://specimens.kew.org/herbarium/1.000" | ||
+ | head specimens.kew.org⁄herbarium⁄1.000.rdf | ||
+ | # <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> | ||
+ | # <html><head> | ||
+ | # <title>404 Not Found</title> | ||
+ | # </head><body> | ||
+ | # <h1>Not Found</h1> | ||
+ | # <p>The requested URL /herbarium/1.000 was not found on this server.</p> | ||
+ | # </body></html> | ||
+ | </syntaxhighlight> | ||
+ | </blockquote> | ||
+ | |||
+ | {{Alert box | ||
+ | |content=Requested XML/RDF is declared by <syntaxhighlight lang="xml" style="font-size:smaller;" inline><?xml version="1.0" encoding="UTF-8"?></syntaxhighlight> but actually encoded as ISO-8859 contradicting the declared xml encoding of UTF-8. This should be fixed to have all non ASCII characters properly mapped. --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 13:16, 11 August 2020 (CEST) | ||
+ | : {{Done}} --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 16:26, 10 October 2022 (CEST) | ||
+ | |type=info | ||
+ | |anchor=specimens.kew.org fix ISO-8859 encoding to be UTF-8 | ||
+ | }} | ||
+ | <blockquote> | ||
+ | For instance under Linux: | ||
+ | <syntaxhighlight lang="bash" style="font-size:smaller;"> | ||
+ | #!/bin/bash | ||
+ | cetaf_uri='http://specimens.kew.org/herbarium/K000001999' | ||
+ | wget --quiet --output-file="${cetaf_uri##*/}.log" --header='Accept: application/rdf+xml' --content-on-error --output-document="${cetaf_uri##*/}.rdf" "${cetaf_uri}" | ||
+ | # download quietly RDF into file 'K000001999.rdf' | ||
+ | |||
+ | /path/to/local/downloaded/apache-jena-3.14.0/bin/rdfxml --validate 'K000001999.rdf' | ||
+ | # validate rdf via Apache-Jena command line tool | ||
+ | # K000001999.rdf :: 12:34:44 ERROR riot :: [line: 37, col: 67] Invalid byte 2 of 3-byte UTF-8 sequence. | ||
+ | |||
+ | file 'K000001999.rdf' | ||
+ | # show file generic properties | ||
+ | # K000001999.rdf: XML 1.0 document, ISO-8859 text, with very long lines | ||
+ | </syntaxhighlight> | ||
+ | Comments (AP 2020-08-11 12:51:26): | ||
+ | * K000001999.rdf contains no UTF-8 but ISO-8859 encoded characters | ||
+ | * manual work around would be: <syntaxhighlight lang="bash" style="font-size:smaller;" inline>iconv -f ISO_8859-1 -t UTF-8 K000001999.rdf</syntaxhighlight> | ||
+ | </blockquote> | ||
+ | |||
+ | ({{Done}}) Requested RDF has <tt>dc:relation</tt> nesting mistake: it is meant to be only inside <syntaxhighlight lang="xml" style="font-size:smaller;" inline><rdf:Description rdf:about="..." ><!-- data --><dc:relation><!-- related rdf:Description nests here --></dc:relation><!-- data --></rdf:Description></syntaxhighlight>, e.g.: | ||
+ | <blockquote> | ||
+ | Perhaps [[Questions, problem solutions and further discussions (Guide of best practices)#develop RDF via TriG format or use it as dump data storage|develop RDF via TriG format (on Questions, problem solutions and further discussions (Guide of best practices))]] helps here ? --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 13:30, 16 July 2020 (CEST) | ||
+ | |||
+ | The following example compares the actual RDF (left) and the <code>diff</code> command line tool (right) from Linux of <syntaxhighlight lang="bash" style="font-size:smaller;" inline>wget --header='Accept: application/rdf+xml' --content-on-error --output-document="K000001005.rdf" "http://specimens.kew.org/herbarium/K000001005"</syntaxhighlight> | ||
+ | <table class="vertical-align-top"> | ||
+ | <tr><td style="max-width:600px">The <nowiki></rdf:Description></nowiki> (of the CETAF-ID) ending in line 36 should end much later and must envelop all the <nowiki><dc:relation></nowiki> and all other elements accordingly: | ||
+ | <syntaxhighlight lang="xml" line start="35" highlight="2" style="font-size:smaller;"> | ||
+ | <dwc:locationRemarks>in umbrosis.</dwc:locationRemarks> | ||
+ | </rdf:Description> | ||
+ | <!-- Image associated with the specimen --> | ||
+ | <dc:relation> | ||
+ | <rdf:Description rdf:about="http://www.kew.org/herbcatimg/588771.jpg"> | ||
+ | <dc:identifier rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/> | ||
+ | <dc:type rdf:resource="http://purl.org/dc/dcmitype/Image"/> | ||
+ | <dc:subject rdf:resource="http://specimens.kew.org/herbarium/K000001005"/> | ||
+ | <dc:format>image/jpeg</dc:format> | ||
+ | <dc:description xml:lang="en">Image of herbarium specimen</dc:description> | ||
+ | <dc:license rdf:resource="https://creativecommons.org/licenses/by/4.0/"/> | ||
+ | </rdf:Description> | ||
+ | </dc:relation> | ||
+ | <dwc:associatedMedia rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/> | ||
+ | </rdf:RDF> | ||
+ | </syntaxhighlight> | ||
+ | </td> | ||
+ | <td><div style="text-align:center;">Using <code>diff</code> to illustrate it, the <nowiki></rdf:Description></nowiki> counting from line 33 on in line 36, moves to the very bottom before <nowiki></rdf:RDF></nowiki>:</div> | ||
+ | <syntaxhighlight lang="diff" style="font-size:smaller;"> | ||
+ | --- K000001005.rdf 2020-07-16 10:25:35.236116113 +0200 | ||
+ | +++ K000001005-fixed.rdf 2020-07-16 10:40:30.246263344 +0200 | ||
+ | @@ -33,7 +33,6 @@ | ||
+ | <dwc:recordNumber>0</dwc:recordNumber> | ||
+ | <dwc:country>Bahia</dwc:country> | ||
+ | <dwc:locationRemarks>in umbrosis.</dwc:locationRemarks> | ||
+ | -</rdf:Description> | ||
+ | <!-- Image associated with the specimen --> | ||
+ | <dc:relation> | ||
+ | <rdf:Description rdf:about="http://www.kew.org/herbcatimg/588771.jpg"> | ||
+ | @@ -46,4 +45,5 @@ | ||
+ | </rdf:Description> | ||
+ | </dc:relation> | ||
+ | <dwc:associatedMedia rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/> | ||
+ | +</rdf:Description> | ||
+ | </rdf:RDF> | ||
+ | </syntaxhighlight></td></tr> | ||
+ | </table> | ||
+ | |||
+ | Done --[[User:Andreas Plank|Andreas Plank]] ([[User talk:Andreas Plank|talk]]) 12:23, 10 August 2020 (CEST) | ||
</blockquote> | </blockquote> |
Latest revision as of 09:28, 12 October 2022
Screenshot of the Firefox RESTED plugin (steps to retrieve an RDF data source) |
Note: Unresolved or pending issues are on top and issues that are done get to the end. To check for RDF in your browser or on command line:
- you can use https://www.w3.org/RDF/Validator/ in general
or use command line tools from Apache Jena (see Documentation), e.g. on Linux:/path/to/your/apache-jena-3.15.0/bin/rdfxml --validate "Testfile.rdf"
# or with log file
/path/to/your/apache-jena-3.15.0/bin/rdfparse -R "Testfile.rdf" > "Testfile.rdf.ttl" 2> "Testfile.rdf.log" - you can more specifically use the CETAF Specimen URI Tester (http://herbal.rbge.info)
- you can use a plugin in your browser, to basically evaluate redirection to the source RDF, e.g. RESTED Client and then adding Header
Accept: application/rdf+xml
(see example aside)
Contents
coldb.mnhn.fr (MNHN)
Unicode/XML issues:
- Notes for a general work around during harvest/import:
- invalid unicode characters break
rdfparse
and subsequent import, so the harvested RDF must be fixed first manually at this point --Andreas Plank (talk) 12:33, 8 June 2020 (CEST) - characters that can not be guessed properly will be replaced by a question mark “?” at that position where the wrong unicode character was before
- invalid unicode characters break
Pending unicode/XML issues (but most will be ignored as these are no plant records to be used for the botany pilot):
- http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 see ?
<dwc:municipality>Szirdokpisp?Ki</dwc:municipality>
rdfparse
found: [line: …, col: 34] An invalid XML character (Unicode: 0x19) was found in the element content of the document - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0058 see ?
<dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0062 see ? in
<dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1966-0061 see ? in
<dwc:occurrenceRemarks>LOC.: TRAPEANG-REPOU, ROUTE VEAL-RENG, KAMP?T</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 71] An invalid XML character (Unicode: 0x5) was found in the element content of the document. - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1986-0545 see ?
<dwc:occurrenceRemarks>1986-545 A 547 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document
M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«) - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0398 see ?
<dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document
M?ME => MÊME (AP: probably »DANS LE MÊME BOCAL«) - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1259 see ?
<dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1260</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document
M?ME => MÊME (AP: probably »MÊME BOCAL QUE 1980-1260«) - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0399 see ? in
<dwc:occurrenceRemarks>1963-398 A 400 DANS LE M?ME BOCAL</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
M?ME => MÊME - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1980-1260 see ? in
<dwc:occurrenceRemarks>M?ME BOCAL QUE 1980-1259</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 29] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
M?ME => MÊME - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1963-0400 see ? in
<dwc:occurrenceRemarks>1963-398 A 400 DANS LE MME BOCAL</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 52] An invalid XML character (Unicode: 0x14) was found in the element content of the document.
M?ME => MÊME - http://coldb.mnhn.fr/catalognumber/mnhn/ic/1995-0897 see ?
<dwc:occurrenceRemarks>don du northern territory museum extrait du n°13530-003. Proc. Biol. Soc. Wash. v. 109 (no. 2). B?ocal a cote de la 230-0-0-1.</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 125] An invalid XML character (Unicode: 0x16) was found in the element content of the document - http://coldb.mnhn.fr/catalognumber/mnhn/ic/b-2510 see ?
<dwc:occurrenceRemarks>PARALECTOTYPE DESIGNE PAR SPRINGER, 1962 IN COPEIA No 2? 2 : 4321 EX. EXTRAIT DE A.2024 / D.XII-23 , A.II-23 / VOIR SMITH. CONTR. TO ZOOL., No 73,</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 83] An invalid XML character (Unicode: 0x1b) was found in the element content of the document - http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4687 see ? in
<dwc:occurrenceRemarks>SYNTYPE?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 35] An invalid XML character (Unicode: 0x13) was found in the element content of the document. - http://coldb.mnhn.fr/catalognumber/mnhn/ic/a-4718 see ? in
<dwc:occurrenceRemarks>SYNTYPES ?DE BATRACHUS POROSISSIMUS CUVIER, 1829 IN REGNE ANIMAL (ed. 2) V. 2 : 254 / LS = 69 - 71 et 82 mm / LT = 78 - 80,5 et 92 mm</dwc:occurrenceRemarks>
rdfparse
found: [line: …, col: 37] An invalid XML character (Unicode: 0x13) was found in the element content of the document. - http://coldb.mnhn.fr/catalognumber/mnhn/ra/1991.4878 see ? in
<dwc:locality>R?mire</dwc:locality>
rdfparse
found: [line: …, col: 20] An invalid XML character (Unicode: 0x1a) was found in the element content of the document.
R?mire => R?mire (perhaps Rémire ?)) - http://coldb.mnhn.fr/catalognumber/mnhn/ra/1991.4926 see ? in
<dwc:locality>piste de St H?lie</dwc:locality>
rdfparse
found: [line: …, col: 32] An invalid XML character (Unicode: 0x1a) was found in the element content of the document..
H?lie => H?lie
data.nhm.ac.uk (NHM)
( Pending (minor issue does not block)) Requesting “Content-Type: application/rdf+xml
” results in 404 (not found) instead of getting RDF (see https://github.com/NaturalHistoryMuseum/ckanext-nhm/issues/458) --Andreas Plank (talk) 14:06, 18 February 2020 (CET)
- minor issue not relevant because header “
Content-Type: application/rdf+xml
” is meant for the (returned) resource, not the request --Andreas Plank (talk) 10:40, 20 February 2020 (CET)
No or mixed up RDF description of CETAF-ID
See perhaps the example of CETAF Specimen Preview Profile (CSPP) in general.
id.zfmk.de (ZFMK)
( Pending) The requested RDF does not describe the requested CETAF-ID http://id.zfmk.de/collection_ZFMK/1650/733377/90217
itself, the ID “hangs somewhat in the air” (from a descriptive point of view):
- http://id.zfmk.de/collection_ZFMK/1650/733377/90217 gets redirected to https://id.zfmk.de/collection_ZFMK/rdf/xml/CollectionSpecimen/1650/733377/90217/?shorturl=1 and
- by analysing the RDF via Apache Jena’s
rdfparse
it reveals that it describes something other:https://id.zfmk.de/collection_ZFMK/1650
, but unrelated to the IDhttp://id.zfmk.de/collection_ZFMK/1650/733377/90217
itself has no related description (rdf:Description
) and “hangs somewhat in the air”- checking the website states a stable URL https://id.zfmk.de/collection_ZFMK/page/CollectionSpecimen/1650 but this very URL does not return any RDF
--Andreas Plank (talk) 12:29, 20 February 2020 (CET)
purl.org/nhmuio (NHMUO)
( Pending) The requested RDF does not describe the requested CETAF-ID http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3
itself, the ID “hangs somewhat in the air” (from a descriptive point of view):
- http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 gets redirected to https://data.gbif.no/resolver/O:L:14 and
- by analysing the RDF via Apache Jena’s
rdfparse
it reveals that it describes something other:http://purl.org/gbifnorway/id/O:L:14
, but unrelated to the IDhttp://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3
itself has no related description (rdf:Description
) and “hangs somewhat in the air”--Andreas Plank (talk) 13:30, 20 February 2020 (CET)
No RDF but HTML
…
Fixed Issues
col.smns-bw.org (SMNS)
Requested RDF is instead an HTML fragment but RDF.--Andreas Plank (talk) 14:38, 18 February 2020 (CET)
- ( Done) Seems fixed --Andreas Plank (talk) 10:27, 12 October 2022 (CEST)
For instance under Linux:
wget --header='Accept: application/rdf+xml' --content-on-error --output-document="col.smns-bw.org⁄object⁄S10000227722006.rdf" "http://col.smns-bw.org/object/S10000227722006" file col.smns-bw.org⁄object⁄S10000227722006.rdf # col.smns-bw.org⁄object⁄S10000227722006.rdf: HTML document, ISO-8859 text, with very long lines, with CRLF line terminators
herbarium.bgbm.org (BGBM)
( Done) In some RDF files are invalid URI entries i.e. there is a tab/space character in the URI in owl:sameAs
and this would break the whole import of data. The error log of triple store loader (tdbloader2) shows something like:
Bad URI: < http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2> Code: 0/ILLEGAL_CHARACTER in SCHEME: The character violates the grammar rules for URIs/IRIs. ERROR Bad character in IRI (space): <[space]...>… see for instance in line 63:
62 <rdf:Description rdf:about="http://www.wikidata.org/entity/Q6382619"> 63 <owl:sameAs rdf:resource=" http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/a86596ea-6f4d-4b97-bf6f-8d492c0fc8b2" /> 64 <owl:sameAs rdf:resource="http://viaf.org/viaf/233473288" /> 65 </rdf:Description>The following objects were detected:
- http://herbarium.bgbm.org/data/rdf/B100000580 --Andreas Plank (talk) 16:21, 30 January 2020 (CET) Done --Andreas Plank (talk) 11:45, 3 February 2020 (CET)
- http://herbarium.bgbm.org/data/rdf/B100000503 --Andreas Plank (talk) 16:21, 30 January 2020 (CET) Done --Andreas Plank (talk) 11:45, 3 February 2020 (CET)
- http://herbarium.bgbm.org/data/rdf/B100000627 --Andreas Plank (talk) 16:21, 30 January 2020 (CET) Done --Andreas Plank (talk) 11:45, 3 February 2020 (CET)
snsb.info (SNSB)
Done: Mistakes of naming RDF elements/properties: they are sometimes mixed with CSPP-Element names, e.g. dwc:kindOfMaterial but with “kindOfMaterial” being meant the CSPP element name only, not the designated property (code) term; I found the following mistakes (considered using the following prefixes) :PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
and PREFIX dcterms: <http://purl.org/dc/terms/>
… the following elements are mistaken for instance and do not resolve:
-
<dwc:kindOfMaterial>
=><dcterms:type>
-
<dwc:collectionDate>
=><dcterms:created>
-
<dwc:sourceLink>
=><dcterms:publisher>
Perhaps there are more RDF elements to fix. --Andreas Plank (talk) 15:22, 8 October 2020 (CEST)
- Done --Andreas Plank (talk) 10:25, 14 October 2020 (CEST)
data.biodiversitydata.nl (Naturalis)
In some RDF files are invalid URI entries that is, they are not URL-encoded, e.g. <rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/L 0934036`">
having bare spaces or accent characters; URIs having spaces there are many (about ≈278.900), having accent characters there are a view, e.g. with error messages like:
[line: …, col: 68] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200934036[`]...> [line: …, col: 80] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200799429%20%20%20%20[`]...> [line: …, col: 68] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L%20%200979378[`]...> [line: …, col: 63] Illegal character in IRI (codepoint 0x60, '`'): <http://data.biodiversitydata.nl/naturalis/specimen/L.4305564[`]...>
These URI entries have to be fixed otherwise it would not be imported, find and replace is fixing this issue manually for the import. --Andreas Plank (talk) 12:30, 8 July 2020 (CEST)
Done in any case https://github.com/infinite-dao/glean-cetaf-rdfs/blob/main/bin/fixRDF_before_validateRDFs.sh checks URL errors and tries to fix it --Andreas Plank (talk) 15:09, 10 October 2022 (CEST)
data.rbge.org.uk (RBGE)
The RDF embedded XML contains pure “&
” which is not properly escaped in the XML realm (the propper escape is &
). Many RDF files e.g. http://data.rbge.org.uk/herb/E00011206, seems a generic problem --Andreas Plank (talk) 16:33, 16 March 2020 (CET)
- it will be fixed during harvesting routine, but provided XML should be valid including escaped ampersand
&
--Andreas Plank (talk) 16:33, 16 March 2020 (CET)- Done in any case https://github.com/infinite-dao/glean-cetaf-rdfs/blob/main/bin/fixRDF_before_validateRDFs.sh checks
&
errors and tries to fix it --Andreas Plank (talk) 15:24, 10 October 2022 (CEST)
id.luomus.fi (LUOMUS)
The requested RDF does not describe the requested CETAF-ID http://id.luomus.fi/GL.749
itself, the ID “hangs somewhat in the air” (from a descriptive point of view):
- http://id.luomus.fi/GL.749 gets redirected to http://id.luomus.fi/GL.749?format=RDFXML and
- by analysing the RDF via Apache Jena’s
rdfparse
it reveals that it describes<http://id.luomus.fi/GL.749?format=RDFXML> <http://purl.org/dc/terms/subject> <http://id.luomus.fi/GL.749>
just to be related, buthttp://id.luomus.fi/GL.749
itself has no related description (rdf:Description
) but there are two descriptionshttp://tun.fi/MY.275076
andhttp://tun.fi/MY.881682
which do not relate tohttp://id.luomus.fi/GL.749
. So CETAF-IDhttp://id.luomus.fi/GL.749
“hangs somewhat in the air” because it is not described.--Andreas Plank (talk) 12:10, 20 February 2020 (CET)
- Done --Andreas Plank (talk) 16:19, 10 October 2022 (CEST)
specimens.kew.org (RBGK)
Requested RDF is instead HTML but RDF --Andreas Plank (talk) 14:32, 18 February 2020 (CET)
- fixing seems in progress, which is good, but some IDs from the GBIF API return no specimen but a 404 page (which is possibly an old data record) --Andreas Plank (talk) 10:56, 16 July 2020 (CEST)
- seems Done --Andreas Plank (talk) 16:26, 10 October 2022 (CEST)
For instance under Linux:
wget --header='Accept: application/rdf+xml' --content-on-error --output-document="specimens.kew.org⁄herbarium⁄1.000.rdf" "http://specimens.kew.org/herbarium/1.000" head specimens.kew.org⁄herbarium⁄1.000.rdf # <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> # <html><head> # <title>404 Not Found</title> # </head><body> # <h1>Not Found</h1> # <p>The requested URL /herbarium/1.000 was not found on this server.</p> # </body></html>
<?xml version="1.0" encoding="UTF-8"?>
but actually encoded as ISO-8859 contradicting the declared xml encoding of UTF-8. This should be fixed to have all non ASCII characters properly mapped. --Andreas Plank (talk) 13:16, 11 August 2020 (CEST)
- Done --Andreas Plank (talk) 16:26, 10 October 2022 (CEST)
For instance under Linux:
#!/bin/bash cetaf_uri='http://specimens.kew.org/herbarium/K000001999' wget --quiet --output-file="${cetaf_uri##*/}.log" --header='Accept: application/rdf+xml' --content-on-error --output-document="${cetaf_uri##*/}.rdf" "${cetaf_uri}" # download quietly RDF into file 'K000001999.rdf' /path/to/local/downloaded/apache-jena-3.14.0/bin/rdfxml --validate 'K000001999.rdf' # validate rdf via Apache-Jena command line tool # K000001999.rdf :: 12:34:44 ERROR riot :: [line: 37, col: 67] Invalid byte 2 of 3-byte UTF-8 sequence. file 'K000001999.rdf' # show file generic properties # K000001999.rdf: XML 1.0 document, ISO-8859 text, with very long linesComments (AP 2020-08-11 12:51:26):
- K000001999.rdf contains no UTF-8 but ISO-8859 encoded characters
- manual work around would be:
iconv -f ISO_8859-1 -t UTF-8 K000001999.rdf
( Done) Requested RDF has dc:relation nesting mistake: it is meant to be only inside <rdf:Description rdf:about="..." ><!-- data --><dc:relation><!-- related rdf:Description nests here --></dc:relation><!-- data --></rdf:Description>
, e.g.:
Perhaps develop RDF via TriG format (on Questions, problem solutions and further discussions (Guide of best practices)) helps here ? --Andreas Plank (talk) 13:30, 16 July 2020 (CEST)
The following example compares the actual RDF (left) and the
diff
command line tool (right) from Linux ofwget --header='Accept: application/rdf+xml' --content-on-error --output-document="K000001005.rdf" "http://specimens.kew.org/herbarium/K000001005"
The </rdf:Description> (of the CETAF-ID) ending in line 36 should end much later and must envelop all the <dc:relation> and all other elements accordingly: 35 <dwc:locationRemarks>in umbrosis.</dwc:locationRemarks> 36 </rdf:Description> 37 <!-- Image associated with the specimen --> 38 <dc:relation> 39 <rdf:Description rdf:about="http://www.kew.org/herbcatimg/588771.jpg"> 40 <dc:identifier rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/> 41 <dc:type rdf:resource="http://purl.org/dc/dcmitype/Image"/> 42 <dc:subject rdf:resource="http://specimens.kew.org/herbarium/K000001005"/> 43 <dc:format>image/jpeg</dc:format> 44 <dc:description xml:lang="en">Image of herbarium specimen</dc:description> 45 <dc:license rdf:resource="https://creativecommons.org/licenses/by/4.0/"/> 46 </rdf:Description> 47 </dc:relation> 48 <dwc:associatedMedia rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/> 49 </rdf:RDF> Usingdiff
to illustrate it, the </rdf:Description> counting from line 33 on in line 36, moves to the very bottom before </rdf:RDF>:--- K000001005.rdf 2020-07-16 10:25:35.236116113 +0200 +++ K000001005-fixed.rdf 2020-07-16 10:40:30.246263344 +0200 @@ -33,7 +33,6 @@ <dwc:recordNumber>0</dwc:recordNumber> <dwc:country>Bahia</dwc:country> <dwc:locationRemarks>in umbrosis.</dwc:locationRemarks> -</rdf:Description> <!-- Image associated with the specimen --> <dc:relation> <rdf:Description rdf:about="http://www.kew.org/herbcatimg/588771.jpg"> @@ -46,4 +45,5 @@ </rdf:Description> </dc:relation> <dwc:associatedMedia rdf:resource="http://www.kew.org/herbcatimg/588771.jpg"/> +</rdf:Description> </rdf:RDF>Done --Andreas Plank (talk) 12:23, 10 August 2020 (CEST)