You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Michael Brunnbauer <br...@netestate.de> on 2015/03/31 11:06:13 UTC

tdbloader2 issues

hi all,

tdbloader2 will not accept IRIs with CR or LF like this one from the Wikidata
RDF dump:

 <http://freital.de/index.phtml?La=1&object=tx|530.4535.1&NavID=530.81&sub=0\n>

But it will happily accept IRIs with |{}\\^`"

I guess there is no chance that the Semantic Web community agrees on how a
valid ntriples/nquads file looks like?

Also, tdbloader2 seems to be gradually slowed down from 100k triples/s to
< 1000 triples/s on a normal disk drive by random access after ca. 10 million
triples. Is this unavoidable? I made this change to tdbloader2 but I think it
is not relevant during the data phase:

-    SORT_ARGS="--buffer-size=50%"
+    SORT_ARGS="--buffer-size=2048M"

I have tried with Jena 2.13.0 and 2.11.1.

Can a TDB generated with Jena 2.13.0 be used with Fuseki 1.1.1?

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbloader2 issues

Posted by Andy Seaborne <an...@apache.org>.

On 31/03/15 12:12, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Tue, Mar 31, 2015 at 10:25:32AM +0100, Andy Seaborne wrote:
>>> Also, tdbloader2 seems to be gradually slowed down from 100k triples/s to
>>> < 1000 triples/s on a normal disk drive by random access after ca. 10 million
>>> triples. Is this unavoidable? I made this change to tdbloader2 but I think it
>>> is not relevant during the data phase:
>>>
>>> -    SORT_ARGS="--buffer-size=50%"
>>> +    SORT_ARGS="--buffer-size=2048M"
>>>
>>> I have tried with Jena 2.13.0 and 2.11.1.
>>
>> What's the machine it's running on?  OS?
>
> Xeon E5502 with 48GB RAM, Linux 3.4.105 with glibc 2.19 and jdk-8u31-linux-x64.
>
>> As this is the data phase, tdbloader2 is, roughly, streaming the parser to
>> disk, allocating nodeids (which is a bad access pattern).  What size are the
>> node-related files?
>
> I have it running right now at
>
> "INFO  Add: 138,800,000 Data (Batch: 15,792 / Avg: 9,656)"
>
> -rw-r--r-- 1 java java 7070640000 Mar 31 13:10 data-triples.17513
> -rw-r--r-- 1 java java 2021654528 Mar 31 13:10 node2id.dat
> -rw-r--r-- 1 java java   16777216 Mar 31 13:10 node2id.idn
> -rw-r--r-- 1 java java 3858513162 Mar 31 13:10 nodes.dat

Wow.  That look like the unique node/triple ratio is quite high.  I take 
the data has a lots of content-like literals in it, or autogenerated URIs.

Lots of unique nodes can slow things down because of all the node writing.

>
>> Does tdbloader do better? (sometimes it does, sometimes it doesn't).
>
> I will try if I fail with tdbloader2 but I guess it will work now because
> I switched to a SSD for the tdb dir.

SSD+database => :-)

>
> Regards,
>
> Michael Brunnbauer
>

Re: tdbloader2 issues

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

On Tue, Mar 31, 2015 at 10:25:32AM +0100, Andy Seaborne wrote:
> >Also, tdbloader2 seems to be gradually slowed down from 100k triples/s to
> >< 1000 triples/s on a normal disk drive by random access after ca. 10 million
> >triples. Is this unavoidable? I made this change to tdbloader2 but I think it
> >is not relevant during the data phase:
> >
> >-    SORT_ARGS="--buffer-size=50%"
> >+    SORT_ARGS="--buffer-size=2048M"
> >
> >I have tried with Jena 2.13.0 and 2.11.1.
> 
> What's the machine it's running on?  OS?

Xeon E5502 with 48GB RAM, Linux 3.4.105 with glibc 2.19 and jdk-8u31-linux-x64.

> As this is the data phase, tdbloader2 is, roughly, streaming the parser to
> disk, allocating nodeids (which is a bad access pattern).  What size are the
> node-related files?

I have it running right now at 

"INFO  Add: 138,800,000 Data (Batch: 15,792 / Avg: 9,656)"

-rw-r--r-- 1 java java 7070640000 Mar 31 13:10 data-triples.17513
-rw-r--r-- 1 java java 2021654528 Mar 31 13:10 node2id.dat
-rw-r--r-- 1 java java   16777216 Mar 31 13:10 node2id.idn
-rw-r--r-- 1 java java 3858513162 Mar 31 13:10 nodes.dat

> Does tdbloader do better? (sometimes it does, sometimes it doesn't).

I will try if I fail with tdbloader2 but I guess it will work now because
I switched to a SSD for the tdb dir.

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbloader2 issues

Posted by Andy Seaborne <an...@apache.org>.

On 31/03/15 10:06, Michael Brunnbauer wrote:
...
> Also, tdbloader2 seems to be gradually slowed down from 100k triples/s to
> < 1000 triples/s on a normal disk drive by random access after ca. 10 million
> triples. Is this unavoidable? I made this change to tdbloader2 but I think it
> is not relevant during the data phase:
>
> -    SORT_ARGS="--buffer-size=50%"
> +    SORT_ARGS="--buffer-size=2048M"
>
> I have tried with Jena 2.13.0 and 2.11.1.

What's the machine it's running on?  OS?

As this is the data phase, tdbloader2 is, roughly, streaming the parser 
to disk, allocating nodeids (which is a bad access pattern).  What size 
are the node-related files?

Does tdbloader do better? (sometimes it does, sometimes it doesn't).

 >
 > Can a TDB generated with Jena 2.13.0 be used with Fuseki 1.1.1?

Yes.

 >
 > Regards,
 >
 > Michael Brunnbauer
 >

	Andy

Re: NT issues (was: Re: tdbloader2 issues)

Posted by james anderson <ja...@dydra.com>.

good afternoon;

On 2015-04-01, at 16:17, Andy Seaborne <an...@apache.org> wrote:

> Thanks for that.
> JENA-911 created.
> 
> Each of the large public dumps has had quality issues.  I'm sure wikidata will fix their process if someone helps them.  (Freebase did.)
> 
> I understand it's frustrating but fixing it in the parser/loader is not a real fix, only a limited workaround, because that data can be passed on to with systems which can't cope.  That's what standards are for!!
> 
> 
> (anyone know who is involved?)

for wikidata, our feedback led to #133 (https://github.com/Wikidata/Wikidata-Toolkit/issues/133).
we had attempted to load their core dataset in the hope of working with their temporal data, and with a thought to hosting the full dataset, but the invalid iri terms have slowed that endeavour down.

> 
> The RDF 1.1 took some time to look at orignal-NT  - the <>-grammar rule allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not bad) then even things like \n (which was an NL not the characters "\" and "n" as the widedata people are using it) are not getting through.  The original NT grammar was specific for test cases and is open and loose by design.
> 
> Please do feed back to wikidata and we can hope it gets fixed at source.

see above.

> 
> (Ditto DBpedia for that matter)
> 
> 	Andy
> 
> Related: JENA-864
> 
> NFC and NFCK are two normalization requirements (warnings, not errors) but they seem to be more of a hinderance than a help so I'm suggesting removing the checking.  The IRIs are legal even if no NFC - just not in the preferred by W3C form.
> 
> On 01/04/15 14:11, Michael Brunnbauer wrote:
>> 
>> Hello Andy,
>> 
>> [tdbloader2 disk access pattern]
>>> Lots of unique nodes can slow things down because of all the node writing.
>> 
>> And there is no way to convert this algorithm to sequential access?
>> 
>> [tdbloader2 parser]
>>>>> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.
>>> 
>>> Could you provide a set of data with one feature per NTriple line,marking in
>>> a comment what you expect, and I'll check each one and add them to the test
>>> suite.
>> 
>> See attachment. I would consider all triples in it illegal according to the
>> n triples spec.
>> 
>> If I allow these characters that RFC 1738 calls "unsafe", why then not allow
>> CR, LF and TAB? And why then allow \\ but not \", which seems to be sanctioned
>> by older versions of the spec:
>> 
>>  http://www.w3.org/2001/sw/RDFCore/ntriples/#character
>> 
>> I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
>> IRIs, e.G.:
>> 
>> <http://www.wikidata.org/entity/P1348v> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
>> <http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D> <http://www.wikidata.org/entity/P18v> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg> .
>> 
>> This trial and error cleaning of data dumps with self made scripts and days
>> between each try is very straining and probably a big deterrent for newcomers.
>> I had it with DBpedia and now I have it with Wikidata all over again (with
>> new syntax problems).
>> 
>> Regards,
>> 
>> Michael Brunnbauer
>> 
> 



---
james anderson | james@dydra.com | http://dydra.com

Re: NT issues (was: Re: tdbloader2 issues)

Posted by james anderson <ja...@dydra.com>.

good afternoon;

On 2015-04-01, at 16:30, Michael Brunnbauer <br...@netestate.de> wrote:

> 
> Hello Andy,
> 
> it would just be great to have a mode for tdbloader[2] where invalid
> triples/quads are simply ignored.

somehow that seems like a bad idea.

there are already tools which one could use to that end.
in the case of the core wikidata dataset, rapper (which i do not hereby to elevate to the role of nt conformance accreditation, but anyway) rejects several thousand statements and can be used to reduce the dataset to those which are valid.

$ rapper -i ntriples -o ntriples wikidata-statements.nt > wikidata-statements-clean.nt 2> wikidata-statements-errors.txt
$ ls -l wikidata-statements*
-rw-r--r-- 1 root root 38770855255 Apr  1 16:00 wikidata-statements-clean.nt
-rw-r--r-- 1 root root     1540120 Apr  1 16:00 wikidata-statements-errors.txt
-rw-r--r-- 1 root root 38772450070 Mar 28 08:15 wikidata-statements.nt
$ wc -l wikidata*
  233096736 wikidata-statements-clean.nt
       9627 wikidata-statements-errors.txt
  233106288 wikidata-statements.nt

from which i looks like some of the errors do not cause it to suppress the statement.

we would be reluctant to host something in that condition as a service, as one never knows which relations have been eliminated and how central they might be to the dataset’s utility.


best regards, from berlin,

> 
> Regards,
> 
> Michael Brunnbauer
> 
> On Wed, Apr 01, 2015 at 03:17:08PM +0100, Andy Seaborne wrote:
>> Thanks for that.
>> JENA-911 created.
>> 
>> Each of the large public dumps has had quality issues.  I'm sure wikidata
>> will fix their process if someone helps them.  (Freebase did.)
>> 
>> I understand it's frustrating but fixing it in the parser/loader is not a
>> real fix, only a limited workaround, because that data can be passed on to
>> with systems which can't cope.  That's what standards are for!!
>> 
>> 
>> (anyone know who is involved?)
>> 
>> The RDF 1.1 took some time to look at orignal-NT  - the <>-grammar rule
>> allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not
>> bad) then even things like \n (which was an NL not the characters "\" and
>> "n" as the widedata people are using it) are not getting through.  The
>> original NT grammar was specific for test cases and is open and loose by
>> design.
>> 
>> Please do feed back to wikidata and we can hope it gets fixed at source.
>> 
>> (Ditto DBpedia for that matter)
>> 
>> 	Andy
>> 
>> Related: JENA-864
>> 
>> NFC and NFCK are two normalization requirements (warnings, not errors) but
>> they seem to be more of a hinderance than a help so I'm suggesting removing
>> the checking.  The IRIs are legal even if no NFC - just not in the preferred
>> by W3C form.
>> 
>> On 01/04/15 14:11, Michael Brunnbauer wrote:
>>> 
>>> Hello Andy,
>>> 
>>> [tdbloader2 disk access pattern]
>>>> Lots of unique nodes can slow things down because of all the node writing.
>>> 
>>> And there is no way to convert this algorithm to sequential access?
>>> 
>>> [tdbloader2 parser]
>>>>>> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.
>>>> 
>>>> Could you provide a set of data with one feature per NTriple line,marking in
>>>> a comment what you expect, and I'll check each one and add them to the test
>>>> suite.
>>> 
>>> See attachment. I would consider all triples in it illegal according to the
>>> n triples spec.
>>> 
>>> If I allow these characters that RFC 1738 calls "unsafe", why then not allow
>>> CR, LF and TAB? And why then allow \\ but not \", which seems to be sanctioned
>>> by older versions of the spec:
>>> 
>>> http://www.w3.org/2001/sw/RDFCore/ntriples/#character
>>> 
>>> I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
>>> IRIs, e.G.:
>>> 
>>> <http://www.wikidata.org/entity/P1348v> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
>>> <http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D> <http://www.wikidata.org/entity/P18v> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg> .
>>> 
>>> This trial and error cleaning of data dumps with self made scripts and days
>>> between each try is very straining and probably a big deterrent for newcomers.
>>> I had it with DBpedia and now I have it with Wikidata all over again (with
>>> new syntax problems).
>>> 
>>> Regards,
>>> 
>>> Michael Brunnbauer
>>> 
> 
> -- 
> ++  Michael Brunnbauer
> ++  netEstate GmbH
> ++  Geisenhausener Straße 11a
> ++  81379 München
> ++  Tel +49 89 32 19 77 80
> ++  Fax +49 89 32 19 77 89 
> ++  E-Mail brunni@netestate.de
> ++  http://www.netestate.de/
> ++
> ++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
> ++  USt-IdNr. DE221033342
> ++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
> ++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel



---
james anderson | james@dydra.com | http://dydra.com

Re: NT issues (was: Re: tdbloader2 issues)

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

it would just be great to have a mode for tdbloader[2] where invalid
triples/quads are simply ignored.

Regards,

Michael Brunnbauer

On Wed, Apr 01, 2015 at 03:17:08PM +0100, Andy Seaborne wrote:
> Thanks for that.
> JENA-911 created.
> 
> Each of the large public dumps has had quality issues.  I'm sure wikidata
> will fix their process if someone helps them.  (Freebase did.)
> 
> I understand it's frustrating but fixing it in the parser/loader is not a
> real fix, only a limited workaround, because that data can be passed on to
> with systems which can't cope.  That's what standards are for!!
> 
> 
> (anyone know who is involved?)
> 
> The RDF 1.1 took some time to look at orignal-NT  - the <>-grammar rule
> allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not
> bad) then even things like \n (which was an NL not the characters "\" and
> "n" as the widedata people are using it) are not getting through.  The
> original NT grammar was specific for test cases and is open and loose by
> design.
> 
> Please do feed back to wikidata and we can hope it gets fixed at source.
> 
> (Ditto DBpedia for that matter)
> 
> 	Andy
> 
> Related: JENA-864
> 
> NFC and NFCK are two normalization requirements (warnings, not errors) but
> they seem to be more of a hinderance than a help so I'm suggesting removing
> the checking.  The IRIs are legal even if no NFC - just not in the preferred
> by W3C form.
> 
> On 01/04/15 14:11, Michael Brunnbauer wrote:
> >
> >Hello Andy,
> >
> >[tdbloader2 disk access pattern]
> >>Lots of unique nodes can slow things down because of all the node writing.
> >
> >And there is no way to convert this algorithm to sequential access?
> >
> >[tdbloader2 parser]
> >>>>But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.
> >>
> >>Could you provide a set of data with one feature per NTriple line,marking in
> >>a comment what you expect, and I'll check each one and add them to the test
> >>suite.
> >
> >See attachment. I would consider all triples in it illegal according to the
> >n triples spec.
> >
> >If I allow these characters that RFC 1738 calls "unsafe", why then not allow
> >CR, LF and TAB? And why then allow \\ but not \", which seems to be sanctioned
> >by older versions of the spec:
> >
> >  http://www.w3.org/2001/sw/RDFCore/ntriples/#character
> >
> >I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
> >IRIs, e.G.:
> >
> ><http://www.wikidata.org/entity/P1348v> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
> ><http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D> <http://www.wikidata.org/entity/P18v> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg> .
> >
> >This trial and error cleaning of data dumps with self made scripts and days
> >between each try is very straining and probably a big deterrent for newcomers.
> >I had it with DBpedia and now I have it with Wikidata all over again (with
> >new syntax problems).
> >
> >Regards,
> >
> >Michael Brunnbauer
> >

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

NT issues (was: Re: tdbloader2 issues)

Posted by Andy Seaborne <an...@apache.org>.

Thanks for that.
JENA-911 created.

Each of the large public dumps has had quality issues.  I'm sure 
wikidata will fix their process if someone helps them.  (Freebase did.)

I understand it's frustrating but fixing it in the parser/loader is not 
a real fix, only a limited workaround, because that data can be passed 
on to with systems which can't cope.  That's what standards are for!!

(anyone know who is involved?)

The RDF 1.1 took some time to look at orignal-NT  - the <>-grammar rule 
allows junk IRIs and, if you assume some IRI parsing (java.net.URI is 
not bad) then even things like \n (which was an NL not the characters 
"\" and "n" as the widedata people are using it) are not getting 
through.  The original NT grammar was specific for test cases and is 
open and loose by design.

Please do feed back to wikidata and we can hope it gets fixed at source.

(Ditto DBpedia for that matter)

	Andy

Related: JENA-864

NFC and NFCK are two normalization requirements (warnings, not errors) 
but they seem to be more of a hinderance than a help so I'm suggesting 
removing the checking.  The IRIs are legal even if no NFC - just not in 
the preferred by W3C form.

On 01/04/15 14:11, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> [tdbloader2 disk access pattern]
>> Lots of unique nodes can slow things down because of all the node writing.
>
> And there is no way to convert this algorithm to sequential access?
>
> [tdbloader2 parser]
>>>> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.
>>
>> Could you provide a set of data with one feature per NTriple line,marking in
>> a comment what you expect, and I'll check each one and add them to the test
>> suite.
>
> See attachment. I would consider all triples in it illegal according to the
> n triples spec.
>
> If I allow these characters that RFC 1738 calls "unsafe", why then not allow
> CR, LF and TAB? And why then allow \\ but not \", which seems to be sanctioned
> by older versions of the spec:
>
>   http://www.w3.org/2001/sw/RDFCore/ntriples/#character
>
> I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
> IRIs, e.G.:
>
> <http://www.wikidata.org/entity/P1348v> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
> <http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D> <http://www.wikidata.org/entity/P18v> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg> .
>
> This trial and error cleaning of data dumps with self made scripts and days
> between each try is very straining and probably a big deterrent for newcomers.
> I had it with DBpedia and now I have it with Wikidata all over again (with
> new syntax problems).
>
> Regards,
>
> Michael Brunnbauer
>

Re: tdbloader2 issues

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

[tdbloader2 disk access pattern]
>Lots of unique nodes can slow things down because of all the node writing.

And there is no way to convert this algorithm to sequential access?

[tdbloader2 parser]
> >>But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.
> 
> Could you provide a set of data with one feature per NTriple line,marking in
> a comment what you expect, and I'll check each one and add them to the test
> suite.

See attachment. I would consider all triples in it illegal according to the
n triples spec. 

If I allow these characters that RFC 1738 calls "unsafe", why then not allow
CR, LF and TAB? And why then allow \\ but not \", which seems to be sanctioned 
by older versions of the spec:

 http://www.w3.org/2001/sw/RDFCore/ntriples/#character

I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
IRIs, e.G.:

<http://www.wikidata.org/entity/P1348v> <http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
<http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D> <http://www.wikidata.org/entity/P18v> <http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg> .

This trial and error cleaning of data dumps with self made scripts and days 
between each try is very straining and probably a big deterrent for newcomers.
I had it with DBpedia and now I have it with Wikidata all over again (with
new syntax problems).

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Stra�e 11a
++  81379 M�nchen
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: M�nchen, HRB Nr.142452 (Handelsregister B M�nchen)
++  USt-IdNr. DE221033342
++  Gesch�ftsf�hrer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbloader2 issues

Posted by Andy Seaborne <an...@apache.org>.

On 31/03/15 12:44, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Tue, Mar 31, 2015 at 01:06:49PM +0200, Michael Brunnbauer wrote:
>>> The spec says
>>> [8] 	IRIREF 	::= 	'<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
>>>
>>> so no \n escapes, just \u and \U
>
> \\ is accepted - but \" not.

The NT/NQ parser is more permissive than the standard (people have 
dubious data already loaded so it's sort of tricky to change too much 
retrospectively.  IIRC \n was legal syntax by original NT as a newline 
esacpe, but illegal because the IRI is bad.  There are two levels - pure 
tokenization, and whether the IRI follows the IRI rules.

You can check data before loading using "riot" and it should generate 
warning on bad IRIs that pass the quick and pragmatic tokenization.

>
>> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.

Could you provide a set of data with one feature per NTriple 
line,marking in a comment what you expect, and I'll check each one and 
add them to the test suite.

	Andy

>
> Regards,
>
> Michael Brunnbauer
>

Re: tdbloader2 issues

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

On Tue, Mar 31, 2015 at 01:06:49PM +0200, Michael Brunnbauer wrote:
> > The spec says
> > [8] 	IRIREF 	::= 	'<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
> > 
> > so no \n escapes, just \u and \U

\\ is accepted - but \" not.

> But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbloader2 issues

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

On Tue, Mar 31, 2015 at 10:24:25AM +0100, Andy Seaborne wrote:
> \ is the escape character.
> 
> The spec says
> [8] 	IRIREF 	::= 	'<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
> 
> so no \n escapes, just \u and \U

But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.

> The RFC 3896 does not allow newlines.

You mean RFC 3986? It also does not seem to allow any of  " { } | ^ `

> Maybe they mean \\n or %0A.

Doesn't the expression above exclude a literal \ in a IRI?

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbloader2 issues

Posted by Andy Seaborne <an...@apache.org>.

On 31/03/15 10:06, Michael Brunnbauer wrote:
>
> hi all,
>
> tdbloader2 will not accept IRIs with CR or LF like this one from the Wikidata
> RDF dump:
>
>   <http://freital.de/index.phtml?La=1&object=tx|530.4535.1&NavID=530.81&sub=0\n>
>
> But it will happily accept IRIs with |{}\\^`"
>
> I guess there is no chance that the Semantic Web community agrees on how a
> valid ntriples/nquads file looks like?

\ is the escape character.

The spec says
[8] 	IRIREF 	::= 	'<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

so no \n escapes, just \u and \U

The RFC 3896 does not allow newlines.

Maybe they mean \\n or %0A.

I'm afraid that it's not a toolkit issue.

Could you feed it back to wikidata?

	Andy

Re: tdbloader2 issues

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Michael

N-Triples and N-Quads are explicitly defined as 1 triple per line so CR or
LF characters in a RDF term will always be invalid in these formats

The on-disk TDB format has been stable since TDB 0.9.0 and so can be used
with any Jena tooling that supports TDB 0.9.0 or higher (though we
recommend you always use the latest versions wherever possible since our
support model for past versions is "please upgrade")

Note that 2.13.0 does provide for some low level customisation of the
storage which requires the store to be always read with the same setup as
it was created.  However if you are using the stock command line tools
this won't affect you.

Rob

On 31/03/2015 10:06, "Michael Brunnbauer" <br...@netestate.de> wrote:

>
>hi all,
>
>tdbloader2 will not accept IRIs with CR or LF like this one from the
>Wikidata
>RDF dump:
>
> 
><http://freital.de/index.phtml?La=1&object=tx|530.4535.1&NavID=530.81&sub=
>0\n>
>
>But it will happily accept IRIs with |{}\\^`"
>
>I guess there is no chance that the Semantic Web community agrees on how a
>valid ntriples/nquads file looks like?
>
>Also, tdbloader2 seems to be gradually slowed down from 100k triples/s to
>< 1000 triples/s on a normal disk drive by random access after ca. 10
>million
>triples. Is this unavoidable? I made this change to tdbloader2 but I
>think it
>is not relevant during the data phase:
>
>-    SORT_ARGS="--buffer-size=50%"
>+    SORT_ARGS="--buffer-size=2048M"
>
>I have tried with Jena 2.13.0 and 2.11.1.
>
>Can a TDB generated with Jena 2.13.0 be used with Fuseki 1.1.1?
>
>Regards,
>
>Michael Brunnbauer
>
>-- 
>++  Michael Brunnbauer
>++  netEstate GmbH
>++  Geisenhausener Straße 11a
>++  81379 München
>++  Tel +49 89 32 19 77 80
>++  Fax +49 89 32 19 77 89
>++  E-Mail brunni@netestate.de
>++  http://www.netestate.de/
>++
>++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
>++  USt-IdNr. DE221033342
>++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
>++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel