You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Jean-Marc Vanel <je...@gmail.com> on 2020/04/24 14:17:32 UTC

Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

How to reproduce with 3.14.0

bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide \
  --verbose http://dbpedia.org/resource/User_guide

echo "
CONSTRUCT {
 <http://dbpedia.org/resource/User_guide>
  ?P ?O . }
WHERE { GRAPH ?G {
 <http://dbpedia.org/resource/User_guide>
  ?P ?O . } }
LIMIT
# 30 # OK
35 # KO !!!
" > /tmp/const.ql

bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql

And here is the *stack*:

16:14:23 ERROR BindingTDB           :: get1(?O)
java.lang.StringIndexOutOfBoundsException: String index out of range: 39
at java.lang.String.charAt(String.java:658)
at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
at org.apache.jena.tdb.lib.NodeLib.decode(NodeLib.java:120)
at org.apache.jena.tdb.lib.NodeLib.fetchDecode(NodeLib.java:97)
at
org.apache.jena.tdb.store.nodetable.NodeTableNative.readNodeFromTable(NodeTableNative.java:182)
at
org.apache.jena.tdb.store.nodetable.NodeTableNative._retrieveNodeByNodeId(NodeTableNative.java:108)
at
org.apache.jena.tdb.store.nodetable.NodeTableNative.getNodeForNodeId(NodeTableNative.java:67)
at
org.apache.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:128)
at
org.apache.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:82)
at
org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at
org.apache.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
at org.apache.jena.tdb.solver.BindingTDB.get1(BindingTDB.java:126)
at
org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:104)
at org.apache.jena.sparql.core.Var.lookup(Var.java:85)
at org.apache.jena.sparql.core.Var.lookup(Var.java:80)
at org.apache.jena.sparql.core.Substitute.substitute(Substitute.java:117)
at org.apache.jena.sparql.core.Substitute.substitute(Substitute.java:73)
at org.apache.jena.sparql.modify.TemplateLib.subst(TemplateLib.java:177)
at org.apache.jena.sparql.modify.TemplateLib$1.apply(TemplateLib.java:83)
at org.apache.jena.sparql.modify.TemplateLib$1.apply(TemplateLib.java:73)
at org.apache.jena.atlas.iterator.Iter$2.next(Iter.java:352)
at
org.apache.jena.ext.com.google.common.collect.Iterators$ConcatenatedIterator.hasNext(Iterators.java:1340)
at
org.apache.jena.sparql.engine.QueryExecutionBase.execConstruct(QueryExecutionBase.java:219)
at
org.apache.jena.sparql.engine.QueryExecutionBase.execConstruct(QueryExecutionBase.java:207)
at
org.apache.jena.sparql.util.QueryExecUtils.doConstructQuery(QueryExecUtils.java:162)
at
org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:83)
at arq.query.lambda$queryExec$0(query.java:225)
at org.apache.jena.system.Txn.exec(Txn.java:77)
at org.apache.jena.system.Txn.executeRead(Txn.java:115)
at arq.query.queryExec(query.java:222)
at arq.query.exec(query.java:153)
at jena.cmd.CmdMain.mainMethod(CmdMain.java:93)
at jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at tdb.tdbquery.main(tdbquery.java:33)

NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?


Jean-Marc Vanel
<http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
 Chroniques jardin
<http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle>

Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
Yeah guys,

sorry, I'm dumb and didn't scroll down enough to see Andy's last inline
comment referring to TDB w.r.t. encoding issue.

Anyways, Andy already spotted the source of the issue, so as usual will
be fixed soon I think


On 25.04.20 10:53, Andy Seaborne wrote:
> JENA-1890, PR#735
>
> On 25/04/2020 08:34, Lorenz Buehmann wrote:
>> Hi,
>>
>> I tried with cURL + riot CLI tools manually and can't reproduce the
>> parsing issue, neither with RDF/XML nor with Turtle.
>
> The problem is in TDB. In fact the use of \u is not part of the
> problem directly.  The parser step works and the database is loaded
> correctly.
>
>
> Encoding URIs term in TDB1 (not TDB2) was added JENA-1793/Jena 3.14.0
> using "_" as the hex marker; so like %XX but as _XX. It allows illegal
> URIs (spaces :-() to be handled by the database.
>
> The decoder is also more general - it can decode multibyte codepoints
> written as %xx%xx but (bug) it gets bytes and chars mixed up at one
> point.
>
> When all the characters before the _ are single byte in UTF-8 it works
> but "사용_" has multi-byte characters before the _. The decoder then
> accesses the string and it can be off the end.
>
>     Andy
>
>> curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide
>>> /tmp/test.ttl
>> curl -L -H "Accept: application/rdf+xml"
>> http://dbpedia.org/resource/User_guide > /tmp/test.rdf
>>
>>
>> I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
>> some issues with serialization, but this has been fixed long time ago.
>>
>> Also, I don't understand what you mean by "suspicious"? The parser can
>> easily convert the UTF-8 encoded URIs as expected:
>>
>> riot --check /tmp/test.ttl
>>
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://nl.dbpedia.org/resource/Handleiding> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://cs.dbpedia.org/resource/Uživatelská_příručka> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://wikidata.dbpedia.org/resource/Q1057179> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://www.wikidata.org/entity/Q1057179> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://ko.dbpedia.org/resource/사용_설명서> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://es.dbpedia.org/resource/Guía_del_usuario> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://ja.dbpedia.org/resource/マニュアル> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://it.dbpedia.org/resource/Manuale> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://rdf.freebase.com/ns/m.04mqbf> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://fr.dbpedia.org/resource/Mode_d'emploi> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://yago-knowledge.org/resource/User_guide> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://de.dbpedia.org/resource/Gebrauchsanleitung> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://id.dbpedia.org/resource/Manual_pengguna> .
>> <http://dbpedia.org/resource/User_guide>
>> <http://www.w3.org/2002/07/owl#sameAs>
>> <http://dbpedia.org/resource/User_guide> .
>>
>> On 24.04.20 22:33, Jean-Marc Vanel wrote:
>>> Le ven. 24 avr. 2020 à 22:17, Andy Seaborne <an...@apache.org> a écrit :
>>>
>>>> On 24/04/2020 15:17, Jean-Marc Vanel wrote:
>>>>> How to reproduce with 3.14.0
>>>>>
>>>>> bin/*tdbloader* --loc TDB
>>>>> --graph=http://dbpedia.org/resource/User_guide
>>>> \
>>>>>     --verbose http://dbpedia.org/resource/User_guide
>>>> Did the log say anything?
>>>>
>>> NO, nothing special, neither with --debug .
>>>
>>> As this is a remote URL, did it all arrive and parse without warnings?
>>> No warning.
>>>
>>> Was the database fresh or was there data in it to start with?
>>> database fresh, of course.
>>>
>>>
>>>>> echo "
>>>>> CONSTRUCT {
>>>>>    <http://dbpedia.org/resource/User_guide>
>>>>>     ?P ?O . }
>>>>> WHERE { GRAPH ?G {
>>>>>    <http://dbpedia.org/resource/User_guide>
>>>>>     ?P ?O . } }
>>>>> LIMIT
>>>>> # 30 # OK
>>>>> 35 # KO !!!
>>>>> " > /tmp/const.ql
>>>>>
>>>>> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
>>>>>
>>>>> And here is the *stack*:
>>>>>
>>>>> 16:14:23 ERROR BindingTDB           :: get1(?O)
>>>>> java.lang.StringIndexOutOfBoundsException: String index out of
>>>>> range: 39
>>>>> at java.lang.String.charAt(String.java:658)
>>>>> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
>>>>> at
>>>>> org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
>>>>>
>>>> If the load was clean, the database is intact and it is a decoding bug
>>>> in Jena for an URI. The data has a lot of encoded \u terms but its
>>>> a URI
>>>> in the object position causing a problem.  (I don't see why these are
>>>> encoded - it's not necessary).
>>>>
>>> Indeed these URI are suspect:
>>>
>>> <http://fr.dbpedia.org/resource/Mode_d\u0027emploi> ,
>>> <http://es.dbpedia.org/resource/Gu\u00EDa_del_usuario> .
>>>
>>> <http://ja.dbpedia.org/resource/\u30DE\u30CB\u30E5\u30A2\u30EB> ,
>>> <
>>> http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka>
>>>
>>> ,
>>> <http://ko.dbpedia.org/resource/\uC0AC\uC6A9_\uC124\uBA85\uC11C> .
>>>
>>>
>>>>       Andy
>>>>
>>>> ...
>>>>> at tdb.tdbquery.main(tdbquery.java:33)
>>>>>
>>>>> NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?
>>>>>
>>>>>
>>>>> Jean-Marc Vanel
>>>>> <
>>>> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
>>>>
>>>>> +33 (0)6 89 16 29 52
>>>>> Twitter: @jmvanel , @jmvanel_fr ; chat:
>>>>> irc://irc.freenode.net#eulergui
>>>>>    Chroniques jardin
>>>>> <
>>>> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
>>>>
>>>>>
>>

Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

Posted by Andy Seaborne <an...@apache.org>.
JENA-1890, PR#735

On 25/04/2020 08:34, Lorenz Buehmann wrote:
> Hi,
> 
> I tried with cURL + riot CLI tools manually and can't reproduce the
> parsing issue, neither with RDF/XML nor with Turtle.

The problem is in TDB. In fact the use of \u is not part of the problem 
directly.  The parser step works and the database is loaded correctly.


Encoding URIs term in TDB1 (not TDB2) was added JENA-1793/Jena 3.14.0 
using "_" as the hex marker; so like %XX but as _XX. It allows illegal 
URIs (spaces :-() to be handled by the database.

The decoder is also more general - it can decode multibyte codepoints 
written as %xx%xx but (bug) it gets bytes and chars mixed up at one point.

When all the characters before the _ are single byte in UTF-8 it works 
but "사용_" has multi-byte characters before the _. The decoder then 
accesses the string and it can be off the end.

     Andy

> curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide
>> /tmp/test.ttl
> curl -L -H "Accept: application/rdf+xml"
> http://dbpedia.org/resource/User_guide > /tmp/test.rdf
> 
> 
> I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
> some issues with serialization, but this has been fixed long time ago.
> 
> Also, I don't understand what you mean by "suspicious"? The parser can
> easily convert the UTF-8 encoded URIs as expected:
> 
> riot --check /tmp/test.ttl
> 
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://nl.dbpedia.org/resource/Handleiding> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://cs.dbpedia.org/resource/Uživatelská_příručka> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://wikidata.dbpedia.org/resource/Q1057179> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://www.wikidata.org/entity/Q1057179> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://ko.dbpedia.org/resource/사용_설명서> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://es.dbpedia.org/resource/Guía_del_usuario> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://ja.dbpedia.org/resource/マニュアル> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://it.dbpedia.org/resource/Manuale> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://rdf.freebase.com/ns/m.04mqbf> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://fr.dbpedia.org/resource/Mode_d'emploi> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://yago-knowledge.org/resource/User_guide> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://de.dbpedia.org/resource/Gebrauchsanleitung> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://id.dbpedia.org/resource/Manual_pengguna> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://dbpedia.org/resource/User_guide> .
> 
> On 24.04.20 22:33, Jean-Marc Vanel wrote:
>> Le ven. 24 avr. 2020 à 22:17, Andy Seaborne <an...@apache.org> a écrit :
>>
>>> On 24/04/2020 15:17, Jean-Marc Vanel wrote:
>>>> How to reproduce with 3.14.0
>>>>
>>>> bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide
>>> \
>>>>     --verbose http://dbpedia.org/resource/User_guide
>>> Did the log say anything?
>>>
>> NO, nothing special, neither with --debug .
>>
>> As this is a remote URL, did it all arrive and parse without warnings?
>> No warning.
>>
>> Was the database fresh or was there data in it to start with?
>> database fresh, of course.
>>
>>
>>>> echo "
>>>> CONSTRUCT {
>>>>    <http://dbpedia.org/resource/User_guide>
>>>>     ?P ?O . }
>>>> WHERE { GRAPH ?G {
>>>>    <http://dbpedia.org/resource/User_guide>
>>>>     ?P ?O . } }
>>>> LIMIT
>>>> # 30 # OK
>>>> 35 # KO !!!
>>>> " > /tmp/const.ql
>>>>
>>>> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
>>>>
>>>> And here is the *stack*:
>>>>
>>>> 16:14:23 ERROR BindingTDB           :: get1(?O)
>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 39
>>>> at java.lang.String.charAt(String.java:658)
>>>> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
>>>> at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
>>> If the load was clean, the database is intact and it is a decoding bug
>>> in Jena for an URI. The data has a lot of encoded \u terms but its a URI
>>> in the object position causing a problem.  (I don't see why these are
>>> encoded - it's not necessary).
>>>
>> Indeed these URI are suspect:
>>
>> <http://fr.dbpedia.org/resource/Mode_d\u0027emploi> ,
>> <http://es.dbpedia.org/resource/Gu\u00EDa_del_usuario> .
>>
>> <http://ja.dbpedia.org/resource/\u30DE\u30CB\u30E5\u30A2\u30EB> ,
>> <
>> http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka>
>> ,
>> <http://ko.dbpedia.org/resource/\uC0AC\uC6A9_\uC124\uBA85\uC11C> .
>>
>>
>>>       Andy
>>>
>>> ...
>>>> at tdb.tdbquery.main(tdbquery.java:33)
>>>>
>>>> NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?
>>>>
>>>>
>>>> Jean-Marc Vanel
>>>> <
>>> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
>>>> +33 (0)6 89 16 29 52
>>>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>>>>    Chroniques jardin
>>>> <
>>> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
>>>>
> 

Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

Posted by Jean-Marc Vanel <je...@gmail.com>.
As was stated by Andy, this is not a parsing issue.
riot is not reporting anything, nor rapper
<http://librdf.org/raptor/rapper.html> .
This is an issue with how TDB renders the URI once it has been stored in
TDB.

Jean-Marc Vanel
<http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
 Chroniques jardin
<http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle>


Le sam. 25 avr. 2020 à 09:34, Lorenz Buehmann <
buehmann@informatik.uni-leipzig.de> a écrit :

> Hi,
>
> I tried with cURL + riot CLI tools manually and can't reproduce the
> parsing issue, neither with RDF/XML nor with Turtle.
>
> curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide
> > /tmp/test.ttl
> curl -L -H "Accept: application/rdf+xml"
> http://dbpedia.org/resource/User_guide > /tmp/test.rdf
>
>
> I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
> some issues with serialization, but this has been fixed long time ago.
>
> Also, I don't understand what you mean by "suspicious"? The parser can
> easily convert the UTF-8 encoded URIs as expected:
>
> riot --check /tmp/test.ttl
>
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://nl.dbpedia.org/resource/Handleiding> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://cs.dbpedia.org/resource/Uživatelská_příručka> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://wikidata.dbpedia.org/resource/Q1057179> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://www.wikidata.org/entity/Q1057179> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://ko.dbpedia.org/resource/사용_설명서> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://es.dbpedia.org/resource/Guía_del_usuario> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://ja.dbpedia.org/resource/マニュアル> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://it.dbpedia.org/resource/Manuale> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://rdf.freebase.com/ns/m.04mqbf> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://fr.dbpedia.org/resource/Mode_d'emploi> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://yago-knowledge.org/resource/User_guide> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://de.dbpedia.org/resource/Gebrauchsanleitung> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://id.dbpedia.org/resource/Manual_pengguna> .
> <http://dbpedia.org/resource/User_guide>
> <http://www.w3.org/2002/07/owl#sameAs>
> <http://dbpedia.org/resource/User_guide> .
>
> On 24.04.20 22:33, Jean-Marc Vanel wrote:
> > Le ven. 24 avr. 2020 à 22:17, Andy Seaborne <an...@apache.org> a écrit :
> >
> >> On 24/04/2020 15:17, Jean-Marc Vanel wrote:
> >>> How to reproduce with 3.14.0
> >>>
> >>> bin/*tdbloader* --loc TDB --graph=
> http://dbpedia.org/resource/User_guide
> >> \
> >>>    --verbose http://dbpedia.org/resource/User_guide
> >> Did the log say anything?
> >>
> > NO, nothing special, neither with --debug .
> >
> > As this is a remote URL, did it all arrive and parse without warnings?
> > No warning.
> >
> > Was the database fresh or was there data in it to start with?
> > database fresh, of course.
> >
> >
> >>> echo "
> >>> CONSTRUCT {
> >>>   <http://dbpedia.org/resource/User_guide>
> >>>    ?P ?O . }
> >>> WHERE { GRAPH ?G {
> >>>   <http://dbpedia.org/resource/User_guide>
> >>>    ?P ?O . } }
> >>> LIMIT
> >>> # 30 # OK
> >>> 35 # KO !!!
> >>> " > /tmp/const.ql
> >>>
> >>> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
> >>>
> >>> And here is the *stack*:
> >>>
> >>> 16:14:23 ERROR BindingTDB           :: get1(?O)
> >>> java.lang.StringIndexOutOfBoundsException: String index out of range:
> 39
> >>> at java.lang.String.charAt(String.java:658)
> >>> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
> >>> at
> org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
> >> If the load was clean, the database is intact and it is a decoding bug
> >> in Jena for an URI. The data has a lot of encoded \u terms but its a URI
> >> in the object position causing a problem.  (I don't see why these are
> >> encoded - it's not necessary).
> >>
> > Indeed these URI are suspect:
> >
> > <http://fr.dbpedia.org/resource/Mode_d\u0027emploi> ,
> > <http://es.dbpedia.org/resource/Gu\u00EDa_del_usuario> .
> >
> > <http://ja.dbpedia.org/resource/\u30DE\u30CB\u30E5\u30A2\u30EB> ,
> > <
> >
> http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka
> >
> > ,
> > <http://ko.dbpedia.org/resource/\uC0AC\uC6A9_\uC124\uBA85\uC11C> .
> >
> >
> >>      Andy
> >>
> >> ...
> >>> at tdb.tdbquery.main(tdbquery.java:33)
> >>>
> >>> NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?
> >>>
> >>>
> >>> Jean-Marc Vanel
> >>> <
> >>
> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
> >>> +33 (0)6 89 16 29 52
> >>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://
> irc.freenode.net#eulergui
> >>>   Chroniques jardin
> >>> <
> >>
> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
> >>>
>
>

Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
Hi,

I tried with cURL + riot CLI tools manually and can't reproduce the
parsing issue, neither with RDF/XML nor with Turtle.

curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide
> /tmp/test.ttl
curl -L -H "Accept: application/rdf+xml"
http://dbpedia.org/resource/User_guide > /tmp/test.rdf


I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
some issues with serialization, but this has been fixed long time ago.

Also, I don't understand what you mean by "suspicious"? The parser can
easily convert the UTF-8 encoded URIs as expected:

riot --check /tmp/test.ttl

<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://nl.dbpedia.org/resource/Handleiding> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://cs.dbpedia.org/resource/Uživatelská_příručka> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://wikidata.dbpedia.org/resource/Q1057179> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://www.wikidata.org/entity/Q1057179> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://ko.dbpedia.org/resource/사용_설명서> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://es.dbpedia.org/resource/Guía_del_usuario> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://ja.dbpedia.org/resource/マニュアル> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://it.dbpedia.org/resource/Manuale> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://rdf.freebase.com/ns/m.04mqbf> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://fr.dbpedia.org/resource/Mode_d'emploi> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://yago-knowledge.org/resource/User_guide> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://de.dbpedia.org/resource/Gebrauchsanleitung> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://id.dbpedia.org/resource/Manual_pengguna> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://dbpedia.org/resource/User_guide> .

On 24.04.20 22:33, Jean-Marc Vanel wrote:
> Le ven. 24 avr. 2020 à 22:17, Andy Seaborne <an...@apache.org> a écrit :
>
>> On 24/04/2020 15:17, Jean-Marc Vanel wrote:
>>> How to reproduce with 3.14.0
>>>
>>> bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide
>> \
>>>    --verbose http://dbpedia.org/resource/User_guide
>> Did the log say anything?
>>
> NO, nothing special, neither with --debug .
>
> As this is a remote URL, did it all arrive and parse without warnings?
> No warning.
>
> Was the database fresh or was there data in it to start with?
> database fresh, of course.
>
>
>>> echo "
>>> CONSTRUCT {
>>>   <http://dbpedia.org/resource/User_guide>
>>>    ?P ?O . }
>>> WHERE { GRAPH ?G {
>>>   <http://dbpedia.org/resource/User_guide>
>>>    ?P ?O . } }
>>> LIMIT
>>> # 30 # OK
>>> 35 # KO !!!
>>> " > /tmp/const.ql
>>>
>>> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
>>>
>>> And here is the *stack*:
>>>
>>> 16:14:23 ERROR BindingTDB           :: get1(?O)
>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 39
>>> at java.lang.String.charAt(String.java:658)
>>> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
>>> at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
>> If the load was clean, the database is intact and it is a decoding bug
>> in Jena for an URI. The data has a lot of encoded \u terms but its a URI
>> in the object position causing a problem.  (I don't see why these are
>> encoded - it's not necessary).
>>
> Indeed these URI are suspect:
>
> <http://fr.dbpedia.org/resource/Mode_d\u0027emploi> ,
> <http://es.dbpedia.org/resource/Gu\u00EDa_del_usuario> .
>
> <http://ja.dbpedia.org/resource/\u30DE\u30CB\u30E5\u30A2\u30EB> ,
> <
> http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka>
> ,
> <http://ko.dbpedia.org/resource/\uC0AC\uC6A9_\uC124\uBA85\uC11C> .
>
>
>>      Andy
>>
>> ...
>>> at tdb.tdbquery.main(tdbquery.java:33)
>>>
>>> NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?
>>>
>>>
>>> Jean-Marc Vanel
>>> <
>> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
>>> +33 (0)6 89 16 29 52
>>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>>>   Chroniques jardin
>>> <
>> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
>>>


Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

Posted by Jean-Marc Vanel <je...@gmail.com>.
Le ven. 24 avr. 2020 à 22:17, Andy Seaborne <an...@apache.org> a écrit :

>
> On 24/04/2020 15:17, Jean-Marc Vanel wrote:
> > How to reproduce with 3.14.0
> >
> > bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide
> \
> >    --verbose http://dbpedia.org/resource/User_guide
>
> Did the log say anything?
>

NO, nothing special, neither with --debug .

As this is a remote URL, did it all arrive and parse without warnings?
>

No warning.

Was the database fresh or was there data in it to start with?
>

database fresh, of course.


> > echo "
> > CONSTRUCT {
> >   <http://dbpedia.org/resource/User_guide>
> >    ?P ?O . }
> > WHERE { GRAPH ?G {
> >   <http://dbpedia.org/resource/User_guide>
> >    ?P ?O . } }
> > LIMIT
> > # 30 # OK
> > 35 # KO !!!
> > " > /tmp/const.ql
> >
> > bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
> >
> > And here is the *stack*:
> >
> > 16:14:23 ERROR BindingTDB           :: get1(?O)
> > java.lang.StringIndexOutOfBoundsException: String index out of range: 39
> > at java.lang.String.charAt(String.java:658)
> > at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
> > at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
>
> If the load was clean, the database is intact and it is a decoding bug
> in Jena for an URI. The data has a lot of encoded \u terms but its a URI
> in the object position causing a problem.  (I don't see why these are
> encoded - it's not necessary).
>

Indeed these URI are suspect:

<http://fr.dbpedia.org/resource/Mode_d\u0027emploi> ,
<http://es.dbpedia.org/resource/Gu\u00EDa_del_usuario> .

<http://ja.dbpedia.org/resource/\u30DE\u30CB\u30E5\u30A2\u30EB> ,
<
http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka>
,
<http://ko.dbpedia.org/resource/\uC0AC\uC6A9_\uC124\uBA85\uC11C> .


>      Andy
>
> ...
> > at tdb.tdbquery.main(tdbquery.java:33)
> >
> > NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?
> >
> >
> > Jean-Marc Vanel
> > <
> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
> >
> > +33 (0)6 89 16 29 52
> > Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
> >   Chroniques jardin
> > <
> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
> >
> >
>

Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

Posted by Andy Seaborne <an...@apache.org>.

On 24/04/2020 15:17, Jean-Marc Vanel wrote:
> How to reproduce with 3.14.0
> 
> bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide \
>    --verbose http://dbpedia.org/resource/User_guide

Did the log say anything?

As this is a remote URL, did it all arrive and parse without warnings?

Was the database fresh or was there data in it to start with?

> echo "
> CONSTRUCT {
>   <http://dbpedia.org/resource/User_guide>
>    ?P ?O . }
> WHERE { GRAPH ?G {
>   <http://dbpedia.org/resource/User_guide>
>    ?P ?O . } }
> LIMIT
> # 30 # OK
> 35 # KO !!!
> " > /tmp/const.ql
> 
> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
> 
> And here is the *stack*:
> 
> 16:14:23 ERROR BindingTDB           :: get1(?O)
> java.lang.StringIndexOutOfBoundsException: String index out of range: 39
> at java.lang.String.charAt(String.java:658)
> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
> at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)

If the load was clean, the database is intact and it is a decoding bug 
in Jena for an URI. The data has a lot of encoded \u terms but its a URI 
in the object position causing a problem.  (I don't see why these are 
encoded - it's not necessary).

     Andy

...
> at tdb.tdbquery.main(tdbquery.java:33)
> 
> NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?
> 
> 
> Jean-Marc Vanel
> <http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
> +33 (0)6 89 16 29 52
> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>   Chroniques jardin
> <http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle>
>