You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Alexandru Todor <to...@inf.fu-berlin.de> on 2011/09/30 17:17:07 UTC

Suspected Bug(s): dealing with UTF8 IRIs in HTTP Sparql Queries

Hi,

I maintain the German language DBpedia endpoint, and have gotten some 
mails from users complaining that they don't get any results from the 
endpoint when they query for resources like:

http://de.dbpedia.org/resource/München

This is the code they sent me:

String queryString= "SELECT ?o WHERE 
{<http://de.dbpedia.org/resource/München> 
<http://purl.org/dc/terms/subject> ?o }";
             Query query = QueryFactory.create(queryString);
             QueryExecution qexec = 
QueryExecutionFactory.sparqlService("http://de.dbpedia.org/sparql", query);
             try {
                 ResultSet results = qexec.execSelect();
                 for (; results.hasNext();) {
                     QuerySolution s = results.nextSolution();
                     System.out.println(s.toString());
                 }
             }
             finally {
                qexec.close();
             }

I tried the code and it works for any IRI that contains no UTF8 chars 
(so only for URIs), but when you have UTF8 chars it returns no result. 
I've tried a couple of variations and it returns no result but also 
doesn't throw any kind of exception, it's just as if the data wasn't there.

Then I proceeded to try an alternative method and used  QueryEngineHTTP 
to execute the query and it worked. However, QueryEngineHTTP messes up 
the UTF8 encoding, so for example in the returned results you get 
München instead of München . My guess is that QueryEngineHTTP encodes 
the SPARQL results in ISO-8859-1 instead of UTF8, so decoding the 
strings as ISO-8859-1 and re-encoding it as UTF8 fixed this.

Kind Regards,
Alexandru Todor

Research Associate
AG Corporate Semantic Web
Freie Universität Berlin






Re: Suspected Bug(s): dealing with UTF8 IRIs in HTTP Sparql Queries

Posted by Andy Seaborne <an...@apache.org>.
On 01/10/11 21:48, Andy Seaborne wrote:
> On 30/09/11 17:08, Alexandru Todor wrote:
>> Hi,
>>
>> Seems to be a Virtuoso issue with the RDF/XML serializer. Both the Greek
>> and German endpoints seem to have the garbled data in the XML files and
>> this issue only arises when choosing RDF/XML as output. Thanks for the
>> tip, I'll report the issue to the Virtuoso devs.
>
> Could you also report that
>
> 1/ asking for N-triples does not return N-triples. It returns something
> Turtle-ish.
>
> 2/ The SPARQL XML results has the same encoding problem as RDF/XML.
>
> These have somewhat slowed down the bug hunting.
>
>> There's still the problem with QueryExecutionFactory.sparqlService
>> returning no results.
>
> Yes - I found it (in turning queries into strings). I need to do some
> careful testing to make sure the fix does not break something elsewhere
> that incorrectly depends on the effect.
>
> Andy

Fix committed to Apache SVN.

	Andy


Re: Suspected Bug(s): dealing with UTF8 IRIs in HTTP Sparql Queries

Posted by Andy Seaborne <an...@apache.org>.
On 30/09/11 17:08, Alexandru Todor wrote:
> Hi,
>
> Seems to be a Virtuoso issue with the RDF/XML serializer. Both the Greek
> and German endpoints seem to have the garbled data in the XML files and
> this issue only arises when choosing RDF/XML as output. Thanks for the
> tip, I'll report the issue to the Virtuoso devs.

Could you also report that

1/ asking for N-triples does not return N-triples.  It returns something 
Turtle-ish.

2/ The SPARQL XML results has the same encoding problem as RDF/XML.

These have somewhat slowed down the bug hunting.

> There's still the problem with QueryExecutionFactory.sparqlService
> returning no results.

Yes - I found it (in turning queries into strings).  I need to do some 
careful testing to make sure the fix does not break something elsewhere 
that incorrectly depends on the effect.

	Andy



>
> Kind Regards,
> Alexandru
>
> On 09/30/2011 05:33 PM, Andy Seaborne wrote:
>> On 30/09/11 16:17, Alexandru Todor wrote:
>>> Hi,
>>>
>>> I maintain the German language DBpedia endpoint, and have gotten some
>>> mails from users complaining that they don't get any results from the
>>> endpoint when they query for resources like:
>>>
>>> http://de.dbpedia.org/resource/München
>>
>> This message and your message are ISO-8859-1
>>
>> ü = 0xFC in ISO-8859-1 which is the same as a Unicode codepoint and
>> 0xC3 0xBC in UTF-8.
>>
>> I tried http://de.dbpedia.org/resource/München in my browser and got:
>>
>> to http://de.dbpedia.org/data/M%C3%BCnchen.xml
>>
>> which returns:
>>
>> RDF/XML in UTF-8 but it contains e.g. line 3:
>>
>> rdf:resource="http://de.dbpedia.org/resource/München"
>>
>> in Firefox. That looks corrupt to me.
>>
>>> This is the code they sent me:
>>>
>>> String queryString= "SELECT ?o WHERE
>>> {<http://de.dbpedia.org/resource/München>
>>> <http://purl.org/dc/terms/subject> ?o }";
>>> Query query = QueryFactory.create(queryString);
>>> QueryExecution qexec =
>>> QueryExecutionFactory.sparqlService("http://de.dbpedia.org/sparql",
>>> query);
>>> try {
>>> ResultSet results = qexec.execSelect();
>>> for (; results.hasNext();) {
>>> QuerySolution s = results.nextSolution();
>>> System.out.println(s.toString());
>>> }
>>> }
>>> finally {
>>> qexec.close();
>>> }
>>>
>>> I tried the code and it works for any IRI that contains no UTF8 chars
>>> (so only for URIs), but when you have UTF8 chars it returns no result.
>>> I've tried a couple of variations and it returns no result but also
>>> doesn't throw any kind of exception, it's just as if the data wasn't
>>> there.
>>>
>>> Then I proceeded to try an alternative method and used QueryEngineHTTP
>>> to execute the query and it worked. However, QueryEngineHTTP messes up
>>> the UTF8 encoding, so for example in the returned results you get
>>> München instead of München . My guess is that QueryEngineHTTP encodes
>>> the SPARQL results in ISO-8859-1 instead of UTF8, so decoding the
>>> strings as ISO-8859-1 and re-encoding it as UTF8 fixed this.
>>
>> the code seems to do:
>>
>> URLEncoder.encode(s, "UTF-8")
>>
>> but it's still working in strings. Something lower level (Sun
>> networking) does the string to bytes.
>>
>> Andy
>>
>>>
>>> Kind Regards,
>>> Alexandru Todor
>>>
>>> Research Associate
>>> AG Corporate Semantic Web
>>> Freie Universität Berlin
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Suspected Bug(s): dealing with UTF8 IRIs in HTTP Sparql Queries

Posted by Alexandru Todor <to...@inf.fu-berlin.de>.
Hi,

Seems to be a Virtuoso issue with the RDF/XML serializer. Both the Greek 
and German endpoints seem to have the garbled data in the XML files and 
this issue only arises when choosing RDF/XML as output. Thanks for the 
tip, I'll report the issue to the Virtuoso devs.

There's still the problem with QueryExecutionFactory.sparqlService 
returning no results.

Kind Regards,
Alexandru

On 09/30/2011 05:33 PM, Andy Seaborne wrote:
> On 30/09/11 16:17, Alexandru Todor wrote:
>> Hi,
>>
>> I maintain the German language DBpedia endpoint, and have gotten some
>> mails from users complaining that they don't get any results from the
>> endpoint when they query for resources like:
>>
>> http://de.dbpedia.org/resource/München
>
> This message and your message are ISO-8859-1
>
> ü = 0xFC in ISO-8859-1 which is the same as a Unicode codepoint and 
> 0xC3 0xBC in UTF-8.
>
> I tried http://de.dbpedia.org/resource/München in my browser and got:
>
> to http://de.dbpedia.org/data/M%C3%BCnchen.xml
>
> which returns:
>
> RDF/XML in UTF-8 but it contains e.g. line 3:
>
> rdf:resource="http://de.dbpedia.org/resource/München"
>
> in Firefox.  That looks corrupt to me.
>
>> This is the code they sent me:
>>
>> String queryString= "SELECT ?o WHERE
>> {<http://de.dbpedia.org/resource/München>
>> <http://purl.org/dc/terms/subject> ?o }";
>> Query query = QueryFactory.create(queryString);
>> QueryExecution qexec =
>> QueryExecutionFactory.sparqlService("http://de.dbpedia.org/sparql", 
>> query);
>> try {
>> ResultSet results = qexec.execSelect();
>> for (; results.hasNext();) {
>> QuerySolution s = results.nextSolution();
>> System.out.println(s.toString());
>> }
>> }
>> finally {
>> qexec.close();
>> }
>>
>> I tried the code and it works for any IRI that contains no UTF8 chars
>> (so only for URIs), but when you have UTF8 chars it returns no result.
>> I've tried a couple of variations and it returns no result but also
>> doesn't throw any kind of exception, it's just as if the data wasn't 
>> there.
>>
>> Then I proceeded to try an alternative method and used QueryEngineHTTP
>> to execute the query and it worked. However, QueryEngineHTTP messes up
>> the UTF8 encoding, so for example in the returned results you get
>> München instead of München . My guess is that QueryEngineHTTP encodes
>> the SPARQL results in ISO-8859-1 instead of UTF8, so decoding the
>> strings as ISO-8859-1 and re-encoding it as UTF8 fixed this.
>
> the code seems to do:
>
> URLEncoder.encode(s, "UTF-8")
>
> but it's still working in strings.  Something lower level (Sun 
> networking) does the string to bytes.
>
>     Andy
>
>>
>> Kind Regards,
>> Alexandru Todor
>>
>> Research Associate
>> AG Corporate Semantic Web
>> Freie Universität Berlin
>>
>>
>>
>>
>>
>


Re: Suspected Bug(s): dealing with UTF8 IRIs in HTTP Sparql Queries

Posted by Andy Seaborne <an...@apache.org>.
On 30/09/11 16:17, Alexandru Todor wrote:
> Hi,
>
> I maintain the German language DBpedia endpoint, and have gotten some
> mails from users complaining that they don't get any results from the
> endpoint when they query for resources like:
>
> http://de.dbpedia.org/resource/München

This message and your message are ISO-8859-1

ü = 0xFC in ISO-8859-1 which is the same as a Unicode codepoint and 0xC3 
0xBC in UTF-8.

I tried http://de.dbpedia.org/resource/München in my browser and got:

to http://de.dbpedia.org/data/M%C3%BCnchen.xml

which returns:

RDF/XML in UTF-8 but it contains e.g. line 3:

rdf:resource="http://de.dbpedia.org/resource/München"

in Firefox.  That looks corrupt to me.

> This is the code they sent me:
>
> String queryString= "SELECT ?o WHERE
> {<http://de.dbpedia.org/resource/München>
> <http://purl.org/dc/terms/subject> ?o }";
> Query query = QueryFactory.create(queryString);
> QueryExecution qexec =
> QueryExecutionFactory.sparqlService("http://de.dbpedia.org/sparql", query);
> try {
> ResultSet results = qexec.execSelect();
> for (; results.hasNext();) {
> QuerySolution s = results.nextSolution();
> System.out.println(s.toString());
> }
> }
> finally {
> qexec.close();
> }
>
> I tried the code and it works for any IRI that contains no UTF8 chars
> (so only for URIs), but when you have UTF8 chars it returns no result.
> I've tried a couple of variations and it returns no result but also
> doesn't throw any kind of exception, it's just as if the data wasn't there.
>
> Then I proceeded to try an alternative method and used QueryEngineHTTP
> to execute the query and it worked. However, QueryEngineHTTP messes up
> the UTF8 encoding, so for example in the returned results you get
> München instead of München . My guess is that QueryEngineHTTP encodes
> the SPARQL results in ISO-8859-1 instead of UTF8, so decoding the
> strings as ISO-8859-1 and re-encoding it as UTF8 fixed this.

the code seems to do:

URLEncoder.encode(s, "UTF-8")

but it's still working in strings.  Something lower level (Sun 
networking) does the string to bytes.

	Andy

>
> Kind Regards,
> Alexandru Todor
>
> Research Associate
> AG Corporate Semantic Web
> Freie Universität Berlin
>
>
>
>
>