You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Ziya Akar <zi...@gmail.com> on 2012/04/27 10:56:50 UTC

Illegal character problem when querying remote dataset

Hi,

i have a problem when querying a remote dataset and my execution is interrupted.

com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
((CTRL-CHAR, code 7))
 at [row,col {unknown-source}]: [61661,133]
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:679)
at com.hp.hpl.jena.sparql.resultset.XMLInputStAX$ResultSetStAX.getOneSolution(XMLInputStAX.java:464)
at com.hp.hpl.jena.sparql.resultset.XMLInputStAX$ResultSetStAX.hasNext(XMLInputStAX.java:217)
        ...
        ...

How can i handle this problem?


Best regards,

Ziya

Re: Illegal character problem when querying remote dataset

Posted by Andy Seaborne <an...@apache.org>.
On 27/04/12 19:35, Sean K wrote:
> I had a similar problem.
>
> When I stored lat long coordinates into a RDF file  (32°37'34.57"N for

Your email is ISO-8859-1 ....

> example), the jena libraries could not load them.

Did you set charset to be iso 8859-1 in the XML?

Char B0 (ISO-8859-1) is the start of a multi byte sequence in UTF-8.

>
> In particular when I have a process receiving data or scanning data on
> other sites or endpoints, I thought that XML would complain about xml
> delimiters only.

XML parsing is done by Xerces both for RDF/XML and SPARQL result sets.

XML has more rules than just delimiters.  Xerces implements them.

(by the way, the issue of illegal chars becoming legal in XML 1.1 is 
stated as one of the reasons for the lack of take up on XML 1.1 -- 
people are relying on strict checking and don't want to chnage to a more 
permissive set of characters as the rest of their workflow replies on 
XML checking)

> What would be the best strategies in populating a RDF graph store?
> What kind of filtering should be done?

Check all data before loading.

Turtle is only required to be UTF-8.

Remember, even if it works for you, passing it on to another app may go 
wrong otherwise.  That's standards for you!

	Andy

>
>
>
> On Fri, Apr 27, 2012 at 9:04 AM, Andy Seaborne<an...@apache.org>  wrote:
>> What is the remote query service?  is it dbpedia by any chance?
>>
>> A similar problem was reported quite recently.
>>
>> If it's a Fuseki endpoint, you can force the output format with
>> "&output=json" but that's non-standard.
>>
>>         Andy
>>
>>
>> On 27/04/12 10:21, Damian Steer wrote:
>>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> On 27/04/12 09:56, Ziya Akar wrote:
>>>>
>>>> Hi,
>>>
>>>
>>> Hi Ziya,
>>>
>>>> i have a problem when querying a remote dataset and my execution is
>>>> interrupted.
>>>>
>>>> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
>>>> ((CTRL-CHAR, code 7)) at [row,col {unknown-source}]: [61661,133] at
>>>> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
>>>
>>>
>>> Ugh.
>>>
>>> Certain
>>>>
>>>>
>>> characters are illegal in XML, and simply can't be written.
>>> Your remote dataset contains such a character unfortunately, so some
>>> results can't be transferred over the network in the standard way
>>> (that is sparql xml result format).
>>>
>>>> How can i handle this problem?
>>>
>>>
>>> The remote data is probably broken (these characters are typically
>>> useless), so report the issue to the upstream provider. Their server
>>> is also producing broken XML, and ought to be fixed (although this
>>> wouldn't help you, it would just produce a 500 error).
>>>
>>> You could try asking for an alternate result serialisation like JSON.
>>> I don't think there's a convenient way to do that via the query
>>> execution factory (?), so you'd need to write more code (although we
>>> can help, of course).
>>>
>>> You could also try filtering the bad characters out pre-xml
>>> processing. Once again, more work I'm afraid.
>>>
>>> Damian
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.11 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>>
>>> iEYEARECAAYFAk+aZSsACgkQAyLCB+mTtymDiACggsbxtdhaifaWOmky9tnl9S6Z
>>> +O8AnA/D0knRRj3IFNNuWF0otrAT3n6N
>>> =NIk9
>>> -----END PGP SIGNATURE-----
>>
>>


Re: Illegal character problem when querying remote dataset

Posted by Sean K <sk...@gmail.com>.
I had a similar problem.

When I stored lat long coordinates into a RDF file  (32°37'34.57"N for
example), the jena libraries could not load them.

In particular when I have a process receiving data or scanning data on
other sites or endpoints, I thought that XML would complain about xml
delimiters only.

What would be the best strategies in populating a RDF graph store?
What kind of filtering should be done?



On Fri, Apr 27, 2012 at 9:04 AM, Andy Seaborne <an...@apache.org> wrote:
> What is the remote query service?  is it dbpedia by any chance?
>
> A similar problem was reported quite recently.
>
> If it's a Fuseki endpoint, you can force the output format with
> "&output=json" but that's non-standard.
>
>        Andy
>
>
> On 27/04/12 10:21, Damian Steer wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 27/04/12 09:56, Ziya Akar wrote:
>>>
>>> Hi,
>>
>>
>> Hi Ziya,
>>
>>> i have a problem when querying a remote dataset and my execution is
>>> interrupted.
>>>
>>> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
>>> ((CTRL-CHAR, code 7)) at [row,col {unknown-source}]: [61661,133] at
>>> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
>>
>>
>> Ugh.
>>
>> Certain
>>>
>>>
>> characters are illegal in XML, and simply can't be written.
>> Your remote dataset contains such a character unfortunately, so some
>> results can't be transferred over the network in the standard way
>> (that is sparql xml result format).
>>
>>> How can i handle this problem?
>>
>>
>> The remote data is probably broken (these characters are typically
>> useless), so report the issue to the upstream provider. Their server
>> is also producing broken XML, and ought to be fixed (although this
>> wouldn't help you, it would just produce a 500 error).
>>
>> You could try asking for an alternate result serialisation like JSON.
>> I don't think there's a convenient way to do that via the query
>> execution factory (?), so you'd need to write more code (although we
>> can help, of course).
>>
>> You could also try filtering the bad characters out pre-xml
>> processing. Once again, more work I'm afraid.
>>
>> Damian
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.11 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>
>> iEYEARECAAYFAk+aZSsACgkQAyLCB+mTtymDiACggsbxtdhaifaWOmky9tnl9S6Z
>> +O8AnA/D0knRRj3IFNNuWF0otrAT3n6N
>> =NIk9
>> -----END PGP SIGNATURE-----
>
>

Re: Illegal character problem when querying remote dataset

Posted by Andy Seaborne <an...@apache.org>.
What is the remote query service?  is it dbpedia by any chance?

A similar problem was reported quite recently.

If it's a Fuseki endpoint, you can force the output format with 
"&output=json" but that's non-standard.

	Andy

On 27/04/12 10:21, Damian Steer wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 27/04/12 09:56, Ziya Akar wrote:
>> Hi,
>
> Hi Ziya,
>
>> i have a problem when querying a remote dataset and my execution is
>> interrupted.
>>
>> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
>> ((CTRL-CHAR, code 7)) at [row,col {unknown-source}]: [61661,133] at
>> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
>
> Ugh.
>
> Certain
>>
> characters are illegal in XML, and simply can't be written.
> Your remote dataset contains such a character unfortunately, so some
> results can't be transferred over the network in the standard way
> (that is sparql xml result format).
>
>> How can i handle this problem?
>
> The remote data is probably broken (these characters are typically
> useless), so report the issue to the upstream provider. Their server
> is also producing broken XML, and ought to be fixed (although this
> wouldn't help you, it would just produce a 500 error).
>
> You could try asking for an alternate result serialisation like JSON.
> I don't think there's a convenient way to do that via the query
> execution factory (?), so you'd need to write more code (although we
> can help, of course).
>
> You could also try filtering the bad characters out pre-xml
> processing. Once again, more work I'm afraid.
>
> Damian
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk+aZSsACgkQAyLCB+mTtymDiACggsbxtdhaifaWOmky9tnl9S6Z
> +O8AnA/D0knRRj3IFNNuWF0otrAT3n6N
> =NIk9
> -----END PGP SIGNATURE-----


Re: Illegal character problem when querying remote dataset

Posted by Damian Steer <d....@bristol.ac.uk>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 27/04/12 09:56, Ziya Akar wrote:
> Hi,

Hi Ziya,

> i have a problem when querying a remote dataset and my execution is
> interrupted.
> 
> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character 
> ((CTRL-CHAR, code 7)) at [row,col {unknown-source}]: [61661,133] at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)

Ugh.

Certain
> 
characters are illegal in XML, and simply can't be written.
Your remote dataset contains such a character unfortunately, so some
results can't be transferred over the network in the standard way
(that is sparql xml result format).

> How can i handle this problem?

The remote data is probably broken (these characters are typically
useless), so report the issue to the upstream provider. Their server
is also producing broken XML, and ought to be fixed (although this
wouldn't help you, it would just produce a 500 error).

You could try asking for an alternate result serialisation like JSON.
I don't think there's a convenient way to do that via the query
execution factory (?), so you'd need to write more code (although we
can help, of course).

You could also try filtering the bad characters out pre-xml
processing. Once again, more work I'm afraid.

Damian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk+aZSsACgkQAyLCB+mTtymDiACggsbxtdhaifaWOmky9tnl9S6Z
+O8AnA/D0knRRj3IFNNuWF0otrAT3n6N
=NIk9
-----END PGP SIGNATURE-----