You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Olivier Rossel <ol...@gmail.com> on 2012/10/29 16:56:20 UTC

Surviving a UTF-8 parsing error.

Sometimes my CONSTRUCTs retrieve wrongly encoded UTF-8 content.
When Xerces/Jena parses such XML data,  it returns an XML parsing error.

Is it a common issue?
Could we imagine a workaround, so the parsing does not fail on UTF-8
encoding errors?

May be preparse and fix any UTF-8 inconsistencies before the XML parsing...

Re: Surviving a UTF-8 parsing error.

Posted by "Dan B." <da...@kempt.net>.
Andy Seaborne wrote:
> On 29/10/12 15:56, Olivier Rossel wrote:
>> Sometimes my CONSTRUCTs retrieve wrongly encoded UTF-8 content.
>> When Xerces/Jena parses such XML data,  it returns an XML parsing error.
>>
>> Is it a common issue?
>> Could we imagine a workaround, so the parsing does not fail on UTF-8
>> encoding errors?
>>
>> May be preparse and fix any UTF-8 inconsistencies before the XML parsing...
>>
>
> The conversion from bytes to chars is done inside Xerces and is not recoverable.
>
> Testing first is better - there is a command riotcmd.utf8 that checks a file.
>
> (The non-RDF/XML parsers use java conversion but the issue remains - it's not recoverable albeit because the standard decoders buffer and don't say where the encoding problem was).

Also, note that rejecting invalid UTF-8 sequences is recommended
(required, actually, in some specifications) for security.

(You don't want some input validator and a later input processing
interpreting invalid UTF-8 byte sequences differently, so the usual
rule is that an invalid UTF-8 byte sequence must result in error.)


Daniel







Re: Surviving a UTF-8 parsing error.

Posted by Andy Seaborne <an...@apache.org>.
On 29/10/12 15:56, Olivier Rossel wrote:
> Sometimes my CONSTRUCTs retrieve wrongly encoded UTF-8 content.
> When Xerces/Jena parses such XML data,  it returns an XML parsing error.
>
> Is it a common issue?
> Could we imagine a workaround, so the parsing does not fail on UTF-8
> encoding errors?
>
> May be preparse and fix any UTF-8 inconsistencies before the XML parsing...
>

The conversion from bytes to chars is done inside Xerces and is not 
recoverable.

Testing first is better - there is a command riotcmd.utf8 that checks a 
file.

(The non-RDF/XML parsers use java conversion but the issue remains - 
it's not recoverable albeit because the standard decoders buffer and 
don't say where the encoding problem was).

	Andy