You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Stephen Collyer <sc...@netspinner.co.uk> on 2008/06/22 13:40:16 UTC
SAX2 parser: encoding="UTF-8" breaks validation
I have a SAX2 parser which is exhibiting odd behaviour.
If I give it some XML with an XML declaration like:
<?xml version="1.0" encoding="UTF-8" ?>
it fails with a "Invalid document structure" error.
If I remove the encoding element, then it parses correctly.
Can anyone suggest what the problem is ? I'm assuming
that this is some interaction between the validator and
the encoding, but I'm baffled as to what, precisely.
This is occuring with Xerces-c 2.7.0.
--
Regards
Steve Collyer
Netspinner Ltd
Re: SAX2 parser: encoding="UTF-8" breaks validation
Posted by David Bertoni <db...@apache.org>.
Stephen Collyer wrote:
> David Bertoni wrote:
>> Stephen Collyer wrote:
>>> I have a SAX2 parser which is exhibiting odd behaviour.
>>>
>>> If I give it some XML with an XML declaration like:
>>>
>>> <?xml version="1.0" encoding="UTF-8" ?>
>>>
>>> it fails with a "Invalid document structure" error.
>>> If I remove the encoding element, then it parses correctly.
>> This is quite strange, since the parser will assume the encoding is
>> UTF-8 without an encoding declaration. The only case where I could
>> imagine this might happen is with a UTF-16 document with an encoding
>> declaration that indicates a byte-oriented encoding. You can verify
>> this by looking at a binary dump of the XML stream.
>
> Dave, thanks for that - I suspect I know what the problem is.
> I am, in fact, handing Xerces a UTF-16 document with an encoding
> that says UTF-8 - is that what you mean by a "byte oriented encoding"
> i.e a variable length encoding ?
Yes. UTF-8 is byte-oriented. OK, well it's octet-oriented, but either
way, they are bytes in C++. UTF-16 is also a variable-length encoding,
but it uses 16-bit code units.
>
> The reason for this is that I am receiving a document in UTF-8 with
> a decln that indicates UTF-8, but I'm transcoding it to UTF-16 early
> on to make it fit in a Qt QString (I'm using the Trolltech Qt libs).
> However, of course, if I hand that off to Xerces, the encoding decln
> no longer matches the true encoding, which I guess is the cause of
> the problem. This only dawned on me after I'd read your comment.
Unfortunately, it's not a very good error message. If you want to, you
can create a Jira issue so we can possibly fix it one of these days.
>
> The only way I can see to fix this is to edit the decln in code.
> Or can I tell Xerces to ignore it somehow ? Advice appreciated.
The easiest way is to set the encoding on the InputSource to "UTF-16",
which will force the parser to use that encoding.
Dave
Re: SAX2 parser: encoding="UTF-8" breaks validation
Posted by Stephen Collyer <sc...@netspinner.co.uk>.
David Bertoni wrote:
> Stephen Collyer wrote:
>> I have a SAX2 parser which is exhibiting odd behaviour.
>>
>> If I give it some XML with an XML declaration like:
>>
>> <?xml version="1.0" encoding="UTF-8" ?>
>>
>> it fails with a "Invalid document structure" error.
>> If I remove the encoding element, then it parses correctly.
> This is quite strange, since the parser will assume the encoding is
> UTF-8 without an encoding declaration. The only case where I could
> imagine this might happen is with a UTF-16 document with an encoding
> declaration that indicates a byte-oriented encoding. You can verify
> this by looking at a binary dump of the XML stream.
Dave, thanks for that - I suspect I know what the problem is.
I am, in fact, handing Xerces a UTF-16 document with an encoding
that says UTF-8 - is that what you mean by a "byte oriented encoding"
i.e a variable length encoding ?
The reason for this is that I am receiving a document in UTF-8 with
a decln that indicates UTF-8, but I'm transcoding it to UTF-16 early
on to make it fit in a Qt QString (I'm using the Trolltech Qt libs).
However, of course, if I hand that off to Xerces, the encoding decln
no longer matches the true encoding, which I guess is the cause of
the problem. This only dawned on me after I'd read your comment.
The only way I can see to fix this is to edit the decln in code.
Or can I tell Xerces to ignore it somehow ? Advice appreciated.
--
Regards
Steve Collyer
Netspinner Ltd
Re: SAX2 parser: encoding="UTF-8" breaks validation
Posted by David Bertoni <db...@apache.org>.
Stephen Collyer wrote:
> I have a SAX2 parser which is exhibiting odd behaviour.
>
> If I give it some XML with an XML declaration like:
>
> <?xml version="1.0" encoding="UTF-8" ?>
>
> it fails with a "Invalid document structure" error.
> If I remove the encoding element, then it parses correctly.
This is quite strange, since the parser will assume the encoding is
UTF-8 without an encoding declaration. The only case where I could
imagine this might happen is with a UTF-16 document with an encoding
declaration that indicates a byte-oriented encoding. You can verify
this by looking at a binary dump of the XML stream.
>
> Can anyone suggest what the problem is ? I'm assuming
> that this is some interaction between the validator and
> the encoding, but I'm baffled as to what, precisely.
Encoding detection and parsing happen at a lower level than validation.
That's also not an error from the validation code -- it's an
indication that the parser has found something wrong with the
fundamental structure of the XML document.
Can you post more details of what your code looks like, and how the
parser is configured? Also, if you can post a trivial document that
reproduces the error, that would help.
Dave
Re: SAX2 parser: encoding="UTF-8" breaks validation
Posted by Jason Stewart <ja...@gmail.com>.
Ah, thanks.
Cheers, jas.
On Mon, Jun 23, 2008 at 12:50 AM, Michael Glavassevich
<mr...@ca.ibm.com> wrote:
> Nope. The correct spelling is "UTF-8" [1] which is what Stephen has in his
> document.
>
> [1] http://www.iana.org/assignments/character-sets
Re: SAX2 parser: encoding="UTF-8" breaks validation
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Nope. The correct spelling is "UTF-8" [1] which is what Stephen has in his
document.
[1] http://www.iana.org/assignments/character-sets
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
"Jason Stewart" <ja...@gmail.com> wrote on 06/22/2008 01:47:30
PM:
> Hey Stephen,
>
> I believe "UTF8" is the correct spelling. Can someone else confirm
> that? Or is this a red herring.
>
> Cheers, jas.
>
> On Sun, Jun 22, 2008 at 5:10 PM, Stephen Collyer
> <sc...@netspinner.co.uk> wrote:
> > I have a SAX2 parser which is exhibiting odd behaviour.
> >
> > If I give it some XML with an XML declaration like:
> >
> > <?xml version="1.0" encoding="UTF-8" ?>
> >
> > it fails with a "Invalid document structure" error.
> > If I remove the encoding element, then it parses correctly.
> >
> > Can anyone suggest what the problem is ? I'm assuming
> > that this is some interaction between the validator and
> > the encoding, but I'm baffled as to what, precisely.
> >
> > This is occuring with Xerces-c 2.7.0.
> >
> > --
> > Regards
> >
> > Steve Collyer
> > Netspinner Ltd
Re: SAX2 parser: encoding="UTF-8" breaks validation
Posted by Jason Stewart <ja...@gmail.com>.
Hey Stephen,
I believe "UTF8" is the correct spelling. Can someone else confirm
that? Or is this a red herring.
Cheers, jas.
On Sun, Jun 22, 2008 at 5:10 PM, Stephen Collyer
<sc...@netspinner.co.uk> wrote:
> I have a SAX2 parser which is exhibiting odd behaviour.
>
> If I give it some XML with an XML declaration like:
>
> <?xml version="1.0" encoding="UTF-8" ?>
>
> it fails with a "Invalid document structure" error.
> If I remove the encoding element, then it parses correctly.
>
> Can anyone suggest what the problem is ? I'm assuming
> that this is some interaction between the validator and
> the encoding, but I'm baffled as to what, precisely.
>
> This is occuring with Xerces-c 2.7.0.
>
> --
> Regards
>
> Steve Collyer
> Netspinner Ltd
>