You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Stephen Collyer <sc...@netspinner.co.uk> on 2008/06/22 13:40:16 UTC

SAX2 parser: encoding="UTF-8" breaks validation

I have a SAX2 parser which is exhibiting odd behaviour.

If I give it some XML with an XML declaration like:

<?xml version="1.0" encoding="UTF-8" ?>

it fails with a "Invalid document structure" error.
If I remove the encoding element, then it parses correctly.

Can anyone suggest what the problem is ? I'm assuming
that this is some interaction between the validator and
the encoding, but I'm baffled as to what, precisely.

This is occuring with Xerces-c 2.7.0.

-- 
Regards

Steve Collyer
Netspinner Ltd

Re: SAX2 parser: encoding="UTF-8" breaks validation

Posted by David Bertoni <db...@apache.org>.
Stephen Collyer wrote:
> David Bertoni wrote:
>> Stephen Collyer wrote:
>>> I have a SAX2 parser which is exhibiting odd behaviour.
>>>
>>> If I give it some XML with an XML declaration like:
>>>
>>> <?xml version="1.0" encoding="UTF-8" ?>
>>>
>>> it fails with a "Invalid document structure" error.
>>> If I remove the encoding element, then it parses correctly.
>> This is quite strange, since the parser will assume the encoding is
>> UTF-8 without an encoding declaration.  The only case where I could
>> imagine this might happen is with a UTF-16 document with an encoding
>> declaration that indicates a byte-oriented encoding.  You can verify
>> this by looking at a binary dump of the XML stream.
> 
> Dave, thanks for that - I suspect I know what the problem is.
> I am, in fact, handing Xerces a UTF-16 document with an encoding
> that says UTF-8 - is that what you mean by a "byte oriented encoding"
> i.e a variable length encoding ?
Yes.  UTF-8 is byte-oriented.  OK, well it's octet-oriented, but either 
way, they are bytes in C++.  UTF-16 is also a variable-length encoding, 
but it uses 16-bit code units.

> 
> The reason for this is that I am receiving a document in UTF-8 with
> a decln that indicates UTF-8, but I'm transcoding it to UTF-16 early
> on to make it fit in a Qt QString (I'm using the Trolltech Qt libs).
> However, of course, if I hand that off to Xerces, the encoding decln
> no longer matches the true encoding, which I guess is the cause of
> the problem. This only dawned on me after I'd read your comment.
Unfortunately, it's not a very good error message.  If you want to, you 
can create a Jira issue so we can possibly fix it one of these days.

> 
> The only way I can see to fix this is to edit the decln in code.
> Or can I tell Xerces to ignore it somehow ? Advice appreciated.
The easiest way is to set the encoding on the InputSource to "UTF-16", 
which will force the parser to use that encoding.

Dave

Re: SAX2 parser: encoding="UTF-8" breaks validation

Posted by Stephen Collyer <sc...@netspinner.co.uk>.
David Bertoni wrote:
> Stephen Collyer wrote:
>> I have a SAX2 parser which is exhibiting odd behaviour.
>>
>> If I give it some XML with an XML declaration like:
>>
>> <?xml version="1.0" encoding="UTF-8" ?>
>>
>> it fails with a "Invalid document structure" error.
>> If I remove the encoding element, then it parses correctly.
> This is quite strange, since the parser will assume the encoding is
> UTF-8 without an encoding declaration.  The only case where I could
> imagine this might happen is with a UTF-16 document with an encoding
> declaration that indicates a byte-oriented encoding.  You can verify
> this by looking at a binary dump of the XML stream.

Dave, thanks for that - I suspect I know what the problem is.
I am, in fact, handing Xerces a UTF-16 document with an encoding
that says UTF-8 - is that what you mean by a "byte oriented encoding"
i.e a variable length encoding ?

The reason for this is that I am receiving a document in UTF-8 with
a decln that indicates UTF-8, but I'm transcoding it to UTF-16 early
on to make it fit in a Qt QString (I'm using the Trolltech Qt libs).
However, of course, if I hand that off to Xerces, the encoding decln
no longer matches the true encoding, which I guess is the cause of
the problem. This only dawned on me after I'd read your comment.

The only way I can see to fix this is to edit the decln in code.
Or can I tell Xerces to ignore it somehow ? Advice appreciated.

-- 
Regards

Steve Collyer
Netspinner Ltd

Re: SAX2 parser: encoding="UTF-8" breaks validation

Posted by David Bertoni <db...@apache.org>.
Stephen Collyer wrote:
> I have a SAX2 parser which is exhibiting odd behaviour.
> 
> If I give it some XML with an XML declaration like:
> 
> <?xml version="1.0" encoding="UTF-8" ?>
> 
> it fails with a "Invalid document structure" error.
> If I remove the encoding element, then it parses correctly.
This is quite strange, since the parser will assume the encoding is 
UTF-8 without an encoding declaration.  The only case where I could 
imagine this might happen is with a UTF-16 document with an encoding 
declaration that indicates a byte-oriented encoding.  You can verify 
this by looking at a binary dump of the XML stream.

> 
> Can anyone suggest what the problem is ? I'm assuming
> that this is some interaction between the validator and
> the encoding, but I'm baffled as to what, precisely.
Encoding detection and parsing happen at a lower level than validation. 
  That's also not an error from the validation code -- it's an 
indication that the parser has found something wrong with the 
fundamental structure of the XML document.

Can you post more details of what your code looks like, and how the 
parser is configured?  Also, if you can post a trivial document that 
reproduces the error, that would help.

Dave

Re: SAX2 parser: encoding="UTF-8" breaks validation

Posted by Jason Stewart <ja...@gmail.com>.
Ah, thanks.

Cheers, jas.

On Mon, Jun 23, 2008 at 12:50 AM, Michael Glavassevich
<mr...@ca.ibm.com> wrote:
> Nope. The correct spelling is "UTF-8" [1] which is what Stephen has in his
> document.
>
> [1] http://www.iana.org/assignments/character-sets

Re: SAX2 parser: encoding="UTF-8" breaks validation

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Nope. The correct spelling is "UTF-8" [1] which is what Stephen has in his
document.

[1] http://www.iana.org/assignments/character-sets

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"Jason Stewart" <ja...@gmail.com> wrote on 06/22/2008 01:47:30
PM:

> Hey Stephen,
>
> I believe "UTF8" is the correct spelling. Can someone else confirm
> that? Or is this a red herring.
>
> Cheers, jas.
>
> On Sun, Jun 22, 2008 at 5:10 PM, Stephen Collyer
> <sc...@netspinner.co.uk> wrote:
> > I have a SAX2 parser which is exhibiting odd behaviour.
> >
> > If I give it some XML with an XML declaration like:
> >
> > <?xml version="1.0" encoding="UTF-8" ?>
> >
> > it fails with a "Invalid document structure" error.
> > If I remove the encoding element, then it parses correctly.
> >
> > Can anyone suggest what the problem is ? I'm assuming
> > that this is some interaction between the validator and
> > the encoding, but I'm baffled as to what, precisely.
> >
> > This is occuring with Xerces-c 2.7.0.
> >
> > --
> > Regards
> >
> > Steve Collyer
> > Netspinner Ltd

Re: SAX2 parser: encoding="UTF-8" breaks validation

Posted by Jason Stewart <ja...@gmail.com>.
Hey Stephen,

I believe "UTF8" is the correct spelling. Can someone else confirm
that? Or is this a red herring.

Cheers, jas.

On Sun, Jun 22, 2008 at 5:10 PM, Stephen Collyer
<sc...@netspinner.co.uk> wrote:
> I have a SAX2 parser which is exhibiting odd behaviour.
>
> If I give it some XML with an XML declaration like:
>
> <?xml version="1.0" encoding="UTF-8" ?>
>
> it fails with a "Invalid document structure" error.
> If I remove the encoding element, then it parses correctly.
>
> Can anyone suggest what the problem is ? I'm assuming
> that this is some interaction between the validator and
> the encoding, but I'm baffled as to what, precisely.
>
> This is occuring with Xerces-c 2.7.0.
>
> --
> Regards
>
> Steve Collyer
> Netspinner Ltd
>