You are viewing a plain text version of this content. The canonical link for it is here.

Posted to soap-dev@xml.apache.org by Glen Daniels <gd...@allaire.com> on 2001/03/21 06:47:35 UTC

Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)

Hi folks:

While we were out at Microsoft, we discovered that their ASP.NET
implementation of SOAP always sends a byte-order mark in front of the XML.
Perhaps because we use Readers down in the SOAPTransport.receive() method
and in TransportMessage.getEnvelopeReader(), we ended up getting confused
somewhere because of this, and the XML wouldn't parse.

I hacked the code to use InputStreams instead of Readers in various places
(I can give you a manifest, but not just at the moment), so we could hand
the XML parser an InputSource constructed directly from an InputStream, and
this worked fine for getting us up and running (I assume the parser ate the
byte-order mark happily).  However, I was a bit worried how this might
affect the whole MIME processing system, so I didn't check this in.

I am not an expert on character encoding, so if possible I would love for
one of you guys who is savvy about this area (Wouter and Sanjiva, I think)
to take a quick look at my changes, which I'll post tomorrow sometime, and
see what you think.  For now, any commentary on the whole byte-order mark /
Reader issue would also be appreciated.

--Glen

Re: Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)

Posted by Scott Nichol <sn...@computer.org>.

> On Thu, Mar 22, 2001 at 03:33:10PM -0500, Scott Nichol wrote:
> > > Insertion of a BOM, even at the level where there is no MIME multipart
> > > envelope, but with an HTTP POST content with to a text/xml
content-type, is
> > > a serious bug.
> >
> > Quoting the XML 1.0 spec section 4.3.3
> >
> > >>>>
> > Entities encoded in UTF-16 must begin with the Byte Order Mark described
by Annex F of [ISO/IEC
> > 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and
section 2.7 of [Unicode3]
> > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
signature, not part of either
> > the markup or the character data of the XML document. XML processors
must be able to use this
> > character to differentiate between UTF-8 and UTF-16 encoded documents.
> > <<<<
> >
> > Doesn't this mean that it is a bug to *not* include a BOM when using
UTF-16?
>
> Quote:
>
> This is an encoding signature, not part of either the markup or the
character data of the XML document
>
> I interpret that as: the BOM is used to identify the binary encoding of
the XML character content.
> It is NOT part of the XML content itself. It's useful for files or
datastreams that have no other
> way of specifying the encoding. If the encoding is already specified (e.g.
MIME encoding type
> specified in a multipart MIME envelope or in the Content-Type header of
HTTP), it's a bug to include it.
>
> I believe that I read in some SOAP or SOAP-related spec that the BOM may
not be a part of the content,
> but I haven't had the time or energy to search for it yet...
>
> bfn, Wouter

I understand your interpretation.  I do not have sufficient grasp of the
full XML spec to know whether I should agree with it ;-).  I checked the
SOAP 1.1 spec and found no reference to BOM.  However, in interpreting the
XML spec, I am somewhat swayed by HTML 4.01, which recommends (but does not
require) the BOM to be transported when UTF-16 is used.  The reason stated
there is to detect the situation where the sender has not spit out data in
network byte order.  While this is just a recommendation, it does refer to
data being transported within a stream (HTTP) that specifies the character
set.

Scott

Re: Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)

Posted by Sanjiva Weerawarana <sa...@watson.ibm.com>.

Is this something to ask the xerces-j-dev folks? They must have a good
grip on the details of this ..

Sanjiva.

----- Original Message -----
From: "Wouter Cloetens" <wo...@mind.be>
To: <so...@xml.apache.org>
Sent: Sunday, March 25, 2001 6:44 PM
Subject: Re: Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)


> HMM! Good point from both of you. I guess that, if the content-type HTTP
> header indicates utf-16 as the charset, the BOM should indeed be present.
>
> For Apache-SOAP, I suppose the impact is that we need to support incoming
> utf-16 encoded SOAP envelope parts with a leading BOM. On the way out,
> 2.1 currently only support utf-8, so it's not an issue there. I was
> planning, though, to make it easier to specify other encodings for outgoing
> requests (client) or responses (server), so that's an issue to keep in
> mind.
>
> bfn, Wouter
>
> On Fri, Mar 23, 2001 at 10:12:55AM -0500, Scott Nichol wrote:
>
> > I understand your interpretation.  I do not have sufficient grasp of the
> > full XML spec to know whether I should agree with it ;-).  I checked the
> > SOAP 1.1 spec and found no reference to BOM.  However, in interpreting the
> > XML spec, I am somewhat swayed by HTML 4.01, which recommends (but does
not
> > require) the BOM to be transported when UTF-16 is used.  The reason stated
> > there is to detect the situation where the sender has not spit out data in
> > network byte order.  While this is just a recommendation, it does refer to
> > data being transported within a stream (HTTP) that specifies the character
> > set.
> >
> > Scott
>
>
> On Fri, Mar 23, 2001 at 03:32:55PM +0000, John Colgrave wrote:
>
> > I interpret this as saying that if UTF-16 is used then the BOM must be
> > used but it is consumed by the XML processor and not regarded as either
> > markup or character data as seen by the application.
> >
> > Even if the UTF-16 encoding is declared, either externally (MIME etc.)
> > or in a text declaration the BOM must be present.
> >
> > RFC 2376 shows the following for the case of text/xml with UTF-16
> > Charset (section 6.2):
> >
> > Content-type: text/xml; charset="utf-16"
> >
> > {BOM}<?xml version='1.0' encoding='utf-16'?>
> >
> > In all of the scenarios involving UTF-16 in RFC 2376, the BOM is used
> > whether the Content-type declaration includes charset="utf-16" or not.
> > --
> > Regards,
> >
> > John Colgrave
> > colgrave@hursley.ibm.com

Re: Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)

Posted by Wouter Cloetens <wo...@mind.be>.

HMM! Good point from both of you. I guess that, if the content-type HTTP
header indicates utf-16 as the charset, the BOM should indeed be present.

For Apache-SOAP, I suppose the impact is that we need to support incoming
utf-16 encoded SOAP envelope parts with a leading BOM. On the way out,
2.1 currently only support utf-8, so it's not an issue there. I was
planning, though, to make it easier to specify other encodings for outgoing
requests (client) or responses (server), so that's an issue to keep in
mind.

bfn, Wouter

On Fri, Mar 23, 2001 at 10:12:55AM -0500, Scott Nichol wrote:

> I understand your interpretation.  I do not have sufficient grasp of the
> full XML spec to know whether I should agree with it ;-).  I checked the
> SOAP 1.1 spec and found no reference to BOM.  However, in interpreting the
> XML spec, I am somewhat swayed by HTML 4.01, which recommends (but does not
> require) the BOM to be transported when UTF-16 is used.  The reason stated
> there is to detect the situation where the sender has not spit out data in
> network byte order.  While this is just a recommendation, it does refer to
> data being transported within a stream (HTTP) that specifies the character
> set.
>
> Scott


On Fri, Mar 23, 2001 at 03:32:55PM +0000, John Colgrave wrote:

> I interpret this as saying that if UTF-16 is used then the BOM must be
> used but it is consumed by the XML processor and not regarded as either
> markup or character data as seen by the application.
> 
> Even if the UTF-16 encoding is declared, either externally (MIME etc.)
> or in a text declaration the BOM must be present.
> 
> RFC 2376 shows the following for the case of text/xml with UTF-16
> Charset (section 6.2):
> 
> Content-type: text/xml; charset="utf-16"
> 
> {BOM}<?xml version='1.0' encoding='utf-16'?>
> 
> In all of the scenarios involving UTF-16 in RFC 2376, the BOM is used
> whether the Content-type declaration includes charset="utf-16" or not.
> -- 
> Regards,
> 
> John Colgrave
> colgrave@hursley.ibm.com

Re: Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)

Posted by John Colgrave <co...@hursley.ibm.com>.

Wouter Cloetens wrote:
> 
> On Thu, Mar 22, 2001 at 03:33:10PM -0500, Scott Nichol wrote:
> > > Insertion of a BOM, even at the level where there is no MIME multipart
> > > envelope, but with an HTTP POST content with to a text/xml content-type, is
> > > a serious bug.
> >
> > Quoting the XML 1.0 spec section 4.3.3
> >
> > >>>>
> > Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC
> > 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either
> > the markup or the character data of the XML document. XML processors must be able to use this
> > character to differentiate between UTF-8 and UTF-16 encoded documents.
> > <<<<
> >
> > Doesn't this mean that it is a bug to *not* include a BOM when using UTF-16?
> 
> Quote:
> 
> This is an encoding signature, not part of either the markup or the character data of the XML document
> 
> I interpret that as: the BOM is used to identify the binary encoding of the XML character content.
> It is NOT part of the XML content itself. It's useful for files or datastreams that have no other
> way of specifying the encoding. If the encoding is already specified (e.g. MIME encoding type
> specified in a multipart MIME envelope or in the Content-Type header of HTTP), it's a bug to include it.
> 
> I believe that I read in some SOAP or SOAP-related spec that the BOM may not be a part of the content,
> but I haven't had the time or energy to search for it yet...
> 
> bfn, Wouter

I interpret this as saying that if UTF-16 is used then the BOM must be
used but it is consumed by the XML processor and not regarded as either
markup or character data as seen by the application.

Even if the UTF-16 encoding is declared, either externally (MIME etc.)
or in a text declaration the BOM must be present.

RFC 2376 shows the following for the case of text/xml with UTF-16
Charset (section 6.2):

Content-type: text/xml; charset="utf-16"

{BOM}<?xml version='1.0' encoding='utf-16'?>

In all of the scenarios involving UTF-16 in RFC 2376, the BOM is used
whether the Content-type declaration includes charset="utf-16" or not.
-- 
Regards,

John Colgrave
colgrave@hursley.ibm.com

Re: Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)

Posted by Wouter Cloetens <wo...@mind.be>.

On Thu, Mar 22, 2001 at 03:33:10PM -0500, Scott Nichol wrote:
> > Insertion of a BOM, even at the level where there is no MIME multipart
> > envelope, but with an HTTP POST content with to a text/xml content-type, is
> > a serious bug.
> 
> Quoting the XML 1.0 spec section 4.3.3
> 
> >>>>
> Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC
> 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either
> the markup or the character data of the XML document. XML processors must be able to use this
> character to differentiate between UTF-8 and UTF-16 encoded documents.
> <<<<
> 
> Doesn't this mean that it is a bug to *not* include a BOM when using UTF-16?

Quote:

This is an encoding signature, not part of either the markup or the character data of the XML document

I interpret that as: the BOM is used to identify the binary encoding of the XML character content.
It is NOT part of the XML content itself. It's useful for files or datastreams that have no other
way of specifying the encoding. If the encoding is already specified (e.g. MIME encoding type
specified in a multipart MIME envelope or in the Content-Type header of HTTP), it's a bug to include it.

I believe that I read in some SOAP or SOAP-related spec that the BOM may not be a part of the content,
but I haven't had the time or energy to search for it yet...

bfn, Wouter

Re: Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)

Posted by Scott Nichol <sn...@computer.org>.

> Insertion of a BOM, even at the level where there is no MIME multipart
> envelope, but with an HTTP POST content with to a text/xml content-type, is
> a serious bug.

Quoting the XML 1.0 spec section 4.3.3

>>>>
Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC
10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
(the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either
the markup or the character data of the XML document. XML processors must be able to use this
character to differentiate between UTF-8 and UTF-16 encoded documents.
<<<<

Doesn't this mean that it is a bug to *not* include a BOM when using UTF-16?

Scott

Re: Byte-order marks and XML parsing (ATTN Wouter and Sanjiva)

Posted by Wouter Cloetens <wo...@mind.be>.

Glen,

Yes, using a Reader will completely break the MIME functionality.

All the Readers/Writers do in comparison with InputStreams/Outputstreams,
is to apply a character set translation table between the bytes of the
latter and the (16-bit Unicode) characters of the former. This needs to be
done at the level of the individual MIME parts, never applied to a MIME
envelope. The MIME encoder/decoder uses the Content-Type header of the
MIME part to determine which character set translation (if any, only allowed
for text/* parts) needs to be applied.

Insertion of a BOM, even at the level where there is no MIME multipart
envelope, but with an HTTP POST content with to a text/xml content-type, is
a serious bug. If you were to write a normal servlet and request a
Reader from the servlet SDK, it too would mangle the BOM through the default
character set translation of Java (using the default ISO-8859-1 if no
charset parameter is specified in the Content-Type). So this bug is very
dangerous. It probably breaks a lot more than just Java implementations
too.

Realising that it's probably a hard thing to get the implementers to 
withdraw their product immediately, pending release of a fix, I guess we
should stuff a workaround into the code anyway. This can be done easily
in org.apache.soap.transport.TransportMessage.read(), at line 239:

 ByteArrayDataSource ds = new ByteArrayDataSource(bytes,
                                                  contentType);

We can insert a check for the BOM at the start of the "bytes" byte array
and strip it off (by doing an arraycopy, or giving the ByteArrayDataSource
a ByteInputStream with the BOM read, or, more efficiently, by hacking
ByteArrayDataSource to accept a byte array along with an offset and
length).

Poor guys, you searched that long and hard for a way to work around this?
;-) You should call me next time. I guess my MIME support code is about
the worst understood part of v2.1. It's the code with the most source
code documentation, but I guess an updated overall guide to how the whole
thing works is in order (more up to date than my initial note in this
forum, see http://workspot.net/~zombie/soap/).

I'd do it myself, but I don't have the software of that particular vendor
available. What are they called again? Microsoft?  Not familiar with that
company. Do they have a Linux version? ;-)

bfn, Wouter 

On Wed, Mar 21, 2001 at 12:47:35AM -0500, Glen Daniels wrote:
> While we were out at Microsoft, we discovered that their ASP.NET
> implementation of SOAP always sends a byte-order mark in front of the XML.
> Perhaps because we use Readers down in the SOAPTransport.receive() method
> and in TransportMessage.getEnvelopeReader(), we ended up getting confused
> somewhere because of this, and the XML wouldn't parse.
> 
> I hacked the code to use InputStreams instead of Readers in various places
> (I can give you a manifest, but not just at the moment), so we could hand
> the XML parser an InputSource constructed directly from an InputStream, and
> this worked fine for getting us up and running (I assume the parser ate the
> byte-order mark happily).  However, I was a bit worried how this might
> affect the whole MIME processing system, so I didn't check this in.
> 
> I am not an expert on character encoding, so if possible I would love for
> one of you guys who is savvy about this area (Wouter and Sanjiva, I think)
> to take a quick look at my changes, which I'll post tomorrow sometime, and
> see what you think.  For now, any commentary on the whole byte-order mark /
> Reader issue would also be appreciated.