You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@xerces.apache.org by ar...@locus.apache.org on 2000/07/31 21:00:50 UTC
cvs commit: xml-xerces/c/src/internal XMLReader.cpp
aruna1 00/07/31 12:00:50
Modified: c/src/internal XMLReader.cpp
Log:
Fixed BOM in UTF-8 files
Revision Changes Path
1.20 +15 -2 xml-xerces/c/src/internal/XMLReader.cpp
Index: XMLReader.cpp
===================================================================
RCS file: /home/cvs/xml-xerces/c/src/internal/XMLReader.cpp,v
retrieving revision 1.19
retrieving revision 1.20
diff -u -r1.19 -r1.20
--- XMLReader.cpp 2000/07/25 22:33:05 1.19
+++ XMLReader.cpp 2000/07/31 19:00:48 1.20
@@ -55,7 +55,7 @@
*/
/*
- * $Id: XMLReader.cpp,v 1.19 2000/07/25 22:33:05 aruna1 Exp $
+ * $Id: XMLReader.cpp,v 1.20 2000/07/31 19:00:48 aruna1 Exp $
*/
// ---------------------------------------------------------------------------
@@ -1331,11 +1331,24 @@
break;
}
- case XMLRecognizer::US_ASCII :
case XMLRecognizer::UTF_8 :
{
+ // If there's a utf-8 BOM (0xEF 0xBB 0xBF), skip past it.
+ // Don't move to char buf - no one wants to see it.
+ // Note: this causes any encoding= declaration to override
+ // the BOM's attempt to say that the encoding is utf-8.
+
// Look at the raw buffer as short chars
const char* asChars = (const char*)fRawByteBuf;
+
+ if (fRawBytesAvail > XMLRecognizer::fgUTF8BOMLen &&
+ XMLString::compareNString( asChars
+ , XMLRecognizer::fgUTF8BOM
+ , XMLRecognizer::fgUTF8BOMLen) == 0)
+ {
+ fRawBufIndex += XMLRecognizer::fgUTF8BOMLen;
+ asChars += XMLRecognizer::fgUTF8BOMLen;
+ }
//
// First check that there are enough bytes to even see the
Re: cvs commit: xml-xerces/c/src/internal XMLReader.cpp
Posted by Andy Heninger <an...@jtcsv.com>.
Dean Roddey asks
> What is this UTF-8 BOM stuff? I've never heard of such a thing. Given
the
> form of UTF-8, why would it need a BOM? Its a multi-byte encoding, so
there
> are no components of it larger than a byte.
That was pretty much my first reaction also. Checking with the ICU folks,
though, it turns out that UTF-8 allows a BOM, and, if it is found, it
should be ignored. It doesn't affect the data that follows in any way,
except to confirm that the encoding is really utf-8 and not ascii or
latin-1 or whatever. The utf-8 BOM is three bytes, and is nothing more
than the UTF-16 BOM character as it appears when encoded as UTF-8.
Pretty silly, especially since we already have an encoding declaration to
tell us what the encoding is. But it seems that Microsoft is generating
utf-8 encoded XML with a BOM, and we need to be able to swallow it.
Andy Heninger
IBM XML Technology Group, Cupertino, CA
heninger@us.ibm.com
Re: cvs commit: xml-xerces/c/src/internal XMLReader.cpp
Posted by Arundhati Bhowmick <ar...@hyperreal.org>.
There were cases where the xml files had BOM marks and the encoding specified
was utf-8. In those situation the parser's unable to recognize those files.
This change causes the UTF-8 BOM to be completely ignored for any ASCII family
encoding.
Andy H had a valid question though - should the BOM override the XML encoding
declaration, or should the declaration override the BOM, or should it be an
error if they conflict?
Right now the encoding declaration overrides the BOM.
Arundhati
Dean Roddey wrote:
> What is this UTF-8 BOM stuff? I've never heard of such a thing. Given the
> form of UTF-8, why would it need a BOM? Its a multi-byte encoding, so there
> are no components of it larger than a byte.
>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> droddey@charmedquark.com
> http://www.charmedquark.com
>
> "You young, and you gotcha health. Whatchoo wanna job fer?"
>
> ----- Original Message -----
> From: <ar...@locus.apache.org>
> To: <xm...@apache.org>
> Sent: Monday, July 31, 2000 12:00 PM
> Subject: cvs commit: xml-xerces/c/src/internal XMLReader.cpp
>
> > aruna1 00/07/31 12:00:50
> >
> > Modified: c/src/internal XMLReader.cpp
> > Log:
> > Fixed BOM in UTF-8 files
> >
> > Revision Changes Path
> > 1.20 +15 -2 xml-xerces/c/src/internal/XMLReader.cpp
> >
> > Index: XMLReader.cpp
> > ===================================================================
> > RCS file: /home/cvs/xml-xerces/c/src/internal/XMLReader.cpp,v
> > retrieving revision 1.19
> > retrieving revision 1.20
> > diff -u -r1.19 -r1.20
> > --- XMLReader.cpp 2000/07/25 22:33:05 1.19
> > +++ XMLReader.cpp 2000/07/31 19:00:48 1.20
> > @@ -55,7 +55,7 @@
> > */
> >
> > /*
> > - * $Id: XMLReader.cpp,v 1.19 2000/07/25 22:33:05 aruna1 Exp $
> > + * $Id: XMLReader.cpp,v 1.20 2000/07/31 19:00:48 aruna1 Exp $
> > */
> >
> >
>
> // -------------------------------------------------------------------------
> --
> > @@ -1331,11 +1331,24 @@
> > break;
> > }
> >
> > - case XMLRecognizer::US_ASCII :
> > case XMLRecognizer::UTF_8 :
> > {
> > + // If there's a utf-8 BOM (0xEF 0xBB 0xBF), skip past it.
> > + // Don't move to char buf - no one wants to see it.
> > + // Note: this causes any encoding= declaration to
> override
> > + // the BOM's attempt to say that the encoding is
> utf-8.
> > +
> > // Look at the raw buffer as short chars
> > const char* asChars = (const char*)fRawByteBuf;
> > +
> > + if (fRawBytesAvail > XMLRecognizer::fgUTF8BOMLen &&
> > + XMLString::compareNString( asChars
> > + , XMLRecognizer::fgUTF8BOM
> > + ,
> XMLRecognizer::fgUTF8BOMLen) == 0)
> > + {
> > + fRawBufIndex += XMLRecognizer::fgUTF8BOMLen;
> > + asChars += XMLRecognizer::fgUTF8BOMLen;
> > + }
> >
> > //
> > // First check that there are enough bytes to even see the
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-cvs-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-cvs-help@xml.apache.org
> >
--
Arundhati Bhowmick
IBM -- XML Technology Group (Silicon Valley)
Re: cvs commit: xml-xerces/c/src/internal XMLReader.cpp
Posted by Dean Roddey <dr...@charmedquark.com>.
What is this UTF-8 BOM stuff? I've never heard of such a thing. Given the
form of UTF-8, why would it need a BOM? Its a multi-byte encoding, so there
are no components of it larger than a byte.
--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com
"You young, and you gotcha health. Whatchoo wanna job fer?"
----- Original Message -----
From: <ar...@locus.apache.org>
To: <xm...@apache.org>
Sent: Monday, July 31, 2000 12:00 PM
Subject: cvs commit: xml-xerces/c/src/internal XMLReader.cpp
> aruna1 00/07/31 12:00:50
>
> Modified: c/src/internal XMLReader.cpp
> Log:
> Fixed BOM in UTF-8 files
>
> Revision Changes Path
> 1.20 +15 -2 xml-xerces/c/src/internal/XMLReader.cpp
>
> Index: XMLReader.cpp
> ===================================================================
> RCS file: /home/cvs/xml-xerces/c/src/internal/XMLReader.cpp,v
> retrieving revision 1.19
> retrieving revision 1.20
> diff -u -r1.19 -r1.20
> --- XMLReader.cpp 2000/07/25 22:33:05 1.19
> +++ XMLReader.cpp 2000/07/31 19:00:48 1.20
> @@ -55,7 +55,7 @@
> */
>
> /*
> - * $Id: XMLReader.cpp,v 1.19 2000/07/25 22:33:05 aruna1 Exp $
> + * $Id: XMLReader.cpp,v 1.20 2000/07/31 19:00:48 aruna1 Exp $
> */
>
>
// -------------------------------------------------------------------------
--
> @@ -1331,11 +1331,24 @@
> break;
> }
>
> - case XMLRecognizer::US_ASCII :
> case XMLRecognizer::UTF_8 :
> {
> + // If there's a utf-8 BOM (0xEF 0xBB 0xBF), skip past it.
> + // Don't move to char buf - no one wants to see it.
> + // Note: this causes any encoding= declaration to
override
> + // the BOM's attempt to say that the encoding is
utf-8.
> +
> // Look at the raw buffer as short chars
> const char* asChars = (const char*)fRawByteBuf;
> +
> + if (fRawBytesAvail > XMLRecognizer::fgUTF8BOMLen &&
> + XMLString::compareNString( asChars
> + , XMLRecognizer::fgUTF8BOM
> + ,
XMLRecognizer::fgUTF8BOMLen) == 0)
> + {
> + fRawBufIndex += XMLRecognizer::fgUTF8BOMLen;
> + asChars += XMLRecognizer::fgUTF8BOMLen;
> + }
>
> //
> // First check that there are enough bytes to even see the
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-cvs-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-cvs-help@xml.apache.org
>