You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@xerces.apache.org by ar...@locus.apache.org on 2000/07/31 21:00:50 UTC

cvs commit: xml-xerces/c/src/internal XMLReader.cpp

aruna1      00/07/31 12:00:50

  Modified:    c/src/internal XMLReader.cpp
  Log:
  Fixed BOM in UTF-8 files
  
  Revision  Changes    Path
  1.20      +15 -2     xml-xerces/c/src/internal/XMLReader.cpp
  
  Index: XMLReader.cpp
  ===================================================================
  RCS file: /home/cvs/xml-xerces/c/src/internal/XMLReader.cpp,v
  retrieving revision 1.19
  retrieving revision 1.20
  diff -u -r1.19 -r1.20
  --- XMLReader.cpp	2000/07/25 22:33:05	1.19
  +++ XMLReader.cpp	2000/07/31 19:00:48	1.20
  @@ -55,7 +55,7 @@
    */
   
   /*
  - * $Id: XMLReader.cpp,v 1.19 2000/07/25 22:33:05 aruna1 Exp $
  + * $Id: XMLReader.cpp,v 1.20 2000/07/31 19:00:48 aruna1 Exp $
    */
   
   // ---------------------------------------------------------------------------
  @@ -1331,11 +1331,24 @@
               break;
           }
   
  -        case XMLRecognizer::US_ASCII :
           case XMLRecognizer::UTF_8 :
           {
  +            // If there's a utf-8 BOM  (0xEF 0xBB 0xBF), skip past it.
  +            //   Don't move to char buf - no one wants to see it.
  +            //   Note: this causes any encoding= declaration to override
  +            //         the BOM's attempt to say that the encoding is utf-8.
  + 
               // Look at the raw buffer as short chars
               const char* asChars = (const char*)fRawByteBuf;
  +
  +            if (fRawBytesAvail > XMLRecognizer::fgUTF8BOMLen &&
  +                XMLString::compareNString(  asChars
  +                                            , XMLRecognizer::fgUTF8BOM
  +                                            , XMLRecognizer::fgUTF8BOMLen) == 0)
  +            {
  +                fRawBufIndex += XMLRecognizer::fgUTF8BOMLen;
  +                asChars      += XMLRecognizer::fgUTF8BOMLen;
  +            }
   
               //
               //  First check that there are enough bytes to even see the

Re: cvs commit: xml-xerces/c/src/internal XMLReader.cpp

Posted by Andy Heninger <an...@jtcsv.com>.


Dean Roddey asks

> What is this UTF-8 BOM stuff? I've never heard of such a thing. Given
the
> form of UTF-8, why would it need a BOM? Its a multi-byte encoding, so
there
> are no components of it larger than a byte.

That was pretty much my first reaction also.  Checking with the ICU folks,
though, it turns out that UTF-8 allows a BOM, and, if it is found, it
should be ignored.  It doesn't affect the data that follows in any way,
except to confirm that the encoding is really utf-8 and not ascii or
latin-1 or whatever.  The utf-8 BOM is three bytes, and is nothing more
than the UTF-16 BOM character as it appears when encoded as UTF-8.

Pretty silly, especially since we already have an encoding declaration to
tell us what the encoding is.  But it seems that Microsoft is generating
utf-8 encoded XML with a BOM, and we need to be able to swallow it.


Andy Heninger
IBM XML Technology Group, Cupertino, CA
heninger@us.ibm.com

Re: cvs commit: xml-xerces/c/src/internal XMLReader.cpp

Posted by Arundhati Bhowmick <ar...@hyperreal.org>.

There were cases where the xml files had BOM marks and the encoding specified
was utf-8. In those situation the parser's unable to recognize those files.

This change causes the UTF-8 BOM to be completely ignored for  any ASCII family
encoding.
Andy H had a valid question though - should the BOM  override the XML encoding
declaration, or should the declaration override the  BOM, or should it be an
error if they conflict?

Right now the encoding declaration overrides the  BOM.

Arundhati

Dean Roddey wrote:

> What is this UTF-8 BOM stuff? I've never heard of such a thing. Given the
> form of UTF-8, why would it need a BOM? Its a multi-byte encoding, so there
> are no components of it larger than a byte.
>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> droddey@charmedquark.com
> http://www.charmedquark.com
>
> "You young, and you gotcha health. Whatchoo wanna job fer?"
>
> ----- Original Message -----
> From: <ar...@locus.apache.org>
> To: <xm...@apache.org>
> Sent: Monday, July 31, 2000 12:00 PM
> Subject: cvs commit: xml-xerces/c/src/internal XMLReader.cpp
>
> > aruna1      00/07/31 12:00:50
> >
> >   Modified:    c/src/internal XMLReader.cpp
> >   Log:
> >   Fixed BOM in UTF-8 files
> >
> >   Revision  Changes    Path
> >   1.20      +15 -2     xml-xerces/c/src/internal/XMLReader.cpp
> >
> >   Index: XMLReader.cpp
> >   ===================================================================
> >   RCS file: /home/cvs/xml-xerces/c/src/internal/XMLReader.cpp,v
> >   retrieving revision 1.19
> >   retrieving revision 1.20
> >   diff -u -r1.19 -r1.20
> >   --- XMLReader.cpp 2000/07/25 22:33:05 1.19
> >   +++ XMLReader.cpp 2000/07/31 19:00:48 1.20
> >   @@ -55,7 +55,7 @@
> >     */
> >
> >    /*
> >   - * $Id: XMLReader.cpp,v 1.19 2000/07/25 22:33:05 aruna1 Exp $
> >   + * $Id: XMLReader.cpp,v 1.20 2000/07/31 19:00:48 aruna1 Exp $
> >     */
> >
> >
>
> // -------------------------------------------------------------------------
> --
> >   @@ -1331,11 +1331,24 @@
> >                break;
> >            }
> >
> >   -        case XMLRecognizer::US_ASCII :
> >            case XMLRecognizer::UTF_8 :
> >            {
> >   +            // If there's a utf-8 BOM  (0xEF 0xBB 0xBF), skip past it.
> >   +            //   Don't move to char buf - no one wants to see it.
> >   +            //   Note: this causes any encoding= declaration to
> override
> >   +            //         the BOM's attempt to say that the encoding is
> utf-8.
> >   +
> >                // Look at the raw buffer as short chars
> >                const char* asChars = (const char*)fRawByteBuf;
> >   +
> >   +            if (fRawBytesAvail > XMLRecognizer::fgUTF8BOMLen &&
> >   +                XMLString::compareNString(  asChars
> >   +                                            , XMLRecognizer::fgUTF8BOM
> >   +                                            ,
> XMLRecognizer::fgUTF8BOMLen) == 0)
> >   +            {
> >   +                fRawBufIndex += XMLRecognizer::fgUTF8BOMLen;
> >   +                asChars      += XMLRecognizer::fgUTF8BOMLen;
> >   +            }
> >
> >                //
> >                //  First check that there are enough bytes to even see the
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-cvs-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-cvs-help@xml.apache.org
> >

--


Arundhati Bhowmick
IBM -- XML Technology Group (Silicon Valley)

Re: cvs commit: xml-xerces/c/src/internal XMLReader.cpp

Posted by Dean Roddey <dr...@charmedquark.com>.

What is this UTF-8 BOM stuff? I've never heard of such a thing. Given the
form of UTF-8, why would it need a BOM? Its a multi-byte encoding, so there
are no components of it larger than a byte.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"


----- Original Message -----
From: <ar...@locus.apache.org>
To: <xm...@apache.org>
Sent: Monday, July 31, 2000 12:00 PM
Subject: cvs commit: xml-xerces/c/src/internal XMLReader.cpp


> aruna1      00/07/31 12:00:50
>
>   Modified:    c/src/internal XMLReader.cpp
>   Log:
>   Fixed BOM in UTF-8 files
>
>   Revision  Changes    Path
>   1.20      +15 -2     xml-xerces/c/src/internal/XMLReader.cpp
>
>   Index: XMLReader.cpp
>   ===================================================================
>   RCS file: /home/cvs/xml-xerces/c/src/internal/XMLReader.cpp,v
>   retrieving revision 1.19
>   retrieving revision 1.20
>   diff -u -r1.19 -r1.20
>   --- XMLReader.cpp 2000/07/25 22:33:05 1.19
>   +++ XMLReader.cpp 2000/07/31 19:00:48 1.20
>   @@ -55,7 +55,7 @@
>     */
>
>    /*
>   - * $Id: XMLReader.cpp,v 1.19 2000/07/25 22:33:05 aruna1 Exp $
>   + * $Id: XMLReader.cpp,v 1.20 2000/07/31 19:00:48 aruna1 Exp $
>     */
>
>



// -------------------------------------------------------------------------
--
>   @@ -1331,11 +1331,24 @@
>                break;
>            }
>
>   -        case XMLRecognizer::US_ASCII :
>            case XMLRecognizer::UTF_8 :
>            {
>   +            // If there's a utf-8 BOM  (0xEF 0xBB 0xBF), skip past it.
>   +            //   Don't move to char buf - no one wants to see it.
>   +            //   Note: this causes any encoding= declaration to
override
>   +            //         the BOM's attempt to say that the encoding is
utf-8.
>   +
>                // Look at the raw buffer as short chars
>                const char* asChars = (const char*)fRawByteBuf;
>   +
>   +            if (fRawBytesAvail > XMLRecognizer::fgUTF8BOMLen &&
>   +                XMLString::compareNString(  asChars
>   +                                            , XMLRecognizer::fgUTF8BOM
>   +                                            ,
XMLRecognizer::fgUTF8BOMLen) == 0)
>   +            {
>   +                fRawBufIndex += XMLRecognizer::fgUTF8BOMLen;
>   +                asChars      += XMLRecognizer::fgUTF8BOMLen;
>   +            }
>
>                //
>                //  First check that there are enough bytes to even see the
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-cvs-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-cvs-help@xml.apache.org
>