You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Shekhar Karani <2k...@sun20.datamatics.com> on 2003/03/28 12:27:16 UTC

UTF-8 Encoding

Hi

I am using the xerces 2.2.1 to parse XML documents. One of the XML 
documents has a hex character B6. This character is being treated as an 
invalid UTF-8 character by the parser. The parser gives the error 
"Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY 
version 5, accepts this character.

Please let me know what I need to do in my code to accept this 
character.

The archives on the mailing list are not accessible hence I am not sure 
if this question is present there.

Thanks
Shekhar

Re: UTF-8 Encoding

Posted by Michael Glavassevich <mr...@engmail.uwaterloo.ca>.

Hi Shekhar,

B6 as a byte sequence is not a valid UTF-8 character, so your document must
be of a different encoding (perhaps ISO Latin 1 as Andy suggested), or
there was some error encoding it as UTF-8. In UTF-8, Unicode 0x00-0x7F have
byte sequences which are equivalent to ASCII characters. Unicode characters
above 0x7F are encoded in UTF-8 as multi-byte sequences, as shown below:

Unicode               UTF-8 Byte Sequence
-------------------   -----------------------------------
0000 0000-0000 007F   0xxxxxxx
0000 0080-0000 07FF   110xxxxx 10xxxxxx
0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The x's denote the binary representation of Unicode number of the
character. You can get more info on UTF-8 at
http://www.ietf.org/rfc/rfc2279.txt?number=2279.

I hope that helps.

At 10:48 AM 31/03/2003 +0530, you wrote:
>The document encoding is "UTF-8"
><?xml version="1.0" encoding="UTF-8"?>
>
>Shekhar
>----- Original Message ----- 
>From: Andy Clark <an...@apache.org>
>To: <xe...@xml.apache.org>
>Sent: Saturday, March 29, 2003 4:05 AM
>Subject: Re: UTF-8 Encoding
>
>
>> Shekhar Karani wrote:
>> > I am using the xerces 2.2.1 to parse XML documents. One of the XML
>> > documents has a hex character B6. This character is being treated as an
>> > invalid UTF-8 character by the parser. The parser gives the error
>> > "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
>> > version 5, accepts this character.
>> 
>> What is the encoding of the document? If it is really
>> ISO Latin 1 ("ISO-8859-1") and you do NOT specify this
>> in the XML Declaration at the top of the document (e.g.
>> "<?xml version='1.0' encoding='...'?>"), then your
>> document is in error.
>> 
>> If XML Spy accepts it w/o the encoding declaration,
>> then it is not following the XML specification which
>> dictates that the encoding of the document is assumed
>> to be UTF-8 in the absence of the XML declaration.
>> 
>> -- 
>> Andy Clark * andyc@apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
>> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>> 
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
>For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>

-----------------------------
Michael Glavassevich
mrglavas@engmail.uwaterloo.ca
4B Computer Engineering
University of Waterloo

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: UTF-8 Encoding

Posted by Shekhar Karani <2k...@sun20.datamatics.com>.

The document encoding is "UTF-8"
<?xml version="1.0" encoding="UTF-8"?>

Shekhar
----- Original Message ----- 
From: Andy Clark <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Saturday, March 29, 2003 4:05 AM
Subject: Re: UTF-8 Encoding


> Shekhar Karani wrote:
> > I am using the xerces 2.2.1 to parse XML documents. One of the XML
> > documents has a hex character B6. This character is being treated as an
> > invalid UTF-8 character by the parser. The parser gives the error
> > "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
> > version 5, accepts this character.
> 
> What is the encoding of the document? If it is really
> ISO Latin 1 ("ISO-8859-1") and you do NOT specify this
> in the XML Declaration at the top of the document (e.g.
> "<?xml version='1.0' encoding='...'?>"), then your
> document is in error.
> 
> If XML Spy accepts it w/o the encoding declaration,
> then it is not following the XML specification which
> dictates that the encoding of the document is assumed
> to be UTF-8 in the absence of the XML declaration.
> 
> -- 
> Andy Clark * andyc@apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: UTF-8 Encoding

Posted by Andy Clark <an...@apache.org>.

Shekhar Karani wrote:
> I am using the xerces 2.2.1 to parse XML documents. One of the XML
> documents has a hex character B6. This character is being treated as an
> invalid UTF-8 character by the parser. The parser gives the error
> "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
> version 5, accepts this character.

What is the encoding of the document? If it is really
ISO Latin 1 ("ISO-8859-1") and you do NOT specify this
in the XML Declaration at the top of the document (e.g.
"<?xml version='1.0' encoding='...'?>"), then your
document is in error.

If XML Spy accepts it w/o the encoding declaration,
then it is not following the XML specification which
dictates that the encoding of the document is assumed
to be UTF-8 in the absence of the XML declaration.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: UTF-8 Encoding

Posted by Shekhar Karani <2k...@sun20.datamatics.com>.

Thanks a lot guys. I will surely try this out.

Shekhar
----- Original Message -----
From: Andy Clark <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Monday, March 31, 2003 11:33 PM
Subject: Re: UTF-8 Encoding


> Michael Glavassevich wrote:
> > If you absolutely cannot get alter your input document, you can try
setting
> > your own character reader on the input source. This will force the
parser
> > to use your own reader. If you have an InputStream to the document you
can
> > easily get one for ISO-8859-1 using an InputStreamReader.
>
> Michael is right. If you know the actual encoding of the
> document, then you can follow this approach and it will
> always work because the parser will not try to perform
> any auto-detection. For example:
>
>    InputStream stream = /* ... */;
>    Reader reader = new InputStreamReader(stream, "ISO-8859-1");
>
>    InputSource source = new InputSource(reader);
>    // NOTE: Also set the system id so that the parser can
>    //       resolve relative URIs.
>
> However, in general, you should let the parser do the
> auto-detection of the character encoding. But if you're
> stuck in the situation where someone has given you a
> document that is not well-formed because the specified
> encoding is wrong, then use this method to work around
> the problem.
>
> --
> Andy Clark * andyc@apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: UTF-8 Encoding

Posted by Andy Clark <an...@apache.org>.

Michael Glavassevich wrote:
> If you absolutely cannot get alter your input document, you can try setting
> your own character reader on the input source. This will force the parser
> to use your own reader. If you have an InputStream to the document you can
> easily get one for ISO-8859-1 using an InputStreamReader.

Michael is right. If you know the actual encoding of the
document, then you can follow this approach and it will
always work because the parser will not try to perform
any auto-detection. For example:

   InputStream stream = /* ... */;
   Reader reader = new InputStreamReader(stream, "ISO-8859-1");

   InputSource source = new InputSource(reader);
   // NOTE: Also set the system id so that the parser can
   //       resolve relative URIs.

However, in general, you should let the parser do the
auto-detection of the character encoding. But if you're
stuck in the situation where someone has given you a
document that is not well-formed because the specified
encoding is wrong, then use this method to work around
the problem.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: UTF-8 Encoding

Posted by Michael Glavassevich <mr...@engmail.uwaterloo.ca>.

Hi Shekhar,

Setting the encoding on the input source allows the parser to skip encoding
auto-detection, however once it reads the encoding from the XML decleration
it will create a new character reader if the previous encoding (either
auto-detected or supplied by the user) is different from the one specified
in the document, so you don't gain anything by doing this.

You should try changing the value of the encoding in the actual document.
<?xml version="1.0" encoding="iso-8859-1"?>

If you absolutely cannot get alter your input document, you can try setting
your own character reader on the input source. This will force the parser
to use your own reader. If you have an InputStream to the document you can
easily get one for ISO-8859-1 using an InputStreamReader.

At 11:08 AM 31/03/2003 +0530, you wrote:
>      is still giving  the same error.    I just thought I should clarify
>that I have a XML document given to me by a client that I  need to parse.
>The XML document has its encoding set to UTF-8   <>   I need to parse this
>document but the character  with hex value B6 present in the XML Document
>is not being accepted by the  parser. I need to overrider the encoding set
>in the XML document through the  code but setting the Inputsource encoding
>to ISO-88591-1 is not doing the  trick.   Thanks Shekhar    ----- Original
>Message -----    From:    Ragunath Marudhachalam    To:
>xerces-j-user@xml.apache.org    Sent: Friday, March 28, 2003 7:48  PM  
>Subject: RE: UTF-8 Encoding   
>   yes.              OutputFormat format = new OutputFormat( Document );
>file://Serialize DOM   format.setEncoding("ISO-8859-1");       This    will
>set the encoding to ISO-8859-1 instead of UTF-8. UTF-8 is the default   
>encoding that is set when u create a document without specifying any   
>encoding.   If you    are not using serialization, then try setting the
>encoding to the    InputSource.               Ragu 
>CircuitVision         -----Original Message-----
>From: Shekhar Karani      [mailto:2kshekhar@sun20.datamatics.com]
>Sent: Friday, March 28,      2003 9:16 AM
>To: xerces-j-user@xml.apache.org
>Subject:      Re: UTF-8 Encoding
>
>     Doing that in my code will over ride the XML      document encoding?  
>        Shekhar            ----- Original Message -----        From:       
>Ragunath Marudhachalam        To: xerces-j-user@xml.apache.org        Sent:
>Friday, March 28, 2003 7:33        PM       Subject: RE: UTF-8 Encoding
   
>       set the encoding to "ISO-8859-1"               Ragu 
>CircuitVision                 -----Original Message-----
>From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
>Sent:          Friday, March 28, 2003 6:27 AM
>To: xerces-j-user@xml.apache.org
>Subject:          UTF-8 Encoding
>
>         Hi
>
>I am using the xerces 2.2.1 to          parse XML documents. One of the XML 
>documents has a hex character          B6. This character is being treated
>as an 
>invalid UTF-8 character by          the parser. The parser gives the error 
>"Invalid byte 1 of UTF-8 byte          stream". However, the editor XML SPY 
>version 5, accepts this          character.
>
>Please let me know what I need to do in my code to          accept this 
>character.
>
>The archives on the mailing list are          not accessible hence I am not
>sure 
>if this question is present          there.                   Thanks
>Shekhar
>
>
>
>
> 

-----------------------------
Michael Glavassevich
mrglavas@engmail.uwaterloo.ca
4B Computer Engineering
University of Waterloo

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: UTF-8 Encoding

Posted by Shekhar Karani <2k...@sun20.datamatics.com>.

I tried setting the InputSource encoding to ISO-8859-1 but this did not work. The parser is still giving the same error.

I just thought I should clarify that I have a XML document given to me by a client that I need to parse. The XML document has its encoding set to UTF-8

<?xml version="1.0" encoding="UTF-8"?>

I need to parse this document but the character with hex value B6 present in the XML Document is not being accepted by the parser. I need to overrider the encoding set in the XML document through the code but setting the Inputsource encoding to ISO-88591-1 is not doing the trick.

Thanks
Shekhar
----- Original Message -----
From: Ragunath Marudhachalam
To: xerces-j-user@xml.apache.org
Sent: Friday, March 28, 2003 7:48 PM
Subject: RE: UTF-8 Encoding

yes.

OutputFormat format = new OutputFormat( Document ); file://Serialize DOM

format.setEncoding("ISO-8859-1");

This will set the encoding to ISO-8859-1 instead of UTF-8. UTF-8 is the default encoding that is set when u create a document without specifying any encoding.

If you are not using serialization, then try setting the encoding to the InputSource.

Ragu
CircuitVision

-----Original Message-----
From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
Sent: Friday, March 28, 2003 9:16 AM
To: xerces-j-user@xml.apache.org
Subject: Re: UTF-8 Encoding

Doing that in my code will over ride the XML document encoding?

Shekhar
----- Original Message -----
From: Ragunath Marudhachalam
To: xerces-j-user@xml.apache.org
Sent: Friday, March 28, 2003 7:33 PM
Subject: RE: UTF-8 Encoding

set the encoding to "ISO-8859-1"

Ragu
CircuitVision

-----Original Message-----
From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
Sent: Friday, March 28, 2003 6:27 AM
To: xerces-j-user@xml.apache.org
Subject: UTF-8 Encoding

I am using the xerces 2.2.1 to parse XML documents. One of the XML
documents has a hex character B6. This character is being treated as an
invalid UTF-8 character by the parser. The parser gives the error
"Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
version 5, accepts this character.

Please let me know what I need to do in my code to accept this
character.

The archives on the mailing list are not accessible hence I am not sure
if this question is present there.

Thanks
Shekhar

RE: UTF-8 Encoding

Posted by Ragunath Marudhachalam <rm...@circuitvision.com>.

yes.

OutputFormat format = new OutputFormat( Document ); //Serialize DOM

format.setEncoding("ISO-8859-1");



This will set the encoding to ISO-8859-1 instead of UTF-8. UTF-8 is the
default encoding that is set when u create a document without specifying any
encoding.

If you are not using serialization, then try setting the encoding to the
InputSource.







Ragu
CircuitVision

  -----Original Message-----
  From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
  Sent: Friday, March 28, 2003 9:16 AM
  To: xerces-j-user@xml.apache.org
  Subject: Re: UTF-8 Encoding


  Doing that in my code will over ride the XML document encoding?

  Shekhar
    ----- Original Message -----
    From: Ragunath Marudhachalam
    To: xerces-j-user@xml.apache.org
    Sent: Friday, March 28, 2003 7:33 PM
    Subject: RE: UTF-8 Encoding


    set the encoding to "ISO-8859-1"

    Ragu
    CircuitVision

      -----Original Message-----
      From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
      Sent: Friday, March 28, 2003 6:27 AM
      To: xerces-j-user@xml.apache.org
      Subject: UTF-8 Encoding


      Hi

      I am using the xerces 2.2.1 to parse XML documents. One of the XML
      documents has a hex character B6. This character is being treated as
an
      invalid UTF-8 character by the parser. The parser gives the error
      "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
      version 5, accepts this character.

      Please let me know what I need to do in my code to accept this
      character.

      The archives on the mailing list are not accessible hence I am not
sure
      if this question is present there.

      Thanks
      Shekhar

Re: UTF-8 Encoding

Posted by Shekhar Karani <2k...@sun20.datamatics.com>.

Doing that in my code will over ride the XML document encoding?

Shekhar
  ----- Original Message ----- 
  From: Ragunath Marudhachalam 
  To: xerces-j-user@xml.apache.org 
  Sent: Friday, March 28, 2003 7:33 PM
  Subject: RE: UTF-8 Encoding


  set the encoding to "ISO-8859-1"

  Ragu 
  CircuitVision 

    -----Original Message-----
    From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
    Sent: Friday, March 28, 2003 6:27 AM
    To: xerces-j-user@xml.apache.org
    Subject: UTF-8 Encoding


    Hi

    I am using the xerces 2.2.1 to parse XML documents. One of the XML 
    documents has a hex character B6. This character is being treated as an 
    invalid UTF-8 character by the parser. The parser gives the error 
    "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY 
    version 5, accepts this character.

    Please let me know what I need to do in my code to accept this 
    character.

    The archives on the mailing list are not accessible hence I am not sure 
    if this question is present there.
     
    Thanks
    Shekhar

RE: UTF-8 Encoding

Posted by Ragunath Marudhachalam <rm...@circuitvision.com>.

set the encoding to "ISO-8859-1"

Ragu 
CircuitVision 

  -----Original Message-----
  From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
  Sent: Friday, March 28, 2003 6:27 AM
  To: xerces-j-user@xml.apache.org
  Subject: UTF-8 Encoding

  Hi

  I am using the xerces 2.2.1 to parse XML documents. One of the XML 
  documents has a hex character B6. This character is being treated as an 
  invalid UTF-8 character by the parser. The parser gives the error 
  "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY 
  version 5, accepts this character.

  Please let me know what I need to do in my code to accept this 
  character.

  The archives on the mailing list are not accessible hence I am not sure 
  if this question is present there.

  Thanks
  Shekhar