You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Shekhar Karani <2k...@sun20.datamatics.com> on 2003/03/28 12:27:16 UTC
UTF-8 Encoding
Hi
I am using the xerces 2.2.1 to parse XML documents. One of the XML
documents has a hex character B6. This character is being treated as an
invalid UTF-8 character by the parser. The parser gives the error
"Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
version 5, accepts this character.
Please let me know what I need to do in my code to accept this
character.
The archives on the mailing list are not accessible hence I am not sure
if this question is present there.
Thanks
Shekhar
Re: UTF-8 Encoding
Posted by Michael Glavassevich <mr...@engmail.uwaterloo.ca>.
Hi Shekhar,
B6 as a byte sequence is not a valid UTF-8 character, so your document must
be of a different encoding (perhaps ISO Latin 1 as Andy suggested), or
there was some error encoding it as UTF-8. In UTF-8, Unicode 0x00-0x7F have
byte sequences which are equivalent to ASCII characters. Unicode characters
above 0x7F are encoded in UTF-8 as multi-byte sequences, as shown below:
Unicode UTF-8 Byte Sequence
------------------- -----------------------------------
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The x's denote the binary representation of Unicode number of the
character. You can get more info on UTF-8 at
http://www.ietf.org/rfc/rfc2279.txt?number=2279.
I hope that helps.
At 10:48 AM 31/03/2003 +0530, you wrote:
>The document encoding is "UTF-8"
><?xml version="1.0" encoding="UTF-8"?>
>
>Shekhar
>----- Original Message -----
>From: Andy Clark <an...@apache.org>
>To: <xe...@xml.apache.org>
>Sent: Saturday, March 29, 2003 4:05 AM
>Subject: Re: UTF-8 Encoding
>
>
>> Shekhar Karani wrote:
>> > I am using the xerces 2.2.1 to parse XML documents. One of the XML
>> > documents has a hex character B6. This character is being treated as an
>> > invalid UTF-8 character by the parser. The parser gives the error
>> > "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
>> > version 5, accepts this character.
>>
>> What is the encoding of the document? If it is really
>> ISO Latin 1 ("ISO-8859-1") and you do NOT specify this
>> in the XML Declaration at the top of the document (e.g.
>> "<?xml version='1.0' encoding='...'?>"), then your
>> document is in error.
>>
>> If XML Spy accepts it w/o the encoding declaration,
>> then it is not following the XML specification which
>> dictates that the encoding of the document is assumed
>> to be UTF-8 in the absence of the XML declaration.
>>
>> --
>> Andy Clark * andyc@apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
>> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
>For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>
-----------------------------
Michael Glavassevich
mrglavas@engmail.uwaterloo.ca
4B Computer Engineering
University of Waterloo
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: UTF-8 Encoding
Posted by Shekhar Karani <2k...@sun20.datamatics.com>.
The document encoding is "UTF-8"
<?xml version="1.0" encoding="UTF-8"?>
Shekhar
----- Original Message -----
From: Andy Clark <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Saturday, March 29, 2003 4:05 AM
Subject: Re: UTF-8 Encoding
> Shekhar Karani wrote:
> > I am using the xerces 2.2.1 to parse XML documents. One of the XML
> > documents has a hex character B6. This character is being treated as an
> > invalid UTF-8 character by the parser. The parser gives the error
> > "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
> > version 5, accepts this character.
>
> What is the encoding of the document? If it is really
> ISO Latin 1 ("ISO-8859-1") and you do NOT specify this
> in the XML Declaration at the top of the document (e.g.
> "<?xml version='1.0' encoding='...'?>"), then your
> document is in error.
>
> If XML Spy accepts it w/o the encoding declaration,
> then it is not following the XML specification which
> dictates that the encoding of the document is assumed
> to be UTF-8 in the absence of the XML declaration.
>
> --
> Andy Clark * andyc@apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: UTF-8 Encoding
Posted by Andy Clark <an...@apache.org>.
Shekhar Karani wrote:
> I am using the xerces 2.2.1 to parse XML documents. One of the XML
> documents has a hex character B6. This character is being treated as an
> invalid UTF-8 character by the parser. The parser gives the error
> "Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
> version 5, accepts this character.
What is the encoding of the document? If it is really
ISO Latin 1 ("ISO-8859-1") and you do NOT specify this
in the XML Declaration at the top of the document (e.g.
"<?xml version='1.0' encoding='...'?>"), then your
document is in error.
If XML Spy accepts it w/o the encoding declaration,
then it is not following the XML specification which
dictates that the encoding of the document is assumed
to be UTF-8 in the absence of the XML declaration.
--
Andy Clark * andyc@apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: UTF-8 Encoding
Posted by Shekhar Karani <2k...@sun20.datamatics.com>.
Thanks a lot guys. I will surely try this out.
Shekhar
----- Original Message -----
From: Andy Clark <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Monday, March 31, 2003 11:33 PM
Subject: Re: UTF-8 Encoding
> Michael Glavassevich wrote:
> > If you absolutely cannot get alter your input document, you can try
setting
> > your own character reader on the input source. This will force the
parser
> > to use your own reader. If you have an InputStream to the document you
can
> > easily get one for ISO-8859-1 using an InputStreamReader.
>
> Michael is right. If you know the actual encoding of the
> document, then you can follow this approach and it will
> always work because the parser will not try to perform
> any auto-detection. For example:
>
> InputStream stream = /* ... */;
> Reader reader = new InputStreamReader(stream, "ISO-8859-1");
>
> InputSource source = new InputSource(reader);
> // NOTE: Also set the system id so that the parser can
> // resolve relative URIs.
>
> However, in general, you should let the parser do the
> auto-detection of the character encoding. But if you're
> stuck in the situation where someone has given you a
> document that is not well-formed because the specified
> encoding is wrong, then use this method to work around
> the problem.
>
> --
> Andy Clark * andyc@apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: UTF-8 Encoding
Posted by Andy Clark <an...@apache.org>.
Michael Glavassevich wrote:
> If you absolutely cannot get alter your input document, you can try setting
> your own character reader on the input source. This will force the parser
> to use your own reader. If you have an InputStream to the document you can
> easily get one for ISO-8859-1 using an InputStreamReader.
Michael is right. If you know the actual encoding of the
document, then you can follow this approach and it will
always work because the parser will not try to perform
any auto-detection. For example:
InputStream stream = /* ... */;
Reader reader = new InputStreamReader(stream, "ISO-8859-1");
InputSource source = new InputSource(reader);
// NOTE: Also set the system id so that the parser can
// resolve relative URIs.
However, in general, you should let the parser do the
auto-detection of the character encoding. But if you're
stuck in the situation where someone has given you a
document that is not well-formed because the specified
encoding is wrong, then use this method to work around
the problem.
--
Andy Clark * andyc@apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: UTF-8 Encoding
Posted by Michael Glavassevich <mr...@engmail.uwaterloo.ca>.
Hi Shekhar,
Setting the encoding on the input source allows the parser to skip encoding
auto-detection, however once it reads the encoding from the XML decleration
it will create a new character reader if the previous encoding (either
auto-detected or supplied by the user) is different from the one specified
in the document, so you don't gain anything by doing this.
You should try changing the value of the encoding in the actual document.
<?xml version="1.0" encoding="iso-8859-1"?>
If you absolutely cannot get alter your input document, you can try setting
your own character reader on the input source. This will force the parser
to use your own reader. If you have an InputStream to the document you can
easily get one for ISO-8859-1 using an InputStreamReader.
At 11:08 AM 31/03/2003 +0530, you wrote:
> is still giving the same error. I just thought I should clarify
>that I have a XML document given to me by a client that I need to parse.
>The XML document has its encoding set to UTF-8 <> I need to parse this
>document but the character with hex value B6 present in the XML Document
>is not being accepted by the parser. I need to overrider the encoding set
>in the XML document through the code but setting the Inputsource encoding
>to ISO-88591-1 is not doing the trick. Thanks Shekhar ----- Original
>Message ----- From: Ragunath Marudhachalam To:
>xerces-j-user@xml.apache.org Sent: Friday, March 28, 2003 7:48 PM
>Subject: RE: UTF-8 Encoding
> yes. OutputFormat format = new OutputFormat( Document );
>file://Serialize DOM format.setEncoding("ISO-8859-1"); This will
>set the encoding to ISO-8859-1 instead of UTF-8. UTF-8 is the default
>encoding that is set when u create a document without specifying any
>encoding. If you are not using serialization, then try setting the
>encoding to the InputSource. Ragu
>CircuitVision -----Original Message-----
>From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
>Sent: Friday, March 28, 2003 9:16 AM
>To: xerces-j-user@xml.apache.org
>Subject: Re: UTF-8 Encoding
>
> Doing that in my code will over ride the XML document encoding?
> Shekhar ----- Original Message ----- From:
>Ragunath Marudhachalam To: xerces-j-user@xml.apache.org Sent:
>Friday, March 28, 2003 7:33 PM Subject: RE: UTF-8 Encoding
> set the encoding to "ISO-8859-1" Ragu
>CircuitVision -----Original Message-----
>From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
>Sent: Friday, March 28, 2003 6:27 AM
>To: xerces-j-user@xml.apache.org
>Subject: UTF-8 Encoding
>
> Hi
>
>I am using the xerces 2.2.1 to parse XML documents. One of the XML
>documents has a hex character B6. This character is being treated
>as an
>invalid UTF-8 character by the parser. The parser gives the error
>"Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
>version 5, accepts this character.
>
>Please let me know what I need to do in my code to accept this
>character.
>
>The archives on the mailing list are not accessible hence I am not
>sure
>if this question is present there. Thanks
>Shekhar
>
>
>
>
>
-----------------------------
Michael Glavassevich
mrglavas@engmail.uwaterloo.ca
4B Computer Engineering
University of Waterloo
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: UTF-8 Encoding
Posted by Shekhar Karani <2k...@sun20.datamatics.com>.
I tried setting the InputSource encoding to ISO-8859-1 but this did not work. The parser is still giving the same error.
I just thought I should clarify that I have a XML document given to me by a client that I need to parse. The XML document has its encoding set to UTF-8
<?xml version="1.0" encoding="UTF-8"?>
I need to parse this document but the character with hex value B6 present in the XML Document is not being accepted by the parser. I need to overrider the encoding set in the XML document through the code but setting the Inputsource encoding to ISO-88591-1 is not doing the trick.
Thanks
Shekhar
----- Original Message -----
From: Ragunath Marudhachalam
To: xerces-j-user@xml.apache.org
Sent: Friday, March 28, 2003 7:48 PM
Subject: RE: UTF-8 Encoding
yes.
OutputFormat format = new OutputFormat( Document ); file://Serialize DOM
format.setEncoding("ISO-8859-1");
This will set the encoding to ISO-8859-1 instead of UTF-8. UTF-8 is the default encoding that is set when u create a document without specifying any encoding.
If you are not using serialization, then try setting the encoding to the InputSource.
Ragu
CircuitVision
-----Original Message-----
From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
Sent: Friday, March 28, 2003 9:16 AM
To: xerces-j-user@xml.apache.org
Subject: Re: UTF-8 Encoding
Doing that in my code will over ride the XML document encoding?
Shekhar
----- Original Message -----
From: Ragunath Marudhachalam
To: xerces-j-user@xml.apache.org
Sent: Friday, March 28, 2003 7:33 PM
Subject: RE: UTF-8 Encoding
set the encoding to "ISO-8859-1"
Ragu
CircuitVision
-----Original Message-----
From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
Sent: Friday, March 28, 2003 6:27 AM
To: xerces-j-user@xml.apache.org
Subject: UTF-8 Encoding
Hi
I am using the xerces 2.2.1 to parse XML documents. One of the XML
documents has a hex character B6. This character is being treated as an
invalid UTF-8 character by the parser. The parser gives the error
"Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
version 5, accepts this character.
Please let me know what I need to do in my code to accept this
character.
The archives on the mailing list are not accessible hence I am not sure
if this question is present there.
Thanks
Shekhar
RE: UTF-8 Encoding
Posted by Ragunath Marudhachalam <rm...@circuitvision.com>.
yes.
OutputFormat format = new OutputFormat( Document ); //Serialize DOM
format.setEncoding("ISO-8859-1");
This will set the encoding to ISO-8859-1 instead of UTF-8. UTF-8 is the
default encoding that is set when u create a document without specifying any
encoding.
If you are not using serialization, then try setting the encoding to the
InputSource.
Ragu
CircuitVision
-----Original Message-----
From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
Sent: Friday, March 28, 2003 9:16 AM
To: xerces-j-user@xml.apache.org
Subject: Re: UTF-8 Encoding
Doing that in my code will over ride the XML document encoding?
Shekhar
----- Original Message -----
From: Ragunath Marudhachalam
To: xerces-j-user@xml.apache.org
Sent: Friday, March 28, 2003 7:33 PM
Subject: RE: UTF-8 Encoding
set the encoding to "ISO-8859-1"
Ragu
CircuitVision
-----Original Message-----
From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
Sent: Friday, March 28, 2003 6:27 AM
To: xerces-j-user@xml.apache.org
Subject: UTF-8 Encoding
Hi
I am using the xerces 2.2.1 to parse XML documents. One of the XML
documents has a hex character B6. This character is being treated as
an
invalid UTF-8 character by the parser. The parser gives the error
"Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
version 5, accepts this character.
Please let me know what I need to do in my code to accept this
character.
The archives on the mailing list are not accessible hence I am not
sure
if this question is present there.
Thanks
Shekhar
Re: UTF-8 Encoding
Posted by Shekhar Karani <2k...@sun20.datamatics.com>.
Doing that in my code will over ride the XML document encoding?
Shekhar
----- Original Message -----
From: Ragunath Marudhachalam
To: xerces-j-user@xml.apache.org
Sent: Friday, March 28, 2003 7:33 PM
Subject: RE: UTF-8 Encoding
set the encoding to "ISO-8859-1"
Ragu
CircuitVision
-----Original Message-----
From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
Sent: Friday, March 28, 2003 6:27 AM
To: xerces-j-user@xml.apache.org
Subject: UTF-8 Encoding
Hi
I am using the xerces 2.2.1 to parse XML documents. One of the XML
documents has a hex character B6. This character is being treated as an
invalid UTF-8 character by the parser. The parser gives the error
"Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
version 5, accepts this character.
Please let me know what I need to do in my code to accept this
character.
The archives on the mailing list are not accessible hence I am not sure
if this question is present there.
Thanks
Shekhar
RE: UTF-8 Encoding
Posted by Ragunath Marudhachalam <rm...@circuitvision.com>.
set the encoding to "ISO-8859-1"
Ragu
CircuitVision
-----Original Message-----
From: Shekhar Karani [mailto:2kshekhar@sun20.datamatics.com]
Sent: Friday, March 28, 2003 6:27 AM
To: xerces-j-user@xml.apache.org
Subject: UTF-8 Encoding
Hi
I am using the xerces 2.2.1 to parse XML documents. One of the XML
documents has a hex character B6. This character is being treated as an
invalid UTF-8 character by the parser. The parser gives the error
"Invalid byte 1 of UTF-8 byte stream". However, the editor XML SPY
version 5, accepts this character.
Please let me know what I need to do in my code to accept this
character.
The archives on the mailing list are not accessible hence I am not sure
if this question is present there.
Thanks
Shekhar