You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Michael Glavassevich <mr...@ca.ibm.com> on 2009/04/24 01:23:23 UTC

Re: doubt about utf8 and charactrers method in DefaultHandler (SaxParser)

Hi Raimon,

Raimon Bosch <ra...@gmail.com> wrote on 04/23/2009 06:59:42 PM:

> I see that characters method is always interpreting the characters as
16-bit
> characters, because is an array of type char. How Xerces manage the
> non-16-bit characters? For example, in UTF8 there is a lot of characters
> between 16 and 32 bits.
>
> If I found a char outside the 16 bit UTF-8 range, can I suppose that it
is
> not an UTF-8 character?

UTF-8 and UTF-16 are character encodings [1], representing the characters
defined by Unicode as sequences of bytes. These encodings have a
representation for every character in Unicode. Like any of the other
encodings they're decoded into Java chars on input so it's all the same to
the consumer of the SAX API regardless of what the document's encoding was.

Thanks.

[1] http://en.wikipedia.org/wiki/Character_encoding

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: doubt about utf8 and charactrers method in DefaultHandler (SaxParser)

Posted by Nathan Beyer <nd...@apache.org>.
I like to break characters into concepts - character sets and
character encodings.

Unicode is a character set.
UTF-8, UTF-16, etc are encodings of the Unicode set.

ISO-8859-1 is a character set and a character encoding.


On Thu, Apr 23, 2009 at 6:23 PM, Michael Glavassevich
<mr...@ca.ibm.com> wrote:
> Hi Raimon,
>
> Raimon Bosch <ra...@gmail.com> wrote on 04/23/2009 06:59:42 PM:
>
>> I see that characters method is always interpreting the characters as
>> 16-bit
>> characters, because is an array of type char. How Xerces manage the
>> non-16-bit characters? For example, in UTF8 there is a lot of characters
>> between 16 and 32 bits.
>>
>> If I found a char outside the 16 bit UTF-8 range, can I suppose that it is
>> not an UTF-8 character?
>
> UTF-8 and UTF-16 are character encodings [1], representing the characters
> defined by Unicode as sequences of bytes. These encodings have a
> representation for every character in Unicode. Like any of the other
> encodings they're decoded into Java chars on input so it's all the same to
> the consumer of the SAX API regardless of what the document's encoding was.
>
> Thanks.
>
> [1] http://en.wikipedia.org/wiki/Character_encoding
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: doubt about utf8 and charactrers method in DefaultHandler (SaxParser)

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
keshlam@us.ibm.com wrote on 04/23/2009 08:02:18 PM:

> > UTF-8 and UTF-16 are character encodings [1], representing the
> > characters defined by Unicode as sequences of bytes. These encodings
> > have a representation for every character in Unicode. Like any of
> > the other encodings they're decoded into Java chars on input so it's
> > all the same to the consumer of the SAX API regardless of what the
> > document's encoding was.
>
> More specifically: Characters too long to represent in a single java
> char will take two chars; that's how UTF-16 works. (UTF-8 is
> similar, except that it takes one, two, or three bytes

or four bytes (0x10000 - 0x10FFFF).

> to cover the
> same range of values rather than UTF16's two or four.)
>
> Yes, this means that full unicode string manipulation in Java is
> more complex than just moving individual chars around. Luckily, most
> alphabetical languages don't need to go over 15 bits per character.
> (The high bit is reserved for signalling when more bits are needed.)
>
> Note that this is general Java behavior, nothing unique to Xerces.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: doubt about utf8 and charactrers method in DefaultHandler (SaxParser)

Posted by ke...@us.ibm.com.
> UTF-8 and UTF-16 are character encodings [1], representing the 
> characters defined by Unicode as sequences of bytes. These encodings
> have a representation for every character in Unicode. Like any of 
> the other encodings they're decoded into Java chars on input so it's
> all the same to the consumer of the SAX API regardless of what the 
> document's encoding was.

More specifically: Characters too long to represent in a single java char 
will take two chars; that's how UTF-16 works. (UTF-8 is similar, except 
that it takes one, two, or three bytes to cover the same range of values 
rather than UTF16's two or four.)

Yes, this means that full unicode string manipulation in Java is more 
complex than just moving individual chars around. Luckily, most 
alphabetical languages don't need to go over 15 bits per character. (The 
high bit is reserved for signalling when more bits are needed.)

Note that this is general Java behavior, nothing unique to Xerces.