You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Nathan Beyer <nd...@apache.org> on 2009/04/24 02:02:21 UTC

Re: doubt about utf8 and charactrers method in DefaultHandler (SaxParser)

I like to break characters into concepts - character sets and
character encodings.

Unicode is a character set.
UTF-8, UTF-16, etc are encodings of the Unicode set.

ISO-8859-1 is a character set and a character encoding.


On Thu, Apr 23, 2009 at 6:23 PM, Michael Glavassevich
<mr...@ca.ibm.com> wrote:
> Hi Raimon,
>
> Raimon Bosch <ra...@gmail.com> wrote on 04/23/2009 06:59:42 PM:
>
>> I see that characters method is always interpreting the characters as
>> 16-bit
>> characters, because is an array of type char. How Xerces manage the
>> non-16-bit characters? For example, in UTF8 there is a lot of characters
>> between 16 and 32 bits.
>>
>> If I found a char outside the 16 bit UTF-8 range, can I suppose that it is
>> not an UTF-8 character?
>
> UTF-8 and UTF-16 are character encodings [1], representing the characters
> defined by Unicode as sequences of bytes. These encodings have a
> representation for every character in Unicode. Like any of the other
> encodings they're decoded into Java chars on input so it's all the same to
> the consumer of the SAX API regardless of what the document's encoding was.
>
> Thanks.
>
> [1] http://en.wikipedia.org/wiki/Character_encoding
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org