You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Takumi Fujiwara <tr...@yahoo.com> on 2002/08/26 17:46:01 UTC

Default decoding in Neko

Hi, 
Could someone pleasea tell me why the default decoding in Neko is Windows-1252 instead of UTF-8? I want to parse pages like yahoo.jp, yahoo.co.jp, dk.yahoo.com, it.yahoo.com, hk.yahoo.com
I know I can change it if I want, I just want to understand why "Windows-1252" is choosen instead of UTF-8?
Thank you.
Sam



---------------------------------
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes

Re: Default decoding in Neko

Posted by Andy Clark <an...@apache.org>.

Takumi Fujiwara wrote:
> I know I can change it if I want, I just want to understand why 
> "Windows-1252" is choosen instead of UTF-8?

If the default were UTF-8, the reader would throw
an exception on many pages. Any page that contains
an ISO Latin 1 character (above the typical ASCII
range), would make the UTF-8 reader die.

ISO Latin 1 (or Windows-1252) are safe defaults
because every possible byte is acceptable. But, if
the high bit is set on a byte read by the UTF-8
reader, then it assumes that it matches the proper
UTF-8 sequence and when it doesn't, it throws an
exception.

As long as the page specifies its encoding using
the http-equiv meta tag, then NekoHTML will change
to the correct reader and everything will be fine.
But, if it does *not*, then we need to use a "safe"
encoding. Therefore, I chose Windows-1252.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org