You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2010/01/29 13:15:57 UTC

Character encodings on the web

Hi,

Interesting graph from Google about the relative usage of different
character encodings:

    http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html

It's interesting to see that the Unicode entry only lists the UTF-8
encoding. Are the other Unicode encodings so infrequent?

I think we can use this data as a guideline when optimizing the
encoding detection code in Tika.

BR,

Jukka Zitting

Re: Character encodings on the web

Posted by Ken Krugler <kk...@transpac.com>.
On Jan 29, 2010, at 4:15am, Jukka Zitting wrote:

> Hi,
>
> Interesting graph from Google about the relative usage of different
> character encodings:
>
>    http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html
>
> It's interesting to see that the Unicode entry only lists the UTF-8
> encoding. Are the other Unicode encodings so infrequent?

Yes (they are infrequent). In fact I've never seen a page encoded as  
any of the other possible transformations for Unicode.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g