You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2010/11/06 20:03:22 UTC
Charset detection algorithm
Hi all,
See https://issues.apache.org/jira/browse/TIKA-539 for a Tika issue
I'm currently working on, which has to do with the charset detection
algorithm.
There's the HTML5 proposal, where the priority is
- charset from Content-Type response header
- charset from HTML <meta http-equiv content-type> element
- charset detected from page contents
Reinhard Schwab proposed a variation on the HTML5 approach, which
makes sense to me; in my web crawling experience, too many servers lie
to just blindly trust the response header contents.
I've got a slight modification to Reinhard's approach, as describe in
a comment on the above issue:
https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=12928832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel
#action_12928832
I'm interested in comments.
Thanks!
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g