You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Laurent Hervaud (JIRA)" <ji...@apache.org> on 2017/09/19 10:54:01 UTC

[jira] [Created] (NUTCH-2421) parse-html to prioritize HTML5 charset definitions

Laurent Hervaud created NUTCH-2421:
--------------------------------------

             Summary: parse-html to prioritize HTML5 charset definitions
                 Key: NUTCH-2421
                 URL: https://issues.apache.org/jira/browse/NUTCH-2421
             Project: Nutch
          Issue Type: Improvement
          Components: parser
            Reporter: Laurent Hervaud
            Priority: Minor


jira NUTCH-1733 add support to HTML5 charset definitions.
In some case web site declare multiple meta element with different charset :
    <meta charset="utf-8">
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
(ex : http://www.edga.fr/)
In this case the second charset is detected (iso-8859-1).
What about prioritize HTML5 charset definitions first ?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)