You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Laurent Hervaud (JIRA)" <ji...@apache.org> on 2017/09/19 10:54:01 UTC
[jira] [Created] (NUTCH-2421) parse-html to prioritize HTML5
charset definitions
Laurent Hervaud created NUTCH-2421:
--------------------------------------
Summary: parse-html to prioritize HTML5 charset definitions
Key: NUTCH-2421
URL: https://issues.apache.org/jira/browse/NUTCH-2421
Project: Nutch
Issue Type: Improvement
Components: parser
Reporter: Laurent Hervaud
Priority: Minor
jira NUTCH-1733 add support to HTML5 charset definitions.
In some case web site declare multiple meta element with different charset :
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
(ex : http://www.edga.fr/)
In this case the second charset is detected (iso-8859-1).
What about prioritize HTML5 charset definitions first ?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)