You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Piotr Kosiorowski (JIRA)" <ji...@apache.org> on 2006/03/09 22:18:39 UTC

[jira] Closed: (NUTCH-91) empty encoding causes exception

     [ http://issues.apache.org/jira/browse/NUTCH-91?page=all ]
     
Piotr Kosiorowski closed NUTCH-91:
----------------------------------

    Fix Version: 0.7.2-dev
                 0.8-dev
     Resolution: Fixed

Commited with small extension. Thanks.

> empty encoding causes exception
> -------------------------------
>
>          Key: NUTCH-91
>          URL: http://issues.apache.org/jira/browse/NUTCH-91
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Michael Nebel
>      Fix For: 0.7.2-dev, 0.8-dev

>
> I found some sites, where the header says:  "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:
> Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
> ===================================================================
> --- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (revision 279397)
> +++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (working copy)
> @@ -120,7 +120,7 @@
>        byte[] contentInOctets = content.getContent();
>        InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
>        String encoding = StringUtil.parseCharacterEncoding(contentType);
> -      if (encoding!=null) {
> +      if (encoding!=null && !"".equals(encoding)) {
>          metadata.put("OriginalCharEncoding", encoding);
>          if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
>            metadata.put("CharEncodingForConversion", encoding);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira