You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Michael Nebel (JIRA)" <ji...@apache.org> on 2005/09/10 14:36:30 UTC

[jira] Created: (NUTCH-91) empty encoding causes exception

empty encoding causes exception
-------------------------------

         Key: NUTCH-91
         URL: http://issues.apache.org/jira/browse/NUTCH-91
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
    Reporter: Michael Nebel


I found some sites, where the header says:  "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:

Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
===================================================================
--- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (revision 279397)
+++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (working copy)
@@ -120,7 +120,7 @@
       byte[] contentInOctets = content.getContent();
       InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
       String encoding = StringUtil.parseCharacterEncoding(contentType);
-      if (encoding!=null) {
+      if (encoding!=null && !"".equals(encoding)) {
         metadata.put("OriginalCharEncoding", encoding);
         if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
           metadata.put("CharEncodingForConversion", encoding);


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-91) empty encoding causes exception

Posted by "Piotr Kosiorowski (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-91?page=all ]
     
Piotr Kosiorowski closed NUTCH-91:
----------------------------------

    Fix Version: 0.7.2-dev
                 0.8-dev
     Resolution: Fixed

Commited with small extension. Thanks.

> empty encoding causes exception
> -------------------------------
>
>          Key: NUTCH-91
>          URL: http://issues.apache.org/jira/browse/NUTCH-91
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Michael Nebel
>      Fix For: 0.7.2-dev, 0.8-dev

>
> I found some sites, where the header says:  "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:
> Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
> ===================================================================
> --- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (revision 279397)
> +++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (working copy)
> @@ -120,7 +120,7 @@
>        byte[] contentInOctets = content.getContent();
>        InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
>        String encoding = StringUtil.parseCharacterEncoding(contentType);
> -      if (encoding!=null) {
> +      if (encoding!=null && !"".equals(encoding)) {
>          metadata.put("OriginalCharEncoding", encoding);
>          if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
>            metadata.put("CharEncodingForConversion", encoding);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira