You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/11/03 12:56:18 UTC

[jira] Closed: (NUTCH-387) host normalization in Generator$Selector

     [ http://issues.apache.org/jira/browse/NUTCH-387?page=all ]

Andrzej Bialecki  closed NUTCH-387.
-----------------------------------

    Fix Version/s: 0.9.0
       Resolution: Fixed
         Assignee: Andrzej Bialecki 

Fixed in rev. 470767 - thanks!

> host normalization in Generator$Selector
> ----------------------------------------
>
>                 Key: NUTCH-387
>                 URL: http://issues.apache.org/jira/browse/NUTCH-387
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: nutch trunk since revision 449088
>            Reporter: Johannes Zillmann
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
>             host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
>             host = new URL(host).getHost().toLowerCase();
>  } catch (Exception e) {
>        LOG.warn("Malformed URL: '" + host + "', skipping");
>  }
> With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira