You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2006/10/20 10:20:36 UTC

[jira] Commented: (NUTCH-387) host normalization in Generator$Selector

    [ http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ] 
            
Otis Gospodnetic commented on NUTCH-387:
----------------------------------------

This indeed looks wrong.
My guess is that the new URL(....) line just needs to be removed, but I'm not sure, so I'll let somebody else make the actual change.

> host normalization in Generator$Selector
> ----------------------------------------
>
>                 Key: NUTCH-387
>                 URL: http://issues.apache.org/jira/browse/NUTCH-387
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: nutch trunk since revision 449088
>            Reporter: Johannes Zillmann
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
>             host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
>             host = new URL(host).getHost().toLowerCase();
>  } catch (Exception e) {
>        LOG.warn("Malformed URL: '" + host + "', skipping");
>  }
> With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira