You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2006/10/20 10:20:36 UTC
[jira] Commented: (NUTCH-387) host normalization in
Generator$Selector
[ http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ]
Otis Gospodnetic commented on NUTCH-387:
----------------------------------------
This indeed looks wrong.
My guess is that the new URL(....) line just needs to be removed, but I'm not sure, so I'll let somebody else make the actual change.
> host normalization in Generator$Selector
> ----------------------------------------
>
> Key: NUTCH-387
> URL: http://issues.apache.org/jira/browse/NUTCH-387
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Environment: nutch trunk since revision 449088
> Reporter: Johannes Zillmann
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
> host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> host = new URL(host).getHost().toLowerCase();
> } catch (Exception e) {
> LOG.warn("Malformed URL: '" + host + "', skipping");
> }
> With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira