You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/11/03 12:56:18 UTC
[jira] Closed: (NUTCH-387) host normalization in Generator$Selector
[ http://issues.apache.org/jira/browse/NUTCH-387?page=all ]
Andrzej Bialecki closed NUTCH-387.
-----------------------------------
Fix Version/s: 0.9.0
Resolution: Fixed
Assignee: Andrzej Bialecki
Fixed in rev. 470767 - thanks!
> host normalization in Generator$Selector
> ----------------------------------------
>
> Key: NUTCH-387
> URL: http://issues.apache.org/jira/browse/NUTCH-387
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Environment: nutch trunk since revision 449088
> Reporter: Johannes Zillmann
> Assigned To: Andrzej Bialecki
> Fix For: 0.9.0
>
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
> host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> host = new URL(host).getHost().toLowerCase();
> } catch (Exception e) {
> LOG.warn("Malformed URL: '" + host + "', skipping");
> }
> With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira