You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Johannes Zillmann (JIRA)" <ji...@apache.org> on 2006/10/18 11:41:34 UTC
[jira] Created: (NUTCH-387) host normalization in
Generator$Selector
host normalization in Generator$Selector
-----------------------------------------
Key: NUTCH-387
URL: http://issues.apache.org/jira/browse/NUTCH-387
Project: Nutch
Issue Type: Bug
Components: generator
Environment: nutch trunk since revision 449088
Reporter: Johannes Zillmann
the host normalization in Generator$Selector#reduce at line 177 seems broken:
String host = new URL(url.toString()).getHost();
...
try {
host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
host = new URL(host).getHost().toLowerCase();
} catch (Exception e) {
LOG.warn("Malformed URL: '" + host + "', skipping");
}
With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
Also in line below 'new URL(host)' will be called.
Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
The job will continue as usual though, cause the exception is catched.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-387) host normalization in
Generator$Selector
Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ]
Otis Gospodnetic commented on NUTCH-387:
----------------------------------------
This indeed looks wrong.
My guess is that the new URL(....) line just needs to be removed, but I'm not sure, so I'll let somebody else make the actual change.
> host normalization in Generator$Selector
> ----------------------------------------
>
> Key: NUTCH-387
> URL: http://issues.apache.org/jira/browse/NUTCH-387
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Environment: nutch trunk since revision 449088
> Reporter: Johannes Zillmann
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
> host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> host = new URL(host).getHost().toLowerCase();
> } catch (Exception e) {
> LOG.warn("Malformed URL: '" + host + "', skipping");
> }
> With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-387) host normalization in Generator$Selector
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-387?page=all ]
Andrzej Bialecki closed NUTCH-387.
-----------------------------------
Fix Version/s: 0.9.0
Resolution: Fixed
Assignee: Andrzej Bialecki
Fixed in rev. 470767 - thanks!
> host normalization in Generator$Selector
> ----------------------------------------
>
> Key: NUTCH-387
> URL: http://issues.apache.org/jira/browse/NUTCH-387
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Environment: nutch trunk since revision 449088
> Reporter: Johannes Zillmann
> Assigned To: Andrzej Bialecki
> Fix For: 0.9.0
>
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
> host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> host = new URL(host).getHost().toLowerCase();
> } catch (Exception e) {
> LOG.warn("Malformed URL: '" + host + "', skipping");
> }
> With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira