You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Johannes Zillmann (JIRA)" <ji...@apache.org> on 2006/10/18 11:41:34 UTC

[jira] Created: (NUTCH-387) host normalization in Generator$Selector

host normalization in Generator$Selector 
-----------------------------------------

                 Key: NUTCH-387
                 URL: http://issues.apache.org/jira/browse/NUTCH-387
             Project: Nutch
          Issue Type: Bug
          Components: generator
         Environment: nutch trunk since revision 449088
            Reporter: Johannes Zillmann


the host normalization in Generator$Selector#reduce at line 177 seems broken:
String host = new URL(url.toString()).getHost();
...
try {
            host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
            host = new URL(host).getHost().toLowerCase();
 } catch (Exception e) {
       LOG.warn("Malformed URL: '" + host + "', skipping");
 }

With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
Also in line below 'new URL(host)' will be called.
Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
The job will continue as usual though, cause the exception is catched.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-387) host normalization in Generator$Selector

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ] 
            
Otis Gospodnetic commented on NUTCH-387:
----------------------------------------

This indeed looks wrong.
My guess is that the new URL(....) line just needs to be removed, but I'm not sure, so I'll let somebody else make the actual change.

> host normalization in Generator$Selector
> ----------------------------------------
>
>                 Key: NUTCH-387
>                 URL: http://issues.apache.org/jira/browse/NUTCH-387
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: nutch trunk since revision 449088
>            Reporter: Johannes Zillmann
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
>             host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
>             host = new URL(host).getHost().toLowerCase();
>  } catch (Exception e) {
>        LOG.warn("Malformed URL: '" + host + "', skipping");
>  }
> With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Closed: (NUTCH-387) host normalization in Generator$Selector

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-387?page=all ]

Andrzej Bialecki  closed NUTCH-387.
-----------------------------------

    Fix Version/s: 0.9.0
       Resolution: Fixed
         Assignee: Andrzej Bialecki 

Fixed in rev. 470767 - thanks!

> host normalization in Generator$Selector
> ----------------------------------------
>
>                 Key: NUTCH-387
>                 URL: http://issues.apache.org/jira/browse/NUTCH-387
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: nutch trunk since revision 449088
>            Reporter: Johannes Zillmann
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> the host normalization in Generator$Selector#reduce at line 177 seems broken:
> String host = new URL(url.toString()).getHost();
> ...
> try {
>             host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
>             host = new URL(host).getHost().toLowerCase();
>  } catch (Exception e) {
>        LOG.warn("Malformed URL: '" + host + "', skipping");
>  }
> With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
> Also in line below 'new URL(host)' will be called.
> Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
> The job will continue as usual though, cause the exception is catched.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira