You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/07/18 14:50:57 UTC

[jira] [Issue Comment Edited] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

    [ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066982#comment-13066982 ] 

Julien Nioche edited comment on NUTCH-1044 at 7/18/11 12:50 PM:
----------------------------------------------------------------

I can confirm the issue. The solution is not straightforward and needs a bit of thinking.

{quote}
The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f as can be seen on the line 122 of CrawlDatum.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).
{quote}

The score is set in the method initialScore() in the ScoringFilters, see line 81 of OPICScoringFilter which sets it to 0 by default as it expects it to be modified later when getting the contributions from the inlinks. 

There are several ways in which a URL can get a score : 
* specifying the param 'db.score.injected' when injecting (default value = 1.0)
* passing it in the seed list for each individual URL as a value of the metadata 'nutch.score'
* from inlinks (depends on the score of the source, number of links etc...)
* from redirection : which is currently broken

The default value of the score in CrawlDatum is 1.0 but this could be changed to 0.0. It also has a constructor 

{code}
CrawlDatum(int status, int fetchInterval, float score) 
{code}

which is allows to specify its score, this constructor is used by the Fetcher when the redirs are refetched immediately however the calls to initialScore() currently set it to 0 immediately.

We should probably change initialScore() in OPICScoringFilter so that by default it leaves the existing scores as they are and change the default value in CrawlDatum to 0.0. Using the CrawlDatum constructor above with the score of the source of the redir in the code of the Fetcher would fix the issue.

I will need to look into this and make sure that it has no negative effect + check the cases where the redirection is obtained from a meta refresh tag in the code.

Thanks for reporting it. 

      was (Author: jnioche):
    I can confirming the issue. The solution is not straightforward and needs a bit of thinking.

{quote}
The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f as can be seen on the line 122 of CrawlDatum.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).
{quote}

The score is set in the method initialScore() in the ScoringFilters, see line 81 of OPICScoringFilter which sets it to 0 by default as it expects it to be modified later when getting the contributions from the inlinks. 

There are several ways in which a URL can get a score : 
* specifying the param 'db.score.injected' when injecting (default value = 1.0)
* passing it in the seed list for each individual URL as a value of the metadata 'nutch.score'
* from inlinks (depends on the score of the source, number of links etc...)
* from redirection : which is currently broken

The default value of the score in CrawlDatum is 1.0 but this could be changed to 0.0. It also has a constructor 

{code}
CrawlDatum(int status, int fetchInterval, float score) 
{code}

which is allows to specify its score, this constructor is used by the Fetcher when the redirs are refetched immediately however the calls to initialScore() currently set it to 0 immediately.

We should probably change initialScore() in OPICScoringFilter so that by default it leaves the existing scores as they are and change the default value in CrawlDatum to 0.0. Using the CrawlDatum constructor above with the score of the source of the redir in the code of the Fetcher would fix the issue.

I will need to look into this and make sure that it has no negative effect + check the cases where the redirection is obtained from a meta refresh tag in the code.

Thanks for reporting it. 
  
> Redirected URLs and possibly all of their outlinked URLs have invalid scores.
> -----------------------------------------------------------------------------
>
>                 Key: NUTCH-1044
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1044
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, parser
>    Affects Versions: 1.3
>            Reporter: Nutch User - 1
>            Assignee: Julien Nioche
>            Priority: Critical
>             Fix For: 1.4
>
>
> 1.: http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
> 2.: http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html
> Please note that also URLs redirected by meta refresh redirection do have invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of ParseOutputFormat.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup). The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f as can be seen on the line 122 of CrawlDatum.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).
> It's another question whether the redirected URL's score should be just passed to the new URL or should the redirection be considered as a link in which case the new URL's score would be 'originalScore' / ('numberOfOutlinks' + 1).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira