You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/09/22 23:46:24 UTC

[jira] Closed: (NUTCH-332) doubling score causes by page internal anchors.

     [ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]

Andrzej Bialecki  closed NUTCH-332.
-----------------------------------

    Fix Version/s: 0.8.1
       Resolution: Fixed

Patch applied to branch-0.8 and to trunk. Thanks!

> doubling score causes by page internal anchors.
> -----------------------------------------------
>
>                 Key: NUTCH-332
>                 URL: http://issues.apache.org/jira/browse/NUTCH-332
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. 
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
> So may be the status fetched will be overwritten with unfetched. 
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira