You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/07/28 08:16:14 UTC

[jira] Updated: (NUTCH-332) doubling score causes by page internal anchors.

     [ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]

Stefan Groschupf updated NUTCH-332:
-----------------------------------

    Attachment: scoreDoubling.patch

A patch to solve this problem. 

This is a example page:
http://bid.berkeley.edu/bidclass/readings/benjamin.html
This page has several anchors that causes the problem in this case.

What happens is: 
foo.com/a.html points to foo.com/a.html#chapter1
we normalize foo.com/a.html#chapter1 to:
foo.com/a.html

foo.com/a.html contributes all scores to foo.com/a.html. 


> doubling score causes by page internal anchors.
> -----------------------------------------------
>
>                 Key: NUTCH-332
>                 URL: http://issues.apache.org/jira/browse/NUTCH-332
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8-dev
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.8-dev
>
>         Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. 
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
> So may be the status fetched will be overwritten with unfetched. 
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira