You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/07/28 08:16:14 UTC
[jira] Updated: (NUTCH-332) doubling score causes by page internal
anchors.
[ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]
Stefan Groschupf updated NUTCH-332:
-----------------------------------
Attachment: scoreDoubling.patch
A patch to solve this problem.
This is a example page:
http://bid.berkeley.edu/bidclass/readings/benjamin.html
This page has several anchors that causes the problem in this case.
What happens is:
foo.com/a.html points to foo.com/a.html#chapter1
we normalize foo.com/a.html#chapter1 to:
foo.com/a.html
foo.com/a.html contributes all scores to foo.com/a.html.
> doubling score causes by page internal anchors.
> -----------------------------------------------
>
> Key: NUTCH-332
> URL: http://issues.apache.org/jira/browse/NUTCH-332
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
> Fix For: 0.8-dev
>
> Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled.
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107.
> So may be the status fetched will be overwritten with unfetched.
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira