You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/07/28 08:14:13 UTC
[jira] Created: (NUTCH-332) doubling score causes by page internal
anchors.
doubling score causes by page internal anchors.
-----------------------------------------------
Key: NUTCH-332
URL: http://issues.apache.org/jira/browse/NUTCH-332
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8-dev
When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled.
I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107.
So may be the status fetched will be overwritten with unfetched.
In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-332) doubling score causes by page internal
anchors.
Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]
Stefan Groschupf updated NUTCH-332:
-----------------------------------
Attachment: scoreDoubling.patch
A patch to solve this problem.
This is a example page:
http://bid.berkeley.edu/bidclass/readings/benjamin.html
This page has several anchors that causes the problem in this case.
What happens is:
foo.com/a.html points to foo.com/a.html#chapter1
we normalize foo.com/a.html#chapter1 to:
foo.com/a.html
foo.com/a.html contributes all scores to foo.com/a.html.
> doubling score causes by page internal anchors.
> -----------------------------------------------
>
> Key: NUTCH-332
> URL: http://issues.apache.org/jira/browse/NUTCH-332
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
> Fix For: 0.8-dev
>
> Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled.
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107.
> So may be the status fetched will be overwritten with unfetched.
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Assigned: (NUTCH-332) doubling score causes by page internal
anchors.
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]
Andrzej Bialecki reassigned NUTCH-332:
---------------------------------------
Assignee: Andrzej Bialecki
> doubling score causes by page internal anchors.
> -----------------------------------------------
>
> Key: NUTCH-332
> URL: http://issues.apache.org/jira/browse/NUTCH-332
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Stefan Groschupf
> Assigned To: Andrzej Bialecki
> Priority: Blocker
> Fix For: 0.9.0
>
> Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled.
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107.
> So may be the status fetched will be overwritten with unfetched.
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-332) doubling score causes by page internal
anchors.
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]
Sami Siren updated NUTCH-332:
-----------------------------
Fix Version/s: 0.9
(was: 0.8)
> doubling score causes by page internal anchors.
> -----------------------------------------------
>
> Key: NUTCH-332
> URL: http://issues.apache.org/jira/browse/NUTCH-332
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Stefan Groschupf
> Priority: Blocker
> Fix For: 0.9
>
> Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled.
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107.
> So may be the status fetched will be overwritten with unfetched.
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-332) doubling score causes by page internal
anchors.
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]
Andrzej Bialecki closed NUTCH-332.
-----------------------------------
Fix Version/s: 0.8.1
Resolution: Fixed
Patch applied to branch-0.8 and to trunk. Thanks!
> doubling score causes by page internal anchors.
> -----------------------------------------------
>
> Key: NUTCH-332
> URL: http://issues.apache.org/jira/browse/NUTCH-332
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Stefan Groschupf
> Assigned To: Andrzej Bialecki
> Priority: Blocker
> Fix For: 0.8.1, 0.9.0
>
> Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled.
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107.
> So may be the status fetched will be overwritten with unfetched.
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira