You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/07/28 08:14:13 UTC

[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

doubling score causes by page internal anchors.
-----------------------------------------------

                 Key: NUTCH-332
                 URL: http://issues.apache.org/jira/browse/NUTCH-332
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8-dev
            Reporter: Stefan Groschupf
            Priority: Blocker
             Fix For: 0.8-dev


When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. 
I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
So may be the status fetched will be overwritten with unfetched. 
In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-332) doubling score causes by page internal anchors.

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]

Stefan Groschupf updated NUTCH-332:
-----------------------------------

    Attachment: scoreDoubling.patch

A patch to solve this problem. 

This is a example page:
http://bid.berkeley.edu/bidclass/readings/benjamin.html
This page has several anchors that causes the problem in this case.

What happens is: 
foo.com/a.html points to foo.com/a.html#chapter1
we normalize foo.com/a.html#chapter1 to:
foo.com/a.html

foo.com/a.html contributes all scores to foo.com/a.html. 


> doubling score causes by page internal anchors.
> -----------------------------------------------
>
>                 Key: NUTCH-332
>                 URL: http://issues.apache.org/jira/browse/NUTCH-332
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8-dev
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.8-dev
>
>         Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. 
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
> So may be the status fetched will be overwritten with unfetched. 
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Assigned: (NUTCH-332) doubling score causes by page internal anchors.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]

Andrzej Bialecki  reassigned NUTCH-332:
---------------------------------------

    Assignee: Andrzej Bialecki 

> doubling score causes by page internal anchors.
> -----------------------------------------------
>
>                 Key: NUTCH-332
>                 URL: http://issues.apache.org/jira/browse/NUTCH-332
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. 
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
> So may be the status fetched will be overwritten with unfetched. 
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-332) doubling score causes by page internal anchors.

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]

Sami Siren updated NUTCH-332:
-----------------------------

    Fix Version/s: 0.9
                       (was: 0.8)

> doubling score causes by page internal anchors.
> -----------------------------------------------
>
>                 Key: NUTCH-332
>                 URL: http://issues.apache.org/jira/browse/NUTCH-332
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.9
>
>         Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. 
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
> So may be the status fetched will be overwritten with unfetched. 
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Closed: (NUTCH-332) doubling score causes by page internal anchors.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]

Andrzej Bialecki  closed NUTCH-332.
-----------------------------------

    Fix Version/s: 0.8.1
       Resolution: Fixed

Patch applied to branch-0.8 and to trunk. Thanks!

> doubling score causes by page internal anchors.
> -----------------------------------------------
>
>                 Key: NUTCH-332
>                 URL: http://issues.apache.org/jira/browse/NUTCH-332
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. 
> I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
> So may be the status fetched will be overwritten with unfetched. 
> In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira