You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/07/17 15:53:14 UTC

[jira] Created: (NUTCH-321) Scoring API deficiency

Scoring API deficiency
----------------------

                 Key: NUTCH-321
                 URL: http://issues.apache.org/jira/browse/NUTCH-321
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.8-dev
            Reporter: Andrzej Bialecki 
             Fix For: 0.8-dev


Currently the method ScoringFilter.updateDbScore() doesn't use the "old" value from existing CrawlDB. Instead it uses the value taken from the fetchlist from the current segment, which represents a snapshot of the "old" value taken at the moment of generating the fetchlist.

The problem with this approach is that if/when we add a possibility to interleave generate/fetch/update cycles, the initial score values in CrawlDatum instance that comes from the current segment could be already outdated, if another updatedb was run in the meantime, which changed the DB score.

For this reason we should always assume that the value from CrawlDB, if exists, represents the most recent version of CrawlDatum before the update, and use this instance as a base.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-321) Scoring API deficiency

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-321?page=all ]

Andrzej Bialecki  updated NUTCH-321:
------------------------------------

    Attachment: patch.txt

Proposed improvements. If there are no objections I'll commit them shortly.

NOTE: this changes the API, but since v. 0.8 is still unreleased I feel it's the right time to do it.

> Scoring API deficiency
> ----------------------
>
>                 Key: NUTCH-321
>                 URL: http://issues.apache.org/jira/browse/NUTCH-321
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>         Attachments: patch.txt
>
>
> Currently the method ScoringFilter.updateDbScore() doesn't use the "old" value from existing CrawlDB. Instead it uses the value taken from the fetchlist from the current segment, which represents a snapshot of the "old" value taken at the moment of generating the fetchlist.
> The problem with this approach is that if/when we add a possibility to interleave generate/fetch/update cycles, the initial score values in CrawlDatum instance that comes from the current segment could be already outdated, if another updatedb was run in the meantime, which changed the DB score.
> For this reason we should always assume that the value from CrawlDB, if exists, represents the most recent version of CrawlDatum before the update, and use this instance as a base.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Closed: (NUTCH-321) Scoring API deficiency

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-321?page=all ]

Andrzej Bialecki  closed NUTCH-321.
-----------------------------------

    Resolution: Fixed

Patch applied to trunk/ .

NOTE: this requires a (trivial) change in any custom scoring plugin. Most likely, to accomodate for the future support for interleaved fetching cycles, you should use the "old" CrawlDatum as a basis for the initial score to be updated, instead of the "datum" (which is a snapshot of the value at the time of generating the fetchlist).

> Scoring API deficiency
> ----------------------
>
>                 Key: NUTCH-321
>                 URL: http://issues.apache.org/jira/browse/NUTCH-321
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>         Attachments: patch.txt
>
>
> Currently the method ScoringFilter.updateDbScore() doesn't use the "old" value from existing CrawlDB. Instead it uses the value taken from the fetchlist from the current segment, which represents a snapshot of the "old" value taken at the moment of generating the fetchlist.
> The problem with this approach is that if/when we add a possibility to interleave generate/fetch/update cycles, the initial score values in CrawlDatum instance that comes from the current segment could be already outdated, if another updatedb was run in the meantime, which changed the DB score.
> For this reason we should always assume that the value from CrawlDB, if exists, represents the most recent version of CrawlDatum before the update, and use this instance as a base.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira