You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/08/17 01:23:14 UTC

[jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

     [ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]

Stefan Groschupf updated NUTCH-348:
-----------------------------------

    Attachment: sortPatchV1.patch

What people think about this kind of solution?

> Generator is building fetch list using *lowest* scoring URLs
> ------------------------------------------------------------
>
>                 Key: NUTCH-348
>                 URL: http://issues.apache.org/jira/browse/NUTCH-348
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Chris Schneider
>         Attachments: sortPatchV1.patch
>
>
> Ever since revision 391271, when the CrawlDatum key was replaced by a FloatWritable key, the Generator.Selector.reduce method has been outputting the *lowest* scoring URLs! The CrawlDatum class has a Comparator that essentially treats higher scoring CrawlDatum objects as "less than" lower scoring CrawlDatum objects, so the higher scoring ones would appear first in a sequence file sorted using this as the key.
> When a FloatWritable based on the score itself (as returned from scfilters.generatorSortValue) became the sort key, it should have been negated in Generator.Selector.map to have the same result. Curiously, there is a comment to this effect immediately before the FloatWritable is set:
>       // sort by decreasing score
>       sortValue.set(sort);
> It seems like the simplest way to fix this is to just negate the score, and this seems to work for me:
>       // sort by decreasing score
>       // 2006-08-15 CSc REALLY sort by decreasing score
>       sortValue.set(-sort);
> Unfortunately, this means that any crawls that have been done using Generator.java after revision 391271 should be discarded, as they were focused on fetching the lowest scoring unfetched URLs in the crawldb, essentially pointing the crawler 180 degrees from its intended direction.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira