You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2009/10/09 14:44:31 UTC

[jira] Closed: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

     [ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-707.
-----------------------------------

    Resolution: Fixed
      Assignee: Andrzej Bialecki 

> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-707
>                 URL: https://issues.apache.org/jira/browse/NUTCH-707
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
>            Reporter: Michael Chan
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.1
>
>         Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.