You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Uroš Gruber <ur...@sir-mag.com> on 2006/09/01 09:57:41 UTC

Re: bug or feature

Uroš Gruber wrote:
> Andrzej Bialecki wrote:
>> Uroš Gruber wrote:
>>> Hi,
>>>
>>> I've made some changes in CrawlDbReader to read from fetchlist made 
>>> from generate command. First I thought that I have problems with 
>>> this script because some urls from inject were missing. Then I test 
>>> on only 6 urls. I've manualy check file generated with inject and by 
>>> generate and generate made only 3 urls in fetch list.
>>>
>>> I don't quite understand this. As far as I understand generate 
>>> command it collects urls from crawdb, do some sorting by score and 
>>> puts it to crawl_generate directory.
>>
>> Are you running in a local mode, or in map-reduce mode with several 
>> tasktrackers? what is the number of reduce tasks in this "generate" job?
>>
> I'm running local mode with mapred.reduce.tasks as default (1) and (2) 
> map.tasks.
>
Debuging through map and reduce job (Generator$Selector [line: 147] - 
reduce, Generator$Selector [line: 99] - map) looks ok and It collects 
all urls from CrawlDB. I can't figure it out why data is lost when 
moving it from /tmp to crawl/segments/***/crawl_generate

If anyone could point me in right direction where to look

regards

Uros
> regards
>
> Uros