You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Hannu Väisänen <hv...@joyx.joensuu.fi> on 2009/08/24 11:39:48 UTC

shouldFetch rejects all files

I am using Nutch to index some directories on my hard disk. It used to
work but now Nutch rejects all files.

File logs/hadoop.log has this

DEBUG crawl.Generator - -shouldFetch rejected [file name here] fetchTime=1253697537652, curTime=1251105859942

for every file in directories I want to index.


How can I start to debug the problem?

Re: shouldFetch rejects all files

Posted by Hannu Väisänen <hv...@joyx.joensuu.fi>.

On Mon, Aug 24, 2009 at 01:15:53PM +0300, Doğacan Güney wrote:
> 2009/8/24 Hannu Väisänen <hv...@joyx.joensuu.fi>
> > DEBUG crawl.Generator - -shouldFetch rejected [file name here]
> > fetchTime=1253697537652, curTime=1251105859942
> 
> fetchTime is ahead of curTime, that's why it is rejected.
> I would suggest playing around with conf options in nutch-site.xml.
> Depending on which scheduler you use, you should modify
> db.fetch.schedule.adaptive.* or db.fetch.interval.default.

If I use Nutch (version 1.0) to index some directories on my hard disk
like this

bin/nutch crawl urls -dir crawl -depth 300 >&crawl.log

how many times it should fetch a file in one run?

If I put db.fetch.interval.default and db.fetch.interval.max to 1
second Nutch seems to fetch files again and again and again... And if
the numbers are to big, Nutch rejects all files.

Obviously I don't understand how scheduler works. (-:

Re: shouldFetch rejects all files

Posted by Doğacan Güney <do...@gmail.com>.

2009/8/24 Hannu Väisänen <hv...@joyx.joensuu.fi>

> I am using Nutch to index some directories on my hard disk. It used to
> work but now Nutch rejects all files.
>
> File logs/hadoop.log has this
>
> DEBUG crawl.Generator - -shouldFetch rejected [file name here]
> fetchTime=1253697537652, curTime=1251105859942
>
> for every file in directories I want to index.
>
>
> How can I start to debug the problem?
>

fetchTime is ahead of curTime, that's why it is rejected. After fetching a
file, nutch sets a next fetch time (i.e. the next time it will fetch the
file), and won't fetch it till that time.
I would suggest playing around with conf options in nutch-site.xml.
Depending on which scheduler you use, you should modify
db.fetch.schedule.adaptive.* or db.fetch.interval.default.

-- 
Doğacan Güney