You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hannu Väisänen <hv...@joyx.joensuu.fi> on 2009/08/24 11:39:48 UTC
shouldFetch rejects all files
I am using Nutch to index some directories on my hard disk. It used to
work but now Nutch rejects all files.
File logs/hadoop.log has this
DEBUG crawl.Generator - -shouldFetch rejected [file name here] fetchTime=1253697537652, curTime=1251105859942
for every file in directories I want to index.
How can I start to debug the problem?
Re: shouldFetch rejects all files
Posted by Hannu Väisänen <hv...@joyx.joensuu.fi>.
On Mon, Aug 24, 2009 at 01:15:53PM +0300, Doğacan Güney wrote:
> 2009/8/24 Hannu Väisänen <hv...@joyx.joensuu.fi>
> > DEBUG crawl.Generator - -shouldFetch rejected [file name here]
> > fetchTime=1253697537652, curTime=1251105859942
>
> fetchTime is ahead of curTime, that's why it is rejected.
> I would suggest playing around with conf options in nutch-site.xml.
> Depending on which scheduler you use, you should modify
> db.fetch.schedule.adaptive.* or db.fetch.interval.default.
If I use Nutch (version 1.0) to index some directories on my hard disk
like this
bin/nutch crawl urls -dir crawl -depth 300 >&crawl.log
how many times it should fetch a file in one run?
If I put db.fetch.interval.default and db.fetch.interval.max to 1
second Nutch seems to fetch files again and again and again... And if
the numbers are to big, Nutch rejects all files.
Obviously I don't understand how scheduler works. (-:
Re: shouldFetch rejects all files
Posted by Doğacan Güney <do...@gmail.com>.
2009/8/24 Hannu Väisänen <hv...@joyx.joensuu.fi>
> I am using Nutch to index some directories on my hard disk. It used to
> work but now Nutch rejects all files.
>
> File logs/hadoop.log has this
>
> DEBUG crawl.Generator - -shouldFetch rejected [file name here]
> fetchTime=1253697537652, curTime=1251105859942
>
> for every file in directories I want to index.
>
>
> How can I start to debug the problem?
>
fetchTime is ahead of curTime, that's why it is rejected. After fetching a
file, nutch sets a next fetch time (i.e. the next time it will fetch the
file), and won't fetch it till that time.
I would suggest playing around with conf options in nutch-site.xml.
Depending on which scheduler you use, you should modify
db.fetch.schedule.adaptive.* or db.fetch.interval.default.
--
Doğacan Güney