You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by 高睿 <ga...@163.com> on 2013/03/10 14:29:20 UTC

How to prevent re-crawling?

 Hi,

Background: I have several article list urls in seed.txt. Currently, the nutch crawl command crawls both the list urls and the article urls every time.
I want to prevent re-crawling for the urls (article urls) which are already crawled. But I want to crawl the urls in the seed.txt (article list urls).
Do you have idea about this?

Regards,
Rui

Re: Re: How to prevent re-crawling?

Posted by feng lu <am...@gmail.com>.

yes, using "nutch crawl" command can not affect the 'fetchInterval',
Currently it will be affected by these factors.

1. db.fetch.interval.default property in nutch-site.xml - The default
number of seconds between re-fetches of a page
2. nutch.fetchInterval metadata in nutch inject process - allows to set a
custom fetch interval for a specific URL
3. if you use adaptive fetch schedule class, it can continuously monitor a
site and crawl updates [0].

Maybe another method can implement your requirement is add all article list
urls to a seed list, set a customed fetchinterval time to them. and set
db.fetch.interval.default to a long time. then inject the article list urls
to crawldb. This method should know all the list page in advance. otherwise
the fetchInterval of new discovered article list url will be set to
db.fetch.interval.default.

[0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

On Mon, Mar 11, 2013 at 12:26 AM, 高睿 <ga...@163.com> wrote:

> OK, thanks.
> I'll try the 2nd approach.
> I'm using the 'nutch crawl' command, it seems 'fetchinterval' doesn't
> really work. Maybe I should build my own script based on the basic command.
>
>
>
>
>
>
>
> At 2013-03-10 22:36:03,"feng lu" <am...@gmail.com> wrote:
> >Hi
> >
> >Maybe you can add article urls that are already crawled to a seed file.
> >Next set db.injector.update to true and set metadata nutch.fetchInterval
> of
> >each url to a long time. Finally use bin/nutch inject command to update
> the
> >fetchInterval time of each urls (article urls).
> >
> >Or can extends AbstractFetchSchedule and override the setFetchSchedule
> >method, use a urlfilter to filter the article url and set a long
> >fetchInterval to it.
> >
> >
> >On Sun, Mar 10, 2013 at 9:29 PM, 高睿 <ga...@163.com> wrote:
> >
> >>  Hi,
> >>
> >> Background: I have several article list urls in seed.txt. Currently, the
> >> nutch crawl command crawls both the list urls and the article urls every
> >> time.
> >> I want to prevent re-crawling for the urls (article urls) which are
> >> already crawled. But I want to crawl the urls in the seed.txt (article
> list
> >> urls).
> >> Do you have idea about this?
> >>
> >> Regards,
> >> Rui
> >>
> >
> >
> >
> >--
> >Don't Grow Old, Grow Up... :-)
>

-- 
Don't Grow Old, Grow Up... :-)

Re:Re: How to prevent re-crawling?

Posted by 高睿 <ga...@163.com>.

OK, thanks.
I'll try the 2nd approach.
I'm using the 'nutch crawl' command, it seems 'fetchinterval' doesn't really work. Maybe I should build my own script based on the basic command.







At 2013-03-10 22:36:03,"feng lu" <am...@gmail.com> wrote:
>Hi
>
>Maybe you can add article urls that are already crawled to a seed file.
>Next set db.injector.update to true and set metadata nutch.fetchInterval of
>each url to a long time. Finally use bin/nutch inject command to update the
>fetchInterval time of each urls (article urls).
>
>Or can extends AbstractFetchSchedule and override the setFetchSchedule
>method, use a urlfilter to filter the article url and set a long
>fetchInterval to it.
>
>
>On Sun, Mar 10, 2013 at 9:29 PM, 高睿 <ga...@163.com> wrote:
>
>>  Hi,
>>
>> Background: I have several article list urls in seed.txt. Currently, the
>> nutch crawl command crawls both the list urls and the article urls every
>> time.
>> I want to prevent re-crawling for the urls (article urls) which are
>> already crawled. But I want to crawl the urls in the seed.txt (article list
>> urls).
>> Do you have idea about this?
>>
>> Regards,
>> Rui
>>
>
>
>
>-- 
>Don't Grow Old, Grow Up... :-)

Re: How to prevent re-crawling?

Posted by feng lu <am...@gmail.com>.

Hi

Maybe you can add article urls that are already crawled to a seed file.
Next set db.injector.update to true and set metadata nutch.fetchInterval of
each url to a long time. Finally use bin/nutch inject command to update the
fetchInterval time of each urls (article urls).

Or can extends AbstractFetchSchedule and override the setFetchSchedule
method, use a urlfilter to filter the article url and set a long
fetchInterval to it.

On Sun, Mar 10, 2013 at 9:29 PM, 高睿 <ga...@163.com> wrote:

>  Hi,
>
> Background: I have several article list urls in seed.txt. Currently, the
> nutch crawl command crawls both the list urls and the article urls every
> time.
> I want to prevent re-crawling for the urls (article urls) which are
> already crawled. But I want to crawl the urls in the seed.txt (article list
> urls).
> Do you have idea about this?
>
> Regards,
> Rui
>

-- 
Don't Grow Old, Grow Up... :-)