You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Joe Zhang <sm...@gmail.com> on 2012/12/03 01:48:44 UTC

scheduled recrawling

Dear List,

Could any of you point me to some resources on how to schedule recrawling
jobs so that I can constantly monitor some site(s)? The docs on this
topic seem quite thin.

I was thinking about cron jobs. But I vaguely remember seeing something
about configure nutch.

Thanks.

Joe.

Re: scheduled recrawling

Posted by Joe Zhang <sm...@gmail.com>.

Thanks.

On Tue, Dec 4, 2012 at 1:42 AM, Markus Jelsma <ma...@openindex.io>wrote:

>
>
>
>
> -----Original message-----
> > From:Joe Zhang <sm...@gmail.com>
> > Sent: Tue 04-Dec-2012 04:58
> > To: user@nutch.apache.org
> > Subject: Re: scheduled recrawling
> >
> > Let me make sure I understand:
> >
> > Let's say I write the following command:
> >
> > bin/nutch crawl urls  -solr http://localhost:8983/solr/ -dir blahblah/
> > -depth 5 -topN 1000
> >
> > into a shell script mycrawl.sh, and I set up a cron job to run mycrawl.sh
> > every day.
> >
> > How does db.default.fetch.interval in nutch-default.xml affect the
> behavior
> > of the crawl? Does it mean existing URLs in the crawldb won't be
> refetched
> > until 30 days later (assuming I didn't change that value)?
>
> Correct.
>
>
> >
> > On Mon, Dec 3, 2012 at 4:07 PM, Markus Jelsma <
> markus.jelsma@openindex.io>wrote:
> >
> > > No, the crawl command executes the individual commands in order. You
> can
> > > use the readdb command to inspect the state of a record and when it's
> > > eligible for fetch again.
> > >
> > > -----Original message-----
> > > > From:Joe Zhang <sm...@gmail.com>
> > > > Sent: Mon 03-Dec-2012 23:46
> > > > To: user@nutch.apache.org
> > > > Subject: Re: scheduled recrawling
> > > >
> > > > I use the single "nutch crawl" command line to do my crawling, and I
> use
> > > a
> > > > cron job to control the schedule. Is that why I don't see the
> relevance
> > > of
> > > > what you describe.
> > > >
> > > > The described scenario would only make sense at the micro
> > > > generate/fetch/update level, correct?
> > > >
> > > > On Mon, Dec 3, 2012 at 2:57 PM, Markus Jelsma <
> > > markus.jelsma@openindex.io>wrote:
> > > >
> > > > > i accidentally sent my email to soon: here's a new reply:
> > > > >
> > > > > ------------
> > > > >
> > > > > Well, by default Nutch uses the DefaultFetchSchedulder (see API
> docs).
> > > > > This scheduler relies on a few settings (db.fetch.interval.* see
> > > > > nutch-default for defaults and description). Out of the box this
> means
> > > each
> > > > > page will be eligible for refetch every 30 days.
> > > > >
> > > > > You can set another (or custom) scheduler via
> db.fetch.schedule.class.
> > > > > Another shipped scheduler is
> > > org.apache.nutch.crawl.AdaptiveFetchSchedule
> > > > > (see API docs). This scheduler can refetch frequently changing
> pages
> > > more
> > > > > often and less frequenctly changing pages less often. Again, see
> > > > > nutch-default to defaults and descriptions.
> > > > >
> > > > > There's also the MimeAdaptiveFetchSchedule, this extends
> > > > > AdaptiveFetchSchedule and allows to change increments and
> decrements
> > > > > depending on MIME type. The reasoning is that commonly non-HTML
> URL's
> > > > > change much less often than HTML.
> > > > >
> > > > > -----Original message-----
> > > > > > From:Joe Zhang <sm...@gmail.com>
> > > > > > Sent: Mon 03-Dec-2012 22:29
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: scheduled recrawling
> > > > > >
> > > > > > Sorry I meant that I didn't see a need to config nutch all, with
> the
> > > use
> > > > > of
> > > > > > cron jobs. What would be the proper scenarios for needing to
> config
> > > > > nutch?
> > > > > > how?
> > > > > >
> > > > > > On Mon, Dec 3, 2012 at 9:31 AM, Lewis John Mcgibbney <
> > > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > >
> > > > > > > what scenario?
> > > > > > >
> > > > > > > Is there a problem with Nutch?
> > > > > > >
> > > > > > > On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <
> smartagent@gmail.com>
> > > > > wrote:
> > > > > > > > My application is monitoring a particular site. Thus I see
> > > > > periodically
> > > > > > > > running the same nutch crawl command through a cron job. I
> just
> > > > > don't see
> > > > > > > > why I need to set up any scheduling within nutch. Could you
> > > explain
> > > > > the
> > > > > > > > scenario?
> > > > > > > >
> > > > > > > > On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <
> > > > > > > markus.jelsma@openindex.io>wrote:
> > > > > > > >
> > > > > > > >> Hi - Nutch will crawl when the crawl cycle is started. So
> you
> > > must
> > > > > > > either
> > > > > > > >> run in continuously is invoke it via a cron job. You can
> check
> > > the
> > > > > > > Javadocs
> > > > > > > >> for the FetchSchedule's for more information on scheduling.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> -----Original message-----
> > > > > > > >> > From:Joe Zhang <sm...@gmail.com>
> > > > > > > >> > Sent: Mon 03-Dec-2012 01:55
> > > > > > > >> > To: user <us...@nutch.apache.org>
> > > > > > > >> > Subject: scheduled recrawling
> > > > > > > >> >
> > > > > > > >> > Dear List,
> > > > > > > >> >
> > > > > > > >> > Could any of you point me to some resources on how to
> schedule
> > > > > > > recrawling
> > > > > > > >> > jobs so that I can constantly monitor some site(s)? The
> docs
> > > on
> > > > > this
> > > > > > > >> > topic seem quite thin.
> > > > > > > >> >
> > > > > > > >> > I was thinking about cron jobs. But I vaguely remember
> seeing
> > > > > > > something
> > > > > > > >> > about configure nutch.
> > > > > > > >> >
> > > > > > > >> > Thanks.
> > > > > > > >> >
> > > > > > > >> > Joe.
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Lewis
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

RE: scheduled recrawling

Posted by Markus Jelsma <ma...@openindex.io>.


 
 
-----Original message-----
> From:Joe Zhang <sm...@gmail.com>
> Sent: Tue 04-Dec-2012 04:58
> To: user@nutch.apache.org
> Subject: Re: scheduled recrawling
> 
> Let me make sure I understand:
> 
> Let's say I write the following command:
> 
> bin/nutch crawl urls  -solr http://localhost:8983/solr/ -dir blahblah/
> -depth 5 -topN 1000
> 
> into a shell script mycrawl.sh, and I set up a cron job to run mycrawl.sh
> every day.
> 
> How does db.default.fetch.interval in nutch-default.xml affect the behavior
> of the crawl? Does it mean existing URLs in the crawldb won't be refetched
> until 30 days later (assuming I didn't change that value)?

Correct.


> 
> On Mon, Dec 3, 2012 at 4:07 PM, Markus Jelsma <ma...@openindex.io>wrote:
> 
> > No, the crawl command executes the individual commands in order. You can
> > use the readdb command to inspect the state of a record and when it's
> > eligible for fetch again.
> >
> > -----Original message-----
> > > From:Joe Zhang <sm...@gmail.com>
> > > Sent: Mon 03-Dec-2012 23:46
> > > To: user@nutch.apache.org
> > > Subject: Re: scheduled recrawling
> > >
> > > I use the single "nutch crawl" command line to do my crawling, and I use
> > a
> > > cron job to control the schedule. Is that why I don't see the relevance
> > of
> > > what you describe.
> > >
> > > The described scenario would only make sense at the micro
> > > generate/fetch/update level, correct?
> > >
> > > On Mon, Dec 3, 2012 at 2:57 PM, Markus Jelsma <
> > markus.jelsma@openindex.io>wrote:
> > >
> > > > i accidentally sent my email to soon: here's a new reply:
> > > >
> > > > ------------
> > > >
> > > > Well, by default Nutch uses the DefaultFetchSchedulder (see API docs).
> > > > This scheduler relies on a few settings (db.fetch.interval.* see
> > > > nutch-default for defaults and description). Out of the box this means
> > each
> > > > page will be eligible for refetch every 30 days.
> > > >
> > > > You can set another (or custom) scheduler via db.fetch.schedule.class.
> > > > Another shipped scheduler is
> > org.apache.nutch.crawl.AdaptiveFetchSchedule
> > > > (see API docs). This scheduler can refetch frequently changing pages
> > more
> > > > often and less frequenctly changing pages less often. Again, see
> > > > nutch-default to defaults and descriptions.
> > > >
> > > > There's also the MimeAdaptiveFetchSchedule, this extends
> > > > AdaptiveFetchSchedule and allows to change increments and decrements
> > > > depending on MIME type. The reasoning is that commonly non-HTML URL's
> > > > change much less often than HTML.
> > > >
> > > > -----Original message-----
> > > > > From:Joe Zhang <sm...@gmail.com>
> > > > > Sent: Mon 03-Dec-2012 22:29
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: scheduled recrawling
> > > > >
> > > > > Sorry I meant that I didn't see a need to config nutch all, with the
> > use
> > > > of
> > > > > cron jobs. What would be the proper scenarios for needing to config
> > > > nutch?
> > > > > how?
> > > > >
> > > > > On Mon, Dec 3, 2012 at 9:31 AM, Lewis John Mcgibbney <
> > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > >
> > > > > > what scenario?
> > > > > >
> > > > > > Is there a problem with Nutch?
> > > > > >
> > > > > > On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <sm...@gmail.com>
> > > > wrote:
> > > > > > > My application is monitoring a particular site. Thus I see
> > > > periodically
> > > > > > > running the same nutch crawl command through a cron job. I just
> > > > don't see
> > > > > > > why I need to set up any scheduling within nutch. Could you
> > explain
> > > > the
> > > > > > > scenario?
> > > > > > >
> > > > > > > On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <
> > > > > > markus.jelsma@openindex.io>wrote:
> > > > > > >
> > > > > > >> Hi - Nutch will crawl when the crawl cycle is started. So you
> > must
> > > > > > either
> > > > > > >> run in continuously is invoke it via a cron job. You can check
> > the
> > > > > > Javadocs
> > > > > > >> for the FetchSchedule's for more information on scheduling.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> -----Original message-----
> > > > > > >> > From:Joe Zhang <sm...@gmail.com>
> > > > > > >> > Sent: Mon 03-Dec-2012 01:55
> > > > > > >> > To: user <us...@nutch.apache.org>
> > > > > > >> > Subject: scheduled recrawling
> > > > > > >> >
> > > > > > >> > Dear List,
> > > > > > >> >
> > > > > > >> > Could any of you point me to some resources on how to schedule
> > > > > > recrawling
> > > > > > >> > jobs so that I can constantly monitor some site(s)? The docs
> > on
> > > > this
> > > > > > >> > topic seem quite thin.
> > > > > > >> >
> > > > > > >> > I was thinking about cron jobs. But I vaguely remember seeing
> > > > > > something
> > > > > > >> > about configure nutch.
> > > > > > >> >
> > > > > > >> > Thanks.
> > > > > > >> >
> > > > > > >> > Joe.
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Lewis
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: scheduled recrawling

Posted by Joe Zhang <sm...@gmail.com>.

Let me make sure I understand:

Let's say I write the following command:

bin/nutch crawl urls  -solr http://localhost:8983/solr/ -dir blahblah/
-depth 5 -topN 1000

into a shell script mycrawl.sh, and I set up a cron job to run mycrawl.sh
every day.

How does db.default.fetch.interval in nutch-default.xml affect the behavior
of the crawl? Does it mean existing URLs in the crawldb won't be refetched
until 30 days later (assuming I didn't change that value)?

On Mon, Dec 3, 2012 at 4:07 PM, Markus Jelsma <ma...@openindex.io>wrote:

> No, the crawl command executes the individual commands in order. You can
> use the readdb command to inspect the state of a record and when it's
> eligible for fetch again.
>
> -----Original message-----
> > From:Joe Zhang <sm...@gmail.com>
> > Sent: Mon 03-Dec-2012 23:46
> > To: user@nutch.apache.org
> > Subject: Re: scheduled recrawling
> >
> > I use the single "nutch crawl" command line to do my crawling, and I use
> a
> > cron job to control the schedule. Is that why I don't see the relevance
> of
> > what you describe.
> >
> > The described scenario would only make sense at the micro
> > generate/fetch/update level, correct?
> >
> > On Mon, Dec 3, 2012 at 2:57 PM, Markus Jelsma <
> markus.jelsma@openindex.io>wrote:
> >
> > > i accidentally sent my email to soon: here's a new reply:
> > >
> > > ------------
> > >
> > > Well, by default Nutch uses the DefaultFetchSchedulder (see API docs).
> > > This scheduler relies on a few settings (db.fetch.interval.* see
> > > nutch-default for defaults and description). Out of the box this means
> each
> > > page will be eligible for refetch every 30 days.
> > >
> > > You can set another (or custom) scheduler via db.fetch.schedule.class.
> > > Another shipped scheduler is
> org.apache.nutch.crawl.AdaptiveFetchSchedule
> > > (see API docs). This scheduler can refetch frequently changing pages
> more
> > > often and less frequenctly changing pages less often. Again, see
> > > nutch-default to defaults and descriptions.
> > >
> > > There's also the MimeAdaptiveFetchSchedule, this extends
> > > AdaptiveFetchSchedule and allows to change increments and decrements
> > > depending on MIME type. The reasoning is that commonly non-HTML URL's
> > > change much less often than HTML.
> > >
> > > -----Original message-----
> > > > From:Joe Zhang <sm...@gmail.com>
> > > > Sent: Mon 03-Dec-2012 22:29
> > > > To: user@nutch.apache.org
> > > > Subject: Re: scheduled recrawling
> > > >
> > > > Sorry I meant that I didn't see a need to config nutch all, with the
> use
> > > of
> > > > cron jobs. What would be the proper scenarios for needing to config
> > > nutch?
> > > > how?
> > > >
> > > > On Mon, Dec 3, 2012 at 9:31 AM, Lewis John Mcgibbney <
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > >
> > > > > what scenario?
> > > > >
> > > > > Is there a problem with Nutch?
> > > > >
> > > > > On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <sm...@gmail.com>
> > > wrote:
> > > > > > My application is monitoring a particular site. Thus I see
> > > periodically
> > > > > > running the same nutch crawl command through a cron job. I just
> > > don't see
> > > > > > why I need to set up any scheduling within nutch. Could you
> explain
> > > the
> > > > > > scenario?
> > > > > >
> > > > > > On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <
> > > > > markus.jelsma@openindex.io>wrote:
> > > > > >
> > > > > >> Hi - Nutch will crawl when the crawl cycle is started. So you
> must
> > > > > either
> > > > > >> run in continuously is invoke it via a cron job. You can check
> the
> > > > > Javadocs
> > > > > >> for the FetchSchedule's for more information on scheduling.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> -----Original message-----
> > > > > >> > From:Joe Zhang <sm...@gmail.com>
> > > > > >> > Sent: Mon 03-Dec-2012 01:55
> > > > > >> > To: user <us...@nutch.apache.org>
> > > > > >> > Subject: scheduled recrawling
> > > > > >> >
> > > > > >> > Dear List,
> > > > > >> >
> > > > > >> > Could any of you point me to some resources on how to schedule
> > > > > recrawling
> > > > > >> > jobs so that I can constantly monitor some site(s)? The docs
> on
> > > this
> > > > > >> > topic seem quite thin.
> > > > > >> >
> > > > > >> > I was thinking about cron jobs. But I vaguely remember seeing
> > > > > something
> > > > > >> > about configure nutch.
> > > > > >> >
> > > > > >> > Thanks.
> > > > > >> >
> > > > > >> > Joe.
> > > > > >> >
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lewis
> > > > >
> > > >
> > >
> >
>

RE: scheduled recrawling

Posted by Markus Jelsma <ma...@openindex.io>.

No, the crawl command executes the individual commands in order. You can use the readdb command to inspect the state of a record and when it's eligible for fetch again. 
 
-----Original message-----
> From:Joe Zhang <sm...@gmail.com>
> Sent: Mon 03-Dec-2012 23:46
> To: user@nutch.apache.org
> Subject: Re: scheduled recrawling
> 
> I use the single "nutch crawl" command line to do my crawling, and I use a
> cron job to control the schedule. Is that why I don't see the relevance of
> what you describe.
> 
> The described scenario would only make sense at the micro
> generate/fetch/update level, correct?
> 
> On Mon, Dec 3, 2012 at 2:57 PM, Markus Jelsma <ma...@openindex.io>wrote:
> 
> > i accidentally sent my email to soon: here's a new reply:
> >
> > ------------
> >
> > Well, by default Nutch uses the DefaultFetchSchedulder (see API docs).
> > This scheduler relies on a few settings (db.fetch.interval.* see
> > nutch-default for defaults and description). Out of the box this means each
> > page will be eligible for refetch every 30 days.
> >
> > You can set another (or custom) scheduler via db.fetch.schedule.class.
> > Another shipped scheduler is org.apache.nutch.crawl.AdaptiveFetchSchedule
> > (see API docs). This scheduler can refetch frequently changing pages more
> > often and less frequenctly changing pages less often. Again, see
> > nutch-default to defaults and descriptions.
> >
> > There's also the MimeAdaptiveFetchSchedule, this extends
> > AdaptiveFetchSchedule and allows to change increments and decrements
> > depending on MIME type. The reasoning is that commonly non-HTML URL's
> > change much less often than HTML.
> >
> > -----Original message-----
> > > From:Joe Zhang <sm...@gmail.com>
> > > Sent: Mon 03-Dec-2012 22:29
> > > To: user@nutch.apache.org
> > > Subject: Re: scheduled recrawling
> > >
> > > Sorry I meant that I didn't see a need to config nutch all, with the use
> > of
> > > cron jobs. What would be the proper scenarios for needing to config
> > nutch?
> > > how?
> > >
> > > On Mon, Dec 3, 2012 at 9:31 AM, Lewis John Mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > > > what scenario?
> > > >
> > > > Is there a problem with Nutch?
> > > >
> > > > On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <sm...@gmail.com>
> > wrote:
> > > > > My application is monitoring a particular site. Thus I see
> > periodically
> > > > > running the same nutch crawl command through a cron job. I just
> > don't see
> > > > > why I need to set up any scheduling within nutch. Could you explain
> > the
> > > > > scenario?
> > > > >
> > > > > On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <
> > > > markus.jelsma@openindex.io>wrote:
> > > > >
> > > > >> Hi - Nutch will crawl when the crawl cycle is started. So you must
> > > > either
> > > > >> run in continuously is invoke it via a cron job. You can check the
> > > > Javadocs
> > > > >> for the FetchSchedule's for more information on scheduling.
> > > > >>
> > > > >>
> > > > >>
> > > > >> -----Original message-----
> > > > >> > From:Joe Zhang <sm...@gmail.com>
> > > > >> > Sent: Mon 03-Dec-2012 01:55
> > > > >> > To: user <us...@nutch.apache.org>
> > > > >> > Subject: scheduled recrawling
> > > > >> >
> > > > >> > Dear List,
> > > > >> >
> > > > >> > Could any of you point me to some resources on how to schedule
> > > > recrawling
> > > > >> > jobs so that I can constantly monitor some site(s)? The docs on
> > this
> > > > >> > topic seem quite thin.
> > > > >> >
> > > > >> > I was thinking about cron jobs. But I vaguely remember seeing
> > > > something
> > > > >> > about configure nutch.
> > > > >> >
> > > > >> > Thanks.
> > > > >> >
> > > > >> > Joe.
> > > > >> >
> > > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Lewis
> > > >
> > >
> >
>

Re: scheduled recrawling

Posted by Joe Zhang <sm...@gmail.com>.

I use the single "nutch crawl" command line to do my crawling, and I use a
cron job to control the schedule. Is that why I don't see the relevance of
what you describe.

The described scenario would only make sense at the micro
generate/fetch/update level, correct?

On Mon, Dec 3, 2012 at 2:57 PM, Markus Jelsma <ma...@openindex.io>wrote:

> i accidentally sent my email to soon: here's a new reply:
>
> ------------
>
> Well, by default Nutch uses the DefaultFetchSchedulder (see API docs).
> This scheduler relies on a few settings (db.fetch.interval.* see
> nutch-default for defaults and description). Out of the box this means each
> page will be eligible for refetch every 30 days.
>
> You can set another (or custom) scheduler via db.fetch.schedule.class.
> Another shipped scheduler is org.apache.nutch.crawl.AdaptiveFetchSchedule
> (see API docs). This scheduler can refetch frequently changing pages more
> often and less frequenctly changing pages less often. Again, see
> nutch-default to defaults and descriptions.
>
> There's also the MimeAdaptiveFetchSchedule, this extends
> AdaptiveFetchSchedule and allows to change increments and decrements
> depending on MIME type. The reasoning is that commonly non-HTML URL's
> change much less often than HTML.
>
> -----Original message-----
> > From:Joe Zhang <sm...@gmail.com>
> > Sent: Mon 03-Dec-2012 22:29
> > To: user@nutch.apache.org
> > Subject: Re: scheduled recrawling
> >
> > Sorry I meant that I didn't see a need to config nutch all, with the use
> of
> > cron jobs. What would be the proper scenarios for needing to config
> nutch?
> > how?
> >
> > On Mon, Dec 3, 2012 at 9:31 AM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > what scenario?
> > >
> > > Is there a problem with Nutch?
> > >
> > > On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <sm...@gmail.com>
> wrote:
> > > > My application is monitoring a particular site. Thus I see
> periodically
> > > > running the same nutch crawl command through a cron job. I just
> don't see
> > > > why I need to set up any scheduling within nutch. Could you explain
> the
> > > > scenario?
> > > >
> > > > On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <
> > > markus.jelsma@openindex.io>wrote:
> > > >
> > > >> Hi - Nutch will crawl when the crawl cycle is started. So you must
> > > either
> > > >> run in continuously is invoke it via a cron job. You can check the
> > > Javadocs
> > > >> for the FetchSchedule's for more information on scheduling.
> > > >>
> > > >>
> > > >>
> > > >> -----Original message-----
> > > >> > From:Joe Zhang <sm...@gmail.com>
> > > >> > Sent: Mon 03-Dec-2012 01:55
> > > >> > To: user <us...@nutch.apache.org>
> > > >> > Subject: scheduled recrawling
> > > >> >
> > > >> > Dear List,
> > > >> >
> > > >> > Could any of you point me to some resources on how to schedule
> > > recrawling
> > > >> > jobs so that I can constantly monitor some site(s)? The docs on
> this
> > > >> > topic seem quite thin.
> > > >> >
> > > >> > I was thinking about cron jobs. But I vaguely remember seeing
> > > something
> > > >> > about configure nutch.
> > > >> >
> > > >> > Thanks.
> > > >> >
> > > >> > Joe.
> > > >> >
> > > >>
> > >
> > >
> > >
> > > --
> > > Lewis
> > >
> >
>

RE: scheduled recrawling

Posted by Markus Jelsma <ma...@openindex.io>.

i accidentally sent my email to soon: here's a new reply:

------------

Well, by default Nutch uses the DefaultFetchSchedulder (see API docs). This scheduler relies on a few settings (db.fetch.interval.* see nutch-default for defaults and description). Out of the box this means each page will be eligible for refetch every 30 days.

You can set another (or custom) scheduler via db.fetch.schedule.class. Another shipped scheduler is org.apache.nutch.crawl.AdaptiveFetchSchedule (see API docs). This scheduler can refetch frequently changing pages more often and less frequenctly changing pages less often. Again, see nutch-default to defaults and descriptions.

There's also the MimeAdaptiveFetchSchedule, this extends AdaptiveFetchSchedule and allows to change increments and decrements depending on MIME type. The reasoning is that commonly non-HTML URL's change much less often than HTML. 

-----Original message-----
> From:Joe Zhang <sm...@gmail.com>
> Sent: Mon 03-Dec-2012 22:29
> To: user@nutch.apache.org
> Subject: Re: scheduled recrawling
> 
> Sorry I meant that I didn't see a need to config nutch all, with the use of
> cron jobs. What would be the proper scenarios for needing to config nutch?
> how?
> 
> On Mon, Dec 3, 2012 at 9:31 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
> 
> > what scenario?
> >
> > Is there a problem with Nutch?
> >
> > On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <sm...@gmail.com> wrote:
> > > My application is monitoring a particular site. Thus I see periodically
> > > running the same nutch crawl command through a cron job. I just don't see
> > > why I need to set up any scheduling within nutch. Could you explain the
> > > scenario?
> > >
> > > On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <
> > markus.jelsma@openindex.io>wrote:
> > >
> > >> Hi - Nutch will crawl when the crawl cycle is started. So you must
> > either
> > >> run in continuously is invoke it via a cron job. You can check the
> > Javadocs
> > >> for the FetchSchedule's for more information on scheduling.
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> > From:Joe Zhang <sm...@gmail.com>
> > >> > Sent: Mon 03-Dec-2012 01:55
> > >> > To: user <us...@nutch.apache.org>
> > >> > Subject: scheduled recrawling
> > >> >
> > >> > Dear List,
> > >> >
> > >> > Could any of you point me to some resources on how to schedule
> > recrawling
> > >> > jobs so that I can constantly monitor some site(s)? The docs on this
> > >> > topic seem quite thin.
> > >> >
> > >> > I was thinking about cron jobs. But I vaguely remember seeing
> > something
> > >> > about configure nutch.
> > >> >
> > >> > Thanks.
> > >> >
> > >> > Joe.
> > >> >
> > >>
> >
> >
> >
> > --
> > Lewis
> >
>

RE: scheduled recrawling

Posted by Markus Jelsma <ma...@openindex.io>.

Well, by default Nutch uses the DefaultFetchSchedulder (see API docs). This scheduler relies on a few settings (db.fetch.interval.* see nutch-default for defaults and description). Out of the box this means each page will be eligible for refetch every 30 days.

You can set another (or custom) scheduler  
 
-----Original message-----
> From:Joe Zhang <sm...@gmail.com>
> Sent: Mon 03-Dec-2012 22:29
> To: user@nutch.apache.org
> Subject: Re: scheduled recrawling
> 
> Sorry I meant that I didn't see a need to config nutch all, with the use of
> cron jobs. What would be the proper scenarios for needing to config nutch?
> how?
> 
> On Mon, Dec 3, 2012 at 9:31 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
> 
> > what scenario?
> >
> > Is there a problem with Nutch?
> >
> > On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <sm...@gmail.com> wrote:
> > > My application is monitoring a particular site. Thus I see periodically
> > > running the same nutch crawl command through a cron job. I just don't see
> > > why I need to set up any scheduling within nutch. Could you explain the
> > > scenario?
> > >
> > > On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <
> > markus.jelsma@openindex.io>wrote:
> > >
> > >> Hi - Nutch will crawl when the crawl cycle is started. So you must
> > either
> > >> run in continuously is invoke it via a cron job. You can check the
> > Javadocs
> > >> for the FetchSchedule's for more information on scheduling.
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> > From:Joe Zhang <sm...@gmail.com>
> > >> > Sent: Mon 03-Dec-2012 01:55
> > >> > To: user <us...@nutch.apache.org>
> > >> > Subject: scheduled recrawling
> > >> >
> > >> > Dear List,
> > >> >
> > >> > Could any of you point me to some resources on how to schedule
> > recrawling
> > >> > jobs so that I can constantly monitor some site(s)? The docs on this
> > >> > topic seem quite thin.
> > >> >
> > >> > I was thinking about cron jobs. But I vaguely remember seeing
> > something
> > >> > about configure nutch.
> > >> >
> > >> > Thanks.
> > >> >
> > >> > Joe.
> > >> >
> > >>
> >
> >
> >
> > --
> > Lewis
> >
>

Re: scheduled recrawling

Posted by Joe Zhang <sm...@gmail.com>.

Sorry I meant that I didn't see a need to config nutch all, with the use of
cron jobs. What would be the proper scenarios for needing to config nutch?
how?

On Mon, Dec 3, 2012 at 9:31 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> what scenario?
>
> Is there a problem with Nutch?
>
> On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <sm...@gmail.com> wrote:
> > My application is monitoring a particular site. Thus I see periodically
> > running the same nutch crawl command through a cron job. I just don't see
> > why I need to set up any scheduling within nutch. Could you explain the
> > scenario?
> >
> > On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <
> markus.jelsma@openindex.io>wrote:
> >
> >> Hi - Nutch will crawl when the crawl cycle is started. So you must
> either
> >> run in continuously is invoke it via a cron job. You can check the
> Javadocs
> >> for the FetchSchedule's for more information on scheduling.
> >>
> >>
> >>
> >> -----Original message-----
> >> > From:Joe Zhang <sm...@gmail.com>
> >> > Sent: Mon 03-Dec-2012 01:55
> >> > To: user <us...@nutch.apache.org>
> >> > Subject: scheduled recrawling
> >> >
> >> > Dear List,
> >> >
> >> > Could any of you point me to some resources on how to schedule
> recrawling
> >> > jobs so that I can constantly monitor some site(s)? The docs on this
> >> > topic seem quite thin.
> >> >
> >> > I was thinking about cron jobs. But I vaguely remember seeing
> something
> >> > about configure nutch.
> >> >
> >> > Thanks.
> >> >
> >> > Joe.
> >> >
> >>
>
>
>
> --
> Lewis
>

Re: scheduled recrawling

Posted by Lewis John Mcgibbney <le...@gmail.com>.

what scenario?

Is there a problem with Nutch?

On Mon, Dec 3, 2012 at 11:58 AM, Joe Zhang <sm...@gmail.com> wrote:
> My application is monitoring a particular site. Thus I see periodically
> running the same nutch crawl command through a cron job. I just don't see
> why I need to set up any scheduling within nutch. Could you explain the
> scenario?
>
> On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <ma...@openindex.io>wrote:
>
>> Hi - Nutch will crawl when the crawl cycle is started. So you must either
>> run in continuously is invoke it via a cron job. You can check the Javadocs
>> for the FetchSchedule's for more information on scheduling.
>>
>>
>>
>> -----Original message-----
>> > From:Joe Zhang <sm...@gmail.com>
>> > Sent: Mon 03-Dec-2012 01:55
>> > To: user <us...@nutch.apache.org>
>> > Subject: scheduled recrawling
>> >
>> > Dear List,
>> >
>> > Could any of you point me to some resources on how to schedule recrawling
>> > jobs so that I can constantly monitor some site(s)? The docs on this
>> > topic seem quite thin.
>> >
>> > I was thinking about cron jobs. But I vaguely remember seeing something
>> > about configure nutch.
>> >
>> > Thanks.
>> >
>> > Joe.
>> >
>>



-- 
Lewis

Re: scheduled recrawling

Posted by Joe Zhang <sm...@gmail.com>.

My application is monitoring a particular site. Thus I see periodically
running the same nutch crawl command through a cron job. I just don't see
why I need to set up any scheduling within nutch. Could you explain the
scenario?

On Mon, Dec 3, 2012 at 3:27 AM, Markus Jelsma <ma...@openindex.io>wrote:

> Hi - Nutch will crawl when the crawl cycle is started. So you must either
> run in continuously is invoke it via a cron job. You can check the Javadocs
> for the FetchSchedule's for more information on scheduling.
>
>
>
> -----Original message-----
> > From:Joe Zhang <sm...@gmail.com>
> > Sent: Mon 03-Dec-2012 01:55
> > To: user <us...@nutch.apache.org>
> > Subject: scheduled recrawling
> >
> > Dear List,
> >
> > Could any of you point me to some resources on how to schedule recrawling
> > jobs so that I can constantly monitor some site(s)? The docs on this
> > topic seem quite thin.
> >
> > I was thinking about cron jobs. But I vaguely remember seeing something
> > about configure nutch.
> >
> > Thanks.
> >
> > Joe.
> >
>

RE: scheduled recrawling

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - Nutch will crawl when the crawl cycle is started. So you must either run in continuously is invoke it via a cron job. You can check the Javadocs for the FetchSchedule's for more information on scheduling.

 
 
-----Original message-----
> From:Joe Zhang <sm...@gmail.com>
> Sent: Mon 03-Dec-2012 01:55
> To: user <us...@nutch.apache.org>
> Subject: scheduled recrawling
> 
> Dear List,
> 
> Could any of you point me to some resources on how to schedule recrawling
> jobs so that I can constantly monitor some site(s)? The docs on this
> topic seem quite thin.
> 
> I was thinking about cron jobs. But I vaguely remember seeing something
> about configure nutch.
> 
> Thanks.
> 
> Joe.
>