You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max Stricker <st...@gmail.com> on 2011/08/19 19:01:01 UTC
force recrawl
Hi,
how do you manage recrawling with current nutch versions (1.2 or 1.3)?
I have seen some scripts in the wiki for older versions but none for 1.2 or above.
Generally db.fetch.interval.default seems ok for my use case but I have situations where I need to
force a recrawl. How would you do this? bin/nutch crawl seems not to have a option for this.
Optionally it would be perfect to only force a recrawl for urls matching a defined regex.
Any ideas?
Re: force recrawl
Posted by lewis john mcgibbney <le...@gmail.com>.
If you only wish to serve crawls to that one page, I'm sure this could
easily be set up by writing a bash script specifying the -adddays arguement
with your commands. This could then be set and run as a cron job?
Please someone correct me if I am wrong.
On Fri, Aug 26, 2011 at 10:22 PM, Radim Kolar <hs...@sendmail.cz> wrote:
> It would be nice to have command which will alter database refetch times in
> specified URLs. With configuration like that:
>
> ^http://www\.google\.com/?$ 1d # fetch google homepage daily
>
> I am willing to help with sponsoring development and testing of such thing.
>
--
*Lewis*
Re: force recrawl
Posted by Radim Kolar <hs...@sendmail.cz>.
It would be nice to have command which will alter database refetch times
in specified URLs. With configuration like that:
^http://www\.google\.com/?$ 1d # fetch google homepage daily
I am willing to help with sponsoring development and testing of such thing.
Re: force recrawl
Posted by lewis john mcgibbney <le...@gmail.com>.
Correct
There should be comprehensive documentation on the wiki for these parameters
(and many more)
On Fri, Aug 19, 2011 at 6:46 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> addDays is not a crawl switch but a generator switch. You cannot use the
> crawl
> command.
>
> > But if I use
> > bin/nutch crawl urls -dir crawl -depth 2 -topN 50
> > addDays does not have any effect.
> > Has anyone a nutch crawl script that can also be used to force a recrawl?
> >
> > > Well, actually. You can! I seem to have forgotten the -addDays switch
> of
> > > the generator. It adds #days to the current time to force URL's with
> > > fetch times in the future to be eligible for fetch.
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268779.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*Lewis*
Re: force recrawl
Posted by Markus Jelsma <ma...@openindex.io>.
addDays is not a crawl switch but a generator switch. You cannot use the crawl
command.
> But if I use
> bin/nutch crawl urls -dir crawl -depth 2 -topN 50
> addDays does not have any effect.
> Has anyone a nutch crawl script that can also be used to force a recrawl?
>
> > Well, actually. You can! I seem to have forgotten the -addDays switch of
> > the generator. It adds #days to the current time to force URL's with
> > fetch times in the future to be eligible for fetch.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268779.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: force recrawl
Posted by jasimop <st...@gmail.com>.
But if I use
bin/nutch crawl urls -dir crawl -depth 2 -topN 50
addDays does not have any effect.
Has anyone a nutch crawl script that can also be used to force a recrawl?
> Well, actually. You can! I seem to have forgotten the -addDays switch of the
> generator. It adds #days to the current time to force URL's with fetch times
> in the future to be eligible for fetch.
--
View this message in context: http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268779.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: force recrawl
Posted by Markus Jelsma <ma...@openindex.io>.
Well, actually. You can! I seem to have forgotten the -addDays switch of the
generator. It adds #days to the current time to force URL's with fetch times
in the future to be eligible for fetch.
> I was going to ask the SAME question :-) I think it is a PITA that you
> can't force a recrawl. Wonder if could be accomplished by altering the
> codebase?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268748.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: force recrawl
Posted by webdev1977 <we...@gmail.com>.
I was going to ask the SAME question :-) I think it is a PITA that you can't
force a recrawl. Wonder if could be accomplished by altering the codebase?
--
View this message in context: http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268748.html
Sent from the Nutch - User mailing list archive at Nabble.com.