You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max Stricker <st...@gmail.com> on 2011/08/19 19:01:01 UTC

force recrawl

Hi,


how do you manage recrawling with current nutch versions (1.2 or 1.3)?
I have seen some scripts in the wiki for older versions but none for 1.2 or above.
Generally db.fetch.interval.default seems ok for my use case but I have situations where I need to
force a recrawl. How would you do this? bin/nutch crawl seems not to have a option for this.
Optionally it would be perfect to only force a recrawl for urls matching a defined regex.

Any ideas?

Re: force recrawl

Posted by lewis john mcgibbney <le...@gmail.com>.
If you only wish to serve crawls to that one page, I'm sure this could
easily be set up by writing a bash script specifying the -adddays arguement
with your commands. This could then be set and run as a cron job?

Please someone correct me if I am wrong.

On Fri, Aug 26, 2011 at 10:22 PM, Radim Kolar <hs...@sendmail.cz> wrote:

> It would be nice to have command which will alter database refetch times in
> specified URLs. With configuration like that:
>
> ^http://www\.google\.com/?$  1d   # fetch google homepage daily
>
> I am willing to help with sponsoring development and testing of such thing.
>



-- 
*Lewis*

Re: force recrawl

Posted by Radim Kolar <hs...@sendmail.cz>.
It would be nice to have command which will alter database refetch times 
in specified URLs. With configuration like that:

^http://www\.google\.com/?$  1d   # fetch google homepage daily

I am willing to help with sponsoring development and testing of such thing.

Re: force recrawl

Posted by lewis john mcgibbney <le...@gmail.com>.
Correct

There should be comprehensive documentation on the wiki for these parameters
(and many more)

On Fri, Aug 19, 2011 at 6:46 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> addDays is not a crawl switch but a generator switch. You cannot use the
> crawl
> command.
>
> > But if I use
> > bin/nutch crawl urls -dir crawl -depth 2 -topN 50
> > addDays does not have any effect.
> > Has anyone a nutch crawl script that can also be used to force a recrawl?
> >
> > > Well, actually. You can! I seem to have forgotten the -addDays switch
> of
> > > the generator. It adds #days to the current time to force URL's with
> > > fetch times in the future to be eligible for fetch.
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268779.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: force recrawl

Posted by Markus Jelsma <ma...@openindex.io>.
addDays is not a crawl switch but a generator switch. You cannot use the crawl 
command.

> But if I use
> bin/nutch crawl urls -dir crawl -depth 2 -topN 50
> addDays does not have any effect.
> Has anyone a nutch crawl script that can also be used to force a recrawl?
> 
> > Well, actually. You can! I seem to have forgotten the -addDays switch of
> > the generator. It adds #days to the current time to force URL's with
> > fetch times in the future to be eligible for fetch.
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268779.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Re: force recrawl

Posted by jasimop <st...@gmail.com>.
But if I use
bin/nutch crawl urls -dir crawl -depth 2 -topN 50
addDays does not have any effect.
Has anyone a nutch crawl script that can also be used to force a recrawl?


> Well, actually. You can! I seem to have forgotten the -addDays switch of the 
> generator. It adds #days to the current time to force URL's with fetch times 
> in the future to be eligible for fetch. 



--
View this message in context: http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268779.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: force recrawl

Posted by Markus Jelsma <ma...@openindex.io>.
Well, actually. You can! I seem to have forgotten the -addDays switch of the 
generator. It adds #days to the current time to force URL's with fetch times 
in the future to be eligible for fetch.

> I was going to ask the SAME question :-)  I think it is a PITA that you
> can't force a recrawl.  Wonder if could be accomplished by altering the
> codebase?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268748.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Re: force recrawl

Posted by webdev1977 <we...@gmail.com>.
I was going to ask the SAME question :-)  I think it is a PITA that you can't
force a recrawl.  Wonder if could be accomplished by altering the codebase?

--
View this message in context: http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268748.html
Sent from the Nutch - User mailing list archive at Nabble.com.