You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike Pountney <Mi...@semantico.com> on 2010/09/03 11:17:41 UTC

Dynamically changing the URL retry interval

I'd like to refetch pages that I know change frequently more often.

Does anyone know of a way to set a lower retry interval on a set of pages matched by a regex? 

Thanks in advance,

Mike


Re: Dynamically changing the URL retry interval

Posted by Julien Nioche <li...@gmail.com>.
Hi Mike,

Using the Adaptive Fetch Schedule is definitely a good option.

You can also specify a custom nutch.fetchInterval per URL of the seedlist
via their metadata e.g.

*http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t
userType=open_source

*
but this works only for the URLs known at the injection step. Writing a
custom map-reduce job to set a custom fetch interval for the URLS matching a
regex would not be too difficult.

In Nutch 2.0 we should be able to do that on a host basis (see
https://issues.apache.org/jira/browse/NUTCH-882).

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 3 September 2010 10:52, Mike Pountney <Mi...@semantico.com>wrote:

> To partially answer my own question, I've just found the Adaptive Fetch
> Schedule blog post by Pascal Dimassimo:
>
> http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
>
> I think this covers what I need to do, but any other advise in this area
> would be appreciated.
>
> Cheers,
>
> Mike
>
> On 3 Sep 2010, at 10:17, Mike Pountney wrote:
>
> >
> > I'd like to refetch pages that I know change frequently more often.
> >
> > Does anyone know of a way to set a lower retry interval on a set of pages
> matched by a regex?
> >
> > Thanks in advance,
> >
> > Mike
> >
>
> --
> Mike Pountney
>
> Information Systems Manager, Semantico Limited
> <ma...@semantico.com> <tel:+44 1273 358 209>
> Registered in England and Wales no. 03841410, VAT no. GB-744614334.
> Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.
>
> Check out all our latest news and thinking on our blog;
> - http://blogs.semantico.com/discovery-blog/
>
> Follow Semantico on Twitter;
> - http://twitter.com/semantico
>
>

Re: Dynamically changing the URL retry interval

Posted by Mike Pountney <Mi...@semantico.com>.
To partially answer my own question, I've just found the Adaptive Fetch Schedule blog post by Pascal Dimassimo:

http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

I think this covers what I need to do, but any other advise in this area would be appreciated.

Cheers,

Mike

On 3 Sep 2010, at 10:17, Mike Pountney wrote:

> 
> I'd like to refetch pages that I know change frequently more often.
> 
> Does anyone know of a way to set a lower retry interval on a set of pages matched by a regex? 
> 
> Thanks in advance,
> 
> Mike
> 

--
Mike Pountney

Information Systems Manager, Semantico Limited
<ma...@semantico.com> <tel:+44 1273 358 209>
Registered in England and Wales no. 03841410, VAT no. GB-744614334.
Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.

Check out all our latest news and thinking on our blog;
- http://blogs.semantico.com/discovery-blog/

Follow Semantico on Twitter;
- http://twitter.com/semantico